ACL2024

Multi-Task Inference: Can Large Language Models Follow Multiple Instructions at Once?

Guijin Son, Sangwon Baek, Sangdae Nam, Ilgyun Jeong, Seungone Kim

Abstract

Large language models (LLMs) are typically prompted to follow a single instruction per inference call. In this work, we analyze whether LLMs also hold the capability to handle multiple instructions simultaneously, denoted as MULTI-TASK INFERENCE. For this purpose, we introduce the MTI BENCH (Multi-Task Inference Benchmark), a comprehensive evaluation benchmark encompassing 5,000 instances across 25 tasks. Each task in the MTI BENCH involves 2 to 3 sub-tasks. As expected, we first demonstrate that MULTI-TASK INFERENCE reduces the total inference time by ×1.46 times in average since it does not require multiple inference calls. Interestingly, contrary to the expectation that LLMs would perform better when tasks are divided, we find that state-ofthe-art LLMs, such as LLAMA-2-CHAT-70B and GPT-4, show up to 7.3% and 12.4% improved performance with MULTI-TASK INFER-ENCE compared to SINGLE-TASK INFERENCE on the MTI BENCH. We release the MTI BENCH dataset and our code at this link 1 .