Foundations and Frontiers: Operations Research for Large Language Model Inference

Events Research Seminars Foundations and Frontiers: Operations Research for Large Language Model Inference

Large Language Model (LLM) inference involves the computational techniques and strategies used to process input prompts and generate responses through a large language model. This field intersects significantly with online optimization, particularly in areas like online batching, scheduling, and resource allocation, making it a topic of keen interest to the OR/OM community. In this talk, I will introduce a foundational model for managing computational tasks on a single GPU, focusing on reducing redundant computations through a special memory-saving mechanism. This mechanism temporarily stores information from each word the model processes into the KV (key-value) cache to avoid recalculating it repeatedly. However, as more words are processed, this storage can quickly reach its limit. When this happens, the system incurs substantial extra costs by reprocessing tasks. We optimize batching and scheduling strategies to manage KV cache memory usage and minimize the inference latency to improve efficiency and sustainability.

We address this challenge by first analyzing a semi-online model, where all prompts arrive initially and must be processed sequentially. For this case, we develop a polynomial-time algorithm that achieves exact optimality. Next, we examine the fully online setting with sequential prompt arrivals. For adversarial sequences, we demonstrate that no algorithm can achieve a constant competitive ratio. For stochastic arrivals, we present a fast algorithm that guarantees constant regret, using a novel framework based on compensated coupling to prove it. Finally, Using the Vidur simulator on a public conversation dataset, we compare our algorithm to benchmark algorithms on 2 linked A100 GPUs with the Llama-70B model. After optimizing benchmark parameters, we find that in high-demand scenarios, our algorithm’s average latency grows only one-third as fast as the best benchmark, and in low-demand cases, only one-eighth as fast.