Back to Blog
LLM
Optimization
CUDA
vLLM

The Anatomy of Fast LLM Inference

Introduction

Large Language Models (LLMs) are everywhere. But as difficult as they are to train, they are just as challenging to run. Every time you ask a model a question (a process called inference), a massive battle for computation and memory is waged in the background.

The core problem that makes LLMs slow and expensive is that they are autoregressive: they generate one word (or token) at a time. To produce "beautiful" after the phrase "The weather is," the model must look at all the preceding words ("The," "weather," "is"). After "beautiful" is generated, it must again look at the entire new phrase ("The," "weather," "is," "beautiful") just to guess the next word.

This quadratic complexity is a nightmare for both computation and memory. Fortunately, engineers and researchers have developed an incredible stack of optimizations to solve this. Here are the critical techniques that make modern AI possible.

1. The Foundation: KV Cache (Key-Value Cache)

Let's start with the most fundamental optimization. The biggest problem with the autoregressive process is recalculating the entire past for every new word.

  • The Problem: When predicting the next word after "The weather is beautiful," it's a colossal waste to re-calculate the Key (K) and Value (V) vectors (core components of the Transformer's attention mechanism) for "The," "weather," and "is."
  • The Solution: After calculating these Key and Value vectors, we store them in a cache in the GPU VRAM. When predicting the next word, we only compute the K/V for the newest token ("beautiful") and append it to the cache.
  • The Result: This drops the computational load from $O(n^2)$ to $O(n)$. We no longer recompute the entire past. However, this creates a new problem: Memory. The KV Cache consumes VRAM very quickly.

2. Reducing Memory Size: Grouped-Query Attention (GQA)

The KV Cache solved computation but created a memory monster, especially with the original Multi-Head Attention (MHA).

  • The Problem: In MHA, if you have 32 "heads," each head has its own Key (K) and Value (V) projection. This means 32 separate sets of KV Caches, multiplying the memory cost by 32.
  • The Solution: Grouped-Query Attention (GQA) is the "golden mean." Instead of 32 separate K/V heads, it groups them. For example, 8 groups, where each group of 4 Query (Q) heads shares a single K/V head. This is the architecture used by models like Llama 2/3 and Mistral.
  • The Result: GQA drastically reduces the KV Cache size and memory bandwidth requirements with very little loss in model quality.

3. Managing Memory: PagedAttention

GQA shrank the cache, but we still waste memory managing it. Different user prompts have different lengths, leading to fragmentation.

  • The Problem: Traditional KV Caching allocates a large, contiguous block of VRAM for each user. If you reserve space for 4000 tokens, but the user only inputs 50, the remaining 3950 tokens of VRAM are wasted. This is internal fragmentation.
  • The Solution: PagedAttention (from the vLLM project) brings the concept of virtual memory from operating systems to the GPU. VRAM is divided into small, fixed-size "pages." A sequence's KV Cache no longer needs to be in one big block; it can be scattered across these pages, which are tracked by a "page table."
  • The Result: Memory utilization can exceed 95%. There is almost no waste. This allows the GPU to serve many more users at the same time (a much higher batch size), causing throughput to skyrocket.

*This post originally appeared on my Medium

.*

Enjoyed this article? You can also read and engage with it on Medium:

Read on Medium