Welcome to Day 2 of the GPU Challenge! Before we dive into CUDA kernels and memory optimization, we need a clear view of the system we're building.
Large Language Models (LLMs) aren't just a single neural network. They're complex systems designed to run massive models efficiently and serve under different workloads' requirements. The goal: maximize throughput (tokens per second) and minimize latency (response time). But here's the thing: to understand where GPUs fit, we need to think like system architects from a birds' eyes overview, not just programmers.
Think of an LLM system as a three-story building. Each floor has a different purpose, different tools, and different optimization strategies. Our GPU programming journey will take us from the basement (raw hardware) to the penthouse (distributed systems). Let's tour the building.
An LLM system can be broken into three layers, each addressing a distinct challenge in running large models at scale.
| Layer | Operations | Key Focus | Representative Tools |
|---|---|---|---|
| Kernel | Scalar/Vector/Tile instructions | Micro-architecture optimization | CUDA C/C++, Triton, PTX, CUTLASS, Mojo |
| Graph | Tensor primitives | Model graph optimization | PyTorch, TensorRT, ONNX, JAX, TinyGrad |
| System | Sharding, Batching, Orchestration | Multi-GPU Coordination & Communication | SGLang, vLLM, TensorRT-LLM, DeepSpeed, Megatron-LM |
1. The Kernel Layer: Raw GPU Execution
This is the basement—where code meets silicon. A kernel is the smallest unit of work that runs directly on the single GPU's cores. Think matrix multiplication, attention computation, or element-wise operations. The focus is on maximizing performance at the micro-architecture level.
What's Happening? Kernels are optimized to achieve high arithmetic intensity (computations per byte of memory accessed). Since memory access is slower than computation, the goal is to perform as much math as possible per data fetch—just like we learned with our GPU monster yesterday.
The kernel layer uses three main optimization strategies:
| Technique | Description | Examples |
|---|---|---|
| Data Locality | Keep data close to compute units | Use registers, shared memory (including tiling), L1/L2 cache |
| Data Movement Efficiency | Optimize memory access patterns | Swizzling, coalescing, overlapping compute with memory |
| Special Instructions | Leverage hardware accelerators | TensorCore MMA, Hopper TMA, vectorized operations |
2. The Graph Layer: Model Optimization
This is the main floor—where we view the model as a computational graph. Instead of optimizing individual kernels, the focus is on streamlining the entire sequence of operations. Think of it as optimizing the recipe, not just the individual cooking techniques.
What's Happening? Frameworks analyze the graph and rewrite it for efficiency, reducing redundant computations and memory usage. They ask questions like: "Can we combine these operations? Can we eliminate this memory copy? Can we compute this once instead of three times?"
The graph layer employs several key optimization strategies:
| Technique | Description | Impact |
|---|---|---|
| Operator Fusion | Combine multiple operations into single kernels | Reduces memory I/O, eliminates intermediate writes |
| Merging | Mathematically combine operations (e.g., Conv+BatchNorm) | Reduces computation and memory usage |
| Quantization | Use lower precision (FP16, INT8, FP4) | Increases throughput, reduces memory |
| Sparsity | Skip zero computations (static/dynamic) | Reduces computation for sparse models |
| JIT Compilation | Convert dynamic graphs to optimized static graphs | Eliminates Python overhead, enables optimizations |
Personal Thoughts: In theory, a “smart-enough” compiler could make use of hardware characteristics and algorithmic requirements to auto‑generate high‑performance kernels, letting users work purely at the higher levels. But no such perfect compiler for GPU exists today, so researchers and engineers still need to work on handcraft kernels for the foreseeable future. I highly recommend reading Chris Lattner’s blog series Democratizing AI Compute at Modular for more context. Besides, cutting training and inference costs remains vital: GPU FLOPS keep outpacing HBM bandwidth growth, so manual effort to turn memory‑bound tasks into compute‑bound ones is still essential. Just a late-night thought I had, happy to hear any feedback or continue the discussion! Feel free to reach out.
3. The System Layer: Multi-GPU Orchestration
This is the penthouse—where we coordinate entire clusters of GPUs. Massive models like GPT-4 or Claude exceed the capacity of a single GPU. This layer treats the model as a distributed program running across hundreds or thousands of GPUs.
What's Happening? The system manages compute, memory, and communication at scale. It's like conducting an orchestra where each musician (GPU) must play their part perfectly, and the conductor must ensure they're all synchronized.
Key Techniques:
-
Parallelism Strategies
-
Intelligent Batching: Like vLLM uses PagedAttention to reduce KV cache fragmentation and SGLang uses RadixAttention to increase KV cache reuse. Both are sophisticated ways of promote efficient use of KV cache.
-
Communication Optimization: Using high-speed interconnects (NVLink, InfiniBand) to minimize data transfer overhead within nodes and networks. Other than that, frameworks like Triton-distributed is a good example of overlapping communication stages of aggregate operations.
The system layer coordinates multiple GPUs using various parallelism techniques:
| Parallelism Type | How It Works | Latency Impact | Memory Impact | Communication Cost |
|---|---|---|---|---|
| Data Parallel | Same model, different batches | ❌ No improvement | ❌ Full model per GPU | ✅ Low (inference) |
| Pipeline Parallel | Different layers on different GPUs | ❌ No improvement | ✅ Saves memory | ✅ Low |
| Tensor Parallel | Split weight matrices across GPUs | ✅ Improves latency | ✅ Saves memory | ❌ High |
| Expert Parallel | Distribute MoE experts across GPUs | ✅ Improves latency (large batch) | ✅ Saves memory | 🔶 Medium |
| Sequence Parallel | Split along sequence dimension | ✅ Improves latency (long context) | ✅ Saves memory | ❌ High |
Different layers optimize for different metrics:
| Layer | Primary Metrics | Secondary Metrics |
|---|---|---|
| Kernel | FLOPS, Memory Bandwidth Utilization | Latency per operation |
| Graph | Model FLOPS Utilization (MFU) | Memory efficiency, Compilation time |
| System | Tokens Per Second (TPS), Time To First Token (TTFT) | Time Per Output Token (TPOT) |
Here's how the three layers interact in a complete LLM system:
┌─────────────────────────────────────────────────────────────┐
│ SYSTEM LAYER │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ GPU 1 │ │ GPU 2 │ │ GPU N │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │
│ │ │ GRAPH │ │ │ │ GRAPH │ │ │ │ GRAPH │ │ │
│ │ │ LAYER │ │ │ │ LAYER │ │ │ │ LAYER │ │ │
│ │ │┌───────┐│ │ │ │┌───────┐│ │ │ │┌───────┐│ │ │
│ │ ││KERNELS││ │ │ ││KERNELS││ │ │ ││KERNELS││ │ │
│ │ │└───────┘│ │ │ │└───────┘│ │ │ │└───────┘│ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ └───────────────┼───────────────┘ │
│ Batching, Scheduling, Load Balancing, Communication │
└─────────────────────────────────────────────────────────────┘
Flow: Requests come in → System layer batches and distributes → Graph layer optimizes operations → Kernels execute on hardware → Results flow back up
Running an LLM efficiently requires optimizing all three layers. You can't just write a fast kernel and call it a day. Here's why:
- The System Layer ensures your GPUs are fully utilized across a cluster and requests are batched efficiently.
- The Graph Layer streamlines your model's operations and eliminates wasteful computations.
- The Kernel Layer squeezes every drop of performance from the hardware.
Miss any layer, and your system will have bottlenecks. It's like having a Ferrari engine in a horse-drawn carriage, meaning the potential is there, but the system design limits performance.
Our challenge starts in the kernel layer, where we'll learn to write high-performance GPU code. This high-level map gives us context for where our work fits in the bigger picture.
Understanding the trade-offs helps identify where to focus optimization efforts:
| Constraint | Kernel Layer | Graph Layer | System Layer |
|---|---|---|---|
| Compute Bound | Optimize arithmetic intensity | Use quantization, sparsity | Choose compute-optimal parallelism |
| Memory Bound | Improve data locality, fusion | Graph optimization, operator fusion | Use memory-efficient parallelism |
| Communication Bound | N/A (single GPU) | Minimize intermediate tensors | Optimize parallelism strategy |
| Latency Critical | Reduce kernel launch overhead | JIT compilation, CUDA graphs | Tensor parallelism, speculative decoding |
| Throughput Critical | Maximize occupancy | Batch operations | Data parallelism, continuous batching |
Recognizing these patterns helps diagnose performance issues:
| Symptom | Likely Layer | Common Causes | Solutions |
|---|---|---|---|
| Low GPU utilization (<50%) | Kernel | Poor memory access, low arithmetic intensity | Optimize memory patterns, use tensor cores |
| High GPU util, slow overall | Graph | Inefficient operators, poor fusion | Operator fusion, quantization |
| Good single GPU, poor scaling | System | Communication overhead, load imbalance | Better parallelism strategy, batching |
| High latency, good throughput | System | Poor request scheduling | Continuous batching, speculative decoding |
Let's test your understanding with a quick scenario:
Scenario: You have a perfectly optimized matrix multiplication kernel (Kernel Layer) running at 90% of theoretical peak performance. Your model uses the latest graph optimizations (Graph Layer) with perfect operator fusion. But your overall system throughput is still terrible.
Question: Which layer is likely the bottleneck, and what might be wrong?
Click to see the solution
Answer: The bottleneck is likely in the System Layer.
Even with perfect kernels and graph optimization, you can still have system-level issues:
- Poor Batching: Your system might be processing requests one at a time instead of batching them efficiently.
- GPU Idle Time: GPUs might be waiting for data or for other GPUs to finish their work.
- Communication Overhead: In multi-GPU setups, time spent transferring data between GPUs can dominate.
- Load Imbalance: Some GPUs might finish their work much earlier than others, leaving them idle.
This is why frameworks like SGLang or vLLM focus heavily on system-level optimizations like continuous batching and efficient memory management.
Some cutting-edge techniques span multiple layers:
| Technique | Layers Involved | Description |
|---|---|---|
| Speculative Decoding | Graph + System | Use small model to predict, verify with large model |
| MegaKernels | Kernel + Graph | Fuse entire transformer blocks into single kernels |
| Algorithm-Hardware Co-design | All layers | Design algorithms specifically for hardware constraints |
| Dataflow Architectures | All layers | Execute operations when data becomes available |
| Prefill-Decode Disaggregation | System | Separates prefill and decode phases onto different hardware for optimized latency and resource allocation |
| Flash Attention | Kernel | Optimized attention mechanism with reduced memory bandwidth and improved efficiency for long sequences |
Tomorrow, we'll roll up our sleeves and dive into the kernel layer. We'll write our first CUDA kernel and see how low-level optimizations can dramatically impact performance. Expect hands-on GPU programming, memory access patterns, and practical examples that build on our GPU monster analogy.
The foundation is set. Now let's start building.
- "Understanding LLM System with 3-layer Abstraction" by Huizi Mao: A concise breakdown of LLM system architecture that inspired this framework.
- "CSE-234: Data Systems for Machine Learning" by Hao Zhang Lecture Notes: Detailed insights into ML system design and optimization strategies.
Special Thanks: A huge shoutout to my friend Chun-Mao (Michael) Lai at UCSD for introducing me to the amazing and well-structured UCSD CSE234 course.
