-
Your GPU is a Monster. Don't Let It Starve -> Max out every part of the GPU (a napkin math first)
-
A High-Level Overview of LLM Systems -> A broad overview without getting lost in technical details
-
Writing Your First CUDA Kernel -> Introduction to GPU programming with a simple CUDA kernel
-
The Art of Pointer Arithmetic -> The underlying memory layout of tensor representiaobn
-
Tiling and Shared Memory -> Dividing the matrix into blocks that fit within the cache
-
Global Memory Coalescing -> Combining adjacent accesses into single memory transaction
-
RL in LLM Post-training -> This is the way LLMs can do reasoning
-
RL Framework Design Space -> Discuss RL infra form factor
-
Setting Up RL Infra -> The logs of me playing Slime
-
Notes from VeRL Talk -> A write-up of Haibin Lin’s introduction and Q&A on VeRL at PyTorch Webinar
-
Don't Just
.cast()-> The note of learning from video: mxfp8, mxfp4, nvfp4 formats and applications in PyTorch - Vasily Kuznetsov & Driss Guessous, Meta -
The Missing 10 Bits -> The note of learning from blog: Some Matrix Multiplication Engines Are Not As Accurate As We Thought