A curated list of resources for mastering GPU engineering from architecture and kernel programming to large-scale distributed systems and AI acceleration.
- Programming Massively Parallel Processors: A Hands-on Approach β David B. Kirk & Wen-mei W. Hwu The canonical introduction to CUDA, memory hierarchies, and parallel patterns. Amazon , notes: Abi's Concise Notes
- CUDA by Example β Jason Sanders & Edward Kandrot
A practical introduction to CUDA for beginners. Amazon - The Ultra-Scale Playbook: Training LLMs on GPU Clusters - Hugging Face Web Version
- CUDA β NVIDIAβs proprietary GPU programming platform.
- ROCm β AMDβs open compute stack.
- OpenCL β Cross-platform parallel computing standard.
- SYCL / oneAPI β Intelβs C++ abstraction for heterogeneous compute.
- Vulkan Compute β Low-level GPU compute API.
- Kompute β Higher level general purpose GPU compute framework built on Vulkan.
- Metal Performance Shaders β Appleβs GPU framework.
- NVIDIA Nsight Systems β System-wide GPU profiler.
- Nsight Compute β Kernel-level performance analysis.
- Occupancy Calculator β NVIDIA spreadsheet for kernel configuration.
- CUTLASS β CUDA templates for linear algebra subroutines.
- TensorRT β High-performance deep learning inference.
- OpenAI Triton β Python DSL for writing high-performance GPU kernels.
- Roofline Model β Analytical model to reason about compute/memory bottlenecks.
- NVIDIA Ampere Whitepaper
- AMD RDNA & CDNA Architectures
- SIMT execution and warp scheduling
- Memory hierarchy and coalescing
- Shared memory and cache optimization
- Warp divergence and thread occupancy
- NCCL β Multi-GPU communication primitives.
- vLLM - Inference and serving engine for LLMs
- Hugging Face Accelerate - Simplify abstractions for distributed training
- SGLang
- Prime Intellect
- TensorRT-LLM
- TGI by Hugging Face
- Horovod β Distributed deep learning across GPUs.
- NVLink & PCIe Topology β GPU interconnects and bandwidth optimization.
- GPUDirect RDMA β Zero-copy GPU networking.
- Ray Train, DeepSpeed, Megatron-LM β Large-scale GPU orchestration frameworks.
- Iris by AMD - open-source multi-GPU programming framework built for compiler-visible performance and optimized multi-GPU execution.
- CUDA C++ Programming Guide
- Triton Tutorials (OpenAI)
- CUDA in 12 hours by FreeCodeCamp and Video Repo
- Stanford CS149, Fall 2025 Parallel Computing Course Fall 2025
- CMU 15-418/618: Parallel Computer Architecture & Programming
- MIT 6.5940: TinyML and Efficient Deep Learning Computing
- GPU MODE video lecture series
- Red Hat vLLM Office Hours video series
- Optimization techniques for GPU programming - Hijma, Pieter, et al.
- Efficient Multi-GPU Programming in Python: Reducing Synchronization and Access Overheads - Oden, Lena, and Klaus NοΏ½οΏ½lp
- Evolving GPU Architecture β Kirk & Hwu
- Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision - Wei Gao et al
- Optimizing Machine Learning Models with CUDA: A Comprehensive Performance Analysis - Niteesh, L., and M. B. Ampareeshan
- NVIDIA Research Papers on Model Parallelism and Megatron-LM
- GPU Virtualization and Multi-Tenant Scheduling
- A Survey of Multi-Tenant Deep Learning Inference on GPU
- Efficient Performance-Aware GPU Sharing with Compatibility and Isolation through Kernel Space Interception
- nvprof, nvvp, Nsight Systems / Compute β NVIDIA profiling tools.
- cuda-memcheck, compute-sanitizer β Memory and correctness tools.
- GPGPU-Sim, Accel-Sim β GPU simulation frameworks.
- Perfetto, Nsight UI β Visual profilers for tracing GPU workloads.
- LeetGPU
- GPU MODE Discord
- GPU Glossary - A dictionary of terms related to programming GPUs
- PyTorch CUDA Extensions β Custom kernels for PyTorch.
- JAX + XLA β Compiler-based GPU vectorization.
- TensorFlow XLA Compiler β Ahead-of-time GPU graph compilation.
- FlashAttention, FlashConv β Kernel optimization techniques for transformers.
- DeepSpeed, FSDP, Megatron-LM β Distributed training systems.
- FlashAttention and PagedAttention
- Matmul Operations
- GPU scheduling algorithms and runtime systems.
- Memory oversubscription and unified memory models.
- Resource allocation in GPU clusters.
- GPU virtualization
- Kernel fusion and graph execution
- Dataflow optimization
- Persistent threads model
Contributions welcome!
Please read the contribution guidelines before submitting a pull request.
CC BY 4.0 β feel free to share and adapt with attribution.
Inspired by:
βGPU engineering is not just about writing kernels. Itβs about understanding how systems work.β β Model Craft