hopper
cudnn_frontend provides a c++ wrapper for the cudnn backend API and samples on how to use it
A Python DSL to write Nvidia PTX for Hopper and Blackwell in JAX and PyTorch
Minimal GPU runtime for Python - high-performance CUDA kernels, memory management, and LLM inference without heavy dependencies
Cross-platform FlashAttention-2 Triton implementation for Turing+ GPUs with custom configuration mode