kernel-fusion
AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming
MLX + Metal implementation of mHC: Manifold-Constrained Hyper-Connections by DeepSeek-AI.
Fused Triton kernels for TurboQuant KV cache compression — 2-4 bit quantization with RHT rotation. Drop-in HuggingFace & vLLM integration. Up to 4.9x KV cache compression for Llama, Qwen, Mistral, and more.