tensor-cores
🤖FFPA: Extends FlashAttention-2 (forward & backward) via Split-D for large headdim, delivering 1.5~3×↑🎉 speedup over SDPA.
INT8 Sparse Tensor Core GEMM kernels for PyTorch — built for Windows