Quantization Python Packages

ctranslate2

Fast inference engine for Transformer models

10.8M 5K 496

faster-whisper

Faster Whisper transcription with CTranslate2

9M 24K 2K

bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.

6.6M 8K 877

torchao

PyTorch native quantization and sparsity for training and inference

3.3M 3K 553

optimum

🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools

2M 3K 658

onnx2tf

A tool for converting ONNX files to LiteRT/TFLite/TensorFlow, PyTorch native code (nn.Module), TorchScript (.pt), state_dict (.pt), Exported Program (.pt2), and Dynamo ONNX. It also supports direct conversion from LiteRT to PyTorch.

1.4M 971 105

llamafactory

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

863K 73K 9K

nncf

Neural Network Compression Framework for enhanced OpenVINO™ inference

851K 1K 299

optimum-quanto

A pytorch quantization backend for optimum

283K 1K 91

llmcompressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

175K 3K 562

sageattention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

144K 3K 441

tensorflow-model-optimization

A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.

87K 2K 348

auto-round

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

76K 2K 149

navec

Compact high quality word embeddings for Russian language

66K 218 19

auto-gptq

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

66K 5K 539

mblt-model-zoo

Mobilint Model Zoo Project

62K 22 1

qonnx

QONNX: Arbitrary-Precision Quantized Neural Networks in ONNX

60K 191 59

turbovec

A vector index built on TurboQuant, written in Rust with Python bindings

44K 13K 1K

intel-extension-for-pytorch

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform

38K 2K 318

aimet-torch

AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.

36K 3K 457

gptqmodel

LLM model quantization (compression) toolkit with HW acceleration support for Nvidia, AMD, Intel GPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.

34K 1K 191

neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime

31K 3K 313

discrete-key-value-bottleneck-pytorch

Implementation of Discrete Key / Value Bottleneck, in Pytorch

31K 88 3

fms-model-optimizer

FMS Model Optimizer is a framework for developing reduced precision neural network models.

28K 21 20