Gguf Python Packages

auto-round

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

76K 2K 149

whichllm

Find the local LLM that actually runs and performs best on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it instantly.

53K 6K 293

lilbee

A local AI search engine: it runs and manages local AI models, searches your files and code, and crawls the web, all in one program. Cited answers, local-first, with an MCP server for your coding agent. TUI, CLI, REST API, and Python library. Works with Ollama and LM Studio.

28K 36 4

llmfit

Hundreds of models & providers. One command to find what runs on your hardware.

23K 29K 2K

auto-round-lib

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

23K 2K 149

auto-round-nightly

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

19K 2K 149

gguf-connector

gguf (GPT-Generated Unified Format) connector

18K 57 12

soup-cli

Soup turns the pain of LLM fine-tuning into a simple workflow. One config, one command, done.

14K 72 21

hf-mem

A CLI to estimate inference memory requirements for Hugging Face models, written in Python.

13K 936 84

transcribe-cpp-native

ggml speech-to-text inference for 16+ model families

9K 95 5

omi-med-stt

On-device English medical speech-to-text — CLI for Omi Med STT v1 (MLX / NeMo / parakeet.cpp)

8K 5 0

llama-cpp-py-sync

Auto-synced CFFI ABI python bindings for llama.cpp with prebuilt wheels (CPU/CUDA/Vulkan/Metal).

7K 4 1

sawyer-core

Distributed MoE inference network — the load is split, friends help.

7K 0 0

jang

JANG — GGUF for MLX. YOU MUST USE JANG_Q RUNTIME. Adaptive Mixed-Precision Quantization + Runtime for Apple Silicon

6K 201 24

quantcpp

LLM inference with 7x longer context. Pure C, zero dependencies. Lossless KV cache compression + single-header library.

6K 395 43

outetts

Interface for OuteTTS models.

5K 1K 117

transcribe-cpp

ggml speech-to-text inference for 16+ model families

5K 95 5

opencode-llmstack

Cursor-Auto / Claude-tier-style serving for local GGUF models on Mac (M4 Max, 64 GB). FastAPI router fronts llama-swap + llama.cpp, classifying each request into a coder, planner, or uncensored-planner tier. OpenAI-compatible API, opencode integration, per-project subshell, one `llmstack` console-script.

2K 0 0

ovos-gguf-embeddings-plugin

DEPRECATED — folded into OpenVoiceOS/ovos-gguf-plugin (keeps plugin name 'ovos-gguf-embeddings-plugin'). Install ovos-gguf-plugin.

2K 0 0

ctransformer-core

solo connector core built on ctransformers

2K 0 0

gguf-core

a simple way to interact llama with gguf

2K 5 1

llama-core

solo connector core built on llama.cpp

2K 1 1

auto-round-hpu

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

2K 2K 149

gguf-node

gguf node for comfyui

2K 237 18