Kv Cache Python Packages

lmcache

LMCache: Supercharge Your LLM with the Fastest KV Cache Layer

77K 10K 1K

tardigrade-db

LLM-native database kernel — persistent KV cache memory for autonomous AI agents

12K 0 1

squish-ai

⚡️ The fastest way to run local LLMs on Apple Silicon — sub-second model loads, beats Ollama on throughput, tail latency, and full-response time. OpenAI/Ollama-compatible. No cloud, no API keys.

9K 10 0

quantcpp

LLM inference with 7x longer context. Pure C, zero dependencies. Lossless KV cache compression + single-header library.

6K 395 43

turboquant-mlx-full

Extreme weight + KV cache compression for LLMs on Apple Silicon (MLX implementation of Google's TurboQuant)

4K 56 11

llm-autotune

Zero-config local LLM optimization for Ollama, LM Studio, and Apple Silicon MLX. Reduces TTFT by 40%, wall time for local agents by 46%, and RAM usage by 3x.

3K 31 3

bigsmall

Lossless AI model compression - ~34% smaller with bit-identical weights; the autopilot profiles your machine, picks the highest fidelity that runs, and streams models bigger than your RAM.

3K 1 0

nexusquant-kv

Training-free KV cache compression via E8 lattice quantization; fits longer context in the same VRAM

3K 20 0

lmcache-cli

LMCache: Supercharge Your LLM with the Fastest KV Cache Layer

3K 10K 1K

turboquant-vllm

TurboQuant KV cache compression plugin for vLLM — asymmetric K/V, 8 models validated, consumer GPUs

2K 48 6

turboquant

First open-source TurboQuant KV cache compression for LLM inference. Drop-in for HuggingFace. pip install turboquant.

2K 41 8

mlx-quant-fidelity

Measure MLX quantization quality loss — KL divergence, perplexity, top-token agreement for KV cache and weights

2K 1 0

processfork

git for AI agents — snapshot, fork, and merge live LLM sessions in 8 ms. Drop-in for Claude Code, LangGraph, vLLM, SGLang.

2K 2 0

thaw-vllm

The fork primitive for LLM inference. Snapshot a running session — weights + KV cache + scheduler state — and hydrate it into N divergent children that skip prefill. For RL rollouts, parallel coding agents, agent branching. Supports vLLM and SGLang.

2K 6 1

avp

Python SDK for Agent Vector Protocol – transfer KV-cache between LLM agents instead of text

1K 24 2

cappr

Completion After Prompt Probability. Make your LLM make a choice

1K 82 3

kakeyalattice

Discrete Kakeya cover for LLM KV cache: D4/E8 nested-lattice quantisation realising a Kakeya-style tube-cover over the direction sphere. 2.4x-2.8x compression at <1% perplexity loss on Qwen3, Llama-3, DeepSeek, GLM-4, Gemma. Drop-in transformers.DynamicCache. pip install kakeyalattice.

1K 9 2

tqai

TurboQuant KV cache compression for local LLM inference

936 0 0

fade-kv

Frequency-Adaptive Decay Encoding: Attention-aware tiered KV cache compression for LLM inference.

863 0 0

kvboost

Make local LLM inference faster with chunk-level KV cache reuse

777 25 0

semblend-vllm-connector

Out-of-tree vLLM KVConnector for SemBlend semantic KV donor discovery

505 0 0

kvfleet

Production-grade, KV-cache-aware intelligent routing for self-hosted and hybrid LLM fleets.

475 0 0

prismkv

3-D stacked-plane KV cache quantization + RAG framework. Defensive prior-art publication extending TurboQuant to conditional 3-D polar cells with adaptive bit allocation.

421 1 1

fused-turboquant

Fused Triton kernels for TurboQuant KV cache compression — 2-4 bit quantization with RHT rotation. Drop-in HuggingFace & vLLM integration. Up to 4.9x KV cache compression for Llama, Qwen, Mistral, and more.

398 8 1