Inference Optimization Python Packages

tessera-hypernetwork

Generate per-session LoRA adapters in <1s for agentic inference efficiency.

26K 4 2

krasis

Krasis is a Hybrid LLM runtime which focuses on efficient running of larger models on consumer grade VRAM limited hardware

6K 477 27

llm-autotune

Zero-config local LLM optimization for Ollama, LM Studio, and Apple Silicon MLX. Reduces TTFT by 40%, wall time for local agents by 46%, and RAM usage by 3x.

3K 31 3

turboquant-vllm

TurboQuant KV cache compression plugin for vLLM — asymmetric K/V, 8 models validated, consumer GPUs

2K 48 6

flashspec

Adaptive speculative-decoding inference engine with Triton-optimised verification and online bandit draft selection.

2K 11 0

sollol

Intelligent Load Balancer for Ollama Clusters - Adaptive Parallelism + VRAM-Aware GPU Routing + Ray/Dask/llama.cpp Distribution

1K 5 2

contextpilot

Accelerating Long Context LLM Inference with Accuracy-Preserving Context Optimization in SGLang, vLLM, llama.cpp, OpenClaw, RAG, and Agentic AI.

1K 120 5

wildedge-sdk

ML inference monitoring for Python: on-device and remote models

668 14 1

rabbitllm

Run 70B+ LLMs on a single 4GB GPU — no quantization required.

494 69 12

torch-quant

BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.

459 931 169

supercompress

SuperCompress — learned context compression for LLMs.

403 2 0

thinkrouter

Cut LLM reasoning-token costs by 60% with one line of code.

290 3 0

kito

Keras inference time optimizer

271 157 18

llm-autobatch

Turn single LLM calls into fast micro-batches. Rust core, Python API.

226 5 0

pavo-bench

PAVO-Bench: 50K-turn voice pipeline benchmark and 85K-param meta-controller for ASR->LLM->TTS routing.

157 1 0