Vllm Python Packages

smg-grpc-proto

Engine-agnostic LLM gateway in Rust. Full OpenAI & Anthropic API compatibility across vLLM, TRT-LLM, TokenSpeed, SGLang, OpenAI, Gemini & more. Industry-first gRPC pipeline, KV cache-aware routing, chat history, tokenization caching, Responses API, embeddings, WASM plugins, MCP, and multi-tenant auth.

1.2M 376 111

smg-grpc-servicer

887K 376 111

funasr

Industrial-grade speech recognition toolkit: 170x realtime, 50+ languages, speaker diarization, emotion detection, streaming, and OpenAI-compatible API.

422K 19K 2K

vllm-cpu

Wheels & Docker images for running vLLM on CPU-only systems, optimized for different CPU instruction sets

183K 8 0

mooncake-transfer-engine

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

176K 6K 926

kserve

Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes

123K 6K 2K

mooncake-transfer-engine-cuda13

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

79K 6K 926

lmcache

LMCache: Supercharge Your LLM with the Fastest KV Cache Layer

77K 10K 1K

auto-round

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

76K 2K 149

ai-dynamo-runtime

A Datacenter Scale Distributed Inference Serving Framework

50K 7K 1K

conch-triton-kernels

A "standard library" of Triton kernels.

42K 26 3

flama

The production framework for Predictive and Generative AI. Serve any model as an API in one line, with OpenAI/Anthropic/Ollama-compatible endpoints, a built-in chat UI, and native MCP.

38K 291 17

ai-dynamo

A Datacenter Scale Distributed Inference Serving Framework

38K 7K 1K

gptqmodel

LLM model quantization (compression) toolkit with HW acceleration support for Nvidia, AMD, Intel GPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.

34K 1K 191

xinference

Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source, speech, and multimodal models on cloud, on-prem, or your laptop — all through one unified, production-ready inference API.

28K 9K 844

auto-round-lib

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

23K 2K 149

tokenspeed-smg-grpc-proto

22K 376 111

tokenspeed-smg-grpc-servicer

22K 376 111

terradev-cli

An imperative command-line-interface for AI workload orchestration

20K 21 3

auto-round-nightly

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

19K 2K 149

tokenspeed-mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

12K 6K 926

tardigrade-db

LLM-native database kernel — persistent KV cache memory for autonomous AI agents

12K 0 1

funasr-onnx

Industrial-grade speech recognition toolkit: 170x realtime, 50+ languages, speaker diarization, emotion detection, streaming, and OpenAI-compatible API.

9K 19K 2K

ironclad-ai

Reliability for LLM agents through enforcement, not model size — Agent-Contract-Kernel + a fail-closed orchestration engine. Model-agnostic. 🇦🇪 Built in the UAE.

9K 3 0