Llm Serving Python Packages

ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

60.7M 43K 8K

vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

5.8M 86K 19K

skypilot

Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, Slurm, 20+ clouds, on-prem).

1.3M 10K 1K

bentoml

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

213K 9K 980

skypilot-nightly

Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, Slurm, 20+ clouds, on-prem).

151K 10K 1K

vllm-tpu

A high-throughput and memory-efficient inference and serving engine for LLMs

58K 86K 19K

mosec

A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine

39K 901 73

flama

The production framework for Predictive and Generative AI. Serve any model as an API in one line, with OpenAI/Anthropic/Ollama-compatible endpoints, a built-in chat UI, and native MCP.

38K 291 17

ant-ray-cpp-nightly

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

31K 43K 8K

ray-cpp

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

18K 43K 8K

vllm-cpu-nightly

A high-throughput and memory-efficient inference and serving engine for LLMs

16K 86K 19K

tensorrt-llm

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.

14K 14K 3K

ant-ray-nightly

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

9K 43K 8K

openllm

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

9K 12K 820

vllm-ascend

Community maintained hardware plugin for vLLM on Ascend

8K 2K 2K

lorax-client

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

6K 4K 322

openllm-core

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

5K 12K 820

friendli-client

Friendli Suite Client

5K 50 7

trainy-skypilot-nightly

SkyPilot: An intercloud broker for the clouds

4K 10K 1K

ant-ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

4K 43K 8K

openllm-client

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

3K 12K 820

omniback

Serving Inside Pytorch

2K 169 12

starlette-api

The production framework for Predictive and Generative AI. Serve any model as an API in one line, with OpenAI/Anthropic/Ollama-compatible endpoints, a built-in chat UI, and native MCP.

2K 291 17

superduper-sentence-transformers

Superduper: End-to-end framework for building custom AI applications and agents.

2K 5K 542