PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Llm Serving Python Packages

Python packages with the GitHub topic llm-serving. Sorted by relevance, with stars and monthly downloads.
ray-project
ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

56.1M 43K 8K
vllm-project
vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

6.2M 81K 17K
skypilot-org
skypilot

Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, Slurm, 20+ clouds, on-prem).

1.8M 10K 1K
skypilot-org
skypilot-nightly

Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, Slurm, 20+ clouds, on-prem).

407K 10K 1K
bentoml
bentoml

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

186K 9K 965
vllm-project
vllm-tpu

A high-throughput and memory-efficient inference and serving engine for LLMs

170K 81K 17K
ray-project
ant-ray-cpp-nightly

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

43K 43K 8K
ray-project
ray-cpp

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

42K 43K 8K
skypilot-org
trainy-skypilot-nightly

Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, Slurm, 20+ clouds, on-prem).

17K 10K 1K
NVIDIA
tensorrt-llm

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.

16K 14K 2K
mosecorg
mosec

A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine

12K 900 72
bentoml
openllm

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

11K 12K 811
predibase
lorax-client

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

8K 4K 312
vllm-project
vllm-ascend

Community maintained hardware plugin for vLLM on Ascend

7K 2K 1K
ray-project
ant-ray-nightly

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

7K 43K 8K
torchpipe
omniback

Serving Inside Pytorch

5K 170 13
friendliai
friendli-client

Friendli Suite Client

4K 50 7
ray-project
ant-ray

Ray provides a simple, universal API for building distributed applications.

4K 43K 8K
bentoml
openllm-core

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

4K 12K 811
bentoml
openllm-client

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

2K 12K 811
superduper-io
superduper-sentence-transformers

Superduper: End-to-end framework for building custom AI applications and agents.

2K 5K 538
superduper-io
superduper-framework

Superduper: End-to-end framework for building custom AI applications and agents.

2K 5K 538
PaddlePaddle
fastdeploy-python

High-performance Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle

2K 4K 744
superduper-io
superduper-transformers

Superduper: End-to-end framework for building custom AI applications and agents.

1K 5K 538
    • Data from PyPI, GitHub, ClickHouse, and BigQuery