Llm Inference Python Packages

ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

60.7M 43K 8K

flashinfer-python

FlashInfer: Kernel Library for LLM Serving

4.5M 6K 1K

flashinfer-cubin

FlashInfer: Kernel Library for LLM Serving

3M 6K 1K

openvino

OpenVINO™ is an open source toolkit for optimizing and deploying AI inference

1.5M 10K 3K

bentoml

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

213K 9K 980

vllm-cpu

Wheels & Docker images for running vLLM on CPU-only systems, optimized for different CPU instruction sets

183K 8 0

openvino-dev

OpenVINO™ is an open source toolkit for optimizing and deploying AI inference

172K 10K 3K

kserve

Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes

123K 6K 2K

gpt4all

GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.

64K 77K 8K

ai-dynamo-runtime

A Datacenter Scale Distributed Inference Serving Framework

50K 7K 1K

yunchang

USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference

45K 678 80

ai-dynamo

A Datacenter Scale Distributed Inference Serving Framework

38K 7K 1K

ant-ray-cpp-nightly

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

31K 43K 8K

litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

24K 13K 1K

prompt-poet

Streamlines and simplifies prompt design for both developers and non-technical users with a low code approach.

20K 1K 95

terradev-cli

An imperative command-line-interface for AI workload orchestration

20K 21 3

ray-cpp

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

18K 43K 8K

optillm

Optimizing inference proxy for LLMs

10K 4K 368

ant-ray-nightly

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

9K 43K 8K

openllm

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

9K 12K 820

monocle-apptrace

Monocle is a framework for tracing GenAI app code. This repo contains implementation of Monocle for GenAI apps written in Python.

7K 266 45

lorax-client

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

6K 4K 322

intel-extension-for-transformers

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

6K 2K 216

quantcpp

LLM inference with 7x longer context. Pure C, zero dependencies. Lossless KV cache compression + single-header library.

6K 395 43