Moe Python Packages

sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

461.1M 30K 7K

vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

5.8M 85K 19K

flashinfer-python

FlashInfer: Kernel Library for LLM Serving

4.5M 6K 1K

nvidia-cudnn-frontend

cuDNN Frontend is NVIDIA's modern, open-source entry point to the cuDNN library and a growing collection of high-performance open-source kernels.

3.4M 862 197

flashinfer-cubin

FlashInfer: Kernel Library for LLM Serving

3M 6K 1K

llamafactory

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

863K 73K 9K

sglang-kernel

SGLang is a high-performance serving framework for large language models and multimodal models.

468K 30K 7K

sgl-kernel

SGLang is a high-performance serving framework for large language models and multimodal models.

280K 30K 7K

ms-swift

Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3.6, DeepSeek-V4, GLM-5.1, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Gemma4, Llava, Phi4, ...) (AAAI 2025).

104K 15K 2K

vllm-tpu

A high-throughput and memory-efficient inference and serving engine for LLMs

58K 85K 19K

vllm-cpu-nightly

A high-throughput and memory-efficient inference and serving engine for LLMs

16K 85K 19K

tensorrt-llm

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.

14K 14K 3K

sawyer-core

Distributed MoE inference network — the load is split, friends help.

7K 0 0

sglang-kt

SGLang is a high-performance serving framework for large language models and multimodal models.

3K 30K 7K

awex

A high-performance RL training-inference weight synchronization framework, designed to enable second-level parameter updates from training to inference in RL workflows

3K 160 18

switch-transformers

Implementation of Switch Transformers from the paper: "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity"

2K 142 18

llmtuner

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

2K 73K 9K

abliterix

Automated alignment adjustment for LLMs — direct steering, LoRA, and MoE expert-granular abliteration, optimized via multi-objective Optuna TPE.

2K 172 35

uccl

UCCL: Ultra and Unified CCL

1K 1K 159

thekaveh-nnx

Lightweight PyTorch toolkit for training, fine-tuning, and exporting modern neural nets. FFN + GNN + decoder-only LM + diffusion + JEPA + MoE; PEFT (LoRA/DoRA/IA3/Prefix/Prompt), PTQ/QAT, pruning, surgery; ONNX/GGUF/Ollama/HuggingFace Hub interop. Dataclass-configured runs with fluent builders, automatic checkpointing, and Plotly viz.

1K 2 0

lsglang

SGLang is a fast serving framework for large language models and vision language models.

1K 30K 7K

mlx-flash

Run AI models too large for your Mac's memory — at near-full speed. Intelligent expert caching, speculative execution, and 15+ research techniques for MoE inference on Apple Silicon.

771 5 0

lazyllm-llamafactory

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

768 73K 9K

ai-dynamo-vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

606 85K 19K