Vision Transformer Python Packages

mlx-vlm

MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.

883K 5K 659

mmdet

OpenMMLab Detection Toolbox and Benchmark

353K 33K 10K

mmcls

OpenMMLab Pre-training Toolbox and Benchmark

26K 4K 1K

mmpretrain

OpenMMLab Pre-training Toolbox and Benchmark

21K 4K 1K

pix2tex

pix2tex: Using a ViT to convert images of equations into LaTeX code.

14K 16K 1K

towhee

Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.

2K 3K 261

attention-and-transformers

Transformers goes brrr... Attention and Transformers from scratch in TensorFlow. Currently contains Vision transformers, MobileViT-v1, MobileViT-v2, MobileViT-v3

2K 14 2

mambavision

[CVPR 2025] Official PyTorch Implementation of MambaVision: A Hybrid Mamba-Transformer Vision Backbone

2K 2K 145

vision-transformers

Vision Transformers for image classification, image segmentation, and object detection.

2K 69 10

towhee-models

Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.

2K 3K 261

mist-medical

MIST: A simple and scalable end-to-end framework for 3D medical imaging segmentation.

2K 55 16

fastervit

[ICLR 2024] Official PyTorch implementation of FasterViT: Fast Vision Transformers with Hierarchical Attention

2K 919 69

thepipe-api

Get clean data from tricky documents, powered by vision-language models ⚡

1K 2K 99

conformal-clip

Few-shot CLIP classification with conformal prediction, probability calibration, and reliability metrics.

1K 0 0

clipq

A simple implementation of a CLIP that splits up an image into quandrants and then gets the embeddings for each quandrant

1K 7 1

image-classification-jax

Image classification in JAX with ViT, resnet, cifar10, cifar100, imagenette, and imagenet

1K 3 0

efficientvit-gml

Efficient vision foundation models for high-resolution generation and perception.

1K 3K 253

supreme-unlearning

A registry-based, multi-GPU framework for reproducible image-unlearning evaluation.

834 5 0

pai-easycv

An all-in-one toolkit for computer vision

813 2K 226

autodistill-vit

ViT module for use with autodistill.

729 4 0

vit-flax

Implementation of Vision Transformers in Flax

726 18 2

deepvision-toolkit

PyTorch and TensorFlow/Keras image models with automatic weight conversions and equal API/implementations - Vision Transformer (ViT), ResNetV2, EfficientNetV2, NeRF, SegFormer, MixTransformer, (planned...) DeepLabV3+, ConvNeXtV2, YOLO, etc.

673 42 7

vitaminp

VitaminP: a vision transformer-assisted multimodal integration network for pathology cell segmentation

634 12 1

clipcap

Using pretrained encoder and language models to generate captions from multimedia inputs.

572 101 14