Benchmark Python Packages

swebench

SWE-bench: Can Language Models Resolve Real-world Github Issues?

47.7M 5K 908

pytest-benchmark

pytest fixture for benchmarking code

15.6M 1K 133

mteb

MTEB: State-of-the-art evaluation of embeddings across languages and modalities

3.1M 3K 632

pytest-harvest

Store data created during your `pytest` tests execution, and retrieve it at the end of the session, e.g. for applicative benchmarking purposes.

642K 76 10

fjsplib

Python package to read and write instances for the flexible job shop problem.

545K 9 0

asv

Airspeed Velocity: A simple Python benchmarking tool with web-based reporting

467K 1K 207

evo

Python package for the evaluation of odometry and SLAM

240K 4K 792

motmetrics

:bar_chart: Benchmark multiple object trackers (MOT) in Python

138K 1K 261

membrowse

Track and analyze binary size and memory footprint in embedded firmware

107K 27 1

agentdojo

A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents.

76K 646 163

google-benchmark

A microbenchmark support library

53K 10K 2K

evalplus

Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024

47K 2K 201

chessboard

:game_die: CLI to solve combinatoric chess puzzles.

45K 8 4

beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

43K 2K 247

picows

Ultra-fast websocket client and server for asyncio

43K 283 18

mmpose

OpenMMLab Pose Estimation Toolbox and Benchmark.

40K 8K 2K

optunahub

Python library to use and implement packages in OptunaHub

33K 57 14

medmnist

[pip install medmnist] 18x Standardized Datasets for 2D and 3D Biomedical Image Classification

32K 1K 212

pytest-django-queries

Generate performance reports from your django database performance tests.

30K 84 2

holobench

A package for benchmarking the characteristics of arbitrary functions

26K 4 3

pyperformance

Python Performance Benchmark Suite

23K 1K 204

apebench

[Neurips 2024] A benchmark suite for autoregressive neural emulation of PDEs. (≥46 PDEs in 1D, 2D, 3D; Differentiable Physics; Unrolled Training; Rollout Metrics)

19K 103 2

lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

18K 4K 611

causal-worlds

Generate fictional-but-coherent causal operations worlds (executable sim + time-series + ground-truth causal answer-key) from a natural-language description — for benchmarking causal-discovery and control agents.

17K 4 0