PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Data Curation Python Packages

Python packages with the GitHub topic data-curation. Sorted by relevance, with stars and monthly downloads.
voxel51
fiftyone-db

Refine high-quality datasets and visual AI models

170K 11K 761
voxel51
fiftyone

Refine high-quality datasets and visual AI models

136K 11K 761
cleanlab
cleanlab

Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

62K 11K 893
visualdatabase
fastdup

fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.

7K 2K 87
cleanlab
cleanlab-studio

Client interface for all things Cleanlab Studio

4K 32 10
voxel51
fiftyone-db-ubuntu2204

Refine high-quality datasets and visual AI models

3K 11K 761
KenObata
distributed-curator

Partition-aware MinHash LSH deduplication for large-scale text data curation on Apache Spark

2K 1 0
Digital-Dermatology
selfclean

A holistic self-supervised data cleaning strategy to detect off-topic samples, near duplicates and label errors.

2K 37 2
voxel51
fiftyone-desktop

Refine high-quality datasets and visual AI models

1K 11K 761
Renumics
sliceguard

A library for detecting problematic data segments in structured and unstructured data with few lines of code.

1K 63 3
TieuLongPhan
synrbl

Rebalancing chemical reaction

1K 29 2
PennLINC
cubids

Curation of BIDS (CuBIDS): A sanity-preserving software package for processing BIDS datasets.

711 30 13
cleanlab
cleanlab-cli

Command line interface for all things Cleanlab Studio

647 32 10
voxel51
fiftyone-db-ubuntu2004

Refine high-quality datasets and visual AI models

482 11K 761
cleanlab
example-package-elisno

The standard package for data-centric AI, machine learning with label errors, and automatically finding and fixing dataset issues in Python.

344 11K 893
UAL-RE
ldcoolp-figshare

Python tool using the Figshare API for data curation

265 3 1
Docta-ai
docta-ai

Docta.ai

231 3K 256
aminnaghdloo
annotate-ez

High-throughput curation and visualization of large-scale single-cell microscopy images, in a lightweight GUI.

196 1 0
voxel51
fiftyone-db-debian9

Refine high-quality datasets and visual AI models

189 11K 761
NVIDIA
invisible-rabbit

Scalable data pre processing and curation toolkit for LLMs

176 2K 267
bluestero
urlgenie

Python package to make URL extraction, generalization, validation, and filtration easy.

151 4 1
voxel51
fiftyone-db-rhel7

Refine high-quality datasets and visual AI models

144 11K 761
voxel51
fiftyone-db-ubuntu1604

Refine high-quality datasets and visual AI models

143 11K 761
pennlinc
cubids-bond-fork

BIDS On Disk Editor

132 30 13
    • Data from PyPI, GitHub, ClickHouse, and BigQuery