PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Training Data Python Packages

Python packages with the GitHub topic training-data. Sorted by relevance, with stars and monthly downloads.
a-maliarov
amazoncaptcha

Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.

78K 490 91
snorkel-team
snorkel

A system for quickly generating training data with weak supervision

69K 6K 854
ydataai
ydata-synthetic

Synthetic data generators for tabular and time-series data

9K 2K 260
alteryx
composeml

A machine learning tool for automated prediction engineering. It allows you to easily structure prediction problems and generate labels for supervised learning.

9K 511 50
sparkfish
augraphy

Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes

8K 536 61
timbo4u1
s2s-certify

Physics-audit engine for Physical AI. 8 biomechanical laws certified in real-time — safety gate for robot training pipelines and prosthetics. pip install s2s-certify

4K 3 0
KennethEnevoldsen
augmenty

An augmentation library based on SpaCy for joint augmentation of text and labels.

2K 156 9
MoonyFringers
ladon-crawl

A Python framework for building structured, resumable web crawlers — designed for domains where data quality matters.

1K 5 1
NorskRegnesentral
skweak

skweak: A software toolkit for weak supervision applied to NLP tasks

973 926 77
Ahmad8864
autosynth

Agentic synthetic-data generation framework inspired by Meta FAIR's Autodata / Agentic Self-Instruct.

739 0 0
liuxiaotong
ai-dataset-radar

Multi-source async competitive intelligence engine for AI training data ecosystems with watermark-driven incremental scanning & anomaly detection. CLI + MCP ready.

676 2 1
Data-Centric-AI-Community
fg-data-synthetic

Synthetic data generators for tabular and time-series data

628 2K 260
liuxiaotong
knowlyr-datasynth

Seed-to-scale LLM synthetic data engine with auto-detected templates, schema validation & quality-diversity optimization. CLI + MCP ready.

504 1 0
kcieslik
eq-insar

Lightweight, physics-based forward model for generating synthetic InSAR deformation data from earthquake sources

473 10 4
ychampion
codeclaw

Export your Claude Code and Codex conversations to Hugging Face as structured training data

384 10 2
stef41
datacruxai

Lightweight CPU-only data quality toolkit for LLM instruction tuning datasets.

345 1 0
stef41
datamix

Dataset mixing & curriculum optimizer — profile, blend, schedule, and budget training data. Zero deps.

318 1 0
stef41
castwright

Generate synthetic instruction-tuning data from seed examples. Simple API, built-in quality filtering, multi-provider.

282 1 0
mockloop
mockloop-mcp

Intelligent Model Context Protocol (MCP) server for AI-assisted API development. Generate mock servers from OpenAPI specs with advanced logging, performance analytics, and server discovery. Optimized for AI development workflows with comprehensive testing insights and automated analysis.

251 16 6
mikl0s
lg3k

Log Generator 3000 - The best log generation engine, with LLM training integration and guides.

220 1 0
phanii9
tidbit

Capture anything into structured Markdown notes and training-ready JSONL.

163 6 0
mockloop
iflow-mcp-mockloop-mockloop-mcp

Intelligent Model Context Protocol (MCP) server for AI-assisted API development. Generate mock servers from OpenAPI specs with advanced logging, performance analytics, and server discovery. Optimized for AI development workflows with comprehensive testing insights and automated analysis.

155 16 6
abinashmeher999
srtvoiceext

A command line interface to combine text information from subtitles with voice data in the video.

142 19 5
qubasehq
qudata

A comprehensive LLM data processing system designed to transform raw multi-format data into high-quality training datasets optimized for Large Language Models.

126 1 0
    • Data from PyPI, GitHub, ClickHouse, and BigQuery