PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Data Preparation Python Packages

Python packages with the GitHub topic data-preparation. Sorted by relevance, with stars and monthly downloads.
skrub-data
skrub

Machine learning with dataframes

181K 2K 218
ironmussa
optimuspyspark

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

4K 2K 232
amphi-ai
jupyterlab-amphi

visual data prep powered by python

2K 1K 105
snowmuffin
convmerge

Merge heterogeneous chat/text sources into a single LLM training format (JSONL)

2K 0 1
sisinflab
datarec-lib

A Python Library for Standardized and Reproducible Data Management in Recommender Systems

2K 20 1
johanneskasser
hdsemg-select

A graphical user interface (GUI) application for selecting and analyzing HDsEMG channels. This tool helps identify and exclude faulty channels (e.g., due to electrode misplacement, corrosion or noisy channels) from HDsEMG recordings.

1K 2 0
tracebloc
tracebloc-ingestor

tracebloc data pipeline for training/test dataset setup

1K 8 0
amphi-ai
amphi-scheduler

Amphi Scheduler (JupyterLab extension + Python backend)

955 1K 105
sisinflab
datarec

Standardized & reproducible data management for recommender systems.

769 20 1
hi-primus
pyoptimus

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

620 2K 233
CyberCRI
refinedoc

Python library for extracting headers, footers and body from PDF

403 26 3
kozodoi
dptools

Data Preprocessing Tools

349 5 3
Florian-Katerndahl
forestiler

Create Image Tiles From Large Input Rasters According to a Classified Mask Vector File

319 0 0
developmentseed
label-maker

Data Preparation for Satellite Machine Learning

310 469 107
ixlan
machine-learning-data-pipeline

Pipeline module for parallel real-time data processing for machine learning models development and production purposes.

285 22 2
dataclr
dataclr

Feature selection for tabular datasets using advanced filter and wrapper methods

276 20 2
ved93
ml-express

A Python library for day to day data analysis and machine learning. This aims to make data building, cleaning and machine learning much much faster. A library of extension and helper modules for Python's data analysis and machine learning libraries.

218 3 1
asavinov
prosto

Data processing toolkit radically changing the way data is processed

180 93 5
NVIDIA
invisible-rabbit

Scalable data pre processing and curation toolkit for LLMs

176 2K 267
maksymsur
spltr

`Spltr` is a simple PyTorch-based data loader and splitter. It may be used to load arrays and matrices or Pandas DataFrames and CSV files containing numerical data with subsequent split it into train, test (validation) subsets in the form of PyTorch DataLoader objects.

161 1 0
alihanozz
daxpy

A pre-machine-learning model package

101 0 0
NVIDIA
invisible-unicorn

Scalable data pre processing and curation toolkit for LLMs

85 2K 267
NVIDIA
lava-ray

Scalable data pre processing and curation toolkit for LLMs

1 2K 267
    • Data from PyPI, GitHub, ClickHouse, and BigQuery