PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Data Processing Python Packages

Python packages with the GitHub topic data-processing. Sorted by relevance, with stars and monthly downloads.
pandera-dev
pandera

A light-weight, flexible, and expressive statistical data testing library

8.9M 4K 397
lithops-cloud
lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀

181K 365 122
svenkreiss
pysparkling

A pure Python implementation of Apache Spark's RDD and DStream interfaces.

160K 270 45
NVIDIA
nvidia-nvimgcodec-cu12

A nvImageCodec library of GPU- and CPU- accelerated codecs featuring a unified interface

82K 149 16
datachain-ai
datachain

The Context Layer for unstructured data: typed, versioned datasets over S3, GCS, Azure

43K 3K 143
bytewax
bytewax

Python Stream Processing

42K 2K 109
run-house
kubetorch

Distribute and run AI workloads on Kubernetes magically in Python, like PyTorch for ML infra.

34K 1K 58
run-house
runhouse

Distribute and run AI workloads on Kubernetes magically in Python, like PyTorch for ML infra.

29K 1K 58
NVIDIA
nvidia-dali-cuda120

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

28K 6K 665
allenai
dolma

Data and tools for generating and inspecting OLMo pre-training data.

25K 1K 189
python-bonobo
bonobo

Extract Transform Load for Python 3.5+

25K 2K 145
matthewdeanmartin
untruncate-json

Python library to repair truncated json. Translated directly from the typescript original version

16K 5 0
crate
cratedb-toolkit

CrateDB Toolkit, an SDK for CrateDB and CrateDB Cloud.

15K 11 4
wq
itertable

⇔ IterTable is a Pythonic API for iterating through tabular data formats, including CSV, XLSX, XML, and JSON.

13K 53 11
pathwaycom
pathway

Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.

12K 63K 2K
NVIDIA
nvidia-nvimgcodec-cu11

A nvImageCodec library of GPU- and CPU- accelerated codecs featuring a unified interface

12K 149 16
NVIDIA
nvidia-nvimgcodec-cu13

A nvImageCodec library of GPU- and CPU- accelerated codecs featuring a unified interface

9K 149 16
tandav
pipe21

Simple functional pipes for python

7K 19 0
LiberTEM
libertem

Open pixelated STEM framework

7K 124 73
Elijas
redis-message-queue

Reliable Python message queuing with Redis and built-in deduplication. Publish once, process once, recover from crashes - across any number of producers and consumers.

6K 5 1
datajuicer
py-data-juicer

Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷

5K 6K 371
polyaxon
haupt

Lineage metadata API, artifacts streams, sandbox, API, and spaces for Polyaxon

5K 452 207
kmatarese
glide

Easy ETL

4K 17 2
NVIDIA
nvidia-dali-cuda110

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

4K 6K 665
    • Data from PyPI, GitHub, ClickHouse, and BigQuery