PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Big Data Python Packages

Python packages with the GitHub topic big-data. Sorted by relevance, with stars and monthly downloads.
cython
cython

The most widely used Python to C compiler

122.5M 11K 2K
apache
pyspark

Apache Spark - A unified analytics engine for large-scale data processing

51.2M 43K 29K
delta-io
delta-spark

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

36M 9K 2K
catboost
catboost

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

6.4M 9K 1K
graphframes
graphframes

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

3.2M 1K 268
graphframes
graphframes-py

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

2.2M 1K 268
Microsoft
synapseml

Simple and Distributed Machine Learning

2.1M 5K 861
apache
pyspark-client

Apache Spark - A unified analytics engine for large-scale data processing

1.7M 43K 29K
delta-io
delta-sharing

An open protocol for secure data sharing

1.4M 941 227
databricks
koalas

Koalas: pandas API on Apache Spark

1.4M 3K 370
scikit-hep
uproot

ROOT I/O in pure Python and NumPy.

1.3M 267 94
Eventual-Inc
daft

High-performance data engine for AI and multimodal workloads. Process images, audio, video, and structured data at any scale

810K 5K 474
nipy
nipype

Workflows and interfaces for neuroimaging packages

695K 825 545
starrocks
starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.

554K 12K 2K
feast-dev
feast

The Open Source Feature Store for AI/ML

474K 7K 1K
man-group
arcticdb

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.

282K 2K 181
scikit-hep
awkward0

Manipulate arrays of complex data structures as easily as Numpy.

262K 214 39
scikit-hep
uproot3

ROOT I/O in pure Python and NumPy.

262K 313 66
h2oai
h2o

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

235K 7K 2K
uxlfoundation
scikit-learn-intelex

Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

186K 1K 186
lithops-cloud
lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀

181K 365 122
uxlfoundation
daal

oneAPI Data Analytics Library (oneDAL)

181K 646 224
IntelPython
daal4py

Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

140K 1K 185
hazelcast
hazelcast-python-client

Hazelcast Python Client

116K 116 78
    • Data from PyPI, GitHub, ClickHouse, and BigQuery