PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Pyspark Python Packages

Python packages with the GitHub topic pyspark. Sorted by relevance, with stars and monthly downloads.
narwhals-dev
narwhals

Lightweight and extensible compatibility layer between dataframe libraries!

86.5M 2K 193
graphframes
graphframes

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

3.2M 1K 268
MrPowers
chispa

PySpark test helper methods with beautiful error messages

3.1M 764 80
capitalone
datacompy

Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!

2.6M 643 160
graphframes
graphframes-py

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

2.2M 1K 268
Microsoft
synapseml

Simple and Distributed Machine Learning

2.1M 5K 861
ibis-project
ibis-framework

the portable Python dataframe library

1.9M 7K 722
jupyter-incubator
hdijupyterutils

Jupyter magics and kernels for working with remote Spark clusters

1.6M 1K 447
jupyter-incubator
autovizwidget

Jupyter magics and kernels for working with remote Spark clusters

1.6M 1K 447
jelmerk
pyspark-hnsw

Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs

1.3M 303 59
JohnSnowLabs
spark-nlp

State of the Art Natural Language Processing

1.1M 4K 743
Nike-Inc
koheesio

Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.

879K 650 39
MrPowers
quinn

pyspark methods to enhance developer productivity 📣 👯 🎉

615K 687 95
debugger24
pyspark-test

Testing library for pyspark, inspired from pandas testing module but for pyspark, to help users write unit tests.

336K 21 5
zero323
pyspark-stubs

Apache (Py)Spark type annotations (stub files).

320K 118 36
CamDavidsonPilon
tdigest

t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark

305K 407 53
asuiu
sparkorm

ORM for Apache Spark and DataFrames schema manager

299K 16 3
databrickslabs
dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines

287K 465 94
uber
petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

273K 2K 284
crflynn
pbspark

protobuf pyspark conversion

241K 23 6
canimus
cuallee

Possibly the fastest DataFrame-agnostic quality check library in town.

103K 246 22
zalmane
copybook

Python copybook parser

97K 33 10
G-Research
pyspark-extension

A library that provides useful extensions to Apache Spark and PySpark.

91K 237 30
h2oai
h2o-pysparkling-3-1

Sparkling Water provides H2O functionality inside Spark cluster

79K 977 361
    • Data from PyPI, GitHub, ClickHouse, and BigQuery