PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Data Mining Python Packages

Python packages with the GitHub topic data-mining. Sorted by relevance, with stars and monthly downloads.
microsoft
lightgbm

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

16.2M 18K 4K
catboost
catboost

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

6.3M 9K 1K
RaRe-Technologies
gensim

Topic Modelling for Humans

4.9M 16K 4K
jaidedai
easyocr

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

3M 29K 4K
yzhao062
pyod

A Python library for anomaly detection across tabular, time series, graph, text, and image data. 60+ detectors, benchmark-backed ADEngine orchestration, and an agentic workflow for AI agents.

2.7M 10K 1K
sktime
sktime

A unified framework for machine learning with time series

1.3M 10K 2K
hackingmaterials
matminer

Data mining for materials science

1.2M 593 211
rasbt
mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.

1.1M 5K 905
alan-turing-institute
clevercsv

CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.

442K 1K 81
deanmalmgren
textract

extract text from any document. no muss. no fuss.

408K 5K 678
barrust
pyprobables

Probabilistic data structures in python http://pyprobables.readthedocs.io/en/latest/index.html

223K 123 12
WenjieDu
tsdb

a Python toolbox loads 173 public time series datasets for machine/deep learning with a single line of code. Datasets from multiple domains including healthcare, financial, power, traffic, weather, and etc.

126K 235 22
WenjieDu
pypots

A Python toolkit/library for reality-centric machine/deep learning & data mining on partially-observed time series, with 50+ SOTA neural network models for scientific analysis tasks (imputation, classification, clustering, forecasting, anomaly detection, cleaning) on incomplete industrial irregularly-sampled multivariate TS with NaN missing values

123K 2K 184
WenjieDu
pygrinder

PyGrinder: a Python toolkit for grinding data beans into the incomplete for real-world data simulation by introducing missing values with different missingness patterns, including MCAR (complete at random), MAR (at random), MNAR (not at random), sub sequence missing, and block missing

121K 66 6
process-intelligence-solutions
pm4py

Official public repository for PM4Py (Process Mining for Python) — an open-source library for exploring, analyzing, and optimizing business processes with Python.

101K 956 349
biolab
orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis

100K 6K 1K
K0lb3
unitypy

UnityPy is python module that makes it possible to extract/unpack and edit Unity assets

91K 1K 182
chuanconggao
prefixspan

The shortest yet efficient Python implementation of the sequential pattern mining algorithm PrefixSpan, closed sequential pattern mining algorithm BIDE, and generator sequential pattern mining algorithm FEAT.

79K 426 93
aeon-toolkit
aeon

A toolkit for time series machine learning and deep learning

62K 1K 268
KyleKing
textract-py3

Maintained fork of deanmalmgren/textract to replace '*' dependencies and other updates

56K 14 2
tommyod
efficient-apriori

An efficient Python implementation of the Apriori algorithm.

50K 347 61
catboost
catboost-dev

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

49K 9K 1K
tommyod
paretoset

Compute the Pareto (non-dominated) set, i.e., skyline operator/query.

38K 70 5
exaxorg
accelerator

The Accelerator is a tool for fast and reproducible processing of eBay-scale datasets on a single computer.

38K 4 2
    • Data from PyPI, GitHub, ClickHouse, and BigQuery