PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Deduplication Python Packages

Python packages with the GitHub topic deduplication. Sorted by relevance, with stars and monthly downloads.
J535D165
recordlinkage

A powerful and modular toolkit for record linkage and duplicate detection in Python

4.6M 1K 153
moj-analytical-services
splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

782K 2K 236
MinishLab
semhash

Fast Multimodal Semantic Deduplication & Filtering

73K 924 56
borgbackup
borgbackup

Deduplicating archiver with compression and authenticated encryption.

70K 13K 846
iscc
fastcdc

FastCDC implementation in Python https://pypi.org/project/fastcdc/

56K 65 17
opensanctions
nomenklatura

Framework and command-line tools for integrating FollowTheMoney data streams from multiple sources

37K 242 44
benzsevern
goldenmatch

Polyglot entity-resolution + data-quality toolkit. Zero-config auto-config (negative-evidence + Path Y) hits DQbench composite 91.04 (T3 53.8% → 85.5%). Holds 0.96 DBLP-ACM, 0.94 Febrl3, 0.97 NCVR. GoldenCheck → GoldenFlow → GoldenMatch → GoldenPipe. MCP per package, multi-arch containers, Airflow DAGs, browser workbench.

7K 44 6
Elijas
redis-message-queue

Reliable Python message queuing with Redis and built-in deduplication. Publish once, process once, recover from crashes - across any number of producers and consumers.

6K 5 1
zinggAI
zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML

5K 1K 168
DBRetina
dbretina

DBRetina Python Package

4K 1 3
jolovicdev
cashet

A Python memoization cache with Redis, async support, and an HTTP server.Cache Python function results like git objects. Content-addressable, pipeline-friendly, and CLI-inspectable. Run once, reuse forever.

4K 0 0
Fallen-Breath
pyfastcdc

A high-performance FastCDC 2020 implementation written in Python + Cython

4K 2 0
LibreTranslate
removedup

Remove duplicates from parallel corpora

4K 7 1
lumen-argus
crossfire-rules

Regex rule overlap analyzer for DLP, secret scanning, SAST, and IDS tools

2K 0 0
fritshermans
deduplipy

End-to-end deduplication solution

2K 82 8
AI-team-UoA
pyjedai

An open-source library that leverages Python’s data science ecosystem to build powerful end-to-end Entity Resolution workflows.

2K 94 13
KenObata
distributed-curator

Partition-aware MinHash LSH deduplication for large-scale text data curation on Apache Spark

2K 1 0
benzsevern
goldenpipe

Polyglot entity-resolution + data-quality toolkit. Zero-config auto-config (negative-evidence + Path Y) hits DQbench composite 91.04 (T3 53.8% → 85.5%). Holds 0.96 DBLP-ACM, 0.94 Febrl3, 0.97 NCVR. GoldenCheck → GoldenFlow → GoldenMatch → GoldenPipe. MCP per package, multi-arch containers, Airflow DAGs, browser workbench.

2K 0 0
AshleyT3
atbu-pkg

ATBU package supports local/cloud backup/restore as well as local file integrity diff tool for helping in efforts to manage file integrity, duplication, and bitrot detection.

2K 1 0
benzsevern
goldenflow

Polyglot entity-resolution + data-quality toolkit. Zero-config auto-config (negative-evidence + Path Y) hits DQbench composite 91.04 (T3 53.8% → 85.5%). Holds 0.96 DBLP-ACM, 0.94 Febrl3, 0.97 NCVR. GoldenCheck → GoldenFlow → GoldenMatch → GoldenPipe. MCP per package, multi-arch containers, Airflow DAGs, browser workbench.

2K 1 0
zevatov
nra

Neural Ready Archive — stream, deduplicate, and train on ML datasets without downloading. Built in Rust.

2K - -
vaultah
replicat

Configurable and lightweight backup utility with deduplication and encryption

2K 5 0
NickCrews
mismo

The SQL/Ibis powered sklearn of record linkage.

1K 23 4
cobanov
semaclust

clustering similar strings using sentence embeddings and agglomerative clustering

1K 5 0
    • Data from PyPI, GitHub, ClickHouse, and BigQuery