PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Corpus Tools Python Packages

Python packages with the GitHub topic corpus-tools. Sorted by relevance, with stars and monthly downloads.
adbar
trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

8.9M 6K 371
adbar
simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

120K 199 15
flairNLP
fundus

A very simple news crawler with a funny name

8K 455 109
johentsch
ms3

A parser for annotated MuseScore 3 files.

4K 56 6
nickduran
align

Python library for extracting quantitative, reproducible metrics of multi-level alignment between speakers in naturalistic language corpora.

1K 54 17
Helsinki-NLP
opusfilter

Toolbox for filtering parallel corpora

913 115 26
opendatalab
mineru-html

MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.

714 246 24
liao961120
concordancer

Extract concordance lines from corpus with CQL

649 19 3
ynop
audiomate

Audiomate is a library for working with audio datasets.

641 138 25
mshakirDr
mfte

MFTE (Multi Feature Tagger of English) Python is the Python version based on Le Foll's MFTE written in Perl. It is extended to include semantic tags from Biber (2006) and Biber et al. (1999), including other specific tags.

516 30 3
jonathandunn
corpus-similarity

Measure the similarity of text corpora for 74 languages

344 14 3
grammarly
ua-gec

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

300 270 23
rmalouf
treesearch-ud

High-performance toolkit for querying linguistic dependency parses

209 3 0
koskenni
betastr

An open source reimplementation of Benny Brodda's BETA in Python

194 63 2
edwardseley
lyricscorpora

An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts

191 18 1
    • Data from PyPI, GitHub, ClickHouse, and BigQuery