PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Tokenization Python Packages

Python packages with the GitHub topic tokenization. Sorted by relevance, with stars and monthly downloads.
explosion
spacy

💫 Industrial-strength Natural Language Processing (NLP) in Python

21.6M 34K 5K
WorksApplications
sudachipy

Sudachi in Rust 🦀 and new generation of SudachiPy

1.9M 444 53
AgentOps-AI
tokencost

Easy token price estimates for 400+ LLMs. TokenOps.

438K 2K 104
mysto
ff3

FPE - Format Preserving Encryption with FF3 in Python

265K 104 20
ScrapeGraphAI
toonify

Toonify: Compact data format reducing LLM token usage by 30-60%

121K 339 24
adbar
simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

120K 199 15
natasha
razdel

Rule-based token, sentence segmentation for Russian language

102K 281 34
izikeros
count-tokens

Count tokens in a text file.

45K 13 0
OpenNMT
pyonmttok

Fast and customizable text tokenization library with BPE and SentencePiece support

44K 333 83
davidpirogov
toon-llm

Token-Oriented Object Notation (TOON) is an LLM-optimized data serialization format implemented in Python.

26K 9 3
lucidrains
h-net-dynamic-chunking

Implementation of the dynamic chunking mechanism in H-net by Hwang et al. of Carnegie Mellon

21K 76 2
OpenVoiceOS
quebra-frases

chunks strings into byte sized pieces

15K 1 3
nlpcloud
nlpcloud

NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, code generation, and more...

14K 86 8
vkcom
youtokentome

Unsupervised text tokenizer focused on computational efficiency

11K 978 108
PyThaiNLP
attacut

A Fast and Accurate Neural Thai Word Segmenter

8K 96 18
JpCurada
filipino-tokenizer

The first open-source, morphologically-aware subword tokenizer for Philippine languages, combining a structured linguistic affix data with BPE for language model pretraining.

6K 4 0
THUDM
icetk

A unified tokenization tool for Images, Chinese and English.

5K 153 17
explosion
spacy-streamlit

👑 spaCy building blocks and visualizers for Streamlit apps

4K 855 116
TI-Toolkit
tivars

A Python library for interacting with TI-(e)z80 (82/83/84 series) calculator files

4K 26 1
BitBadges
bitbadgespy-sdk

The most feature-rich tokenization standard ever built — TypeScript SDK for the BitBadges tokenization Cosmos SDK module

3K 0 0
kemingy
plane

A text processing tool including tag(HTML, URL, Email) extraction and removing, punctuation normalization, simple segmentation, and so on.

3K 11 2
cbaziotis
ekphrasis

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

2K 675 94
rosette-api
rosette-api

Babel Street Analytics Client Library for Python

2K 38 37
cedricrupb
code-tokenize

Fast tokenization and structural analysis of any programming language

2K 62 10
    • Data from PyPI, GitHub, ClickHouse, and BigQuery