PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Tokenizer Python Packages

Python packages with the GitHub topic tokenizer. Sorted by relevance, with stars and monthly downloads.
tusharsadhwani
pytokens

A fast, spec compliant Python 3.14+ tokenizer that runs on older Pythons.

76.9M 4 7
hplt-project
sacremoses

Python port of Moses tokenizer, truecaser and normalizer

2.7M 495 59
polm
fugashi

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.

591K 520 39
taishi-i
nagisa

A Japanese tokenizer based on recurrent neural networks

226K 417 23
lovit
soynlp

한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.

155K 983 183
adbar
simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

120K 199 15
berkmancenter
sentence-splitter

Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.

62K 258 34
mideind
tokenizer

A tokenizer for Icelandic text.

54K 30 8
natasha
natasha

Solves basic Russian NLP tasks, API for lower level Natasha projects

49K 1K 116
izikeros
count-tokens

Count tokens in a text file.

45K 13 0
OpenNMT
pyonmttok

Fast and customizable text tokenization library with BPE and SentencePiece support

44K 333 83
ngocjr7
sctokenizer

A Source Code Tokenizer

39K 13 6
ModelCloud
tokenicer

A (nicer) tokenizer you want to use for model inference and training: with all known peventable gotchas normalized or auto-fixed.

27K 11 4
davidpirogov
toon-llm

Token-Oriented Object Notation (TOON) is an LLM-optimized data serialization format implemented in Python.

26K 9 3
OpenPecha
botok

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python

24K 80 15
OpenVoiceOS
quebra-frases

chunks strings into byte sized pieces

15K 1 3
PyThaiNLP
nlpo3

Thai natural language processing library in Rust, with Python and Node bindings.

12K 46 13
roshan-research
hazm

Persian NLP Toolkit

11K 1K 206
cereja-project
cereja

Cereja is a bundle of useful functions we don't want to rewrite and .. just pure fun!

10K 29 12
naturalness
javac-parser

Exposes OpenJDK's Java parser and scanner to Python

10K 7 4
krahd
modelito

Lightweight Python abstractions and connectors for LLMs (OpenAI, Claude, Gemini, Ollama)

7K 0 0
trag1c
crossandra

A fast and simple tokenization library for Python operating on enums and regular expressions, with a decent amount of configuration.

7K 9 1
cahya-wirawan
pyrwkv-tokenizer

A fast RWKV Tokenizer written in Rust

7K 54 5
tsproisl
somajo

A tokenizer and sentence splitter for German and English web and social media texts.

6K 153 21
    • Data from PyPI, GitHub, ClickHouse, and BigQuery