PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Bpe Python Packages

Python packages with the GitHub topic bpe. Sorted by relevance, with stars and monthly downloads.
OpenNMT
pyonmttok

Fast and customizable text tokenization library with BPE and SentencePiece support

44K 333 83
akretion
nfelib

nfelib - bindings Python para e ler e gerar XML de NF-e, NFS-e nacional, CT-e, MDF-e, BP-e

34K 192 69
rsennrich
subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation

26K 2K 474
gweidart
rs-bpe

A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

18K 38 5
vkcom
youtokentome

Unsupervised text tokenizer focused on computational efficiency

11K 978 108
JpCurada
filipino-tokenizer

The first open-source, morphologically-aware subword tokenizer for Philippine languages, combining a structured linguistic affix data with BPE for language model pretraining.

6K 4 0
Systemcluster
kitoken

Fast tokenizer for language models, compatible with SentencePiece, Tokenizers, Tiktoken and more. Supports BPE, Unigram and WordPiece tokenization in JavaScript, Python and Rust.

3K 49 2
soaxelbrooke
bpe

Byte pair encoding for graceful handling of rare words in NLP

1K 232 39
neluca
tinybpe

🐍This is a fast, lightweight, and clean CPython extension for the Byte Pair Encoding (BPE) algorithm, which is commonly used in LLM tokenization and NLP tasks.

1K 4 0
VihangaFTW
bytetok

A fast, minimal BPE tokenizer for NLP research.

887 2 0
Hk669
bpetokenizer

A Byte Pair Encoding (BPE) tokenizer, which algorithmically follows along the GPT tokenizer(tiktoken), allows you to train your own tokenizer. The tokenizer is capable of handling special tokens and uses a customizable regex pattern for tokenization(includes the gpt4 regex pattern). supports `save` and `load` tokenizers in the `json` and `file` format. The `bpetokenizer` also supports [pretrained](bpetokenizer/pretrained/) tokenizers.

678 3 1
crodriguez1a
bpe-summarizer

This summarizer attempts to leverage Byte Pair Encoding (BPE) tokenization and the Bart vocabulary to filter text by semantic meaningfulness.

471 3 1
zouharvi
tokenization-scorer

Simple-to-use scoring function for arbitrarily tokenized texts.

459 48 6
sefineh-ai
amharic-tokenizer

Syllable-aware BPE tokenizer for the Amharic language (αŠ αˆ›αˆ­αŠ›) – fast, accurate, trainable.

440 99 14
Thibault00
runtoken

A blazing-fast BPE tokenizer for LLMs. Drop-in tiktoken replacement, 20-80x faster.

331 2 0
thammegowda
nlcodec

nlcodec is a collection of encoding schemes for natural language sequences. nlcodec.db is a efficient storage and retrieval layer for integer sequences of varying lengths.

299 5 4
DVDAGames
pgn-tokenizer

PGN Tokenizer, a Byte Pair Encoding (BPE) tokenizer for Chess Portable Game Notiation (PGN).

289 0 0
TnsaAi
tokenize2

Official Repository of Tokenize2 Tokenizers by TNSA

176 0 0
shiningsunnyday
geobpe

Geometric Byte Pair Encoding of Protein Structure (ICLR 2026)

167 24 4
akretion
nfelib-xsdata

nfelib - bindings Python para e ler e gerar XML de NF-e, NFS-e nacional, CT-e, MDF-e, BP-e

3 192 69
    • Data from PyPI, GitHub, ClickHouse, and BigQuery