PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Pretraining Python Packages

Python packages with the GitHub topic pretraining. Sorted by relevance, with stars and monthly downloads.
MatthewK78
rose-opt

🌹 Rose: Range-Of-Slice Equilibration PyTorch optimizer. Stateless optimization through range-normalized gradient updates.

4K 60 4
alea-institute
alea-preprocess

Accessible, efficient data preprocessing library for pretrain and SFT datasets, including KL3M

906 1 0
lpalbou
forgellm

A comprehensive toolkit for end-to-end continued pre-training, fine-tuning, monitoring, testing and publishing of language models with MLX-LM

686 4 0
vincentzed
decontaminate

`decon`, but with python API binding.

649 3 0
loicgrobol
zeldarose

Train transformer-based models.

543 28 3
4thel00z
ccdown

A rust based, resumable downloader cli and python library for Common Crawl data

489 0 0
AI-sandbox
iltm

iLTM: Integrated Large Tabular Model

470 19 0
PaddlePaddle
fleet-x

No description available

370 479 165
kyo-takano
chinchilla

A toolkit for scaling law research ⚖

351 63 4
dean0x
autoevolve

Multi-agent research competition orchestrator for autoresearch

308 6 1
a-r-j
proteinworkshop

Benchmarking framework for protein representation learning. Includes a large number of pre-training and downstream task datasets, models and training/task utilities. (ICLR 2024)

272 273 22
dean0x
autojudge

Smarter experiment evaluation for autoresearch — replaces eyeballing val_bpb with statistical verdicts

245 6 1
dean0x
autosteer

Companion tools for Karpathy's autoresearch - smarter evaluation, guided steering, and multi-agent competitions for GPT pretraining

243 6 1
open-sciencelab
graphg

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

237 1K 82
PaddlePaddle
fleet-lightning

飞桨大模型开发套件,提供大语言模型、跨模态大模型、生物计算大模型等领域的全流程开发工具链。

179 479 165
anto18671
lumenspark

Lumenspark is a lightweight Linformer-based Language Model Trained from Scratch

169 1 0
adrien-lagesse
ngab

Benchmarking and generating PE for GNNs via the Graph Alignment task. Code for our paper: Graph Alignment for Benchmarking Graph Neural Networks and Learning Positional Encodings

141 0 0
marian-nmt
sotastream

A library for data streaming and augmentation

104 21 4
PaddlePaddle
paddle-fleet

飞桨大模型开发套件,提供大语言模型、跨模态大模型、生物计算大模型等领域的全流程开发工具链。

64 479 165
    • Data from PyPI, GitHub, ClickHouse, and BigQuery