Document Intelligence Python Packages

kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

191K 9K 507

paddlenlp

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

37K 13K 3K

infinity-parser2

INF Tech's open-source MLLMs for SOTA visual-language understanding and advanced document intelligence.

14K 222 24

contextgem

ContextGem: Effortless LLM extraction from documents

13K 2K 158

tool-helpers

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

10K 13K 3K

ks-xlsx-parser

XLSX parser for LLMs, RAG, LangChain, LangGraph, CrewAI, Claude, MCP — turns Excel (.xlsx) into citation-ready JSON with formulas, charts, dependency graphs, and token-counted chunks. Open-source Python library (MIT).

10K 40 4

fast-dataindex

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

5K 13K 3K

docnest-ai

The document normalization engine RAG has always needed. Parse any document, understand its structure, build RAG that actually works.

2K 4 2

fast-tokenizer-python

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

2K 13K 3K

faster-tokenizer

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

2K 13K 3K

vectorless

Knowing by reasoning, not vectors.

1K 39 2

preocr

Fast document classification and OCR detection. Analyzes any file type to determine if OCR is needed, saving time and money on unnecessary processing.

1K 12 4

langchain-kreuzberg

Langchain document loader for Kreuzberg

839 5 1

docimprint

Document memory agents can prove — SDK for tamper-evident evidence bundles, citations, and on-chain attestation

788 0 0

xberg

785 9K 507

knwler

Knwler is a lightweight Python tool that extracts structured knowledge graphs from documents using AI. Feed it a PDF or text file and receive a richly connected network of entities, relationships, and topics — complete with an interactive HTML report and exports ready for your favorite graph analytics platform.

621 144 15

decompose-mcp

The missing cognitive primitive for AI agents. Decompose any text into classified semantic units — authority, risk, attention, entities. No LLM. Deterministic.

617 10 3

faster-tokenizers

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

530 13K 3K

longparser

Privacy-first document intelligence engine — parse PDFs, DOCX, PPTX, XLSX & CSV into AI-ready chunks for RAG pipelines. Includes HITL review, 3-layer memory chat, and a production FastAPI server.

442 29 2

paddle-pipelines

Paddle-Pipelines: An End to End Natural Language Proceessing Development Kit Based on PaddleNLP

376 13K 3K

kreuzberg-surrealdb

Kreuzberg-to-SurrealDB connector for document ingestion pipelines — schema management, content deduplication, chunk storage, and index configuration

372 16 1

llama-index-readers-kreuzberg

LlamaIndex reader for 88+ document formats powered by kreuzberg's Rust extraction engine

352 1 0

mseep-kreuzberg

259 9K 508

llama-index-node-parser-kreuzberg

Element-aware LlamaIndex node parser for kreuzberg-extracted documents

223 1 0