PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Document Intelligence Python Packages

Python packages with the GitHub topic document-intelligence. Sorted by relevance, with stars and monthly downloads.
kreuzberg-dev
kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

170K 8K 488
PaddlePaddle
paddlenlp

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

31K 13K 3K
shcherbak-ai
contextgem

ContextGem: Effortless LLM extraction from documents

13K 2K 156
PaddlePaddle
tool-helpers

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

10K 13K 3K
PaddlePaddle
fast-dataindex

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

9K 13K 3K
yuvaraj3855
preocr

Fast document classification and OCR detection. Analyzes any file type to determine if OCR is needed, saving time and money on unnecessary processing.

3K 10 4
vectorlessflow
vectorless

Knowing by reasoning, not vectors.

3K 33 2
knowledgestack
ks-xlsx-parser

XLSX parser for LLMs, RAG, LangChain, LangGraph, CrewAI, Claude, MCP — turns Excel (.xlsx) into citation-ready JSON with formulas, charts, dependency graphs, and token-counted chunks. Open-source Python library (MIT).

1K 17 2
infly-ai
infinity-parser2

INF Tech's open-source MLLMs for SOTA visual-language understanding and advanced document intelligence.

1K 149 14
ENDEVSOLS
longparser

Privacy-first document intelligence engine — parse PDFs, DOCX, PPTX, XLSX & CSV into AI-ready chunks for RAG pipelines. Includes HITL review, 3-layer memory chat, and a production FastAPI server.

1K 26 2
PaddlePaddle
fast-tokenizer-python

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

979 13K 3K
PaddlePaddle
faster-tokenizer

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

898 13K 3K
kreuzberg-dev
llama-index-readers-kreuzberg

LlamaIndex reader and node parser integrations for kreuzberg — 88+ format document extraction with element-aware splitting

752 0 0
kreuzberg-dev
langchain-kreuzberg

Langchain document loader for Kreuzberg

511 5 1
Orbifold
knwler

Knwler is a lightweight, single-file Python tool that extracts structured knowledge graphs from documents using AI. Feed it a PDF or text file and receive a richly connected network of entities, relationships, and topics — complete with an interactive HTML report and exports ready for your favorite graph analytics platform.

400 125 10
echology-io
decompose-mcp

The missing cognitive primitive for AI agents. Decompose any text into classified semantic units — authority, risk, attention, entities. No LLM. Deterministic.

346 9 2
PaddlePaddle
faster-tokenizers

PaddleNLP Faster Tokenizer Library written in C++

333 13K 3K
kreuzberg-dev
llama-index-node-parser-kreuzberg

LlamaIndex reader and node parser integrations for kreuzberg — 88+ format document extraction with element-aware splitting

249 0 0
PaddlePaddle
paddle-pipelines

Paddle-Pipelines: An End to End Natural Language Proceessing Development Kit Based on PaddleNLP

241 13K 3K
Goldziher
mseep-kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

160 8K 487
AiAgentKarl
document-intelligence-mcp

Local document intelligence MCP server — extract text, tables, metadata from PDF and DOCX. No API key needed.

147 0 0
kreuzberg-dev
kreuzberg-txtai

Kreuzberg integration for txtai — drop-in Textractor replacement and custom pipeline

144 3 0
kreuzberg-dev
kreuzberg-crewai

Extract text and metadata from 88+ document formats — PDF, DOCX, XLSX, HTML, images with OCR, and more — directly from your CrewAI agents.

109 1 0
    • Data from PyPI, GitHub, ClickHouse, and BigQuery