Pdf Extractor Rag Python Packages

paddleocr

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

2.5M 85K 11K

mineru

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

365K 73K 6K

magic-pdf

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

72K 73K 6K

mineru-next-dev

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

2K 73K 6K

langchain-paddleocr

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

2K 85K 11K

mineru-selfhosted-mcp

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

1K 73K 6K

fadoudou2

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

531 85K 11K

paddleocrwordleveldetection

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

249 85K 11K

ragtable-extract

PDF table extraction for RAG and LLM — convert PDF tables to clean HTML. Fast, local, no GPU. Handles merged cells, line-wrapped text, no serialization.

205 1 0

je-paddleocr

Awesome OCR toolkits based on PaddlePaddle(8.6M ultra-lightweight pre-trained model, support training and deployment among server, mobile, embedded and IoT devices)

205 85K 11K

ppocrlabel-japan

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

203 85K 11K

xh-pdf-parser

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

187 73K 6K

lazyllm-magic-pdf

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

142 73K 6K

pdf2struct

pdf2struct: extract structured JSON from PDFs (text, metadata, tables, OCR, invoice key-value fields).

121 1 0

paddleocr-fagougou

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

1 85K 11K

fadoudou

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

1 85K 11K