PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Pdf Extractor Rag Python Packages

Python packages with the GitHub topic pdf-extractor-rag. Sorted by relevance, with stars and monthly downloads.
PaddlePaddle
paddleocr

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

2.2M 78K 10K
opendatalab
mineru

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

305K 64K 5K
opendatalab
magic-pdf

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

76K 64K 5K
opendatalab
mineru-selfhosted-mcp

MCP bridge for a self-hosted MinerU API

5K 64K 5K
PaddlePaddle
fadoudou2

Awesome OCR toolkits based on PaddlePaddle(8.6M ultra-lightweight pre-trained model, support training and deployment among server, mobile, embedded and IoT devices)

481 78K 10K
Kubenew
pdf2struct

`pdf2struct` extracts structured JSON from PDF documents.

392 1 0
PaddlePaddle
paddleocrwordleveldetection

Awesome OCR toolkits based on PaddlePaddle (8.6M ultra-lightweight pre-trained model, support training and deployment among server, mobile, embeded and IoT devices

296 78K 10K
PaddlePaddle
langchain-paddleocr

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

231 78K 10K
PaddlePaddle
je-paddleocr

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

231 78K 10K
PaddlePaddle
ppocrlabel-japan

PPOCRLabelv2 is a semi-automatic graphic annotation tool suitable for OCR field, with built-in PP-OCR model to automatically detect and re-recognize data. It is written in Python3 and PyQT5, supporting rectangular box, table, irregular text and key information annotation modes. Annotations can be directly used for the training of PP-OCR detection and recognition models.

163 78K 10K
ZhuJiaxin2
ragtable-extract

PDF table extraction for RAG — convert to clean HTML. Fast, local, no GPU.

147 1 0
opendatalab
xh-pdf-parser

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

141 64K 5K
opendatalab
lazyllm-magic-pdf

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

124 64K 5K
PaddlePaddle
paddleocr-fagougou

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

1 78K 10K
PaddlePaddle
fadoudou

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

1 78K 10K
    • Data from PyPI, GitHub, ClickHouse, and BigQuery