Australia/Sydney
BlogAugust 22, 2024

Install MinerU Locally to Create LLM Dataset from PDF Files

Fahd Mirza
n

 This video shows how to install MinerU which is a LLM-powered tool that converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format to create datasets.


Code:

git clone https://github.com/opendatalab/MinerU.git && cd MinerU

conda create -n MinerU python=3.10 && conda activate MinerU

pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com

magic-pdf --version

git lfs install

mkdir model
cd model
git lfs clone https://huggingface.co/wanderkid/PDF-Extract-Kit

change magic-pdf.json for models-dir and cuda

wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf

magic-pdf -p small_ocr.pdf
n
n
Share this post:
On this page

Let's Partner

If you are looking to build, deploy or scale AI solutions — whether you're just starting or facing production-scale challenges — let's chat.

Subscribe to Fahd's Newsletter

Weekly updates on AI, cloud engineering, and tech innovations