How to use OCR Text Recognition

Upload a scanned or image-based PDF and select the primary text language.
Set the maximum number of pages to process. Each page takes 2-8 seconds depending on complexity.
Click Extract text and monitor the per-page progress bar.
Review extracted text and per-page confidence scores. Above 85% is reliable; below 70% expect errors.
Download results as plain text. Pages with low confidence are flagged in the output.

Tips

300 DPI scans produce dramatically better results than 150 DPI. If confidence is low, rescan at higher DPI before blaming the OCR.
The first run per language downloads a ~15MB Tesseract data file. Subsequent runs use the cached version — no re-download.
English is the default language model. For documents mixing Latin scripts (French, Spanish, Italian), English still works well. Switch language explicitly for non-Latin scripts (Chinese, Arabic, Korean).
Process a single test page first to check quality before running all pages. This saves time on bad scans.
Skewed or rotated scans reduce accuracy. Use the Rotate tool to fix orientation before running OCR.
If quota is reached, wait for month reset or upgrade for unlimited usage.

Tesseract.js runs in a Web Worker. Each page consumes ~50-100MB of RAM during processing. Documents over 50 pages may cause memory pressure on devices with less than 4GB free.
Handwritten text is poorly supported. Tesseract is designed for printed text. Expect less than 30% accuracy on handwriting.
Multi-column layouts are partially supported. Tesseract reads left-to-right by default and may interleave columns on complex layouts.
Tables are not preserved structurally. Cell contents are extracted as text, but row/column relationships are lost.
It does not replace legal, compliance, or incident-response workflows.