How to use OCR Text Recognition
- Upload a scanned or image-based PDF and select the primary text language.
- Set the maximum number of pages to process. Each page takes 2-8 seconds depending on complexity.
- Click Extract text and monitor the per-page progress bar.
- Review extracted text and per-page confidence scores. Above 85% is reliable; below 70% expect errors.
- Download results as plain text. Pages with low confidence are flagged in the output.
Tips
- 300 DPI scans produce dramatically better results than 150 DPI. If confidence is low, rescan at higher DPI before blaming the OCR.
- The first run per language downloads a ~15MB Tesseract data file. Subsequent runs use the cached version — no re-download.
- English is the default language model. For documents mixing Latin scripts (French, Spanish, Italian), English still works well. Switch language explicitly for non-Latin scripts (Chinese, Arabic, Korean).
- Process a single test page first to check quality before running all pages. This saves time on bad scans.
- Skewed or rotated scans reduce accuracy. Use the Rotate tool to fix orientation before running OCR.
- If quota is reached, wait for month reset or upgrade for unlimited usage.
What this does not protect
- Tesseract.js runs in a Web Worker. Each page consumes ~50-100MB of RAM during processing. Documents over 50 pages may cause memory pressure on devices with less than 4GB free.
- Handwritten text is poorly supported. Tesseract is designed for printed text. Expect less than 30% accuracy on handwriting.
- Multi-column layouts are partially supported. Tesseract reads left-to-right by default and may interleave columns on complex layouts.
- Tables are not preserved structurally. Cell contents are extracted as text, but row/column relationships are lost.
- It does not replace legal, compliance, or incident-response workflows.