OCR Text Recognition — How to

Step-by-step guide for OCR Text Recognition.

How to use OCR Text Recognition

  1. Upload a scanned or image-based PDF and select the primary text language.
  2. Set the maximum number of pages to process. Each page takes 2-8 seconds depending on complexity.
  3. Click Extract text and monitor the per-page progress bar.
  4. Review extracted text and per-page confidence scores. Above 85% is reliable; below 70% expect errors.
  5. Download results as plain text. Pages with low confidence are flagged in the output.

Tips

  • 300 DPI scans produce dramatically better results than 150 DPI. If confidence is low, rescan at higher DPI before blaming the OCR.
  • The first run per language downloads a ~15MB Tesseract data file. Subsequent runs use the cached version — no re-download.
  • English is the default language model. For documents mixing Latin scripts (French, Spanish, Italian), English still works well. Switch language explicitly for non-Latin scripts (Chinese, Arabic, Korean).
  • Process a single test page first to check quality before running all pages. This saves time on bad scans.
  • Skewed or rotated scans reduce accuracy. Use the Rotate tool to fix orientation before running OCR.
  • If quota is reached, wait for month reset or upgrade for unlimited usage.

What this does not protect

  • Tesseract.js runs in a Web Worker. Each page consumes ~50-100MB of RAM during processing. Documents over 50 pages may cause memory pressure on devices with less than 4GB free.
  • Handwritten text is poorly supported. Tesseract is designed for printed text. Expect less than 30% accuracy on handwriting.
  • Multi-column layouts are partially supported. Tesseract reads left-to-right by default and may interleave columns on complex layouts.
  • Tables are not preserved structurally. Cell contents are extracted as text, but row/column relationships are lost.
  • It does not replace legal, compliance, or incident-response workflows.