OCR Text Recognition
Extract text from scanned or image-based PDF pages using in-browser Tesseract OCR.
Category: Convert
Processing: On-device
Quota bucket: Heavy
Open interactive tool: /tools/ocr
How to use OCR Text Recognition
- Upload a scanned or image-based PDF and select the primary text language.
- Set the maximum number of pages to process. Each page takes 2-8 seconds depending on complexity.
- Click Extract text and monitor the per-page progress bar.
- Review extracted text and per-page confidence scores. Above 85% is reliable; below 70% expect errors.
- Download results as plain text. Pages with low confidence are flagged in the output.
Tips
- 300 DPI scans produce dramatically better results than 150 DPI. If confidence is low, rescan at higher DPI before blaming the OCR.
- The first run per language downloads a ~15MB Tesseract data file. Subsequent runs use the cached version — no re-download.
- English is the default language model. For documents mixing Latin scripts (French, Spanish, Italian), English still works well. Switch language explicitly for non-Latin scripts (Chinese, Arabic, Korean).
- Process a single test page first to check quality before running all pages. This saves time on bad scans.
- Skewed or rotated scans reduce accuracy. Use the Rotate tool to fix orientation before running OCR.
Privacy: Your files never leave your browser. All processing runs on-device.
Full privacy model
Frequently asked questions
What does OCR Text Recognition change?
It renders each PDF page to an image at 300 DPI and runs Tesseract.js OCR. Each page produces a text block with a confidence percentage. Results are concatenated with page markers.
Is OCR Text Recognition private by default?
All OCR processing happens in your browser using Tesseract.js (WASM). No page images leave your device. Language data files are cached in the browser after first download.
What does OCR Text Recognition not protect?
It extracts text only — it does not create a searchable PDF layer. Output quality varies with scan resolution, language selection, and font clarity. Below 70% confidence, expect word-level errors.
Limitations
- Tesseract.js runs in a Web Worker. Each page consumes ~50-100MB of RAM during processing. Documents over 50 pages may cause memory pressure on devices with less than 4GB free.
- Handwritten text is poorly supported. Tesseract is designed for printed text. Expect less than 30% accuracy on handwriting.
- Multi-column layouts are partially supported. Tesseract reads left-to-right by default and may interleave columns on complex layouts.
- Tables are not preserved structurally. Cell contents are extracted as text, but row/column relationships are lost.