What does OCR Text Recognition change?
It renders each PDF page to an image at 300 DPI and runs Tesseract.js OCR. Each page produces a text block with a confidence percentage. Results are concatenated with page markers.
What this does not protect
- Tesseract.js runs in a Web Worker. Each page consumes ~50-100MB of RAM during processing. Documents over 50 pages may cause memory pressure on devices with less than 4GB free.
- Handwritten text is poorly supported. Tesseract is designed for printed text. Expect less than 30% accuracy on handwriting.
- Multi-column layouts are partially supported. Tesseract reads left-to-right by default and may interleave columns on complex layouts.
- Tables are not preserved structurally. Cell contents are extracted as text, but row/column relationships are lost.
- It cannot fix compromised devices, accounts, or unsafe sharing channels.