What does OCR Text Recognition change?

Frequently asked question for OCR Text Recognition.

What does OCR Text Recognition change?

It renders each PDF page to an image at 300 DPI and runs Tesseract.js OCR. Each page produces a text block with a confidence percentage. Results are concatenated with page markers.

What this does not protect

  • Tesseract.js runs in a Web Worker. Each page consumes ~50-100MB of RAM during processing. Documents over 50 pages may cause memory pressure on devices with less than 4GB free.
  • Handwritten text is poorly supported. Tesseract is designed for printed text. Expect less than 30% accuracy on handwriting.
  • Multi-column layouts are partially supported. Tesseract reads left-to-right by default and may interleave columns on complex layouts.
  • Tables are not preserved structurally. Cell contents are extracted as text, but row/column relationships are lost.
  • It cannot fix compromised devices, accounts, or unsafe sharing channels.