OCR and searchable PDF: the step that transforms "scanning" into a useful file.
Many people scan documents and think they're "organized"... until they need to find a number and realize the PDF is just a photo. OCR solves that.
1) What is OCR and why is it a game changer?
OCR (Optical Character Recognition) converts the document image into recognizable text, creating a searchable PDF and extracting content.
2) Setting #1 for OCR: 300 dpi (in most cases)
For office documents, 300 dpi is usually sufficient and recommended for good OCR accuracy, balancing quality and file size.
If the original is of poor quality (old photocopies, very small text), you can increase the resolution (e.g., 400–600 dpi), but it will increase the file size.
3) Color modes: black and white vs gray vs color
-
Black and white: lightweight file, great for clean text.
-
Gray: great for receipts and documents with stamps/shadows.
-
Color: when you need to maintain fidelity (signatures, stamps, graphics)
The goal is to maintain legibility and aid OCR (contrast without "blowing out" the image).
4) Regular PDF vs. PDF/A: When to use
PDF/A is an ISO standard for long-term preservation, designed to keep documents "reproducible" in the future.
If you're building a company's archive and want a more conservative format, PDF/A is excellent.
5) Step-by-step: creating a high-quality searchable PDF
-
Scan at 300 dpi (usually grayscale mode)
-
Apply OCR (in the scanner software or in an OCR tool)
-
Saved as a searchable PDF.
-
(Optional) If it's a long file, export as PDF/A
-
Validate: open the PDF and search for a term (name, CPF, value)
6) Common mistakes that ruin OCR
-
Very low DPI (jagged text)
-
excessive brightness/contrast (blurs letters)
-
crooked document (corrects "deskew"/alignment)
-
heavy shadows (especially on crumpled receipts)
FAQ
Does OCR work on thermal receipts?
It works best if you scan with good contrast and grayscale, and if the receipt is legible (thermal receipts degrade over time).
Is PDF/A always better?
It's best for preservation and archiving, but not always necessary for casual use.
