I’m on a major decluttering toot. When I realised that the filing cabinet I bought three years ago would no longer close with all the papers stuffed in it, I knew something had to change. I’ve been shredding like it’s Houston in 2001. I have the duplex scanner to suck in the stuff I need to keep. I’m moving to paperless wherever possible to stop it building up again.
My bank provides PDF statements. Of this I approve. PDF is almost perfect for this: it provides an electronic version of the page, but with searchable text and the potential for some level of security. Except, this is not the way that my bank does it. At first glance, the text looks pretty harmless:
Zoom in, and it gets a bit blocky:
Zoom right in:
Aargh! Blockarama! Did they really store text as bitmaps? Sure enough, pdftotext output from the files contains no text. Running pdfimages produces hundreds of tiny images; here’s just a few:
Dear oh dear. This format is the worst of electronic, combined with paper’s lack of computer indexability. The producer claims to be Xenos D2eVision. Smooth work there, Xenos.
So, how can I fix this? It’s a bit of a pain to set this workflow up, but what I’ve done is:
- Convert the PDF to individual TIFF files at 300 dpi. Ghostscript is good for this:
gs -SDEVICE=tiffg4 -r300x300 -sOutputFile=file%03d.tif -dNOPAUSE -dBATCH -- file.pdf
Run Tesseract OCR on the TIFF files to make hOCR output:
for f in file*tif
tesseract $f `basename $f` hocr
Update: Cuneiform seems to work better than Tesseract when feeding pdfbeads:
for f in file*tif
cuneiform -f hocr -o `basename $f .tif`.html $f
- Move the images and the hOCR files to a new folder; the filenames must correspond, so file001.tif needs file001.html, file002.tif file002.html, etc.
- In the new folder, run
pdfbeads * > ../Output.pdf
The files are really small, and the text is recognized pretty well. It still looks pretty bad:
but at least the text can be copied and indexed.
This thread â€œConvert Scanned Images to a Single PDF Fileâ€ got me up and running with PDFBeads. You might also have success using the method described here: â€œHow to extract text with OCR from a PDF on Linux?â€ â€” it uses hocr2pdf to create single-page OCR’d PDFs, then joins them.
I work for Actuate, the company who owns the Xenos Doc Transformation (pka d2e Vision) software.
This is not a problem with the software or the way it functions, rather how the application was configured. The result you see is most likely the result of an incorrectly configured application by an employee of the bank or another 3rd party.
Based on your post, it appears the entire document and all text has been “rasterized”. That is, the engine is producing a bitmap image of every character on the page from the metrics of the source documents font. We see this all the time for reasons such as (i) the people developing the applications don’t have enough knowledge on how to map fonts or (ii) the application was developed in a hurry and this was the fastest way to get it online.
In order to resolve the problem, the application needs to be updated and an exercise we call “font mapping” needs to be done. This is the process of mapping the fonts from the input file (generally a print stream such as AFP, PCL or Metacode), to equivalent PDF fonts such as Helvetica, Arial, etc. If one does this and is done correctly, it will produce a fully searchable document using scaleable vector fonts. This will also greatly improve the appearance when you zoom in and transform times if these are being run on-demand by the bank when requested by their users.
Yes, something’s misconfigured. It’s the way that the PDF is made up of individual image fragments that’s weird. Incorrect font configuration was probably the cause of 95%+ of the prepress issues I ever had to deal with. You can have as lovely a PDF toolkit as you wish (as I’m sure Xenos is lovely), but if someone messes up the setup …
If pdfbeads crashes with a huge error starting:
/var/lib/gems/1.9.1/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:589:in `to_i’: NaN (FloatDomainError)
you can fix it by
following the “Followup” advice here: Error with hOCR from tesseractchanging line 576 of pdfbuilder.rb to:
ratio = wwidth / (0.000000001 + @fdata.getLineWidth( ltxt,fsize ))
(rubyforge is gone, so the old link is dead)
Also, tesseract supports going from multipage TIFF directly to PDF in version 3.03
â€¦ and if you have a more texty PDF utility statement, ps2pdf (specifically, ps2pdf13) will cut it down to size while keeping it searchable.