My bank broke PDF … and how I used PDFBeads to fix it

I’m on a major decluttering toot. When I realised that the filing cabinet I bought three years ago would no longer close with all the papers stuffed in it, I knew something had to change. I’ve been shredding like it’s Houston in 2001. I have the duplex scanner to suck in the stuff I need to keep. I’m moving to paperless wherever possible to stop it building up again.

My bank provides PDF statements. Of this I approve. PDF is almost perfect for this: it provides an electronic version of the page, but with searchable text and the potential for some level of security. Except, this is not the way that my bank does it. At first glance, the text looks pretty harmless:

Zoom in, and it gets a bit blocky:

Zoom right in:

Aargh! Blockarama! Did they really store text as bitmaps? Sure enough, pdftotext output from the files contains no text. Running pdfimages produces hundreds of tiny images; here’s just a few:

Dear oh dear. This format is the worst of electronic, combined with paper’s lack of computer indexability. The producer claims to be Xenos D2eVision. Smooth work there, Xenos.

So, how can I fix this? It’s a bit of a pain to set this workflow up, but what I’ve done is:

  1. Convert the PDF to individual TIFF files at 300 dpi. Ghostscript is good for this:
    gs -SDEVICE=tiffg4 -r300x300 -sOutputFile=file%03d.tif -dNOPAUSE -dBATCH -- file.pdf
  2. Run Tesseract OCR on the TIFF files to make hOCR output:
    for f in file*tif
    tesseract $f `basename $f` hocr

    Update: Cuneiform seems to work better than Tesseract when feeding pdfbeads:
    for f in file*tif
    cuneiform -f hocr -o `basename $f .tif`.html $f
  3. Move the images and the hOCR files to a new folder; the filenames must correspond, so file001.tif needs file001.html, file002.tif file002.html, etc.
  4. In the new folder, run pdfbeads * > ../Output.pdf

The files are really small, and the text is recognized pretty well. It still looks pretty bad:

but at least the text can be copied and indexed.

This thread “Convert Scanned Images to a Single PDF File” got me up and running with PDFBeads. You might also have success using the method described here: “How to extract text with OCR from a PDF on Linux?” — it uses hocr2pdf to create single-page OCR’d PDFs, then joins them.

Open letter to Jason Farris

Jason Farris is President and CEO of Citizens Bank of Canada.

Dear Jason,

So you’ve decided to “no longer offer savings and loan products“. For a company called Citizens Bank, your new business plan sounds neither much like a bank, nor of much benefit to citizens.

I moved to your bank less than a year ago. I love the public ethical standards that you hold.  I love the online banking facilities — they’re almost as good as my UK bank was offering back in 2001, so that means they’re stellar for Canada. I love the way that if you’re kept on hold for too long at Citizens Bank, the bank will call you back in five minutes or less — and actually does. I love the way that your employees go out of the way for clients — your Toronto account manager came to my house in the evening to help fill out the paperwork. (Never mind that you let him go a few months later when the “current economic conditions” hit.)

I moved to Citizens because my other bank holds the Canadian platinum-iridium standard for absolute bloody ineptitude (actually, I suspect they had it, but lost it somewhere). In the very rare occasions they can help, they charge you for it — even using their bank machines with their card will give you a monthly charge. I did look into a local alternative bank, but they were rude and unhelpful, rather more interested in tallying up and closing in half an hour than helping me with my enquiries.

You’re giving me the option to move to TD. This is my impressed face. What are they but yet another big downtown bank? What’s their ethical policy? Where’s their community reinvestment? Will they return my calls, or help me set up accounts out of hours? I think you know the answer, Jason.

I’m very disappointed, Jason. I’m also embarrassed, as I recommended your bank to many people, some of whom opened accounts, and will now have to close them. You’ve let me down badly, just when I thought I had found a bank I could trust.

All Good Things,


HSBC must really hate Linux

HSBC Canada Bank discriminates against Linux users. On April 18th, they “upgraded” their online banking facilities. Before this, they were slightly clunky, but worked just fine on almost any browser and computer I’d care to try.

Since Sunday, though, this is what I get when I try to access my bank details using Mozilla 1.6 on any of my Linux boxes:

To access internet banking, please use:
* Internet Explorer version 5.0 or above; or
o Netscape Communicator version 4.72 or above (version 6.x currently not supported)

So I mail them about this, and get this reply:

We apologize for the inconvenience; however effective April 18, 2004, when we launched our Personal Internet Banking update, the browsers that our Internet Banking will support are as follows: Internet Explorer 5.5 and up, Netscape 6.2.1 or 7.1.

I dutifully install Netscape 7.1 on my notebook, and what do I get?

To access internet banking, please use:
* Internet Explorer version 5.0 or above; or
o Netscape Communicator version 4.72 or above (version 6.x currently not supported)

And this is with the real bloated-as-life Netscape 7.1
[Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 ] browser.

Things got really weird when I tried Mozilla 1.6
[Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.6) Gecko/20040113] under Windows 2000 — and it worked just fine.

My usual browser identifies itself as [Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040406]. Looking at HSBC’s browser-sniffing code (eww!) I find that it’s looking for Windows or Mac more than it cares about the actual browser.

I’d best go tell Evan, who maintains the very useful Banks ‘n’ Browsers page, that HSBC must really hate Linux. They really don’t need to give me yet another reason to switch banks.