My bank broke PDF … and how I used PDFBeads to fix it

I’m on a major decluttering toot. When I realised that the filing cabinet I bought three years ago would no longer close with all the papers stuffed in it, I knew something had to change. I’ve been shredding like it’s Houston in 2001. I have the duplex scanner to suck in the stuff I need to keep. I’m moving to paperless wherever possible to stop it building up again.

My bank provides PDF statements. Of this I approve. PDF is almost perfect for this: it provides an electronic version of the page, but with searchable text and the potential for some level of security. Except, this is not the way that my bank does it. At first glance, the text looks pretty harmless:

Zoom in, and it gets a bit blocky:

Zoom right in:

Aargh! Blockarama! Did they really store text as bitmaps? Sure enough, pdftotext output from the files contains no text. Running pdfimages produces hundreds of tiny images; here’s just a few:

Dear oh dear. This format is the worst of electronic, combined with paper’s lack of computer indexability. The producer claims to be Xenos D2eVision. Smooth work there, Xenos.

So, how can I fix this? It’s a bit of a pain to set this workflow up, but what I’ve done is:

  1. Convert the PDF to individual TIFF files at 300 dpi. Ghostscript is good for this:
    gs -SDEVICE=tiffg4 -r300x300 -sOutputFile=file%03d.tif -dNOPAUSE -dBATCH -- file.pdf
  2. Run Tesseract OCR on the TIFF files to make hOCR output:
    for f in file*tif
    do
    tesseract $f `basename $f` hocr
    done

    Update: Cuneiform seems to work better than Tesseract when feeding pdfbeads:
    for f in file*tif
    do
    cuneiform -f hocr -o `basename $f .tif`.html $f
    done
  3. Move the images and the hOCR files to a new folder; the filenames must correspond, so file001.tif needs file001.html, file002.tif file002.html, etc.
  4. In the new folder, run pdfbeads * > ../Output.pdf

The files are really small, and the text is recognized pretty well. It still looks pretty bad:

but at least the text can be copied and indexed.

This thread “Convert Scanned Images to a Single PDF File” got me up and running with PDFBeads. You might also have success using the method described here: “How to extract text with OCR from a PDF on Linux?” — it uses hocr2pdf to create single-page OCR’d PDFs, then joins them.

The strange world of the 808 Car Keys Micro Camera

They have no viewfinder, no way of focusing, no controls beyond a power button and a multi-function shutter button (and two other seemingly useless buttons). They come with no manual, no readily identifiable manufacturer and you don’t really know what you’re going to get until you turn them on — yet they sell in their thousands. They are the 808 Car Keys Micro Camera.

I first heard about them from This Camera is an Adventure on MetaFilter, then someone suggested one as a solution to my Halfbakery idea “Tiny high quality digital camera”. So I bought two:

  • a #3 from ebay seller liangmin9888. Total cost $14.59 shipped from Hong Kong.
  • a #16 from ebay seller elehomegood. Total cost $40.99 shipped from Hong Kong.

I chose these sellers for their high reputation, and they didn’t disappoint. The cameras? They’re no Leicas.

The #3 is supposedly the best of the standard resolution cameras. They have a large yellow timestamp permanently inscribed in the corner of any image or video. The one I have is loaded with lens aberrations, and makes a Lomo look like a view camera. Still, I see some potential in it.

The #16 is a bit better. It still is miles behind my phone camera, and it only takes slightly soft 0.9 megapixel images. No video samples yet, but here’s a squinty picture I took in Lakefield today:

Lakefield, rather wonkily by 808 #16

I do feel a bit self conscious about using such a covert camera, but I’ll see what I can do with them.

And that’s that …

There’s something satisfying when your computer tells you, “The software was installed definitely.” I’d forgotten how ropey the translations were on Epson software, and I got this as I installed my new Epson WorkForce WF-7520 printer.

Haven’t had enough time to really dig into it, but it seems quite a fun unit. Duplex printing and scanning up to A3/Tabloid. Wireless printing (including AirPrint direct from an iOS device). Scans to flash storage, which is available as a network share. All good stuff.

My very standard bicycle is not a standard bicycle to the city

I have had a nice BASIL basket on the back of my bike:

With that, it has had all three of Syd’s requirements. But there’s a problem; with the basket on, it doesn’t fit into my bike locker:

These Cycle-Safe lockers taper down to a narrow point, so basically anything other than a stripped-down bike won’t fit. The city says of the lockers:

Locker dimensions
The space inside of a locker is approximately: 1.2m (4 feet) high x 1.9m (6 feet, 5 inches) deep x 0.9m (3 feet) wide at the door and narrows toward the back of the locker. Most standard bicycles will fit inside. Longer bicycles such as tandem bikes or some recumbent bikes will not fit into the lockers.

“Most standard bicycles will fit inside”? Grah. If there’s something more standard that a Dutch bike with a basket on the back, I don’t know what it is. I have to go back to my makeshift solution — a too-tall basket lashed on with bungees — and deal with it biting my bum as I ride. Sigh.

TCA is online

Radio Amateurs of Canada may seem a bit slow at times, but they’ve quietly gone and put their magazine The Canadian Amateur online. It has a decent interface, definitely up there with Exact Editions‘ work:

The files are downloadable as PDF, too. They look pretty decent on my e-reader:

(and yes, that is really an article about making a contact over 121km using a 5mW laser)

I don’t think any of the editions before 2012 will be going online. It would be nice, but RAC is severely limited in resources. The almost total lack of fanfare is a contrast to the ARRL’s digital QST, which is much announced but not actually available yet …

Ten years in Canada

A decade ago today, Catherine and I landed in our adopted home. There was snow on the ground. Late in the day, we checked into the Holiday Inn at Martin Grove and Dixon. We hadn’t brought clothes for snow.

The next day we went to stay at the meeting house. The day after I braved slush and the Warden bus for a job interview at Warden and Alden in Markham. There were still farms at Warden and Steeles.

Until we moved in here in late June, we house sat, couch-surfed, whatever you want to call it. We relied so much upon the kindness of then-strangers. So thank you to: Don and all the Bowyers, Jane Orion, Brett & Nancy, Lynn & Tam, Brydon & René; to Les for the first job at Gandalf, to Dave and the TREC crew for being there at the start of a new industry.

I didn’t blog back then, kept no journal, and took few photographs. The first few years were tough — early 2003 might be a special low point, with a bitter winter, a dreadful job and a flooded basement. Every tiny detail of the immigration process seemed so important at the time, but now barely registers. Getting a SIN card up on St Clair? Biggest deal ever, then.

So, thanks to everyone, here’s home now. I think it was the right move.