{"id":7414,"date":"2012-04-29T10:53:31","date_gmt":"2012-04-29T14:53:31","guid":{"rendered":"http:\/\/scruss.com\/blog\/?p=7414"},"modified":"2014-07-13T14:15:14","modified_gmt":"2014-07-13T18:15:14","slug":"my-bank-broke-pdf-and-how-i-used-pdfbeads-to-fix-it","status":"publish","type":"post","link":"https:\/\/scruss.com\/blog\/2012\/04\/29\/my-bank-broke-pdf-and-how-i-used-pdfbeads-to-fix-it\/","title":{"rendered":"My bank broke PDF &#8230; and how I used PDFBeads to fix it"},"content":{"rendered":"<p>I&#8217;m on a major decluttering toot. When I realised that the filing cabinet I bought three years ago would no longer close with all the papers stuffed in it, I knew something had to change. I&#8217;ve been shredding like it&#8217;s <a href=\"https:\/\/en.wikipedia.org\/wiki\/Enron\">Houston in 2001<\/a>. I have the <a href=\"http:\/\/www.epson.ca\/cgi-bin\/ceStore\/jsp\/Product.do?sku=C11CB58201\">duplex scanner<\/a> to suck in the stuff I need to keep. I&#8217;m moving to paperless wherever possible to stop it building up again.<\/p>\n<p>My <a title=\"HSBC\" href=\"http:\/\/hsbc.ca\/\">bank<\/a> provides PDF statements. Of this I approve. PDF is almost perfect for this: it provides an electronic version of the page, but with searchable text and the potential for some level of security. Except, this is not the way that my bank does it. At first glance, the text looks pretty harmless:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-7418\" title=\"Screen_Shot_2012-04-29_at_10.01.38_\" src=\"http:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/Screen_Shot_2012-04-29_at_10.01.38_.png\" alt=\"\" width=\"314\" height=\"166\" srcset=\"https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/Screen_Shot_2012-04-29_at_10.01.38_.png 314w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/Screen_Shot_2012-04-29_at_10.01.38_-160x84.png 160w\" sizes=\"auto, (max-width: 314px) 100vw, 314px\" \/><\/p>\n<p>Zoom in, and it gets a bit blocky:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-7417\" title=\"Screen_Shot_2012-04-29_at_10.02.13_\" src=\"http:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/Screen_Shot_2012-04-29_at_10.02.13_.png\" alt=\"\" width=\"335\" height=\"353\" srcset=\"https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/Screen_Shot_2012-04-29_at_10.02.13_.png 335w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/Screen_Shot_2012-04-29_at_10.02.13_-151x160.png 151w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/Screen_Shot_2012-04-29_at_10.02.13_-303x320.png 303w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/Screen_Shot_2012-04-29_at_10.02.13_-284x300.png 284w\" sizes=\"auto, (max-width: 335px) 100vw, 335px\" \/><\/p>\n<p>Zoom right in:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-7416\" title=\"Screen_Shot_2012-04-29_at_10.02.31_\" src=\"http:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/Screen_Shot_2012-04-29_at_10.02.31_.png\" alt=\"\" width=\"546\" height=\"491\" srcset=\"https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/Screen_Shot_2012-04-29_at_10.02.31_.png 546w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/Screen_Shot_2012-04-29_at_10.02.31_-160x143.png 160w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/Screen_Shot_2012-04-29_at_10.02.31_-320x287.png 320w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/Screen_Shot_2012-04-29_at_10.02.31_-333x300.png 333w\" sizes=\"auto, (max-width: 546px) 100vw, 546px\" \/><\/p>\n<p>Aargh! Blockarama! Did they really store text as bitmaps? Sure enough, pdftotext output from the files contains no text. Running pdfimages produces hundreds of tiny images; here&#8217;s just a few:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-7425\" title=\"plop-923\" src=\"http:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/plop-923.png\" alt=\"\" width=\"267\" height=\"24\" srcset=\"https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/plop-923.png 267w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/plop-923-160x14.png 160w\" sizes=\"auto, (max-width: 267px) 100vw, 267px\" \/><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-7423\" title=\"plop-921\" src=\"http:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/plop-921.png\" alt=\"\" width=\"16\" height=\"24\" \/><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-7424\" title=\"plop-922\" src=\"http:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/plop-922.png\" alt=\"\" width=\"52\" height=\"24\" \/><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-7422\" title=\"plop-905\" src=\"http:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/plop-905.png\" alt=\"\" width=\"1112\" height=\"36\" srcset=\"https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/plop-905.png 1112w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/plop-905-160x5.png 160w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/plop-905-320x10.png 320w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/plop-905-1024x33.png 1024w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/plop-905-500x16.png 500w\" sizes=\"auto, (max-width: 1112px) 100vw, 1112px\" \/><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-7421\" title=\"plop-904\" src=\"http:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/plop-904.png\" alt=\"\" width=\"17\" height=\"24\" \/><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-7420\" title=\"plop-903\" src=\"http:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/plop-903.png\" alt=\"\" width=\"34\" height=\"24\" \/><\/p>\n<p>Dear oh dear. This format is the worst of electronic, combined with paper&#8217;s lack of computer indexability. The producer claims to be <a href=\"http:\/\/www.xenos.com\/xe\/home\/\">Xenos D2eVision<\/a>. Smooth work there, Xenos.<\/p>\n<p>So, how can I fix this? It&#8217;s a bit of a pain to set this workflow up, but what I&#8217;ve done is:<\/p>\n<ol>\n<li>Convert the PDF to individual TIFF files at 300 dpi. <a href=\"http:\/\/www.ghostscript.com\/\">Ghostscript<\/a> is good for this:<br \/>\n<code>gs -SDEVICE=tiffg4 -r300x300 -sOutputFile=file%03d.tif -dNOPAUSE -dBATCH -- file.pdf<\/code><\/li>\n<li><del datetime=\"2012-05-09T13:35:12+00:00\">Run <a href=\"http:\/\/code.google.com\/p\/tesseract-ocr\/\">Tesseract OCR<\/a> on the TIFF files to make <a href=\"https:\/\/en.wikipedia.org\/wiki\/HOCR\">hOCR<\/a> output:<br \/>\n<code>for f in file*tif<br \/>\ndo<br \/>\ntesseract $f `basename $f` hocr<br \/>\ndone<\/code><\/del><br \/>\n<strong>Update<\/strong>: <a href=\"https:\/\/launchpad.net\/cuneiform-linux\">Cuneiform<\/a> seems to work better than Tesseract when feeding pdfbeads:<br \/>\n<code>for f in file*tif<br \/>\ndo<br \/>\ncuneiform -f hocr -o `basename $f .tif`.html $f<br \/>\ndone<\/code><\/li>\n<li>Move the images and the hOCR files to a new folder; the filenames must correspond, so file001.tif needs file001.html, file002.tif file002.html, etc.<\/li>\n<li>In the new folder, run <code>pdfbeads * &gt; ..\/Output.pdf<\/code><\/li>\n<\/ol>\n<p>The files are really small, and the text is recognized pretty well. It still looks pretty bad:<\/p>\n<p><a href=\"http:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/Screen-Shot-2012-04-29-at-10.48.38-.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-medium wp-image-7426\" title=\"Screen Shot 2012-04-29 at 10.48.38\" src=\"http:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/Screen-Shot-2012-04-29-at-10.48.38--320x154.png\" alt=\"\" width=\"320\" height=\"154\" srcset=\"https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/Screen-Shot-2012-04-29-at-10.48.38--320x154.png 320w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/Screen-Shot-2012-04-29-at-10.48.38--160x77.png 160w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/Screen-Shot-2012-04-29-at-10.48.38--500x241.png 500w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2012\/04\/Screen-Shot-2012-04-29-at-10.48.38-.png 667w\" sizes=\"auto, (max-width: 320px) 100vw, 320px\" \/><\/a>but at least the text can be copied and indexed.<\/p>\n<p>This thread \u00e2\u20ac\u0153<a title=\"Convert Scanned Images to a Single PDF File\" href=\"http:\/\/www.diybookscanner.org\/forum\/viewtopic.php?f=3&amp;t=683\" rel=\"nofollow\">Convert Scanned Images to a Single PDF File<\/a>\u00e2\u20ac\u009d got me up and running with PDFBeads. You might also have success using the method described here: \u00e2\u20ac\u0153<a href=\"http:\/\/superuser.com\/questions\/28426\/how-to-extract-text-with-ocr-from-a-pdf-on-linux\">How to extract text with OCR from a PDF on Linux?<\/a>\u00e2\u20ac\u009d \u00e2\u20ac\u201d it uses <a href=\"http:\/\/www.exactcode.de\/site\/open%5Fsource\/exactimage\/hocr2pdf\/\">hocr2pdf<\/a> to create single-page OCR&#8217;d PDFs, then joins them.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;m on a major decluttering toot. When I realised that the filing cabinet I bought three years ago would no longer close with all the papers stuffed in it, I knew something had to change. I&#8217;ve been shredding like it&#8217;s Houston in 2001. I have the duplex scanner to suck in the stuff I need [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[7],"tags":[1600,2496,1752,2495,1161,765,2497,807,2822],"class_list":["post-7414","post","type-post","status-publish","format-standard","hentry","category-computers-suck","tag-bank","tag-declutter","tag-hsbc","tag-ocr","tag-paper","tag-pdf","tag-pdfbeads","tag-scan","tag-tesseract"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/pQNZZ-1VA","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/scruss.com\/blog\/wp-json\/wp\/v2\/posts\/7414","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/scruss.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/scruss.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/scruss.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/scruss.com\/blog\/wp-json\/wp\/v2\/comments?post=7414"}],"version-history":[{"count":7,"href":"https:\/\/scruss.com\/blog\/wp-json\/wp\/v2\/posts\/7414\/revisions"}],"predecessor-version":[{"id":10877,"href":"https:\/\/scruss.com\/blog\/wp-json\/wp\/v2\/posts\/7414\/revisions\/10877"}],"wp:attachment":[{"href":"https:\/\/scruss.com\/blog\/wp-json\/wp\/v2\/media?parent=7414"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/scruss.com\/blog\/wp-json\/wp\/v2\/categories?post=7414"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/scruss.com\/blog\/wp-json\/wp\/v2\/tags?post=7414"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}