pdf – We Saw a Chicken …

GTALUG Lightning Talk: Pages â†’ Searchable PDF (â†’ archive.org)

Here are my slides:

Pages_to_SearchablePDF-GTALUG_Lightning_Talk-20210713.odp Download

Or if you like PDF:

Pages_to_SearchablePDF-GTALUG_Lightning_Talk-20210713.pdf Download

cp2up.sh â€” fits the important part of Canada Post print labels two per sheet

blurred (for privacy) 2-up landscape page of Canada Post Tracked Package (to USA) shipping labels made by this script — no you will not read my 2-up shipping labels

If you need to ship things, you’re probably not too keen on queuing at the post office right now. Canada Post’s Ship Online service is pretty handy if you have a printer. The PDFs it produces are okay to print on plain paper, but if you’re using full-sheet labels like Avery 5165 you’re going to waste half a sheet of expensive labels.

If you’ve got two parcels to mail, this shell script will extract the right side of each page and create a single 2-up PDF with both your labels on the same page. You will need:

On my Ubuntu system, you can get good-enoughÂ¹ versions by doing this:

sudo apt install poppler-utils netpbm img2pdf

The code:

#!/bin/bash
# cp2up.sh - fits the important part of Canada Post print labels 2 per sheet
# scruss, 2021-05 - CC-BY-SA
# hard-coded input name (document.pdf)
# hard-coded output name (labels-2up.pdf)
# accepts exactly two labels (sorry)

dpi=600
width_in=11
height_in=8.5
# png intermediate format uses pixels per metre
dpm=$(echo "scale=3; $dpi * 1000 / 25.4" | bc)
# calculated pixel sizes, truncated to integer
half_width_px=$(echo "$width_in * $dpi / 2" | bc | sed 's/\..*$//')
height_px=$(echo "$height_in * $dpi" | bc | sed 's/\..*$//')

pdftoppm -mono -r "$dpi" -x "$half_width_px" -y 0 \
	 -W  "$half_width_px" -H "$height_px" document.pdf labels
pnmcat -lr labels-1.pbm labels-2.pbm |\
    pnmtopng -compression 9 -phys "$dpm" "$dpm" 1 > labels.png \
    && rm labels-1.pbm labels-2.pbm
# fix PDF time stamps
now=$(date --utc --iso-8601=seconds)
img2pdf -o labels-2up.pdf --creationdate "$now" --moddate "$now" \
	--pagesize "Letter^T" labels.png \
    && rm labels.png 

# saved from:
# history | tail | awk '{$1=""; print}' |\ 
#           perl -pwle 'chomp;s/^\s+//;' > cp2up.sh

It’s got a few hard-coded assumptions:

input name (document.pdf);
output name (labels-2up.pdf);
accepts exactly two labels (sorry).

Clever people could write code to work around these. Really clever people could modify this to feed a dedicated label printer.

Yes, I could probably have done all this with one ImageMagick command. When ImageMagick’s command line syntax begins to make sense, however, it’s probably time to retire to that remote mountain cabin and write that million-word thesis on a manual typewriter. Also, ImageMagick’s PDF generation is best described as pish.

One of the issues that this script avoids is aliasing in the bar-codes. For reasons known only to the anonymous PDF rendering library used by Canada Post, their shipping bar-codes are stored as smallish (780 Ã— 54 px) bitmaps that are scaled up to a 59 Ã— 19 mm print size. Most PDF viewers (and Adobe Viewer is one of these) will anti-alias scaled images, making them slightly soft. If you’re really unlucky, your printer driver will output these as fuzzy lines that no bar-code scanner could ever read. Rendering them to high resolution mono images may still render the edges a little roughly, but they’ll be crisply rough, and scanners don’t seem to mind that.

split image of simulated printed barcode: top image is five indistinct black-grey bars merging into a white background, bottom image is the same vertical lines, rendered crisply but showing some slightly rough edges — fuzzy vs crisply rough: scaled image (top) vs direct-rendered (bottom), at simulated 600 dpi laser print resolution

Â¹: Debian/Ubuntu’s netpbm package is roughly 20 years out of date for reasons that only a very few nerds care about, and the much better package is blocked by Debian’s baroque and gatekeepery packaging protocol. I usually build it from source for those times I need the new features.

TPUG talk tonight: Commodore Retro Gaming on the Raspberry Pi

I’m talking about RetroPie at TPUG tonight!

Big old scanned manuals to small old scanned manuals

It is good that there are so many scanned manuals for old computer systems out there. Every old system did things its very own special way, and life’s too short to guess. I mean, there’s not much out there on the SYM-1 I’m trying to get working again:

â€” not much except for 6502.org’s excellent Synertek SYM-1 Resources, that is.

Some manuals, though, while lovingly scanned, are just too large to download, browse or file. Take, for instance, AppleIIScans’ Apple II BASIC Programming With ProDOS. It’s a very faithful colour scan, but at 170 MB for 280 pages, it’s a bit unwieldy. I suspect it’s Adobe Acrobat Paper Capture’s fault: while it makes turning scans into readable files really easy, it doesn’t warn against using 600 dpi full colour for a book with only decorative use of colour.

Good old Ghostscript saves the day, though:

gs -sDEVICE=pdfwrite -sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray -dPDFSETTINGS=/ebook -dNOPAUSE -dBATCH -dSAFER -q -sOutputFile=1983-A2L2013-m-a2-bpwp-grey.pdf -- 1983-A2L2013-m-a2-bpwp.pdf

By downsampling the scanned images and converting everything to greyscale, the result’s only 16 MB. All text and indexing from Acrobat is left intact.

Fortran Coding Form

The original seems to have fallen off the web (though the Wayback machine might have it), but:

(PDF is linked under image)

Wikimedia has some metadata from the source, drawn by Anthony Atkielski:

By Agateller at English Wikipedia – Transferred fromÂ en.wikipediaÂ to CommonsÂ byÂ Sreejithk2000Â using CommonsHelper., Public Domain, Link

nerrrdy Bourgoin mini-zine

All fired up by Natalie Draz‘s presentation about The Transmitting Library at Make Change Conference last Sunday, I made a tiny zine using Natalie’s template:

Fold it, cut across the broken line in the middle, then re-fold so the front and back cover are on the outside. Colour it in!

I’ve supplied it in a couple of formats:

PDF: all vector, no jaggies — patterns_from_bourgoin-zine.pdf.
PNG: if you’re still all about the pixels — patterns_from_bourgoin-zine.png.

GTALUG Lightning Talk(s): Mostly Searchable and Mini Printers

My lightning talk for GTALUG seemed to go down quite well. Here are the slides. It’s mostly based on experience gleaned from My bank broke PDF â€¦ and how I used PDFBeads to fix it. I really must write this up properly â€¦ oh wait, I just did.

I also prepared â€” but didn’t get to use â€” notes on using Mini Printers and Linux. Again, this is fromÂ Thermal Printer driver for CUPS, Linux, and Raspberry Pi: zj-58 and Notes on mini-printers and Linux.

Screen pattern, Aga Khan Museum

I’m rather pleased with this, as it’s the first pattern I’ve worked out from a real example, a screen in the Aga Khan Museum:

Here’s the pattern as a full page PDF from Inkscape: akm-screen-tiled-strapwork.pdf. The pattern’s not much more than an 8-pointed star with a smaller 8-pointed star inside it, rotated 22Â½Â°. But it’s still kinda neat.

GSView â€¦ not to be confused with GSview

Artifex’s GSView is rather good. It describes itself as â€˜a user friendly viewer for Postscript, PDF, XPS, EPUB, CBZ, JPEG, and PNGâ€™, and it sure does those things. It’s currently bundled as Mac, Windows and Linux Intel-only binaries, but maybe we’ll see ARM distribution or source soon enough.

The name confused me a bit. Russell Lang of Ghostgum Software Pty Ltd has maintained a nice Windows-only Ghostscript front end called GSview for years. Note the huge difference in names: Artifex‘s release is GSView 6, while Ghostgum’s is GSview 5. Hmm.

Naming aside, GSView does make it very easy to convert its input files to PDF/A, the ISO standard archival PDF definition that is immune to Adobe’s format meddling. (Adobe have, with Acrobat Reader DC, maintained an unbroken tradition that their latest PDF reader software is more bloated and craptastic than the last.)

PDF/A defines several archival settings such as font embedding and colour management. It’s possible to do this on the Ghostscript command line, but it’s fiddly.Â GSView just needs you to point it at the colour standard files on your system. On Mac, these live in /System/Library/ColorSync/Profiles/, and in the image below, I’ve picked out the generic ones:

GSView Mac Colour Profiles

On Linux, these files will likely be somewhere predictable; for me, they are in /usr/share/ghostscript/9.15/iccprofiles/. I made copies in the GSView executable folder so they wouldn’t get lost if my system updates Ghostscript:

GSView Linux Colour Profiles

The PDF/A files you get can be considerably smaller than the originals. A 10 MB LibreOffice Impress slide deck from a presentation on OpenStreetMap that I gave last week shrunk down to 1.3 MB when saved by GSView, with only very minor JPEG gribblies visible in the slide background. The graphic above (modified from the Ghostscript example file â€˜golfer.epsâ€™; yay, Illustrator 1.0!) shrunk by â…“. These are handy savings, plus you get a standalone archival format that willÂ never change!

The Gadsden Flag Nouveau

Instagram filter used: Normal

View in Instagram â‡’

I made this after being inspired by this MetaFilter comment. There’s a PDF linked from the image below:

JPEG 2000 on Ubuntu â€” without anyone getting stabbed

Aargh! Ubuntu 16.10 has decided that ImageMagick doesn’t need JPEG 2000 support, and will quietly (and very very wrongly) write JP2s as JPEG.

(NB: JPEG 2000 images ~~still~~ maybe crash Ubuntu’s file browser in 14.10. My old installation didn’t like them, but my reinstall seems quite happy. Go figure.)

JPEG 2000 is a great image file format: well-defined, and able to store high quality photographic data in a very small space. It truly is the JPEG of the 2000s â€” except for its dismal support under Ubuntu.

The problem is the patents. An open library has been a long time coming, and lots of Linux software is built without JP2 support. This helped keep it away from my desktop.

Under Ubuntu 14.04, here’s what does and doesn’t support JP2 files:

Gimp â€” not supported. It appears to have a non-functioning plugin that tries to read the file, then gives up. This is annoying, as Gimp is defined as the system default viewer for JPEG 2000.
Image Viewer â€” does support JP2, but occasionally mis-renders pages. To make this the default, right-click on a JP2 file, and select Open with â†’ Other Application â€¦, then choose Image Viewer. It should work from then onwards.
Document Viewer â€” a bit rough when looking at JPEG 2000-encoded PDFs. Very slow, too.
GraphicsMagick â€” seems to be the most painless way of converting graphics files to JPEG 2000. My preferred method of invoking it is:
gm convert -define 'jp2:rate=0.008' in.png out.jp2
The rate option should be a small number; the smaller, the greater the compression, and the worse the image quality.
OpenJPEG â€” provides the image_to_j2k and j2k_to_image tools. Far more picky about input formats than it should be, and often fails on seemingly perfect input.
img2pdf â€” (built from source)Â is a tiny gem of a package. All it does is wrap various image formats into a PDF file. It doesn’t modify the image data in any way, so with a bit of ingenuity (and pdftk) you can use PDF as a true metafile archive. You can view the content on any platform, but get the source images out bit-for-bit perfect. We used to call files which could contain files metafiles, but that stopped being popular when TIFF started to be a baroque travesty of an image container back in the mid-1990s.
poppler â€” (for full features, build from source) has a tool, pdfimages, which can extract embedded image files from PDFs. Some of the metadata might get lost, but all of the image bits come through.

Since JPEG 2000 isn’t included in web browsers (grar), I’ve embedded a sample scanned JPEG into a PDF, and added a series of progressively more compressed JPEG 2000 versions: JPEG-2000Booklet [PDF].Â The booklet has notes showing the byte size of each page. The image still looks pretty good at 8% of the original file size!

Precisely what nobody wanted â€” the Hershey fonts as a huge great PDF

It’s impractically huge, but under the image link lives a table of all of theÂ Hershey fonts (well, the Western ones, at least). It’s interesting to note Dr Hershey’s preferences in this pre-ASCII table: almost every variant has degree, minute and second symbols, but none of them have â€˜\â€™. Many of them don’t have â€˜@â€™, either, so no e-mail addresses in Hershey Fraktur for you â€¦

ICQuestionBank2csv

ICQuestionBank2csv: A tool to extract both the Basic and Advanced Amateur Radio Examination guides from Industry Canada’s rather annoying two-column PDFs. Written for IC’s 2014-02 database updates.

See: Amateur Radio Exam Generator.

Written by Stewart C. Russell (aka scruss) / VA3PID – 2014-03-07.

Requirements

Perl, with Text::CSV_XS
xpdf tools
Bash
wget

Usage

Run either basic2csv.sh or advanced2csv.sh to download the source PDF and extract the data.

Licence

WTFPL (srsly).

For all your HP 7470a plotter manual needs

On the off chance you need to control a 30 year old graphics plotter, have I got something for you:

The image links to a scanned copy of the HP 7470A Graphics Plotter: Interfacing and Programming Manual which I found on the web, and cleaned up. The pages have been OCR’d, so it should be searchable.

Protext lives!

Protext screenshot (dosemu)

Oh man, Protext! For years, it was all I used: every magazine article, every essay at university (all two of them), my undergraduate dissertation (now mercifully lost to time: The Parametric Design of a Medium Specific Speed Pump Impeller, complete with spline-drawing code in HiSoft BASIC for the Amiga, is unlikely to be of value to anyone these days), letters â€” you name it, I used Protext for it.

I first had it on 16kB EPROM for the Amstrad CPC464; instant access with |P. I then ran it on the Amiga, snagging a cheap copy direct from the authors at a trade show. I think I had it for the PC, but I don’t really remember my DOS days too well.

The freeware version runs quite nicely under dosemu. You can even get it to print directly to PDF:

In your Linux printer admin, set up a CUPS PDF printer. Anything sent to it will appear as a PDF file in the folder ~/PDF.
Add the following lines to your ~/.dosemurc:
$_lpt1 = “lpr -l -P PDF”
$_printer_timeout = (20)
In Protext, configure your printer to be a PostScript printer on LPT1:

The results come out not bad at all:

Protext’s file import and export is a bit dated. You can use the CONVERT utility to create RTF, but it assumes Code page 437, so your accents won’t come out right. Adding \ansicpg437 to the end of the first line should make it read okay.

(engraving of Michel de Montaigne in mad puffy sleeves: public domain from Wikimedia Commons: File:Michel de Montaigne 1.jpg – Wikimedia Commons)

Wind Power, 1940s style

This is how wind turbines were supposed to look, at least in the 1940s. It’s the experimental Smith-Putnam 1.25 MW unit than ran for a short while on a hill near Rutland, VT. The picture’s from a rather falling-apart copy of Large Horizontal-axis Wind Turbines (Thresher, R. W., & Solar Energy Research Institute. (1982). Large horizontal-axis wind turbines: Proceedings of a workshop held in Cleveland, Ohio, July 28-30, 1981. Golden, Colo: Solar Energy Research Institute) that I rescued from Jim‘s recycling years ago.

The first part of these proceedings has a historical review of the Smith-Putnam turbine, including an excerpt from the S. Morgan Smith Company’s house organ on the project. As the rest of the book is pretty much all about the MOD series of turbines, it’s of less interest. I’ve scanned the bits about the Smith-Putnam turbine, and put them here: NASA_DOE-1981-large_horizontal_axis_wind_turbines-excerpt. If anyone wants the book, let me know. It’s very ratty, but readable.

I’ve written about this turbine before, but in relation to a packet of crayons. More awesome turbine pictures from Paul Gipe: Smith-Putnam Industrial Photos.

.awesome

Here are the complete 1988-vintage Sun manuals â€œUsing NROFF and TROFFâ€ and â€œFormatting Documentsâ€ scanned just for you. I’d scanned these in 2000, and they’d sat on a forgotten archive volume since then.

Update: there are better versions on the Internet Archive: Using NROFF and TROFF and Formatting Documents, all as part of the Sun Microsystems, Inc. manual collection.

(if you need to get your troff on, go to Ralph’s troff.org.)

pretty-printing Arduino sketches

I don’t often need it, but the code printing facility in the Arduino IDE is very weak. It has some colour highlighting, but no page numbering, no line numbering, and no headers at all.

a2ps will sort you right out here. Years back, it was a simple text to PostScript filter, but now it has many wonderful filters for pretty-printing code. The Wiring/Arduino language is basically C++, and a2ps knows how to deal with that. So, to create a PostScript file with a nice version of the the most basic Blink sketch:

a2ps --pro=color -C -1 -M letter -g --pretty-print='c++' -o ~/Desktop/Blink.ps Blink.ino

If you’re somewhere that uses sensible paper sizes (in other words, not North America), you probably don’t want theÂ -M letter option. ~~a2ps is supposed to have a PDF print option (-P pdf), but it doesn’t work on my installation, so I just splat the output through ps2pdf~~. You can’t use the -P Â«printerÂ» option combined with the -o Â«fileÂ» option, but cups-pdf is your friend if you need to print to a PDF. The results are linked below:

Not bad, eh?

(Update: think I must have written this post on a Mac with a case-insensitive filesystem. Using the --pretty-print='C++' option I had before failed on Linux.)

I guess people were, at the time

â€” an ad from the December 1976 edition of Byte, from the BYTE magazine scanning effort.

My bank broke PDF … and how I used PDFBeads to fix it

I’m on a major decluttering toot. When I realised that the filing cabinet I bought three years ago would no longer close with all the papers stuffed in it, I knew something had to change. I’ve been shredding like it’s Houston in 2001. I have the duplex scanner to suck in the stuff I need to keep. I’m moving to paperless wherever possible to stop it building up again.

My bank provides PDF statements. Of this I approve. PDF is almost perfect for this: it provides an electronic version of the page, but with searchable text and the potential for some level of security. Except, this is not the way that my bank does it. At first glance, the text looks pretty harmless:

Zoom in, and it gets a bit blocky:

Zoom right in:

Aargh! Blockarama! Did they really store text as bitmaps? Sure enough, pdftotext output from the files contains no text. Running pdfimages produces hundreds of tiny images; here’s just a few:

Dear oh dear. This format is the worst of electronic, combined with paper’s lack of computer indexability. The producer claims to be Xenos D2eVision. Smooth work there, Xenos.

So, how can I fix this? It’s a bit of a pain to set this workflow up, but what I’ve done is:

Convert the PDF to individual TIFF files at 300 dpi. Ghostscript is good for this:
gs -SDEVICE=tiffg4 -r300x300 -sOutputFile=file%03d.tif -dNOPAUSE -dBATCH -- file.pdf
Run Tesseract OCR on the TIFF files to make hOCR output:
for f in file*tif do tesseract $f `basename $f` hocr done
Update: Cuneiform seems to work better than Tesseract when feeding pdfbeads:
for f in file*tif do cuneiform -f hocr -o `basename $f .tif`.html $f done
Move the images and the hOCR files to a new folder; the filenames must correspond, so file001.tif needs file001.html, file002.tif file002.html, etc.
In the new folder, run pdfbeads * > ../Output.pdf

The files are really small, and the text is recognized pretty well. It still looks pretty bad:

but at least the text can be copied and indexed.

This thread â€œConvert Scanned Images to a Single PDF Fileâ€ got me up and running with PDFBeads. You might also have success using the method described here: â€œHow to extract text with OCR from a PDF on Linux?â€ â€” it uses hocr2pdf to create single-page OCR’d PDFs, then joins them.