It is good that there are so many scanned manuals for old computer systems out there. Every old system did things its very own special way, and life’s too short to guess. I mean, there’s not much out there on the SYM-1 I’m trying to get working again:
— not much except for 6502.org’s excellent Synertek SYM-1 Resources, that is.
Some manuals, though, while lovingly scanned, are just too large to download, browse or file. Take, for instance, AppleIIScans’ Apple II BASIC Programming With ProDOS. It’s a very faithful colour scan, but at 170 MB for 280 pages, it’s a bit unwieldy. I suspect it’s Adobe Acrobat Paper Capture’s fault: while it makes turning scans into readable files really easy, it doesn’t warn against using 600 dpi full colour for a book with only decorative use of colour.
Good old Ghostscript saves the day, though:
gs -sDEVICE=pdfwrite -sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray -dPDFSETTINGS=/ebook -dNOPAUSE -dBATCH -dSAFER -q -sOutputFile=1983-A2L2013-m-a2-bpwp-grey.pdf -- 1983-A2L2013-m-a2-bpwp.pdf
By downsampling the scanned images and converting everything to greyscale, the result’s only 16 MB. All text and indexing from Acrobat is left intact.
The original seems to have fallen off the web (though the Wayback machine might have it), but:
Wikimedia has some metadata from the source, drawn by Anthony Atkielski:
I’ve supplied it in a couple of formats:
My lightning talk for GTALUG seemed to go down quite well. Here are the slides. It’s mostly based on experience gleaned from My bank broke PDF … and how I used PDFBeads to fix it. I really must write this up properly.
I also prepared — but didn’t get to use — notes on using Mini Printers and Linux. Again, this is from Thermal Printer driver for CUPS, Linux, and Raspberry Pi: zj-58 and Notes on mini-printers and Linux.
I’m rather pleased with this, as it’s the first pattern I’ve worked out from a real example, a screen in the Aga Khan Museum:
Here’s the pattern as a full page PDF from Inkscape: akm-screen-tiled-strapwork.pdf. The pattern’s not much more than an 8-pointed star with a smaller 8-pointed star inside it, rotated 22½°. But it’s still kinda neat.
Artifex’s GSView is rather good. It describes itself as ‘a user friendly viewer for Postscript, PDF, XPS, EPUB, CBZ, JPEG, and PNG’, and it sure does those things. It’s currently bundled as Mac, Windows and Linux Intel-only binaries, but maybe we’ll see ARM distribution or source soon enough.
The name confused me a bit. Russell Lang of Ghostgum Software Pty Ltd has maintained a nice Windows-only Ghostscript front end called GSview for years. Note the huge difference in names: Artifex‘s release is GSView 6, while Ghostgum’s is GSview 5. Hmm.
Naming aside, GSView does make it very easy to convert its input files to PDF/A, the ISO standard archival PDF definition that is immune to Adobe’s format meddling. (Adobe have, with Acrobat Reader DC, maintained an unbroken tradition that their latest PDF reader software is more bloated and craptastic than the last.)
PDF/A defines several archival settings such as font embedding and colour management. It’s possible to do this on the Ghostscript command line, but it’s fiddly. GSView just needs you to point it at the colour standard files on your system. On Mac, these live in /System/Library/ColorSync/Profiles/, and in the image below, I’ve picked out the generic ones:
On Linux, these files will likely be somewhere predictable; for me, they are in /usr/share/ghostscript/9.15/iccprofiles/. I made copies in the GSView executable folder so they wouldn’t get lost if my system updates Ghostscript:
The PDF/A files you get can be considerably smaller than the originals. A 10 MB LibreOffice Impress slide deck from a presentation on OpenStreetMap that I gave last week shrunk down to 1.3 MB when saved by GSView, with only very minor JPEG gribblies visible in the slide background. The graphic above (modified from the Ghostscript example file ‘golfer.eps’; yay, Illustrator 1.0!) shrunk by ⅓. These are handy savings, plus you get a standalone archival format that will never change!
The Gadsden Flag Nouveau
Aargh! Ubuntu 16.10 has decided that ImageMagick doesn’t need JPEG 2000 support, and will quietly (and very very wrongly) write JP2s as JPEG.
(NB: JPEG 2000 images
still maybe crash Ubuntu’s file browser in 14.10. My old installation didn’t like them, but my reinstall seems quite happy. Go figure.)
JPEG 2000 is a great image file format: well-defined, and able to store high quality photographic data in a very small space. It truly is the JPEG of the 2000s — except for its dismal support under Ubuntu.
The problem is the patents. An open library has been a long time coming, and lots of Linux software is built without JP2 support. This helped keep it away from my desktop.
Under Ubuntu 14.04, here’s what does and doesn’t support JP2 files:
- Gimp — not supported. It appears to have a non-functioning plugin that tries to read the file, then gives up. This is annoying, as Gimp is defined as the system default viewer for JPEG 2000.
- Image Viewer — does support JP2, but occasionally mis-renders pages. To make this the default, right-click on a JP2 file, and select Open with → Other Application …, then choose Image Viewer. It should work from then onwards.
- Document Viewer — a bit rough when looking at JPEG 2000-encoded PDFs. Very slow, too.
- GraphicsMagick — seems to be the most painless way of converting graphics files to JPEG 2000. My preferred method of invoking it is:
gm convert -define 'jp2:rate=0.008' in.png out.jp2
The rate option should be a small number; the smaller, the greater the compression, and the worse the image quality.
- OpenJPEG — provides the image_to_j2k and j2k_to_image tools. Far more picky about input formats than it should be, and often fails on seemingly perfect input.
- img2pdf — (built from source) is a tiny gem of a package. All it does is wrap various image formats into a PDF file. It doesn’t modify the image data in any way, so with a bit of ingenuity (and pdftk) you can use PDF as a true metafile archive. You can view the content on any platform, but get the source images out bit-for-bit perfect. We used to call files which could contain files metafiles, but that stopped being popular when TIFF started to be a baroque travesty of an image container back in the mid-1990s.
- poppler — (for full features, build from source) has a tool, pdfimages, which can extract embedded image files from PDFs. Some of the metadata might get lost, but all of the image bits come through.
Since JPEG 2000 isn’t included in web browsers (grar), I’ve embedded a sample scanned JPEG into a PDF, and added a series of progressively more compressed JPEG 2000 versions: JPEG-2000Booklet [PDF]. The booklet has notes showing the byte size of each page. The image still looks pretty good at 8% of the original file size!
It’s impractically huge, but under the image link lives a table of all of the Hershey fonts (well, the Western ones, at least). It’s interesting to note Dr Hershey’s preferences in this pre-ASCII table: almost every variant has degree, minute and second symbols, but none of them have ‘\’. Many of them don’t have ‘@’, either, so no e-mail addresses in Hershey Fraktur for you …
ICQuestionBank2csv: A tool to extract both the Basic and Advanced Amateur Radio Examination guides from Industry Canada’s rather annoying two-column PDFs. Written for IC’s 2014-02 database updates.
Written by Stewart C. Russell (aka scruss) / VA3PID – 2014-03-07.
- Perl, with Text::CSV_XS
- xpdf tools
advanced2csv.sh to download the source PDF and extract the data.
Oh man, Protext! For years, it was all I used: every magazine article, every essay at university (all two of them), my undergraduate dissertation (now mercifully lost to time: The Parametric Design of a Medium Specific Speed Pump Impeller, complete with spline-drawing code in HiSoft BASIC for the Amiga, is unlikely to be of value to anyone these days), letters — you name it, I used Protext for it.
I first had it on 16kB EPROM for the Amstrad CPC464; instant access with |P. I then ran it on the Amiga, snagging a cheap copy direct from the authors at a trade show. I think I had it for the PC, but I don’t really remember my DOS days too well.
The freeware version runs quite nicely under dosemu. You can even get it to print directly to PDF:
- In your Linux printer admin, set up a CUPS PDF printer. Anything sent to it will appear as a PDF file in the folder ~/PDF.
- Add the following lines to your ~/.dosemurc:
$_lpt1 = “lpr -l -P PDF”
$_printer_timeout = (20)
- In Protext, configure your printer to be a PostScript printer on LPT1:
The results come out not bad at all:
Protext’s file import and export is a bit dated. You can use the CONVERT utility to create RTF, but it assumes Code page 437, so your accents won’t come out right. Adding \ansicpg437 to the end of the first line should make it read okay.
(engraving of Michel de Montaigne in mad puffy sleeves: public domain from Wikimedia Commons: File:Michel de Montaigne 1.jpg – Wikimedia Commons)
This is how wind turbines were supposed to look, at least in the 1940s. It’s the experimental Smith-Putnam 1.25 MW unit than ran for a short while on a hill near Rutland, VT. The picture’s from a rather falling-apart copy of Large Horizontal-axis Wind Turbines (Thresher, R. W., & Solar Energy Research Institute. (1982). Large horizontal-axis wind turbines: Proceedings of a workshop held in Cleveland, Ohio, July 28-30, 1981. Golden, Colo: Solar Energy Research Institute) that I rescued from Jim‘s recycling years ago.
The first part of these proceedings has a historical review of the Smith-Putnam turbine, including an excerpt from the S. Morgan Smith Company’s house organ on the project. As the rest of the book is pretty much all about the MOD series of turbines, it’s of less interest. I’ve scanned the bits about the Smith-Putnam turbine, and put them here: NASA_DOE-1981-large_horizontal_axis_wind_turbines-excerpt. If anyone wants the book, let me know. It’s very ratty, but readable.
I don’t often need it, but the code printing facility in the Arduino IDE is very weak. It has some colour highlighting, but no page numbering, no line numbering, and no headers at all.
a2ps will sort you right out here. Years back, it was a simple text to PostScript filter, but now it has many wonderful filters for pretty-printing code. The Wiring/Arduino language is basically C++, and a2ps knows how to deal with that. So, to create a PostScript file with a nice version of the the most basic Blink sketch:
a2ps --pro=color -C -1 -M letter -g --pretty-print='c++' -o ~/Desktop/Blink.ps Blink.ino
If you’re somewhere that uses sensible paper sizes (in other words, not North America), you probably don’t want the
-M letter option. a2ps is supposed to have a PDF print option (
-P pdf), but it doesn’t work on my installation, so I just splat the output through ps2pdf. The results are linked below:
Not bad, eh?
(Update: think I must have written this post on a Mac with a case-insensitive filesystem. Using the
--pretty-print='C++' option I had before failed on Linux.)
— an ad from the December 1976 edition of Byte, from the BYTE magazine scanning effort.
I’m on a major decluttering toot. When I realised that the filing cabinet I bought three years ago would no longer close with all the papers stuffed in it, I knew something had to change. I’ve been shredding like it’s Houston in 2001. I have the duplex scanner to suck in the stuff I need to keep. I’m moving to paperless wherever possible to stop it building up again.
My bank provides PDF statements. Of this I approve. PDF is almost perfect for this: it provides an electronic version of the page, but with searchable text and the potential for some level of security. Except, this is not the way that my bank does it. At first glance, the text looks pretty harmless:
Zoom in, and it gets a bit blocky:
Zoom right in:
Aargh! Blockarama! Did they really store text as bitmaps? Sure enough, pdftotext output from the files contains no text. Running pdfimages produces hundreds of tiny images; here’s just a few:
Dear oh dear. This format is the worst of electronic, combined with paper’s lack of computer indexability. The producer claims to be Xenos D2eVision. Smooth work there, Xenos.
So, how can I fix this? It’s a bit of a pain to set this workflow up, but what I’ve done is:
- Convert the PDF to individual TIFF files at 300 dpi. Ghostscript is good for this:
gs -SDEVICE=tiffg4 -r300x300 -sOutputFile=file%03d.tif -dNOPAUSE -dBATCH -- file.pdf
Run Tesseract OCR on the TIFF files to make hOCR output:
for f in file*tif
tesseract $f `basename $f` hocr
Update: Cuneiform seems to work better than Tesseract when feeding pdfbeads:
for f in file*tif
cuneiform -f hocr -o `basename $f .tif`.html $f
- Move the images and the hOCR files to a new folder; the filenames must correspond, so file001.tif needs file001.html, file002.tif file002.html, etc.
- In the new folder, run
pdfbeads * > ../Output.pdf
The files are really small, and the text is recognized pretty well. It still looks pretty bad:
This thread “Convert Scanned Images to a Single PDF File” got me up and running with PDFBeads. You might also have success using the method described here: “How to extract text with OCR from a PDF on Linux?” — it uses hocr2pdf to create single-page OCR’d PDFs, then joins them.
Radio Amateurs of Canada may seem a bit slow at times, but they’ve quietly gone and put their magazine The Canadian Amateur online. It has a decent interface, definitely up there with Exact Editions‘ work:
I don’t think any of the editions before 2012 will be going online. It would be nice, but RAC is severely limited in resources. The almost total lack of fanfare is a contrast to the ARRL’s digital QST, which is much announced but not actually available yet …