The third magic technology for scanning is Optical Character Recognition — OCR for short. This is the application which converts your pictures of pages into a word processing document that — hopefully — contains more or less the same words in approximately the same order.
There are two types of OCR software: famous, ridiculously expensive programs that may work very well — but you’ll never know because you’ll never be able to afford them — and crapulous ones cobbled together by technical assistants on a slow day and given away free by the manufacturers of scanners and (occasionally) digital cameras. And there is Tesseract, which is a libre OCR system that runs under Linux as well as Windows. Unfortunately the Open Source Fairy has failed to cast its healing magic over Tesseract, and from the perspective of recognising scanned books it is pretty useless. The program I use — the only program which I can rely on to produce a good result — is the old, old Presto OCR Pro Version 4, which I bought for about $100 back in 2000, and have dug out and reinstalled on more versions of Windows — real and virtual — than I care to remember.
Yes, I am deeply embarrassed to be running a program — a Windows program, at that! — which is more than ten years old, and just as soon as the OCR wizards manage to come up with a better one at a reasonable price, I will rush out and buy it. But there seems to be some curious temporal anomaly affecting OCR software, whereby each new version or application is actually worse than the one before it. I have tried at least a dozen OCR programs since I bought Presto 4, and the results in comparison vary from atrocious to merely annoying. Whatever the OCR boffins are currently doing, they need to stop doing it and do the reverse: perhaps then they will make some progress.
In any event, the OCR software has become less important lately because of the fourth magic technology, which is simply bigger hard disks, faster internet connections, and more efficient ways of shifting large numbers of bits around in a hurry. In the early days I had to convert my scans to text files almost immediately, because I just didn’t have space to keep them. Now I can collate them into PDF files, tuck them into an unused partition, and back them up on a 16 Gb memory card. Pretty soon I’ll be able to carry them around in a 1 Tb memory stick on a lanyard around my neck, along with all my other important documents, so if my house burns down when I’m not there I can merely laugh: “Wife and children gone? What care I? I have my data!” They can bury me with it when I die.
So my current strategy is to scan at 600 dpi to a PDF file of page images, and hang the size! Except that I now remove colour and greyscale information by using the ‘text’ setting for scanning, which cuts the size per book down from about 80 Mb on average to around 10 Mb. So far I have scanned about a hundred books, and my system has been worked out to a fine art. Which brings us to…