Digitizing books with linux

It turns out there are some great open-source tools for digitizing books with linux that make it pretty painless. It took me about a half a day to turn out a high quality DjVu with OCR of a 150 page book. Here are the tools I used.

xsane

xsane just works, not much to be said about it. I used a flatbed scanner to scan each page of the book in 300 DPI in color. You can use grayscale but I would not recommend B&W because you could potentially lose some of the text near the spine. Save the files either in pnm or tiff (Scan Tailor doesn't work with pnm but you can use ImageMagick's convert utility to convert between the two). Don't worry about getting the book absolutely flat, Scan Tailor can correct the page curling. xsane automatically increments the number in the filename so when you're done you should have a directory full of numbered files ready for Scan Tailor.

Scan Tailor

Scan Tailor will split the pages in two (if you scanned two pages at once), rotate them so they're straight, remove noise and correct for curled pages. It did all the heavy lifting and the results were excellent, even from less-than-stellar scans. I used the auto deskew algorithm with pretty good results but a few pages needed some manual tweaking.

DjVu

I used the directions on the page linked above to assemble the tiff images from Scan Tailor into a DjVu book. Note that the compression utility he uses (cjb2) is only for black and white text/images. If you have grayscale or color photos you need to compress those pages with c44 instead. If you don't you'll end up with black squares for your images in the final book.

ocropus

ocropus did the OCR on the DjVu book. I used the command string from Daniel's page http://www.danielstender.com/granthinam/564/ and it just worked.

Chapter Bookmarks

To add bookmarks I used the following command (the command on Daniel's page has the DjVu filename in the wrong place).

djvused -e 'set-outline mydjvu.outline' -s mydjvu.djvu

And Finished!

It was pretty easy. The most time-consuming part was scanning the book in by hand