Edit: Jeffrey Ratcliffe, the very active developer of gscan2pdf, has released an update that fixes this bug. Ubuntu users can access it his PPA (see below).
In this post the other day I talked about my relatively painless experience upgrading to Xubuntu 14.04. Since then, I have discovered a couple of bugs in some OCR software I use fairly regularly.
Here is a solution to a slightly annoying regression in gscan2pdf, an otherwise great little PDF scanning, clean-up and OCR solution.
In Ubuntu 14.04 gscan2pdf has a bug in it’s tesseract OCR support meaning it appears to OCR the document but once completed no text is added to the OCR layer. Although the bug does not affect the gocr OCR engine, tesseract (which was developed by Google HP Labs) is a much better engine and the one I prefer to use.
My first attempt at rectifying the problem was to upgrade gscan2pdf to the latest version (from 1.2.3-1 to 1.2.4) which doesn’t seem to have made it into the Ubuntu 14.04 repos, a shame considering Trusty is an LTS release. On the upside Jeffrey Ratcliffe, gscan2pdf’s developer, has a PPA that contains the latest version, so upgrading was relatively painless. The process is well documented here on the RCLUBLINUX blog.
Unfortunately, the bug is not fixed in gscan2pdf 1.2.4 so the upgrade didn’t fix my problem.
A little poking about on the gscan2pdf Sourceforge page however, showed this bug report, and also patch to fix the problem contributed by user tzieg (Thomas Zieg?).
After applying the patch and firing up gscan2pdf I was glad to see tesseract again worked as expected, thanks Thomas!
Problem: After upgrading to Xubuntu 14.04 the tesseract OCR engine no longer worked in gscan2pdf.
Solution: Patch gscan2pdf using the patch supplied by Thomas Zeig.
Procedure: Download a copy of the patch from gscan2pdf’s Sourceforge bugtracker.
Copy the patch to the gscan2pdf directory.
sudo cp Tesseract.pm.patch /usr/share/perl5/Gscan2pdf/
Change to the gscan2pdf directory.
cd /usr/share/perl5/Gscan2pdf/
Apply the patch,
sudo patch -p0 < Tesseract.pm.patch
OCR with tesseract should now work as expected, easy.
Right, now to figure out why OCRFeeder crashes when exporting to PDF.