Editor's Note: This post originally appeared in Source, an OpenNews project designed to amplify the impact of journalism by connecting a network of developers, designers, journalists, and editors to collaborate on open technologies.It was originally written for journalists, but we thought the piece so unique and useful to libraries that we're reposting a somewhat shortened version. Find the original here.
Do you need to pay a lot of money to get reliable OCR results? Is Google Cloud Vision actually better than Tesseract? Are any cutting-edge neural-network-based OCR engines worth the time investment of getting them set up?
OCR, or optical character recognition, allows us to transform a scan or photograph of a letter or court filing into searchable, sortable text that we can analyze. One of our projects at Factful is to build tools that make state-of-the-art machine learning and artificial intelligence accessible to investigative reporters. We have been testing the components that already exist so we can prioritize our own efforts.
We couldn't find a single side-by-side comparison of the most accessible OCR options, so we ran a handful of documents through seven different tools and compared the results. Here they are.