Category Archives: OCR

Building a summariser

Clients usually have very specific requests regarding how summaries / descriptions of articles should appear.  Producing the summaries can cause a bottleneck in producing daily media reports so we try to automate the process where we can.  This is how we put together a quick tool for a new client –

 

The Data : soft copies of  English language newspaper articles

Endpoints

  1. 50-400 word summaries
  2. articles on the same topic are to be grouped
  3. include client specified  entities and quotes in summaries.

 

Task 1 – turn the soft copies into  text we can work with:

For this we’re going to use Tesseract with a python wrapper

 

To be continued …

 

 

 

 

OCR Engines

We use OCR extensively. It’s a vital first stage for textual analysis. For desktop applications we’ve found Foxit Phantom and Abby Finereader to perform well, but neither are ideal for an automated workflow.   Ideally we would prefer a minimalist solution, either a command line tool or a software library  (Python is our weapon of choice here).  We like tesseract for English language OCR…but we haven’t had satisfactory results for Chinese or Tamil script.  What are our options here?