Building a summariser

Clients usually have very specific requests regarding how summaries / descriptions of articles should appear.  Producing the summaries can cause a bottleneck in producing daily media reports so we try to automate the process where we can.  This is how we put together a quick tool for a new client –


The Data : soft copies of  English language newspaper articles


  1. 50-400 word summaries
  2. articles on the same topic are to be grouped
  3. include client specified  entities and quotes in summaries.


Task 1 – turn the soft copies into  text we can work with:

For this we’re going to use Tesseract with a python wrapper


To be continued …