Building a summariser

Clients usually have very specific requests regarding how summaries / descriptions of articles should appear.  Producing the summaries can cause a bottleneck in producing daily media reports so we try to automate the process where we can.  This is how we put together a quick tool for a new client –

 

The Data : soft copies of  English language newspaper articles

Endpoints

  1. 50-400 word summaries
  2. articles on the same topic are to be grouped
  3. include client specified  entities and quotes in summaries.

 

Task 1 – turn the soft copies into  text we can work with:

For this we’re going to use Tesseract with a python wrapper

 

To be continued …