Clients usually have very specific requests regarding how summaries / descriptions of articles should appear. Producing the summaries can cause a bottleneck in producing daily media reports so we try to automate the process where we can. This is how we put together a quick tool for a new client –
The Data : soft copies of English language newspaper articles
- 50-400 word summaries
- articles on the same topic are to be grouped
- include client specified entities and quotes in summaries.
Task 1 – turn the soft copies into text we can work with:
For this we’re going to use Tesseract with a python wrapper
To be continued …
Ok …the long awaited follow up to the tonality problem. To recap – a standard sentiment analysis was not providing us with pertinent information for the client…we needed a new measure..or so we thought. The first thing we did was to talk at length with the client to gain a better understanding of what they were hoping to do with the report. In this case the client had been through a hostile press cycle the previous year which had led to them requesting tonality analysis as part of their media monitoring spec. So their motivation was simply to get a handle on tonality using our standard process…the fact that tone was uniformly neutral in the current data set was not a problem since she was only looking to flag the outliers rather than extract information from the main body of data. Moral…our need for a ‘story’ for the data is not always the same as the client’s.
So here’s the scenario : We’re tasked with producing a quarterly media report which will ultimately end up in the hands of an Asian governmental client. I’ve assigned the team and they have started work coding the data when they hit a snag – our client has emphasised the importance of sentiment analysis…but thanks to a combination of journalistic style and media self-censorship in the region everything at first parse comes up neutral. So what’s the problem? None from a data purist’s perspective…but plenty from the client’s point of view..she’s looking for differentiation, perspectives …some meat. The team is divided …the purists will have no truck with the idea of ‘tweaking’ the coding – reproducibility is a necessity after all…the pragmatists know they need more. And so…we need a more useful metric…To be continued…