Building a summariser

Clients usually have very specific requests regarding how summaries / descriptions of articles should appear.  Producing the summaries can cause a bottleneck in producing daily media reports so we try to automate the process where we can.  This is how we put together a quick tool for a new client –


The Data : soft copies of  English language newspaper articles


  1. 50-400 word summaries
  2. articles on the same topic are to be grouped
  3. include client specified  entities and quotes in summaries.


Task 1 – turn the soft copies into  text we can work with:

For this we’re going to use Tesseract with a python wrapper


To be continued …





The tonality problem pt 2

Ok …the long awaited follow up to the tonality problem.  To recap – a standard sentiment analysis was not providing us with pertinent information for the client…we needed a new measure..or so we thought.  The first thing we did was to talk at length with the client to gain a better understanding of what they were hoping to do with the report.   In this case the client had been through a hostile press cycle the previous year which had led to them requesting tonality analysis as part of their media monitoring spec. So their  motivation was simply to get a handle on tonality using our standard process…the fact that tone was uniformly neutral in the current data set was not a problem since she was only  looking to flag the outliers rather than extract information from the main body of data.  Moral…our need for a ‘story’ for the data is not always the same as the client’s.

OCR Engines

We use OCR extensively. It’s a vital first stage for textual analysis. For desktop applications we’ve found Foxit Phantom and Abby Finereader to perform well, but neither are ideal for an automated workflow.   Ideally we would prefer a minimalist solution, either a command line tool or a software library  (Python is our weapon of choice here).  We like tesseract for English language OCR…but we haven’t had satisfactory results for Chinese or Tamil script.  What are our options here?

Divided by Language

remote worker[RW]:”Wait”

less-than-temperate-analyst [LTTA]: “WTF – who does he think he is telling to wait. I am waiting! I shouldn’t be – that’s the problem”

It’s only 100ms as the ping flies but t’s a world away, and in the absence of other clues, communication when working to deadlines can be fraught with stress and misunderstandings.

“Be nice” comes a voice of reason[VOR], “I’m sure he doesn’t mean to be rude, English isn’t his first language you know”

And so it proves … the ‘wait’ in question is not so much a ‘go-to-the-back-of-the-line-I’ll-deal-with-you-when-I-feel-like-it sort of wait, it’s an I’m-on-it-and-you-are-my-top-priority, wait… as we soon see the required stream of translations flowing through.

[LTTA]:”OK well I don’t see why he has to wind me up like that – I need a cuppa”


[LTTA]:” :-\”


“But it’s always neutral!”

So here’s the scenario : We’re tasked with producing a quarterly media report which will ultimately end up in the hands of an Asian  governmental client.  I’ve assigned the team and they have started work coding the data when they hit a snag – our client has emphasised the importance of sentiment analysis…but thanks to a combination of journalistic style and media self-censorship in the region everything at first parse comes up neutral.  So what’s the problem? None from a data purist’s perspective…but plenty from the client’s point of view..she’s looking for differentiation, perspectives …some meat.   The team is divided …the purists will have no truck with the idea of ‘tweaking’ the coding – reproducibility is a necessity after all…the pragmatists know they need more.  And so…we need a more useful metric…To be continued…


Welcome to boxedNews.   We’ll be sharing out thoughts here on media monitoring and analysis, and keeping everyone appraised of our projects in the automation sphere.