Big Data in Manchester
Mike summarised big data and its potential far more eloquently than I could, so I defer in this instance. What is critically important to me is the collaboration that occurs in the wider community. We as developers entered this event with a number of things on our mind. First up was to build broad links between ourselves, as a service provider, with public and private sector data analysts and with academia. It’s important to us to listen to what people really want, and this was a great opportunity to engage. Next up, what do businesses need to succeed? How can we help them minimise set up and operational costs, and make them more productive? We had a good time doing all of this which I shall delve into without further ado.
The day kicked off with a cup of tea and a series of introductions by the University of Manchester who organised the event. After being divided into teams we were ushered into a creative space within The Landing at MediaCityUK. The task: using data sets provided by Manchester City Council, the University of Manchester and Aridhia (who also provided cloud based data analytics software in the form of AnalytiXagility). We were to mine the data and look for insights that could aid the local community. It also provided ample opportunity to get a feel for how Big Data works and the challenges faced while working with it. We opted to drill into a dataset of ~700,000 twitter messages from a broad region encompassing Greater Manchester and Lancashire and see what could be derived. My initial intuition was to dive in with a custom C++ application to leverage the high performance offered (besides my knowledge of data mining is non-existent, so stick to what you know!) Fairly quickly longitude and latitude were able to be extracted and used to generate a heat map overlay which could be plotted on Google’s maps API. Everything looks much as you’d expect with a fairly even distribution over built-up areas, tending to increase in the major urban centres. There were some rather obvious anomalies however which were worth further investigation.
As it turns out there are some prolific tweeters out there. We are talking in the order of thousands of messages over a single month. Oddly enough most of the small red hot spots are actually single users, one user in particular had something of an infatuation with 5 Seconds of Summer! This highlights one of the challenges we discovered with Big Data in general; filtering out the irrelevant data that skews the picture. Veracity of data as is the parlance within IBM. So what was important to all these twitter users? First we were able to construct a dictionary of all the words used, then allow discrete users only a single vote as to a word’s popularity. This had the effect of mitigating the impact #5SOS and @Luke5SOS had on the overall picture. Given the data set covered a period of June-July it was unsurprising that #WorldCup2014 was on everyone’s mind. More surprisingly the most popular celebrity was @GaryLineker. Using further techniques such as sentiment analysis revealed Gary is generally regarded in a positive light by Greater Manchester, he’ll be pleased to know! To bring the day to a close we presented our tenuously useful findings to the participants. It was good to see the work of professional data analysts to see what is truly possible with analytics.
Big Data is difficult. First, acquiring the data. There is a lot out there, but you need to gather enough relevant to your studies. Then there is cleaning the data up to reduce noise caused by power users or automated services (one of the most prolific accounts was that of a ‘word of the day’ service). Finally there is the actual analysis, discovering semantics, links and trends. But it was educational to learn all this first hand. Some interesting leads which came out of the day were the Hadoop Manchester group, useful for anyone interested in how big data is done in local business, which I will try to attend. Secondly, and this links in nicely to my time in Italy where these technologies were discussed, there is a lot of interest in data flow analysis (e.g. Apache Spark) as a successor to the map-reduce paradigm, and SQL on Hadoop. We’d be really interested to hear your thoughts on this as they are definitely services we’d be eager to offer.