Data Data Everywhere

…but let’s just stop and think”, was the opening title for the panel debate earlier this month which I and forty or so others attended at the Royal Society, sponsored by the Royal Statistical Society (as part of the Get Stats Statistical Literacy campaign) and the British Academy.

The panel session was titled “Speed Data-ing: the effects of the rapid rise of the data society. Is the public’s date with data heading for disaster, or could it be a match made in heaven?”

The opener was from David Hand, the current President of the Royal Statistical Society and Professor of Statistics at Imperial College London. The key messages here were around that the fact that some of the data collection is more explicit and some more implicit. The more explicit includes the government collection of data to help understand the needs and wishes of the population. The more implicit includes all that personal and collective on-line purchasing information that is used to make quite targeted recommendations to us.

Perhaps the key message though was around the impacts of joining all of the data up. Well not in fact all of it, just some of it will enable new insights. How long before the life insurance premium is informed by the information collected about personal food purchases….

Opening up and explaining the numbers behind the news was the message from Simon Rogers the Guardian’s Datablog editor. Acting as the bridge between the data and expert user is a key role for the datablog, supporting the mutualisation of data. This has led to a significant flow of visualisation of the emerging data, but also widening the scope of what might be helpfully visualised, such as mapping the locations in the latest round of wikileaks. Also some frustrations along the way for the datablog, including the ability to get consistent high level data from across government departments. But the Guardian still pull together the most publically accessible and comprehensive spend profile for government.

The benefits and risk of open data was the theme for David Spiegelhalter, Winton Professor of the Public Understanding of Risk at the University of Cambridge. The emphasis here again on the added value of integrating data together, and the ease with which this can be done. But also acknowledging the risk of over interpretation. Was interesting to explore where things goes wrong, including where the logic flows in linking data but the outcome is flawed and even nonsensicle, and especially when outcomes are statistically significant but meaning is nonsensicle. Simple tools can be quite enlightening, with the funnel plot providing easy insight in a range of cases.

One of the key questions that emerged for me - and which I put to the panel - was the extent to which open access to data is just access for expert public rather than the lay public, and this might evolve. While the Guardian has a datablog, and other broadsheets have data experts, so perhaps a measure might be when red-tops newspapers are also driving a data enabling agenda. There was recognition of the work going on to help identity what works for the lay public, with data experts getting more involved in public documents. Also value is now getting extracted from the numbers by folk that don’t need to crunch the numbers in the ways that would have been necessary in the past, and that we’re in the early days of data open data and visualisation… Also noteworthy that some of the media data products are already becoming definitive reference points – even for government - such as the Guardian’s visualisation of government departmental spend as seen here…

So where might all this take us….

1. Road Map. This makes we wonder if now we need a new way to think about “data”. It seems we’re missing a macro and accessible way to organise and describe this emerging world, the way we have for our physical world. Perhaps the best analogy for me is “roads” (plural, the way data is plural of datum). We all have a sense of a UK road structure (road numbers generally increase clockwise out of London) and a hierarchy of roads (M,A, B, minor) with standard characteristics which are generally predictable (which we know having travelled only some) but which are still locally unique. As a simple structure like ‘roads’ helps us understand and deal with the real world with some degree of confidence, we might need a framework for the “public understanding of data”. After all roads get us to a destination the way data enables us to get to a message

2. Data Stardom. Also, as the volume of data increases so some will reach stardom and others will fall by the wayside. While talent is only a factor in stardom, so usefulness will be only one factor for data, there’ll also be the right time right place factors. So a world where not all data are equal will be the norm, with some survival of the fittest, but because of those other factors may not be the most ‘fit for purpose’.

3. So what Test. With the wonderful visualisations that are emerging on a daily basis, there’s a risk that these are seen as an end product. Some visualisations are an attempt to jazz up the standard data tools and in doing so create more complexity by needing to understand how the visualisation works before being able to work out what it says.  O course that's still a necessity for the more traditional approaches which are more generic, but which are currently more familiar. Of course the real challenge is about extracting the messages – the “so what” test - and engaging visualisations will be a big factor to that end.

4. Intuitive Insight. Something quite fundamental emerging here about how the traditional heady stats are not always providing a meaningful and engaging answer, even a nonsensicle one even though statistically significant. I sense a trend here to the more visually intuitive rather than statistical inductive. After all the eye and brain team can see quite complex patterns, so a bit more right side art brain and a bit less left side science brain might be the new norm.

5. Data Rave. In the same way that two good ideas rubbed together can create a great idea, linking the right two data might just create great insights (a bit more ambitious than just the whole being greater than the sum of the parts). And while there’s a law of diminishing returns, the turning point at which value starts to tail off in the data world might be further along than we might initially think. In public data terms we’re only starting to merge small numbers of data sets (even if they are big ones), so there’s an interesting times ahead. So more like a table for two over a glass of wine, soon to become the big party, then the club and the rave. Roll on the data rave then.

So a great session to tease out some of the new dynamics in the world of data, but like learning to ride a bike, there’s plenty of early wobbling and tumbling before getting to the state of generating efficient speed and distance from the here and now.

British Academy's audio coverage of the event..... http://www.britac.ac.uk/events/2010/SpeedData-ing.cfm