Speaking Data: Meta 'Sweta' Data

We now have the idea of meta data – data about data – relatively well established. That is information which is used to describe the attributes of a data set. This might include filename, formats, date of creation and version numbers for example. Equally this can include attributes of the data variables themselves, such as a variable name, description, format, validation limits and so on.

As well as well as meta data, I now propose another quite distinct sort of “data about the data” – that which describes it’s usage. This I’ll call “sweta” data. While meta is derived from the Greek, sweta is also derived from the more colloquial “sweat the asset or data”. In keeping with the analogy, this relates the degree of exertion, the extent to which data is ‘exercised’ or used.

So as well as the data itself, we have…

Meta Data… the information necessary to be able to make use of the data.

Sweta Data…the information about the extent of use of the data.

get the graphic

There’s really two sorts of sweta data, macro and micro. Micro is about the use of individual fields within a data set. Macro is about the overall use of a data set, which might be then compared to other data sets.

get the graphic

Use and Value

Having managed several annual cycles of national data collection requirements, usage of fields is a very strong practical driver to what stays and what goes to make way for the new. There’s an emerging opportunity to strengthen this right now. If we apply some of the more generic thinking about prioritisation of data, then that of ‘used and useful’ is a fair test. That ‘extent of use’ will be a proxy for value, although biased towards quantity rather than quality. Given than value can be such an elusive thing to measure, more measurable usage at least provides a starting point for that debate. The usage is more a sense of output, and value more a sense of outcome.

With the overall transparency agenda, and more specifically with the national data portals, there’s a real opportunity to understand usage. So www.data.gov.uk has over 5500 links to data sets after the first year. And there’s the web based data hub of the Office for National Statistics provide a wide scope of national and regional level data on a daily basis. Plus increasing numbers of other independent data consolidator sites that may well bring these sweta matters into sharper focus sooner, especially those with an eye to the future financial viability. That would be a first port of call for the Public Data Corporation, after all a first place to look for biggest revenues will be a biggest uses.

And in age of austerity, tougher prioritisation and decisions need a tighter evidence base, at both macro and micro data levels. So then let’s put the collection cost beside the usage? A starting proposition might reasonably be that low-cost-high-use data sets might well be of more value than high-cost-low-use. And if something has to go (….). So some usage information – macro sweta data - on the extent to which specific data sets are more or less used will be more centre stage.

Apply some Web Analytics

Because of the web have become the default portal for data, this might not be as difficult as it seems. This is because the web has inbuilt counting mechanisms. This used to be simply ‘hit counters’ but now is an emergent analytical domain in its own right – web analytics. The really helpful starting point here is that these web site usage statistics are generally prepared to industry standards (ABCe), the electronic equivalent of the established media circulation, viewing and listening figures. At the simplest level, the measures include the number of web pages viewed (in a specific period), the number of visits (each visit may view more than one page) and unique visits (as there may be multiple visits). So plenty of potential for some easy usage assessment.

This also applies more broadly, not just to the data web portals. At the simplest level, a web site is a tool for communication, often a primary tool for many organisations or functions. These usage analytics provides a starting point for evaluating its extent and effectiveness, especially given its relative ease and availability. For example, www.direct.gov.uk (the established government portals which provide public information about government services) has 30 million visits a month. The government’s transparency programme is using the web as the prime means of publishing all sorts of government data for open use and challenge.

Clearly with the web being used as a key channel of information provision to users, so we might expect to see more on usage of public service web sites.

Official Statistics….about official statistics

Certainly in the world of UK official and national statistics there has been an increasing emphasis on understating the user engagement (UK Statistics Authority), so understanding in a quantitative way the degree of data set usage is a helpful starting point to understand frequency of use and to some extent even user behaviour (repeat visitors for example).

Perhaps more important - although less conspicuously driven - is the (public) value attributed to interpretation of the data (UK Statistics Authority). This is perhaps the most profound indicator of the importance of the output and value end of the data production chain. So much closer to the purpose for having the data in the first place – to extract messages, meaning, and implications - and that data being a means to that end rather than an end in itself. Usage is a window on that value end of that data production chain.

So that’s mainstreaming these other data related aspects, as opposed to just the data. Then it’s a simple leap to see how the quantification of usage is also something that is equally critical for the public domain and for public debate.

Now the question begs whether those statistics usage of official statistics should be available in their own right….oh yes…official statistics about official statistics… that would make it meta statistics.

This usage information could easily be mesh with the costs of data collection to provide a simple if crude index of usage per pound spent….meta money statistics. With the consolidation of the national data asset through data.gov.uk so it won’t be long before these two dimensions start to be seen side by side. Especially since a lot of the public data is collected under contract, and the value of such contracts are becoming available.

In fact the official web site for the Prime Minister (www.number10.gov.uk) has been publishing the usage statistics on a monthly basis. And typically within one or two days of the end of the month. That’s the “official usage statistics for the official web site”. Here’s the pages viewed per visit, a crude “interesting index….

get the graphic

So ‘roll on’ better 'sweta' meta data.