Data Smoothie

Data is the plural for datum – or individual data items.   So when we think about analysing data we are probably already thinking “data set” rather than about the constituent “datum”.  Perhaps that’s why we’re quite comfortable to accept aggregate descriptions of the data – e.g. average – as a starting point.  So potentially we’re already one step removed from the underlying raw data. 


Data smoothing is one way that’s been used to look for underlying patterns in those datum.  A typical approach being to smooth data over time, maybe smooth monthly data over a years, to see more macro, perhaps seasonal patterns.   However this can so easily hide also hide all sorts of interesting micro patterns.
 
The examples below show how a range of scenarios when simply smoothed, hide the real patterns of interest.    These examples use a 12 month rolling average (mean) and show this for a 12 month period.  In each scenario, the real data follows repeats the same pattern over 12 months... 

Perhaps less data smoothing, more datum smothering…..

And for a more helicopter view of the underling patterns....



So as soon as we distance ourselves from that underlying data we are potentially distancing ourselves from the underlying questions and messages.  Exploratory Data Analysis has been a champion of staying close to the underlying data, using approaches which try to reflect more of the original data in the calculations and visuals.  And this is perhaps helpfully reborn into the current world of data visualisation, where we see a stronger sense of “datum” over “data”.

NOTES: The real data.

This is the matrix of data. For each series, the same annual pattern is followed each year.  This graphs show the last 12 months data, the raw underlying data being the same as the previous year.  This is based on the principle that if the data point that drops out of the rolling average is the same value as the new data point that enters, the rolling average remains unchanged.  

For Whom the Buck Stops

In the work of open data we are only starting to see the mix/mash/mesh of the various data.  Given that data.gov.uk already has over 5600 data sets, this might start with linking any two data sets and potentially linking up to 5600.  That’s where reality gets a grip.  The scope is technically challenging –only some data will be effectively linkable to selected others - and the general capacity and incentive to do this probably insufficient right now.

Also at the moment there’s a lot of data news, visualisation and apps.  And like news, there’s a new data topic or two tomorrow, with a new visualisation and app to follow.  But this is not the same as insightful analysis that moves the debate on by distilling and consolidating messages.  So lots of opportunities for taking some proper analytical time, and getting into the key messages and questions, not just from a single set of data but also by considering some simple linkage.

Helpfully the Law of Diminishing Returns might suggest some early benefits from the more straightforward linking of simple key data, rather than too many of the more complex data.   Here’s one example of the opportunity to start to make some relevant connections.  Perhaps the simplest example of the basic need for that connection and context, and the extra insight that is possible.  

The case in point is the publication of the salaries of those civil servants earning over £150k per year.
There are 345 top players, who are collectively paid £58m.  Looking overall, that £58m works out on average £170k salary per year per head (some part-timers in the mix, and based on the lower salary band).  But is that good value?   

But the list of salaries of the top public players, does not give a sense of relative worth, just their relative pay.   Helpfully, what’s clear from the Comprehensive Spending Review – and the range of departmental budgets - is that some jobs are definitely bigger than others.  The means with variable salaries and variable budget responsibilities, some might well have more responsibility for each £1000 of their salary than others.

Those 345 senior staff collectively manage annual spending of £696bn (Comprehensive Spending Review 2010/2011).  So in simple average terms:

-          - That’s about £2bn managed per person per year, or

-          - Every £1000 of salary comes with £12m of budget responsibility, or

-          - Each £1 of salary buys £12,000k of budget responsibility.

If we look at this in another way, this is less about salary paid, more about the total risk being managed.   In total £58m is what we pay for the effective management of £696bn, £1000 management salary for each £12m of spend.   Is that about right?

However: (1) there are more senior and less senior folk in that pool of 345 top civil servants, and the key point here is that there are a smaller number of senior departmental staff – the Permanent Secretaries - with whom the buck eventually stops; (2) the spend per department is quite variable; (3) for a more complete picture there is also some revenue collection responsibility to take into account.    So we really need to unpack these three dimensions to get better sense of salary worth – basically who’s having to work hardest.

1.       Super Senior Staff

The buck would tend to stop with the Permanent Secretaries of each department, as the most senior civil servant.  However that does not necessarily mean that they are paid the most in their department.  There’s a scattering of roles across the departments where the Perm Sec is paid less than others in the department.  This includes the Chief Information Officer at the Department for Work and Pensions, several at the Department for Business Innovation and skills, Finance Director at the Home Office and Chief of the Defence Staff.   In addition these 345 roles also cover the Arms Length Bodies (Quangos – Quasi-Autonomous Non Government Organisations), where some senior salaries are higher than the Permanent Secretary of the department who’s arms they are at the end of.   All a function of the real world at play.

Of course this tends toward recognising a “top team” approach with collective shared responsibility.  However there is still a responsibility and accountability hierarchy in place, given the importance of the roles and the public resources at stake.  And that avoids that conundrum…about four people named Everybody, Somebody, Anybody and Nobody:

There was an important job to be done and Everybody was sure that Somebody would do it.  Anybody could have done it, but Nobody did it.  Somebody got angry about that, because it was Everybody's job.  Everybody thought Anybody could do it, but Nobody realised that Everybody wouldn't do it.  It ended up that Everybody blamed Somebody when Nobody did what Anybody could have done.

So the focus here is on those 17 Permanent Secretaries from the Departments, and for whom the buck eventually stops.

2.       Departmental Spending.

The Comprehensive Spending Review lays out the government spending plans over the next five years and by government department.  There’s a Total Managed Expenditure of £696bn in 2010/11, which is broken down into (in the simplest terms) (a) capital and resource spending and (b) externally committed spend such as benefits (called Annual Managed Expenditure) and more variable spending such as running costs (called Departmental Expenditure Limits). 

The focus here is on the departmental spend which is broken down by department, and which represents the substantive volume and variability.  However there’s plenty of complexity around this and generally excluded here as they are generic rather than departmental spending, most notably: accounting adjustments (£22bn gross); depreciation (£16bn); reserves (£8bn); and debt interest payments (£43bn).  This also excludes: £60.4bn collective spend for Scotland, Wales and Northern Ireland; £39bn allocated to local government; £2.1bn for Small Independent Bodies; and HM Treasury with a negative expenditure of £2.4bn in 2010/11.  The Cabinet Office is still included, while one of the smallest spending departments, the head also functions as the head of the Civil Service and paid accordingly.

So the identified departmental spend  amounts to £500.5bn which is directly allocated to the 17 Departments as follows:

So that £500bn spend is managed delivered through the 17 most accountable senior staff.   That makes for a much sharper financial responsibility than spread across the collective 345 members of “top teams”.   The collective salaries of those 17 staff is £3.1m.

So again in simple average terms:

- That’s £29bn managed per person

- Every £1000 of salary comes with £161m of budget responsibility 
(rather than £12m per £1000 when spread across the 345 top team)

- Each £1 of salary buys £161k of budget management.

The top 5 departments account for 80% (£400bn) of the spend, and at the other end of the scale, half of departments have a budget of under £10bn. There’s more variation in the departmental budgets than there variation in salaries,  so this clearly points to some having relatively more financial responsibility for the same level of salary.

So at one extreme, the Work and Pensions Perm Sec gets £195k and is responsible for £159bn of public funds.   So each £1000 of salary brings with it a responsibility for £815m of public funds.  
Then it is the NHS, for which each £1000 salary has a financial responsibility for £404m.  Then its education with £337m responsibility per £1000 salary.  HM Revenue and Customs at £236m and then Defence at £221m.

Then there’s the majority pool of smaller spending departments (under c.£100bn).  And at the lower extreme the Foreign and Commonwealth Office Perm Sec gets £175k salary and is responsible for £17bn.  So each £1000 salary brings with it a responsibility for only £10m of public money.  

So that difference in financial responsibility for each £1000 of salary varies from £815m (Work and Pensions) to around £10m (FCO).    That represents a factor of 80 times more responsibility for each £1000 of salary.  That’s either 80 times more responsibility, opportunity or risk depending on your viewpoint.

In short, reward and risk not evenly distributed, and in fact hugely variable.  

So to look at this another way.  What it everyone was paid relative the level of financial responsibility?   First let's use the FCO salary level of £175k for £17bn of responsibility as the benchmark.  Then the Work and Pensions salary would rise from £195k to £15m a year.  Or if we standardise the other way around and use the Work and Pensions as the benchmark £195k salary for £159bn financial responsibility, then the FCO salary drops from £175k to £22k per year....

3.       Departmental Income.

Just to complicate things further, some departments also generate revenue.  While some of this might be indirect (every £1 spent on the British Library – arms length body of Dept. Culture, Media and Sport- generates £4 for the UK economy), some of this is very direct revenue collection.  
The Budget 2010 shows that Government income (“net receipts”) totalled £548bn.  Of this, over three quarters, £419bn was raised through HMRC.  That’s through Income Tax (£150bn), National Insurance (£99bn), VAT (£81bn), Excise Duty (£46bn) and Corporation Tax £43bn.


So while HMRC have a spend responsibility of £40bn, they also have a collection responsibility of £419bn, so a collective annual financial responsibility of £459bn.   So for HMRC that salary comes with even more responsibility.   Which means that the original £236m of financial responsibility per £1000 salary at HMRC increases to £2448m.  (Strangely HMRC seem to have two Permanent Secretaries, so even if this £2448 gets shared that’s still over £1220 per £1000 salary.)

So comparing that HMRC £2448m budget per £1000 salary to FCO of £10m, that now represents a factor of nearly 250 times more financial responsibility per £1000 salary.  So definitely not all jobs are equal.

So….

So a relatively straight forward data mix up… take some senior civil servant salary data, mix in some headline comprehensive spending review departmental spends and budget revenue data. 

What represents a relatively small variation in absolute salary terms, amounts to hugely variability  - up to 250 time more – in terms of levels of financial responsibility for the same £1000 of salary.   The question still begs, what is the right level of salary?  Where financial responsibility is high relative to salary, does this mean we are getting the best value for money?  Alternatively are we taking to high a level of risk in having such a low level of management costs relative to the level of budget responsibility?

This is really one of those "provocative-indicative" analysis.  Some simplifications and assumptions here but not to much as to distract from an overall message around relative disparity in salary and responsibility.  Of course financial responsibility is really just a proxy for overall responsibility, there are plenty of other factors at play.  The FCO has a management scope and complexity which is global in nature, quite unlike work and pensions. Similarly there will be certain responsibilities which are bigger than financial - including outside the mainstream departments - the Nuclear Decommissing Authority might well be a case in point (Chief Executive Salary of £365k).   As usual the numbers - the things we can measure more directly - both ground and open the discussion.

NOTES: There are some imperfections in all of this. The reality of course is that there is some legwork in getting to the data and making the connections.  Not only are the salaries in bands, but the data is provided as text rather than numeric fields.  So need to first mix in a few spreadsheet functions to covert text to numbers, and use the lower value of the banding.  So more of an approximation – although consistent - than exact measure.  Similarly whether salaries have been consistently defined across departments (to include all benefits standardised in the same way) we cannot be certain.



Impactometer

In the world of open data there is lots of it.  At the last count 5600 data sets through data.gov.uk.  Lots of this and others get usefully highlighted by the expert “brokers” (such as the Guardian).  Some great apps emerge and some lovely visualisations.   The depth of impact can be limited and the longevity short lived.   As someone who has been a deep user of data for strategic purposes – academic, policy, regulatory, operation - this can be a little superficial.

In short there’s plenty to publish, publicise and visualise…

1.        Published.  Not just though the main portals such as data.gv.uk, and the Office for National Statistics but in light of the UK Statistics Authority Code Of Practice, there’s a greater consistency in the publication of statistical releases at the departmental level.

2.       Publicised. Some sets of data get the big headlines, such as police crime maps.  But of course there’s the rolling highlight from media champions…such as Guardian and Straight Statistics.

3.       Visualised. For some of this data we see some typically intricate, elaborate and attractive visualisations.  See for example the Guardian’s flick pages with nearly 500 visualisations to data and nearly 1000 members.  

So we can get so something which is “pretty nice”, the question then begs is that also “pretty useful”. So I’ve developed the data “Impactometer” as a simple tool to evaluate effectiveness.  So let’s keep the published, publicised and visualised.  Then add on some analysis (whether driven by hypothesis or data mining), synthesis of that analysis (to distil the key messages for insight and understanding), and then subsequent impact.  It’s that impact which can lead to improvement, change and even transformation. 

It can seem like there’s lots of publish, publicise, visualise, and then it’s really on to the next item to publish, publicise and visualise… so this can feel quite perishable.  So while these data sets are publicised, in some cases the comments section can get closed after a week or so.   Given that knowledge builds over time, there’s no easy way to consolidate that at present.


So while we have a holding framework for the data, perhaps we need a holding framework for the analysis that has flowed from that data….. so analysis.gov.uk?  As bit like the way academia broadly ‘stores’ it’s analysis through the peer reviewed journal process.  So the new researcher can stand on the shoulders of giants that have gone before.    Certainly to provide a home for more than the visualisations, to include the basic descriptive analysis including the basic messages  - at least the more straightforward ones….   

After all any more advanced analysis is will need to go through that basic descriptive stage anyway, - important context for the deeper analysis - so maybe an opportunity for a head start, or at least some early validation of the basics. 

Some a home for more analysis, including the audit trail for the methodology of that analysis.  I see this all this in the form of a recipe card… the specific data ingredients, the techniques applied and the tools used and the output. Plus some lessons learned for next time…. Perhaps a little too sweet this time….


Transparency Timing

Web Analytics is a growth sector.  Not least because much of the web traffic data is relatively available, as it is generally automatically produced behind the scenes for any web site. 

I’ve used the web stats from the official web site of the Prime Minister’s Office… Number10.gov.uk, as illustration of first level of web stats (See: Presentation to the Royal Statistical Society’s International Conference 2010: Web analytics – a new statistical domain).  This has been a good example of the industry standard web stats (ABCe), and has been a great exemplar for speed of publication and transparency.   


Not only are these industry standard data, but No.10 has published these within a day or two from the end of the month.  There’s always a need to balance the speed of publication with the need to ensure its quality and consistency.  But because this is simple data (“it is what it is”), collected automatically, and to industry standard definitions, then this timing is quite achievable and reasonable.  There in is at least one of the secrets to prompt publication, sort as much as possible at stages higher up the data food chain. 

So for No.10, this time from the end of the monthly data collection period (midnight on the last day) to the publication, is sometimes measurable in hours rather than days.  In comparison (and from experience), that quality assurance work for the mainstream official and national statistics data is often only available weeks, typically months, sometimes even years after the end of the data collection period.

While these No.10 web stats are arguably not official statistics (See Statistics and Registration Service Act 2007) the UK Statistics Authority best practice guide (Code of Practice: Protocol 2, Practice 1) encourages release at the earliest opportunity…. 'release statistical reports as soon as that are judged ready, so that there is no opportunity, or perception of opportunity, for the release to be withheld of delayed." 

But now the flow of data has stopped.  The last published data was for November 2010.   It's just that the previous data shows steadily reducing page views per visit each month….

The three pieces of monthly web data that are provided are (a) Page views, (b) Visits (c) Unique visits.  And here's the data so far....



All these three measures showed month on month reductions through June, July, August and September, then rising a little for October and November.  Technically we would expect some variation here simply due to the months not all having the same number of days, and typically quieter summer months.

A simple and common enough web analytics derivative calculation from these measures is page views per visit (which at the same time standardises for length of month).  In the simplest of terms what I’ll call an “interesting Index”.   There are some implicit assumptions going on here.  Most importantly (given that this is a ratio) is that the number of pages on the site has remained stable.  After all if there’s less content then there’s bound to be less staying power.



Sure enough when web traffic was highest, in May 2010 just after the election, an average of 4.2 pages were viewed per visit.  This has been steadily declining month on month to 2.3 pages per visit for the latest data.  Given that May was a peak, it’s worth looking at June, which was an average 3.0 pages per visit.  So from June, in simple terms the pages viewed per visit is reducing by 0.7 over 6 months. So at that rate - reducing by 0.7 every 6 months -it’s just over three years until there are no page views per visit at all……(which technically would imply zero visits).

Of course with what is a relatively small amount of data, this is all more indicative and insight.  The trend is almost certainly non–linear, and tailing to a stable level.  Really need more data to tell, and there of course is the problem.   Sure, this is not the most important data out there, and may indeed stumble at the “so what”, or “who cares” tests.  With the whole open data movement this is not about pre-judging potential uses or value, rather more about assuming “helpful until proven to be unhelpful”.  That starts to become a relative proxy for value of money in relation to data collection costs.

So with an open data movement, a big transparency agenda, and the need to build trust in official public statistics, this surprisingly missing data which is already pointing to declining in web site usage, looks like an awkward oversight, and especially so from the seat of government and champion of transparency.