The Office for National Statistics (ONS) has published a review of data-linking practices across government, and elsewhere, in order to make data more useful for government decision-making.
The guidance, Joined up data in government: the future of data linking methods, is part of a series, known as Data and Analysis Method Reviews, under the oversight of Ian Diamond as head of the analysis function at the ONS.
Diamond is the UK’s national statistician as chief executive of the UK Statistics Authority and head of the UK Government Statistical Service, and has become a familiar face on our TV screens during the Covid-19 pandemic.
Although the ONS review mentions challenges in accessing data and data sharing, this is not within the scope of this methods review.
The guidance highlights data linkage work done during the pandemic as an example of what can be done to improve government decision-making. The guidance states: “The lack of ethnicity information on death registrations was overcome by linking death registrations with the 2011 census. This allowed for further research into the effects of the coronavirus pandemic on different ethnic groups.”
The review drops into a climate in government data where more centralisation in the name of a strategic privileging of data is the order of the day.
This has been a big theme in the thinking of Dominic Cummings, chief adviser to the prime minister.
There have been signs, small and large, of a consistent drive to join up data better. Before the pandemic set in, the Department for Digital, Culture, Media and Sport (DCMS) announced it was looking for consultants to undertake a short-term project to improve data sharing across government.
And, on a more ambitious scale, Boris Johnson announced, on the very day that Parliament was packing its bags for the summer recess, that responsibility for government use of data had been transferred from DCMS to the Cabinet Office.
That move followed swiftly on from the government’s announcement of the creation of a new analytical unit at Number 10, 10ds, aimed at driving change across Whitehall, using data science.
The ONS guidance review, published this week, says: “While there is a lot of data linkage taking place across government, this is often conducted in isolation with limited knowledge sharing. There needs to be a joined-up approach to ensure that data linkage is at the heart of improvements to official statistics.
“Furthermore, UK government linkage is falling behind other countries, especially those that have population registers and where ID numbers can be used for linkage.
“Therefore, time and investment are required for optimising and applying data linkage methods and ensuring that government has the skills required to link data optimally.”
The guidance describes data linkage as “the process of joining datasets through deciding whether two records, in the same or different datasets, belong to the same entity”.
It gives this example of data linkage: “The Ministry of Justice (MoJ) and the Department for Eduation (DfE) share data on childhood characteristics, educational outcomes and (re)-offending. This data share includes 20 DfE datasets, including data on academic achievement, pupil absence and pupil exclusions. It also includes 11 MoJ datasets, including data on offenders’ criminal histories, court appearances and time in prison. Each dataset has a unique ID variable that can be used to link across the datasets.”
The review features a slew of expert and peer-reviewed essays on state-of-the-art data-linkage methods and applications from recognised experts.
However, it highlights the trade-off “between maintaining privacy of entities and linkage quality” as a challenge faced by government departments.
It also looks at the issue of difficulties caused by the use of different software to link data. “Additionally, most open source software is not suitable for linking millions of records – a requirement for many government linkage projects,” it adds.
One linked review document describes Splink, the Ministry of Justice’s in-house open source software solution for linkage. “This is an application of the expectation-maximisation algorithm to the Fellegi-Sunter linkage model, run on Apache Spark,” it says. “The package has tested well on datasets containing 15 million records. Such software needs further testing to find solutions suitable for large-scale government linkage.”
The guidance also flags the use of graph databases as a method for storing and processing data in linkage projects. “This allows data linkers to store relationships between records in the database, maintaining knowledge of their potential links,” it says. “This knowledge can inform subsequent linkage when more data is added or changed.
“Graph databases are a new approach for linkage projects and further research is needed to understand its robustness and utility in government.”