HealthDataNavigator Share your ideas for improving health data and system analysis at the international level

Data Linkage

The importance of collecting and analysing information about episodes of illness for comprehensive evaluation of health system performance has long been recognised. Quality monitoring programs requires data from different sources on episodes of care and care provision for individuals. To identify and bring together records from diverse data sets that contain information on the same person a systematic strategy is required, also known as data linkage.

These well-developed linkage systems in the health sector demonstrate linkage processes, methods and prevalent issues:

In a second part key challenges and methods of data linkage are outlined assisting researchers in processing data from different sources. The lessons from the EuroREACH Diabetes Case Study in linking administrative person-level data from three countries might give valuable first hand insights on data processing and linkage.

 Western Australia Data Linkage System (WADLS)*
Function Performing linkage within and between core data collections held by the DoHWA, Geocoding of address information on health records, and providing linked and geocoded data to support health planning, evaluation and research, as well as offering guidance on the use of the data system
Data sources used DoHWA (hospital discharges, midwives notifications, cancer registrations, mental health contacts, emergency presentations), Registry of Births, Deaths and MarriagesElectoral Commission, Western Australia.  The electoral roll constitutes the primary population listing for the system.
Linkage & Privacy issues Linkage is accomplished using identifiers only. Linked datasets have identifiers removed before they are made available to researchers. The data linkage process involves probabilistic methods to calculate the likelihood that two records belong to the same entity (person, family, event, and location). To protect the privacy of persons in the system, the linkage and analysis tasks are performed separately.
Access to  system

Applicants for data must submit an application form specifying the datasets and variables required for the project. Applications must be approved not only by the Department of Health and the WADLS, but also by the custodians of each of the data collections from which information is requested.  Data for each linkage project are encrypted.  Encryption is different for each project, such that it would be impossible to link datasets produced for two different projects.  Applicants must certify that they will not attempt to re-identify individuals in the dataset or to use a dataset created for a specific project for a new project.

Research projects

Between 1995 and 2011, over 750 research projects have been conducted using WADLS linked data. 



Oxford Record Linkage Study (ORLS) *


ORLS began as a joint project between the National Health Service (Oxford Regional Health Authority (RHA)) and academics (University of Oxford) and was funded largely by the NHS. The rationale behind the ORLS was to maximize the value of existing data by making linkage possible.

After the abolition of the Regional Health Authorities in 1995 ORLS shifted to the University Unit and continued to function, collecting hospital data from individual health authorities within the former Oxford RHA.
Since 2005, the NHS National Information Centre (funded by the Department of Health) has held English national data on a variety of topics and performs linkage. The Oxford group continues to do English national linkage with funding from the National Institute for Health Research, and continues to take the Oxford subset for ORLS.          

Data sources used

All hospitalizations, all deaths and some birth certificates for the population of the jurisdictions covered by the administrative Health Authority of the Oxford Region (population 2.5 million), for the period from 1963-2009. Mortality data were provided by the Office for National Statistics. Over the years of operation of the ORLS, other sources of information, including laboratory and cancer registry data, have been linked to the system for specific projects.


Data incorporated into the ORLS collected were those that were routinely collected by the NHS, along with names and addresses, which were used for linkage until the inclusion of the NHS number as an identifier on hospital records became more widespread. Personal identifiers are removed from the linked file prior to analysis.



General Practice Research Database (GPRD), UK

The GPRD is a primary care database containing information derived from electronic medicals records for approximately 8% of the population of the United Kingdom. 

Data sources used

Over 600 practices contribute data to the GPRD, and the system included data for over 11 million persons (1). Participating practices record each episode of illness, new symptoms, patient encounters, diagnoses and abnormal laboratory test results, referrals to outpatient clinics and hospital admissions. The system also captures all prescriptions generated by the GP and other information recorded by practice staff, including vaccinations, weight and blood pressure measurements (2).


GPRD data has been linked to national death and hospitalization data, disease registers at the person level, and to socioeconomic and census data at the small area level (1). 


The Directory of Clinical Databases, created in 2001, provides descriptions and independent assessment of multicenter clinical databases containing person-level information  in existence in the UK.  The directory was compiled through structured interviews with data custodians and data quality was assessed using a standardized instrument (3). 



Population Health Research Network (PHRN), Australia

PHRN is a 21st century initiative to collaboratively build nationwide data linkage infrastructure for population health related research in Australia.


The objective is to access linkable de-identified data from a diverse and rich range of health datasets, across jurisdictions and sectors in order to support nationally and internationally significant population based research that will improve health and enhance the delivery of health care services in Australia. The PHRN develops technology to ensure the safe and secure linking of data collections whilst working to protect peoples’ privacy.


To link data about the same person from several data collections, data linkers create unique Linkage IDs that allow the linkage of different data sources. 




Information Services Division (NHS), Scotland
Function ISD Scotland manages a national database of administrative, management and clinical data. Information is drawn from a wide range of existing data collections belonging to ISD and to other organizations, and includes (4,5).
Data sources used

Data sources include NHS and private hospitals, vital records, general practitioners, pharmacists and hospices, linked using probability matching (6).  Data sources include NHS and private hospitals, vital records, general practitioners, pharmacists and hospices, linked using probability matching (6).  An example of a linked database maintained by ISD is the linked acute dataset, which brings together hospital inpatient data (including diagnoses and procedures), cancer diagnoses, demographic and mortality data (7). 


The data can be used alone or linked to other datasets using the ISD’s Record Linkage Service. ISD has a service contact point, the electronic Data Research and Innovation Service (eDRIS), to assist researchers in study design, approvals and data access in a secure environment.




SAIL (Secure Anonymised Information Linkage) Databank, Wales


SAIL (Secure Anonymised Information Linkage) databank is maintained by the Health Information Research Unit (HIRU) of the School of Medicine at Swansea University (8).  IT aims to link the widest possible range of person-based data using robust anonymisation techniques for health related research. It holds over 500 million records and continual growth is in progress.

Data sources used

Data are drawn from a variety of sources, including hospitals, general practices, and social services authorities, including:

  • All Wales Injury Surveillance System
  • NHS Hospital Episode Statistics
  • NHS administrative register
  • ONS birth and deaths
  • Breast and Cervical screening data
  • Educational attainment
Linkage Linkage of datasets from different sources is accomplished using the 10-digit National Health Service (NHS) number.  Demographic data for the covered population are drawn from the NHS Administrative Register (NHSAR) which lists all persons who have registered or received care from the health services in Wales.  A number of steps are taken in order to maintain the confidentiality of the data, including splitting into clinical and demographic components at the level of source organization, with the addition of a field to allow for later linkage and assignment of an anonymous linking field to substitute for identifying data and facilitate linkage between files.  One of the challenges of creating this data system was that social service data do not contain the NHS number, therefore, a probabilistic linkage strategy, using a series of matching variables (first name, last name, gender, postal code and birth date)  was required to link these data to health care files.

Key challenges of data linkage

Preparation of a linked data set involves identifying the sources and quality of the required data elements and establishing a method of combining them to create a more complete picture of the experience of individuals than could be obtained from any single data source. Data linkage requires not only a thorough understanding of the component databases to be linked, but also expertise in statistics and programming in order to establish a methodology for identifying matches between files, while minimizing errors (9). 

Linkage of data is simplest when all of the data sources use a common unique key to identify individual members.  The ideal identifier is unique, permanent, and applicable to the entire population of interest.  Unique identifiers assigned at birth exist in a number of countries, including Sweden, Norway, Denmark, and Israel (10, 11).  In practice, however, numbering systems are not universal, even within health systems (12).  Therefore, other identifying information, such as name, birth date, gender, and residence may be taken into consideration in order to identify matching records. In the event that the purpose of the project is to identify discrete episodes of care, dates of service may also be of importance in linking different services related to a single episode of care.

Gill et al., describe three steps in the linkage process:

  1. Ordering or blocking a file to make the process of searching for matches more efficient.  The ordering of the file will depend on the keys used to establish links (for example, identification number or surname).  In the event that names are used as a key, they may be translated into phonetic equivalents (such as a Soundex code) that allow for matches between similar names and control for the effects of misspellings.  
  2. Matching, or identification of pairs of records, belonging to the same person
  3. Linkage of matched records relating to a single person into a composite record (12)

Linkage methods

Two basic methods exist for linkage of disparate data sets, deterministic and probabilistic (13):

  • Deterministic linkage requires an exact match between linkage variables (identity number, last name/first name, etc.).  In cases in which data entry errors, name changes and other factors have resulted in differences between linkage variables in the two files (incorrect coding of identity number, or the appearance of a maiden name in one file and a married name in the other), true matches between records will be missed. 
  • In probabilistic linkage, in contrast, less than exact matches may be accepted, according to a predetermined method that assigns a score to the level of the match.  Level of acceptable error depends on how crucial identification of a specific person is.  Different fields may be given different weight (matched birth date may be more important than matching spelling of last name).  In cases in which the goal is to link data collected by multiple agencies, probabilistic matching may allow more complete matching of records related to the same individual, at the expense of an increased risk of incorrect matches.

Nicknames, misspellings, reversals of first and last name, and name changes (such as in the event of marriage) must be taken into consideration in assessing the quality of a match.  Linkage software facilitates the process, but must be tailored to the population under study. Two products in the public domain are Link Plus, developed by the US Centers for Disease Control and The Link King, developed by the Washington State Division of Alcohol and Substance Abuse (14). Read more

icon plus References