Data Linkage
The importance of collecting and analysing information about episodes of illness for comprehensive evaluation of health system performance has long been recognised. Quality monitoring programs requires data from different sources on episodes of care and care provision for individuals. To identify and bring together records from diverse data sets that contain information on the same person a systematic strategy is required, also known as data linkage.
These well-developed linkage systems in the health sector demonstrate linkage processes, methods and prevalent issues:
- Western Australia Data Linkage System (WADLS)
- Oxford Record Linkage Study (ORLS)
- General Practise Research Database (GPRD)
- Population Health Research Network (PHRN)
- Information Service Division (ISD)
- SAIL Database
In a second part key challenges and methods of data linkage are outlined assisting researchers in processing data from different sources. The lessons from the EuroREACH Diabetes Case Study in linking administrative person-level data from three countries might give valuable first hand insights on data processing and linkage.
Western Australia Data Linkage System (WADLS)* | |
Function | Performing linkage within and between core data collections held by the DoHWA, Geocoding of address information on health records, and providing linked and geocoded data to support health planning, evaluation and research, as well as offering guidance on the use of the data system |
Data sources used | DoHWA (hospital discharges, midwives notifications, cancer registrations, mental health contacts, emergency presentations), Registry of Births, Deaths and MarriagesElectoral Commission, Western Australia. The electoral roll constitutes the primary population listing for the system. |
Linkage & Privacy issues | Linkage is accomplished using identifiers only. Linked datasets have identifiers removed before they are made available to researchers. The data linkage process involves probabilistic methods to calculate the likelihood that two records belong to the same entity (person, family, event, and location). To protect the privacy of persons in the system, the linkage and analysis tasks are performed separately. |
Access to system |
Applicants for data must submit an application form specifying the datasets and variables required for the project. Applications must be approved not only by the Department of Health and the WADLS, but also by the custodians of each of the data collections from which information is requested. Data for each linkage project are encrypted. Encryption is different for each project, such that it would be impossible to link datasets produced for two different projects. Applicants must certify that they will not attempt to re-identify individuals in the dataset or to use a dataset created for a specific project for a new project. |
Research projects |
Between 1995 and 2011, over 750 research projects have been conducted using WADLS linked data. |
Function |
ORLS began as a joint project between the National Health Service (Oxford Regional Health Authority (RHA)) and academics (University of Oxford) and was funded largely by the NHS. The rationale behind the ORLS was to maximize the value of existing data by making linkage possible. After the abolition of the Regional Health Authorities in 1995 ORLS shifted to the University Unit and continued to function, collecting hospital data from individual health authorities within the former Oxford RHA. |
Data sources used |
All hospitalizations, all deaths and some birth certificates for the population of the jurisdictions covered by the administrative Health Authority of the Oxford Region (population 2.5 million), for the period from 1963-2009. Mortality data were provided by the Office for National Statistics. Over the years of operation of the ORLS, other sources of information, including laboratory and cancer registry data, have been linked to the system for specific projects. |
Linkage |
Data incorporated into the ORLS collected were those that were routinely collected by the NHS, along with names and addresses, which were used for linkage until the inclusion of the NHS number as an identifier on hospital records became more widespread. Personal identifiers are removed from the linked file prior to analysis. |
General Practice Research Database (GPRD), UK | |
Function |
The GPRD is a primary care database containing information derived from electronic medicals records for approximately 8% of the population of the United Kingdom. |
Data sources used |
Over 600 practices contribute data to the GPRD, and the system included data for over 11 million persons (1). Participating practices record each episode of illness, new symptoms, patient encounters, diagnoses and abnormal laboratory test results, referrals to outpatient clinics and hospital admissions. The system also captures all prescriptions generated by the GP and other information recorded by practice staff, including vaccinations, weight and blood pressure measurements (2). |
Linkage |
GPRD data has been linked to national death and hospitalization data, disease registers at the person level, and to socioeconomic and census data at the small area level (1). |
Others |
The Directory of Clinical Databases, created in 2001, provides descriptions and independent assessment of multicenter clinical databases containing person-level information in existence in the UK. The directory was compiled through structured interviews with data custodians and data quality was assessed using a standardized instrument (3). |
Population Health Research Network (PHRN), Australia PHRN is a 21st century initiative to collaboratively build nationwide data linkage infrastructure for population health related research in Australia. |
|
Function |
The objective is to access linkable de-identified data from a diverse and rich range of health datasets, across jurisdictions and sectors in order to support nationally and internationally significant population based research that will improve health and enhance the delivery of health care services in Australia. The PHRN develops technology to ensure the safe and secure linking of data collections whilst working to protect peoples’ privacy. |
Linkage |
To link data about the same person from several data collections, data linkers create unique Linkage IDs that allow the linkage of different data sources. |
Source | www.phrn.org.au |
Information Services Division (NHS), Scotland | |
Function | ISD Scotland manages a national database of administrative, management and clinical data. Information is drawn from a wide range of existing data collections belonging to ISD and to other organizations, and includes (4,5). |
Data sources used |
Data sources include NHS and private hospitals, vital records, general practitioners, pharmacists and hospices, linked using probability matching (6). Data sources include NHS and private hospitals, vital records, general practitioners, pharmacists and hospices, linked using probability matching (6). An example of a linked database maintained by ISD is the linked acute dataset, which brings together hospital inpatient data (including diagnoses and procedures), cancer diagnoses, demographic and mortality data (7). |
Linkage |
The data can be used alone or linked to other datasets using the ISD’s Record Linkage Service. ISD has a service contact point, the electronic Data Research and Innovation Service (eDRIS), to assist researchers in study design, approvals and data access in a secure environment. |
Source | http://www.isdscotland.org/ |
SAIL (Secure Anonymised Information Linkage) Databank, Wales |
|
Function |
SAIL (Secure Anonymised Information Linkage) databank is maintained by the Health Information Research Unit (HIRU) of the School of Medicine at Swansea University (8). IT aims to link the widest possible range of person-based data using robust anonymisation techniques for health related research. It holds over 500 million records and continual growth is in progress. |
Data sources used |
Data are drawn from a variety of sources, including hospitals, general practices, and social services authorities, including:
|
Linkage | Linkage of datasets from different sources is accomplished using the 10-digit National Health Service (NHS) number. Demographic data for the covered population are drawn from the NHS Administrative Register (NHSAR) which lists all persons who have registered or received care from the health services in Wales. A number of steps are taken in order to maintain the confidentiality of the data, including splitting into clinical and demographic components at the level of source organization, with the addition of a field to allow for later linkage and assignment of an anonymous linking field to substitute for identifying data and facilitate linkage between files. One of the challenges of creating this data system was that social service data do not contain the NHS number, therefore, a probabilistic linkage strategy, using a series of matching variables (first name, last name, gender, postal code and birth date) was required to link these data to health care files. |
Source |
http://www.adls.ac.uk/secure-anonymised-information-linkage-databank/ |
Key challenges of data linkage
Preparation of a linked data set involves identifying the sources and quality of the required data elements and establishing a method of combining them to create a more complete picture of the experience of individuals than could be obtained from any single data source. Data linkage requires not only a thorough understanding of the component databases to be linked, but also expertise in statistics and programming in order to establish a methodology for identifying matches between files, while minimizing errors (9).
Linkage of data is simplest when all of the data sources use a common unique key to identify individual members. The ideal identifier is unique, permanent, and applicable to the entire population of interest. Unique identifiers assigned at birth exist in a number of countries, including Sweden, Norway, Denmark, and Israel (10, 11). In practice, however, numbering systems are not universal, even within health systems (12). Therefore, other identifying information, such as name, birth date, gender, and residence may be taken into consideration in order to identify matching records. In the event that the purpose of the project is to identify discrete episodes of care, dates of service may also be of importance in linking different services related to a single episode of care.
Gill et al., describe three steps in the linkage process:
- Ordering or blocking a file to make the process of searching for matches more efficient. The ordering of the file will depend on the keys used to establish links (for example, identification number or surname). In the event that names are used as a key, they may be translated into phonetic equivalents (such as a Soundex code) that allow for matches between similar names and control for the effects of misspellings.
- Matching, or identification of pairs of records, belonging to the same person
- Linkage of matched records relating to a single person into a composite record (12)
Linkage methods
Two basic methods exist for linkage of disparate data sets, deterministic and probabilistic (13):
- Deterministic linkage requires an exact match between linkage variables (identity number, last name/first name, etc.). In cases in which data entry errors, name changes and other factors have resulted in differences between linkage variables in the two files (incorrect coding of identity number, or the appearance of a maiden name in one file and a married name in the other), true matches between records will be missed.
- In probabilistic linkage, in contrast, less than exact matches may be accepted, according to a predetermined method that assigns a score to the level of the match. Level of acceptable error depends on how crucial identification of a specific person is. Different fields may be given different weight (matched birth date may be more important than matching spelling of last name). In cases in which the goal is to link data collected by multiple agencies, probabilistic matching may allow more complete matching of records related to the same individual, at the expense of an increased risk of incorrect matches.
Nicknames, misspellings, reversals of first and last name, and name changes (such as in the event of marriage) must be taken into consideration in assessing the quality of a match. Linkage software facilitates the process, but must be tailored to the population under study. Two products in the public domain are Link Plus, developed by the US Centers for Disease Control and The Link King, developed by the Washington State Division of Alcohol and Substance Abuse (14). Read more
In probabilistic matching, a pair of records is assigned a score that represents the probability that they are a true match. The decision must be made whether it is more important to maximize true matches (in which case the number of false negatives will be higher) or to maximize overall matches (in which case the number of false positives will be higher). While it may be tempting to choose a strategy that minimizes the sum of false positives and false negatives, this may in fact be the worst approach. Clerical review is an additional tool for establishing a match.
A recent review (15) identified reasons for incomplete linkage between files that may impact on the findings of the research, and pointed out that no standard grading system exists for rating the quality of linked data. For example, linkage of a study population to a disease registry by patient name may result in fewer matches for women because of name changes, resulting in underestimation of the rate of the disease of interest in women compared to men.
The following issues must be taken into consideration in linking data from multiple sources for the creation of a research data set:
- Is use of health care data and linkage to other data sources for the purposes of research done with the express permission of the subjects, or with a waiver of consent?
- Who will be responsible for creating linked data sets? Where will the linked data set be located? Who will have access and under what circumstances? Will linked data be de-identified?
- An accurate listing of the underlying population is crucial for the calculation of rates of health events and mortality. Examples of linked national data systems that incorporate a population registry include those of Scandinavian countries, Canadian provinces, and the Western Australian registry, which was built from electoral roll registrations (16).
- How will the privacy of individuals in the data sets be protected? One option is to remove identifiers from the final file after linkage has been completed. Other options include single-coding (wherein the researcher holds a key to identify participants separately from the research file) and double-coding (wherein the data custodian, rather the researcher using the data, holds the key to identifying participants) (17).
* Based on presentation by Prof. Michael Goldacre at EuroREACH expert panel meeting, Tel Aviv, Israel, May 23-24, 2011.
(1) General Practice Research Database [Internet] http://www.cprd.com/intro.asp , last accessed 07 March 2013.
(2) Herrett E, Thomas SL, Schoonen WM, Smeeth L, Hall AJ. Validation and validity of
diagnoses in the General Practice Research Database: a systematic review. British Journal of Clinical Pharmacology 2010;69(1):4-14.
(3) Black N, Barker M, Payne M: Cross sectional survey of multicenter clinical
databases in the United Kingdom. Br Med J 2004, 328(7454):1478.
(4) ISD Scotland, http://www.isdscotland.org/About-ISD/Data-Collection/, last accessed 16 June 2011.
(5) NHS National Services Scotland, Information Services Division. Statement of Administrative Sources,
Version 1.0. January 2010. Accessed at http://www.isdscotland.org/About-ISD/Data-Collection/, last accessed 16 June 2011
(6) ISD Scotland, Data Anonymisation. Managing Patient identifying Data: Best Practice Guidelines, Draft V 0.3. September 2003. Accessed at http://www.isdscotland.org/About-ISD/Confidentiality/ISD_anon_guide.pdf, Last accessed 16 June 2011.
(7) Morris, C. Linking health information in Scotland. NHS National Health Services Scotland, Information Services Division [presentation]. accessed at http://www.ihdln.org/wp-content/uploads/2008/12/sg_awareness_session_081208.pdf, last accessed 23 June 2011.
(8) Lyons RA, Jones KH, John G, Brooks CJ, Verplancke JP, Ford DV, Brown G, Leake K. The SAIL databank: linking multiple health and social care datasets. BMC Medical Informatics and Decision Making 2009, 9:3 doi:10.1186/1472-6947-9-3.
Accessed from: http://www.biomedcentral.com/1472-6947/9/3., last accessed 23 June 2011.
(9) Bradley CJ, Penberthy L, Devers KJ, Holden DJ. Health Services Research and Data Linkages: Issues, Methods, and Directions for the Future. HSR: Health Services Research 2010: 45:5, Part II.
(10) Lunde AS. The birth number concept and record linkage. Am J Public Health. 1975; 65(11):1165-9.
(11) Lunde, A. S., Lundeborg, S., Lettenstrom, G.S., Thygesen, L., Huebner, J. Person-Number Systems of Sweden, Norway, Denmark and Israel. DHHS Publication No. (PHS) 80-1358 Series 2 No. 84.
(12) Gill L, Goldacre M, Simmons H, Bettley G, Griffith M. Computerized linking of medical records: methodological guidelines. Journal of Epidemiology and Community Health 1993; 47: 316-319.
(13) NCSIMG 2004. Statistical Data Linkage in Community Services Data Collection. Canberra: Australian Institute of Health and Welfare.
(14) Campbell KM, Deck D, Krupski A. Record linkage software in the public domain: a comparison of Link Plus, The Link King, and a ‘basic’ deterministic algorithm. Health Informatics Journal 2008; 14:5-15.
(15) Bohensky MA, Jolley D, Sundararajan V, Evans S, Pilcher DV, Scott I, Brand CA. Data Linkage: A powerful research tool with potential problems. BMC Health Services Research 2010, 10:346 doi:10.1186/1472-6963-10-346. available at http://www.biomedcentral.com/1472-6963/10/346, last accessed December 30, 2010.
(16) Cameron CM, Purdie DM, Kliewer EV, Wajda A, McClure RJ. Population health and clinical data linkage: the importance of a population registry. Australian and New Zealand Journal of Public Health 2007; 31:459-63.
(17) Canadian Institutes for Health Research. CIHR Best Practices for Protecting Privacy in Health Research. September 2005.
http://www.cihr-irsc.gc.ca/e/documents/et_pbp_nov05_sept2005_e.pdf. Last accessed 23 June 2011.
* Based on presentation by Prof. Michael Goldacre at EuroREACH expert panel meeting, Tel Aviv, Israel, May 23-24, 2011.