10 Re-use of registry data
Re-use of information means that some information collected for a given purpose is to be used for another one. Re-use of clinical data for registry (and other public health) purposes is usually and typically an abstraction process based on some sort of knowledge.
- There are several types of re-use of the data: internal re-use, international comparison (same purpose, different context), cross-registry comparison, comparison with information outside the health domain.
- Both aggregated and micro data can be re-used, but the first are much easier to apply.
- Cross-border use of data for public health is well-known and used for cross-country data comparison and surveillance, outbreak alerting and communication, bioterrorism threat, identification of best and cost-effective practices and public health research.
- There are many different applications in the field of cross-border use for research purposes (risk factor studies, genetic research, clinical and therapeutic research).
- The issues like compatibility, comparability and interoperability do need to be taken into account.
- Before planning to re-use of data, the legal background and various policies need to be studied carefully at EU level, national level and institutional level.
- 1 10.1 Background
- 2 10.2 Why to re-use?
- 3 10.3 Is re-use possible?
- 4 10.4 Re-use of data
- 5 10.5 Types of re-use of registry data
- 6 10.6 Re-use of aggregations vs. re-use of elementary data
- 7 10.7 Definition of Possible Types of Data
- 8 10.8 Cross-border Use for Public Health
- 9 10.9 Cross-border Use for Research Purposes
- 10 10.10 Compatibility, comparability and interoperability
- 11 10.11 Interoperability Standards and Approaches for Data Exchange
- 11.1 10.11.1 General Concept
- 11.2 10.11.2 eHealth standards
- 11.3 10.11.3 Coding schemes, terminologies
- 11.4 10.11.4 Ontologies and data structures
- 11.5 10.11.5 Mobile health delivery, personalized medicine, and social media applications
- 12 10.12 Problem with populations
- 13 10.13 Examples of legal frameworks for data protection and data sharing
- 13.1 10.13.1 Policy on data submission, access, and use of data within TESSy
- 13.2 10.13.2 European Commission’s proposal for a General Data Protection Regulation
- 13.3 10.13.3 European Data Protection Board, General Data Protection Regulation
- 13.4 10.13.4 HIPAA Privacy and Security Rules for Public Health Data Exchange
- 14 References
Re-use of information in general is a current issue in informatics and for health informatics in particular. In 2012 the International Medical Informatics Association organised a summit in Brussels with the title “Trustworthy re-use of health data”. The title in itself points out that re-use of health data is a sensitive issue and it is important to find ways, where re-use can be done in a trusted manner. The conclusion of the summit has been published in the article referenced above. The participants considered various scenarios of re-use, with a focus on re-use of EHR data. In the following sections it is shown that re-use can be done at different levels, and all registries re-use clinical data in a certain sense, but on a higher level. Data stored in the registries can be re-used again for further purposes.
But first, in order to avoid confusion it is necessary to clearly define what is meant by re-use of registry data.
10.1.1 Definition of re-use
According to information theory, information is “something about something” i.e. a series of symbols that represents something else . For our purpose it is important to understand from this, that all information is only an abstraction of the thing (event or phenomenon) that is represented. No representation can completely describe the represented entity. Due to the abstraction, some features of reality are neglected, and only the relevant attributes of the real world entity are expressed. The best example for this is when we use identifiers to denote human beings. A “social security number” refers unanimously to a real person but nothing or only a very few attributes (e.g. gender, birthdate etc.) can be expressed by such a series of digits or characters.
As a consequence: all reasonable representations are purpose dependent. For a given purpose some features are relevant while others are not. The effective use of the information depends on appropriate selection of relevant features. Naturally, the relevancy depends on the purpose. A very good example of this is the different kinds of maps about the same territory. Maps for touristic purposes will be totally different, for example, from maps for public administration and these differences explain why a map created for some purpose is difficult or even impossible to use for another.
Re-use of information means cases where some information recorded for a given purpose is to be used for another one.
10.1.2 Re-use in the context of patient registries
The fact that all information is purpose dependent generates serious limitation of re-use, which of course does not mean that no information can be used for any other purpose but that for which it was originally recorded.
Sometimes there is a temptation for purposeless data collection: i.e. trying to store everything without defining the goals and future usage of data. As data acquisition and storage costs decrease, this temptation could become greater and greater. In the case of patient registries the privacy concerns prevent us from yielding to the temptation (moreover in most European countries legislation makes it impossible). But it is also important that purposeless data gathering is not a good way: it often leads to bad quality of the collected information.
Registries – often and preferably – are realisations of information re-use. Perhaps with the exception of registries created for public health purposes, it may be difficult to justify collecting data just for registry purposes, that are not relevant or not needed for clinical purposes (this is especially true for especially hospital-based registries). In this case the primary reason for storing some patient data is the clinical need, and registries should store extracts and abstraction of clinical information. This requirement will be addressed in section 10.4.
Summing up these considerations:
- Designing and operating registries should serve well defined purposes
- The normal way of using registry data is to serve the defined purpose
- Re-use of registry data is using data for any other purpose than originally planned for
The next two sections provide brief answers to the emerging questions on why to re-use data and whether re-use is possible.
10.2 Why to re-use?
One could think, that if all data collection is purpose-dependent, then any reuse of collected data can be inappropriate. Sometimes it really is the case. For example, using ICD coded data in clinical context can be a misuse, since ICD coded data are neither sufficiently detailed nor reliable enough to directly serve care of individual patients. (The reproducibility of ICD codes is around 30%) This does not mean, that in certain cases such a solution cannot be helpful. Theoretically, while all representations of reality (all data about something) are abstractions (some part of details is always lost) the remaining details still can convey many useful information that was not in the mind of the designers of the data collection. Beyond that in the most exciting cases of re-use data collected for a given specific purpose are merged and analysed with other data (see sections 10.5.3 and 10.5.4) that always gives added value to our data.
Practically, in many health systems a vast amount of information is collected and poorly utilised. If re-use is possible it is more advantageous than separate data collections for all different purposes. Re-use is a much more cost effective and straightforward way.
10.3 Is re-use possible?
In spite of the above mentioned concerns or limitation in many cases it is possible, however care is always needed. For example, data, collected originally for health care reimbursement often can be used for quality assessment or capacity planning. But it is important to note that using some data for financial purposes always induces some distortion. Indeed, all observations distort somewhat the phenomenon that we want to observe. (It is a basic law in quantum physics, but also applies for many social phenomena). It is important to measure or at least estimate how large the distortion is, in order to draw correct conclusions from noisy data.
10.4 Re-use of data
10.4.1 Re-use of clinical data in registries
It is a critical success factor for designing and implementing registries that the administrative burden of health care providers is minimised. Data collection systems should be automated as much as possible. The proper way is to extract all relevant data for a registry from the clinical documentation without much human workload. But this “extraction” is not always so straightforward. In Hungary there is a registry for premature new born babies, and this registry stores information on administration of surfactants. In the data model of the registry this is just a YES/NO rubric. Naturally, there is no such rubric in the patient records, but of course all drug administrations (including surfactants) are recorded. In order to automate the data submission to the registry, an abstraction process has to be implemented that is able to extract the information regarding which drugs are surfactants.
Therefore re-use of clinical data for registry (and other public health) purposes is usually and typically an abstraction process based on some sort of knowledge.
10.4.2 Re-use of spatial data
Using geographic data in different application domains has resulted in large amounts of data stored in spatial databases and these spatial data can be re-used for health purposes, sharing accurate geographic references to track communicable diseases by place and time, link various geo-referenced environmental factors such as air pollution, traffic, and built environment with geo-referenced health outcome data to analyse potential associations and identify risk factors. Such spatial data have been extensively used in the health domain in recent years. However, re-use of spatial data collected outside of the health domain has still an enormous potential for re-use related to the health domain.
10.5 Types of re-use of registry data
10.5.1 Internal re-use
Whenever an authority establishes a patient registry, the tasks, roles and goals of the registry are defined. The data-model of the registry is ideally designed based on these tasks. It may happen however that later the collected data are used for further purposes. For example, if the original task of a cancer registry is to measure cancer incidence, but later on the same data are used within the registry to estimate cancer prevalence, then this is a case of internal re-use. The term ‘internal’ refers to the fact that the re-use happens in the same organisation operating the registry. Sometimes such internal re-use requires additional data from different sources. In the mentioned example this could be cancer mortality data.
10.5.2 International comparison (same purpose, different context)
Patient registries for the same disease (same purpose) have been set up in different countries. Obviously, a cancer registry is the best example, since most of the countries operate some sort of cancer registry. Evidently there is a benefit in cross-country comparison of their data. Due to lack of standardisation it is often not so easy. This applies not only for the standardisation of their data structure, but also for the aim, scope and organisation of the registries. For example, data of population-based registries are difficult to compare with hospital-based registries. Comparison of national (one single registry for the whole country) with country level data aggregated from regional registries may raise methodological problems.
10.5.3 Cross-registry comparison (correlation between diseases)
Morbidity patterns are evergreen research topics. Correlation between disease incidences either from a genetic or a geographic aspect is a subject of tremendous number of studies. Using patient registry data for this purpose can be done on individual or aggregated level.
Cross-registry comparison of registry data at individual level implies the possibility to merge data about the same person from different registries. However, this does not necessarily require the use of personal data. Such investigations can be performed also on pseudo-anonymous data as well. Different scenarios are possible. Consider two registries for two different diseases. If a comparison is to be made among them, the following options emerge:
- When two registries use personalised data based on the same identifier (e.g. social security number), to make a comparison without infringing privacy, one possibility is to have both registries remove the IDs from the records, and replace it with an artificial identifier, or pseudonym and merge them by this artificial identifier.
- Another option is to aggregate the data in the two registries separately and compare them at aggregate level. This method necessarily has some limitations.
- Datasets with common identifiers can be merged on a secure server with encrypted data transfers, and a de-identified dataset is generated on the server and provided back to researchers.
With large amount of environmental, economic, social, spatial data generated and available in different databases and registries, these data can be linked with health data and secondary data analysis and comparisons are possible. Interactions of these factors could provide useful information for researchers, policy makers in both health and non-health domains.
10.6 Re-use of aggregations vs. re-use of elementary data
Patient registries typically store data about individual patients and create statistics from the individual data. Such statistical data can be used in many research or policy planning activities, and it can be integrated into other statistical data (e.g. comparing morbidity data with economical or social data etc.). Detailed studies, however, need to process the elementary data, when matching data from various sources is not possible on an aggregated level. Re-use of elementary (individual) data is, of course, much more sensitive and problematic from the privacy perspective. Therefore, it is absolutely important to understand the nature of various kinds of elementary and aggregated information.
10.7 Definition of Possible Types of Data
10.7.1 Aggregated Data (Indicator Compilation)
Data about a single entity (legal or natural person, institution, etc.) is called individual data. Data aggregation is a process where data and information is searched, gathered and presented in a report-based, summarised format that is meaningful and useful for the end user or application. In statistics, aggregated data denotes data combined from several measurements. When data are aggregated, groups of observations are replaced with summary statistics based on those observations. Data aggregation may be performed manually or through specialised software.
Aggregated data are usually calculated from individual data by summing or averaging values of some data-type attribute of a set of individuals (population). For example, “body weight of John Smith is 76 kg” is an individual piece of data. “The average body weight of adult citizens of London is 76 kg” is an aggregated data.
Health indicators such as community, public health, or occupational health indicators are typically aggregated data. Using aggregated data, various reports can be generated containing a compilation of selected indicators measuring health status, non-medical determinants of health, health system performance, and finally community and health system characteristics. Patient registries can serve as a valuable source for health indicators such as morbidity and mortality rates.
Aggregated data are generally considered harmless from a privacy perspective and hence can be used without any legal restriction in most cases, providing that appropriate data disclosure control techniques have been used. The normal way that most statistics work, is that a total amount of some phenomenon is counted and then divided along some attributes. For example, first, the total number of deaths is counted in a country then it is divided according to gender, age group, geography or cause of death. By combining of divisions along different attributes we often get very small numbers and run into the risk of possible identification of some individuals. For that reason in most countries legislation restricts the publication of aggregated data where there are less individuals than a certain limit behind each number. This limit varies typically between three and five. It is reasonable, however, to make a distinction between publication (making data available to everybody, without any control of further use) and use of such kind of data, for example, for research purposes. In the latter case it is possible to control the proper use of data, for example, by supervision of an ethical committee.
An increasing number of global patient registries have been established in recent years, which could especially be valuable for rare health conditions to help biomedical research. One example for a global patient registry coming from the US National Institute of Health (NIH), National Center for Advancing Translational Sciences (NCATS):
“The goal of the NIH/NCATS Global Rare Diseases Patient Registry Data Repository (GRDR) program is to serve as a central web-based global data repository that aggregates coded patient information and clinical data to be available to investigators to conduct various biomedical studies, including clinical trials. The aim of the program is to advance research for many rare diseases and apply to common diseases as well.
Data are collected and aggregated from rare disease registries in a standardized manner, linking the registry data to Common Data Elements (CDEs) using nationally accepted standards and standard terminologies. The aim is that through standardization, registries will be interoperable to enable exchange and sharing of data. Each registry will be free to develop its own survey questions according to patient preference and the nature of the disease.” 
10.7.2 Anonymised Data
Anonymisation is a procedure to completely remove any information from the data that could lead to an individual being identified. Oxford Redcliff Hospitals Confidentiality Guidelines states: “[Anonymous] data concerning an individual from which the identity of the individual cannot be determined” .
A Bristol University ethical document defines the following anonymous data types:
"Anonymised data are data prepared from personal information but from which the person cannot be identified by the recipient of the information.
Linked anonymised data are anonymous to the people who receive and hold it (e.g. a research team) but contain information or codes that would allow the suppliers of the data, such as Social Services, to identify people from it.
Unlinked anonymised data contain no information that could reasonably be used by anyone to identify people. The link to individuals must be irreversibly broken. As a minimum, unlinked anonymised data must not contain any of the following, or codes traceable by you for the following :
- name, address, phone/fax number, email address, full postcode
- NHS number, any other identifying reference number
- photograph, names of relatives"
The main difference between anonymous and pseudo-anonymous data is that the former does not contain any key to merge or collect different data about the same individuals. Both data are individual, i.e. contain information about a single person. For example, if all personal identifiers are stripped out from a death certificate (name, birth date, home address, social security number etc.) it is still about a single individual. However, such a document cannot be merged anymore with other (either anonymous, pseudo-anonymous or personal) data about the same person.
Using fully anonymised data is relatively safe from a privacy perspective, however, if one is in possession of additional personal data that allows joining anonymous and personal data with reasonable effort, then privacy concerns emerge.
On the other hand, usability of anonymous data is limited if multiple recording and counting is possible. If there is any chance of having more than one record about the same individual, then calculations will be incorrect (e.g., if we have salary data without personal identifier and one person can have multiple employments, then average incomes cannot be calculated). This is the main reason to use pseudo-anonymised (pseudonymised) data.
10.7.3 Pseudo-anonymised Data
Generally speaking, pseudoanonymisation (or pseudonymisation) is a procedure to break the link to the data subject by replacing the most identifying fields within a data record by one or more artificial identifiers, or pseudonyms. Pseudonymisation is not a method of anonymisation. It merely reduces the linkability of a dataset with the original identity of a data subject, and is accordingly a useful security measure .
Pseudo-anonymised (or pseudonymised) data means that information is represented in a way that allows collecting all data corresponding to the same person without the possibility to identify the real person. Such data cannot include personal identifiers such as names and addresses of the person.
However, there is a disagreement regarding the interpretation of what ‘possibility’ means. For example, according to recent Hungarian legislation, the possibility of re-identification exists if the handler of the data is in the possession of the technical tools necessary to re-identify the person. There are much stronger interpretations in some European countries that say if there is any chance to re-identify (e.g. by using additional information) then the data should be treated as personal. Again, other regulation considers the effort necessary to recognize the real persons, saying that data should be treated as personal only if reasonable effort is enough to re-identify.
There are other definitions of pseudo-anonymisation. For example, the National Health Service (NHS) in UK uses the following definition:
“The technical process of replacing person identifiers in a dataset with other values (pseudonyms) available to the data user, from which the identities of individuals cannot be intrinsically inferred, for example replacing an NHS number with another random number, replacing a name with a code or replacing an address with a location code.” 
This definition interprets the possibility of re-identification again in another way. It says that the data are pseudonymous if the real individuals cannot be "intrinsically" inferred, i.e. just by using the data. If data need to be merged with any other (extrinsic) information in order to refer to real persons, than it is not personal data.
Independently from which definition is worth accepting, it is clear that the use of such data is extremely important and unavoidable for health research and evidence-based health policy. On the other hand, it is clear that using such data requires special regulation. For example, current Hungarian legislation says that any data handled by governmental bodies are either public or personal. Pseudo-anonymous data are not mentioned in the legislation. Only the law of statistics mentions that statistical bodies must not publish data with less than three entities in any given cell presented. However, publication of data (i.e., making data available for everybody) and using data for research purposes are different.
A European directive on using pseudonymous data that defines this type of data and the conditions of use of them would be welcome.
10.7.4 Personal Data
Several laws and regulations exist around the world, which include a definition for personal data. Personal data are defined in EU directive 95/46/EC, for the purposes of the directive, as the following:
Article 2a: 'personal data' shall mean any information relating to an identified or identifiable natural person ('data subject'); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity .
Personal data, personally identifiable information may be categorised into two main groups:
- Personal data, which are often used to identify the individual such as full name, home address, date of birth, birth place, national identification number, genetic information, telephone number, e-mail address, vehicle registration plate number, credit card numbers, biometric records, etc.
- Personal data, which may be shared by many people and may identify the individual. Examples include city, county, state, country of residence, age, race/ethnicity, gender, salary, job position, etc.
However, it is important to keep in mind that sometimes multiple pieces of information, none sufficient by itself to uniquely identify an individual, may uniquely identify a person when combined.
Because a very rare disease itself could be personally identifiable information, collecting and publishing information about rare diseases in patient registries requires careful consideration.
10.8 Cross-border Use for Public Health
There are several initiatives and examples for cross-border use for both public health and research purposes of various data including patient registries' data. Sharing information, data exchange across the borders could serve several purposes.
10.8.1 Cross-country Data Comparison, Surveillance
Data exchange and information-sharing across borders would allow cross-country surveillance, monitoring, and comparison of data. For example, disease rates, trends could be compared by various demographic and clinical characteristic. EUROCAT, European Surveillance of Congenital Anomalies, which collects data on birth defects from several regional and national birth defects programs to generate trends, is a good example of that, as well as the European Network of Cancer Registries (ENCR), which collects and regularly disseminates information on incidence and mortality from cancer in the European Union and Europe. The European Surveillance System (TESSy) is a highly flexible metadata-driven system for collection, validation, cleaning, analysis and dissemination of data on communicable diseases. Its key aims are data analysis and production of outputs for public health action .
10.8.2 Outbreak Alerting and Communication
Sharing cross-border information on communicable (infectious) diseases has great significance on the EU level or international alerting of outbreaks and potential pandemics. Several infectious diseases spread from human to human and these do not respect country borders. Therefore, effectively tracking and preventing, or at least minimising the consequence of an outbreak, to the extent it is possible, prompt information sharing and data reporting is extremely important. An example of this is the novel H1N1 influenza virus outbreak in recent years. However, these emerging diseases are usually not related to or part of patients' registries. Nevertheless, this information may be linked to special patient's registries (such as vulnerable patient groups) that could help alerting them and also help a better understanding of the course and treatment of disease. In this highly globalised and mobile world, transmission of many diseases is more frequent and more possible than ever before in recorded history.
10.8.3 Bioterrorism Threat
Sharing data among specific patient registries could even be helpful in the case of a bioterrorism threat to inform and protect vulnerable patients and groups in a timely manner (e.g., patients with immune deficiencies). The anthrax threat and infections in the United States a few years ago showed the potential danger and need to set up harmonised reporting systems. Patient registries may also benefit from sharing information if a functional cross-border data exchange system was in place.
10.8.4 Identification of Best and Cost-effective Practices
Data sharing could help searching for and identifying best and cost-effective practices by health care providers such as timely diagnosis and treatment, professional recommendations. For example, identification of best practices for reducing hospital readmissions could lead to the implementation of such practice by other health care providers, which could lead to significant cost reduction, and reduce avoidable hospital readmissions.
10.8.5 Referral to Services, Establishing New Services
Mapping the distribution of patients by well-defined smaller geographical units could help to refer these patients to the available services on a European level. At the same time, lack of services in certain geographical areas can also be identified and a new service may be established. Taking into account travel time and distance is very important from both the service providers' and the patients' point of view. The less time and distance is needed to travel, the better, especially in urgent care, to save life and also costs.
10.8.6 Public Health Research
Data exchange could provide information for basic and applied research, and help also understand various demographic and clinical characteristics, long-term outcomes of specific diseases, comorbidities, and effective prevention and intervention efforts on a European or global level.
It is important to differentiate ad hoc, irregular cross-border data sharing, data communication, which could also have significant public health value, from public health surveillance, which is, by definition, an ongoing, systematic data collection in a timely manner.
10.9 Cross-border Use for Research Purposes
The use of registry data for public health and research purposes in cross-organisational and cross-border setting is becoming more and more important. For example:
- increasing mobility increases the risk of cross-country infections,
- for rare conditions setting up international databases or exchanging data is crucial to establish large enough cohorts to study a specific population or specific rare conditions such as genetic disorders, congenital malformations, and metabolic conditions.
Harmonisation of registry data could lead to a reduced cost of managing and using these data, and better quality data would be available for analyses and various indicators.
10.9.2 Risk Factor Studies
Registry data could provide valuable information for epidemiologic studies to analyse potential risk factors for diseases. Sociodemographic data such as race/ethnicity, gender, age can help understand whether there is an increased risk among certain groups of people. Data on environmental factors like air pollutants, agricultural activities such as pesticide exposure can be linked and associations can be analysed. Natural disasters, neighbourhood effects on health can also be studied. Data on medication/drug use and adverse outcomes could be valuable information for drug safety studies.
10.9.3 Genetic Research
Registry data may include information on genetic analysis (molecular or cytogenetic), or the registry data may be linked with bio banks, biological samples that allow further genetic analyses. Gene mutations may be identified for rare genetic conditions. Registries could potentially contribute also to gene-environment correlation studies. Several genetic research initiatives are going on in Europe and researchers look for data from different sources including patient registries.
10.9.4 Clinical and Therapeutic Research
Registry data could also help clinical research studies to look at treatment options, and may include data from clinical trials for new medications and medical devices. Using available data researchers can analyse clinical parameters, effectiveness, and outcomes. Inequalities and disparities in health outcomes by country or other factors could drive establishing new or improved clinical guidelines and recommendations, and inform policy makers.
10.9.5 Some additional information
During 2011-2015, major FP7 project “The Data without Boundaries – DwB” took place. The project had a mission to support equal and easy access to the rich resources of official microdata for the European Research Area, within a structured framework where responsibilities and liability would be equally shared. During its four-year lifespan the DwB worked towards preparing a comprehensive European service with better and friendly metadata, a more harmonized transnational accreditation and a secure infrastructure that would allow transnational access to the highly detailed and confidential microdata, both national and European, so that the European Union would be able to continuously produce cutting-edge research and reliable policy evaluations. Most of the results of DwB could be applied also to re-use of patient registries for research purposes.
Several important and relevant issues had been analysed and a few tools had been developed in the life-span of the DwB:
- What are the researchers’ ideas and expectation regarding the re-use of data for research purposes: the most important issues are search strategy, quick overview, good documentation, comparability, information about procedures, user generated context (see http://www.dwbproject.org/export/sites/default/promotion/dissemination_material/dwb_factsheet_user-requirements-def.pdf).
- State-of-the-art of the remote access to data systems has been analysed and Database on National Accreditation & Data Access Conditions has been prepared (http://www.dwbproject.org/access/accreditation_db.html).
- Analyses of legal frameworks for data re-use for research purposes have been performed and could be browsed via on-line visualization tool: http://fryford.uk/wp-content/visuals/europe/european.html, where possibilities to access data according to different types of data and types of access are presented.
- Several software tools have been developed: Synthetic Data Tools, CTA (Controlled Tabular Adjustment), Enhanced Controlled Tabular Adjustment - ECTA - & Cell Suppression Free Open Solver software, and Record Linkage tool.
Many more results of the project could be found on their web page. However, researchers who are planning to re-use registry data, even in cross-border setting, can find a lot of important information and tools.
10.10 Compatibility, comparability and interoperability
10.10.1 Data compatibility
The integration of multiple data sets from different sources requires that they be compatible. Methods used to create the data should be considered early in the process, to avoid problems later during attempts to integrate data sets.
“Compatibility is the capacity for two systems to work together without having to be altered to do so. Compatible software applications use the same data formats. For example, if word processor applications are compatible, the user should be able to open their document files in either product” .
“Another factor that should be considered is the compatibility of existing data sets. Frequently, a data search may reveal multiple sources of similar data types, but the metadata may reveal that the individual data sets are not compatible, as the data have not been collected in a consistent manner …” .
For registries it means that data created by one registry can be imported into another, without manual data manipulation. Such a scenario is reasonable and necessary, for example, when in a country data collection is carried out at regional level, and regional registry data are used to build up a national registry. Similarly if a European registry is built on Member State registries. Data compatibility is usually considered at technical level (same data structure and format, character coding etc.) as in the mentioned example with word processors. In the case of patient registries the issue is more complex however. If we want to compile a national registry from regional ones, this technical compatibility is a prerequisite only, but far from sufficient. Such compilation can be done on the level of elementary data (e.g. data of patients registered in each registry is to be sent to the national registry). But it also can be done at aggregated level, where only sums and (weighted) average numbers are sent. In both cases it is important to be sure, for example, that each patient is registered in one regional registry only, so double counting is excluded. It is also important, that there are no definitional or methodological differences among the regional registries, or at least there should be awareness of such differences.
Summing up, compatibility of registry data has the following requirements:
- Technical compatibility of data (identical or convertible data structures, formats, coding schemes etc.)
- Comparability (see section 10.10.2)
- Double counting exclusion (see the problem of populations in section 10.12)
Comparability is different from compatibility. Colloquially speaking, comparability means that one has to be sure to compare apples with apples and not peaches. Whenever data are compared from different registries it is important to be sure that the observed differences are attributable to real differences in the thing that is being measured, not some artefacts that are consequences of external or irrelevant circumstances. Full comparability occurs exceptionally, i.e. raw data of registries are hardly comparable.
The more common situation is that the differences that make raw data incomparable are known, and ways can be found to resolve them. The most typical example is standardised death rates. Raw mortality figures of different populations are practically never comparable due to the different age structure of different populations. By standardisation raw data can be projected onto a standard age distribution that enables a comparison of mortality data from very different countries.
In other cases comparability problems arise from different definitions and categorisations. Such entities like ‘hospital’, ‘hospital bed’, ‘long term care’, ‘community care’ are often interpreted differently, and data that are built on such entities are sometimes hard to compare. Contrary to the standardised death rate example, in such cases the problem cannot always be fully resolved. Sometimes relative comparability has to be accepted. For example, if it is known that ‘number of hospital beds’ in country A covers more kinds (e.g. new-born baby incubators included) than in country B, but even so country A has fewer hospital beds than country B, then it is certain that there is a real difference, but not in the reverse case.
The most important issue is to be aware of comparability issues. To achieve the possible optimum, the following conditions have to be met:
- Sufficiently detailed metadata should be available. Metadata should describe what is counted in a registry, with what exceptions, how the measured entities are defined, what data collection methodology was applied etc.
- Additional data necessary for standardisation should be available. If there are known external or irrelevant factors that influence the thing to be measured (e.g. as age distribution influences mortality) then these data must be available in order to eliminate these effects.
Interoperability has a huge literature and it is not the aim of this study to give a comprehensive overview of the various approaches, definitions and theories behind it. The various definitions often divide interoperability into different layers such as technical, functional, semantic etc. One of these divisions is described in detail in chapters 3 and 5 , where legal, organisational/process, semantic and technical interoperability are considered.
In this section we restrict ourselves to technical (functional) and semantic aspects, because these are the levels of interoperability where IT standards can be used to find solutions. Briefly and generally speaking, interoperability form IT aspect is the ability of systems to work together (section 10.11 explains the technical aspects of interoperability in detail).
Semantic interoperability between registries implies that the recipient system is not only able to handle the received information but also able to automatically interpret it. It is possible that two registries that collect data for the same disease use different disease coding systems. Functional interoperability of such registries implies in such a case that the disease codes can be imported, but does not imply that the semantically identical codes are recognised or codes from one scheme is converted to the other one (see the problem of mapping in subchapter 10.11.3.2).
Semantic interoperability comes into question only if (at least one of) the systems are able to process information semantically: it makes inferences, or actions that depend entirely on the meaning of information, not on its syntax. Such semantic functions are hard to imagine without using some sort of ontology.
10.11 Interoperability Standards and Approaches for Data Exchange
10.11.1 General Concept
The concept of functional interoperability is to permit one system (sender) to transmit data to another system (receiver) to accomplish a specific communication in a precise and unambiguous manner. To achieve this, both systems have to know the format and content, and understand the terminology used. Using standard terminology can help database and system developers, and can facilitate exchange of data among various systems.
The recognition of the need to interconnect health related applications and exchange data led to the development and enforcement of interoperability standards. The following sections explain the standards used for structuring and encoding data.
10.11.2 eHealth standards
Exchanging and interchanging data in the health care domain in a seamless manner is becoming critically important. Lots of efforts have been made in this area to develop standards, which have obvious economic benefits as well. Here are a few examples of current standards developed and used for data exchange (see also chapter in 220.127.116.11).
- Health Level 7 (HL7): HL7 and its members provide a framework (and related standards) for the exchange, integration, sharing, and retrieval of electronic health information. These standards define how information is packaged and communicated from one party to another, setting the language, structure and data types required for seamless integration between systems. HL7 standards were originally developed to exchange data among hospital computer systems. HL7 standards support clinical practice and the management, delivery, and evaluation of health services, and are recognized as the most commonly used in the world.
- The National Council for Prescription Drug Programs: The US National Council created data-interchange standards such as drug claims for the pharmacy services sector of the health care industry.
- Data Interchange Standards for Bioinformatics: These standards were developed to support data exchange among various databases in bioinformatics and have gained popularity.
- Health Informatics Service Architecture: The European Committee for Standardization (CEN) Standard Architecture for Healthcare Information Systems (ENV 12967), Health Informatics Service Architecture or HISA is a standard that provides guidance on the development of modular open information technology (IT) systems in the healthcare sector.
- openEHR: It is a virtual community working on interoperability and computability in e-health. Its main focus is electronic patient records (EHRs) and systems. The openEHR Foundation has published a set of specifications defining a health information reference model, a language for building 'clinical models', or archetypes, which are separate from the software, and a query language. The architecture is designed to make use of external health terminologies, such as SNOMED CT, LOINC and ICDx. Components and systems conforming to openEHR are 'open' in terms of data (they obey the published openEHR XML Schemas), models (they are driven by archetypes, written in the published ADL formalism) and APIs. They share the key openEHR innovation of adaptability, due to the archetypes being external to the software, and significant parts of the software being machine-derived from the archetypes. The essential outcome is systems and tools for computing with health information at a semantic level, thus enabling true analytic functions like decision support, and research querying.
- EN/ISO 13606 - Electronic Health Record Communication: This European and ISO standard defines the means to communicate a part or all of the Electronic Health Record (EHR) of a single subject of care. The standard can be seen as a harmonisation of openEHR and HL7.
- ESRI developed spatial interoperability standards for public health and health care delivery .
- Extensible Markup Language (XML) is the most widespread markup languages used for data exchange. It defines a set of rules for encoding data structures (including documents) in a textual data format which is both human-readable and machine-readable. It is defined by the World Wide Web Consortium's (W3C) XML 1.0 Specification .
- The Resource Description Framework (RDF) and RDF-Schema (RDFS) are W3C recommendations used as a general method for conceptual description or modelling of information in web resources, using a variety of syntax notations and data serialization formats, the most used is XML. It is also used in knowledge management applications .
- The Web Ontology Language (OWL) is a family of knowledge representation languages for representing ontologies. The OWL languages are extensions of RDF by constructs allowing the representation of formal semantics and. OWL1 has been extended with additional features in 2009, becoming OWL2. Both languages are supported by Protégé and DL reasoners such as FaCT++, HermiT, Pellet and RacerPro. OWL and RDF have attracted significant academic, medical and commercial interest .
- Simple Knowledge Organization System (SKOS) is a W3C recommendation designed for representation of thesauri, classification schemes, taxonomies, or any other type of structured controlled vocabulary. SKOS is part of the Semantic Web family of standards built upon RDF and RDFS, and its main objective is to enable easy publication and use of such vocabularies as linked data .
- Common Terminology Services, Release 2 (CTS2) is a Health Level 7 (HL7) and Object Management Group (OMG) specification for representing, accessing and disseminating terminological content . It is an extension of HL7 Version 3 Standard: Common Terminology Services, Release 1 .
In the United States the “Public Health Data Standards Consortium was invited by the Integrating the Healthcare Enterprise (IHE) to start a Public Health Domain at IHE. IHE is a collaborative of clinicians, administrators, standard development organizations and health information technology (HIT) vendors that drives the adoption of standards to address specific clinical needs through the development of the technical specifications for the software applications. PHDSC and IHE are collaborating to enable interoperability across clinical and public health enterprises.” .
10.11.3 Coding schemes, terminologies
The idea of representing certain entities by codes instead of natural language descriptors goes back many centuries. The original cause of using codes was twofold. An important aspect was the need for unambiguity, either across or within languages. The other reason was to represent the entirety of a domain by a limited number of concepts to conduct statistical studies. In the modern age the computational tractability became another point.
Most coding systems are based on some classification: entities of the given domain are arranged into a – usually hierarchical – structure. One of the earliest problems with classification was the problem of multiple hierarchies. For example, diseases can be classified by location (according to the primarily affected organ), by aetiology (infectious, acquired, hereditary etc.), by epidemiology (sporadic, epidemic, etc.), or by pathology (neoplastic, metabolic disorder, etc.). Therefore a certain disease can be a member of many different, partially overlapping classes. The problem of multiple hierarchy is quite ubiquitous, it applies for nearly all large classifications, not only in the healthcare domain.
It depends again on the purpose, which dimension should be considered as the main aspect of classification. This is one of the most important reasons, why more than one classification exists for most of the medical domains. There are other reasons of course, like differences in granularity, content coverage, availability in different languages, etc.
For public health purposes, however, the International Classification of Diseases (ICD) is perhaps the most frequently used classification system, although different versions of it are in use.
The terms – terminology, nomenclature, and vocabulary – are often used interchangeably. However, there are differences in these terms. Terminology can be defined as a set of terms representing the system of concepts of a particular subject field. Nomenclature is a system of terms that is elaborated according to pre-established naming rules. Vocabulary refers to a dictionary containing the terminology of a subject field.
10.11.3.1 Most important terminologies
There are various terminologies used in the health domain. Here is a partial list of terminologies widely accepted and used either globally or by many countries.
- International Classification of Diseases and its clinical modifications: this is one of the best known terminologies, which was first published in 1893, and has been revised at roughly 10-year intervals, by WHO. The most recent version is the 10th revision (ICD-10). WHO has been working on the 11th revision for a few years. In the United States the National Center for Health Statistics published a clinical modification of ICD-9 and now ICD-10 by adding an extra digit to the codes to provide an extra level of detail (ICD-9-CM; ICD-10-CM). The Royal College of Paediatrics and Child Health (formerly British Paediatric Association) also created a modified and extended version of ICD-9 and ICD-10 codes for birth defects (congenital anomalies).
- International Classification of Primary Care: This classification includes over 1000 diagnostic concepts that are partially mapped into ICD.
- Medical Dictionary for Regulatory Activities (MedDRA) is an international medical terminology dictionary used by regulatory authorities in the pharmaceutical industry during the regulatory process an also used for adverse event classification. It has been translated into several languages and used in the EU, Japan and the USA.
- Systematized Nomenclature of Medicine (SNOMED): Originally called SNOP (Systematized Nomenclature of Pathology), it has been developed by the College of American Pathologists to describe pathological findings using topographic (anatomic), morphologic, etiologic and functional terms. The current version, SNOMED CT (SNOMED Clinical Terms) was created in 1999 by the merger, expansion and restructuring of SNOMED RT (SNOMED Reference Terminology) and the Clinical Terms Version 3 (formerly known as the Read codes), developed by the National Health Service of the United Kingdom. Since 2007, SNOMED CT is maintained by the IHTSDO (International Health Terminology Standards Development Organisation).
- GALEN and GALEN-In-Use projects in Europe: the aim was to develop standards for representing coded patient information. The consortium developed the GRAIL concept modelling language, the structure and content of the GALEN Common Reference Model. It also created tools to enable the further development, scaling-up and maintenance of the model.
- Logical Observations, Identifiers, Names, and Codes (LOINC) in the US, and a similar EUCLIDES work in Europe: LOINC was created to represent laboratory tests and observations but later included also non-laboratory observations such as vital signs. A similar work (EUCLIDES) has been done in Europe.
- WHO Drug Dictionary, ATC codes: The Drug Dictionary is an international classification of drugs by name, ingredient, and chemical substance). It is used by pharmaceutical companies, clinical trial organizations and drug regulatory authorities for identifying drug names in spontaneous ADR reporting (and pharmacovigilance) and in clinical trials. The dictionary was created in 1968 and it is regularly updated. Since 2005 there have been major developments in the form of a WHO Drug Dictionary Enhanced (with considerably more fields and data entries) and a WHO Herbal Dictionary, which covers traditional and herbal medicines. Drugs are classified according to the Anatomical-Therapeutic-Chemical (ATC) classification.
- Unified Medical Language System (UMLS): started by US National Library of Medicine in 1986, it is a quarterly updated compendium (Metathesaurus) of biomedical terminologies, providing a mapping structure among these vocabularies and thus allows the transcoding among various terminology systems. Altogether, it contains over a million concepts and 5 million terms which stem from the over 100 incorporated terminologies. Each concept in the Metathesaurus is assigned one or more semantic types, and they are linked with one another through semantic relationships. The Semantic Network provides these types and relations: there are 135 semantic types and 54 relationships in total. UMLS can be used to enhance or develop applications, such as electronic health records, classification tools, dictionaries and language translators. It can be also used for information retrieval, data mining, public health statistics reporting, and terminology research.
10.11.3.2 Mapping between classification systems
Whenever we are faced with the Babel of classification and coding systems, a trivial idea is the (automated) mapping (conversion) from one to another. At first sight, it can be done easily, for example, by a simple cross-reference table that contains the corresponding code pairs (triplets, etc.) Since coding systems are not just a set of code values, but – as mentioned – most of them are built on a classification, the matter is not so easy. Usually the categories of one classification do not fit entirely in the categories of the other. Unless the underlying classifications are totally identical, no mapping is possible between two coding systems without distortion. Theoretically, a special case is also possible: if one classification is a mere subset of another, then there is an unambiguous mapping from the former to the latter but not vice-versa.
10.11.4 Ontologies and data structures
Computer-based patient records can be improved by the use of ontologies. “An ontology specifies the conceptualization of a domain and is often comprised of definitions of a hierarchy of concepts in the domain and restrictions on the relationships between them.” 
An ontology representing the content of an electronic patient record may include (among others) the following:
- Clinical acts (health care flow, surgical and other procedures, etc.)
- Clinical findings
- Disease manifestation, etiology, pathophysiology
Mobile technology, social media, personalised medicine, remote diagnostics could transform health care. The number of e-health applications available for mobile devices steadily increases. Developing communication standards for information and communication technologies to facilitate interoperability among systems and devices, provide privacy and security, and address the needs of the developing world is timely and important.
Personalised medicine, “A form of medicine that uses information about a person’s genes, proteins, and environment to prevent, diagnose, and treat disease”, is a new area of e-health when personalised medical records are generated .
Social media applications related to health are on the rise. Patients often consult medical information online, and turn to social media communities for peer-to-peer support and information. Lot of information can be obtained but careful considerations are needed to filter out useful information .
10.12 Problem with populations
10.12.1 Definition of population
Comparability of data of population-based registries requires clear definition of the given population. Without such a clear definition it cannot be certain, for example, that there is no overlap between the populations of the registries. This is especially true within the EU, where free mobility of people increases the probability that the same person is registered in different registries.
The definition of population in general is in itself not without difficulties. Most often, “population” is defined as a group or collection of individuals inhabiting a certain territory or forming an interbreeding community. There is a proposed definition of population especially for public health purposes saying that “A population (in public health) is a group of persons sharing a common resource.” 
10.12.2 Inclusion and exclusion criteria
To generate comparable data on a population level, requires having the same set of inclusion and exclusion criteria (i.e., residency status, socio-demographic data, geographic area, etc.), therefore using data from two or more systems or registries could be interpreted in a uniform fashion. For example, the definition of stillbirths (gestational age cut-off point) varies by country and collecting information on the stillbirths population and comparing characteristics and prevalence could lead to false interpretation of data. When comparing rates of population-based registries, the residency status criterion, whether including or excluding non-resident persons living in a defined geographic area, is very important.
Free mobility within and across borders makes the establishment of population-based registries (especially in a smaller geographical area) and comparison of data between other registries without the risk of having the same person recorded in two or more databases challenging. National and EU level, or global registries could help eliminate this problem. Communication between systems and linking data on a regular basis could also help in finding duplicate records and make data comparable.
10.12.4 Socio-demographic, genetic factors
Variations and differences in socio-demographic and genetic factors such as ethnicity, genetic mutations in certain populations could make it difficult or even nearly impossible to compare some specific data among populations.
10.13 Examples of legal frameworks for data protection and data sharing
Data exchange is sometimes a complex process, and organisations, registries, and data providers have to ensure compliance with cross-border restrictions, privacy and confidentiality rules. All member countries of the EU impose restrictions on the sharing of personal information outside the EU. Organisations sharing personal information collected in the EU with service providers based outside the EU need to find ways to comply with these laws .
Privacy generally applies to people, while confidentiality applies to information. There are many important reasons to protect privacy and confidentiality.
Privacy is the control over the extent, timing, and circumstances of sharing oneself (physically, behaviorally, or intellectually) with others. For example, persons may not want to be seen entering a place that might stigmatize them, such as a pregnancy counselling centre clearly identified by signs on the front of the building. The evaluation of privacy also involves consideration of how the researcher accesses information from or about potential participants.
Confidentiality pertains to the treatment of information that an individual has disclosed in a relationship of trust and with the expectation that it will not be divulged to others in ways that are inconsistent with the understanding of the original disclosure.
Maintaining privacy and confidentiality helps to protect participants from potential harms including psychological harm such as embarrassment or distress; social harms such as loss of employment or damage to one‘s financial standing; and criminal or civil liability. Especially in social/behavioral research the primary risk to subjects is often an invasion of privacy or a breach of confidentiality.
The next sections present a few examples for data sharing policies and regulations related to health information in Europe and in the United States.
10.13.1 Policy on data submission, access, and use of data within TESSy
The European Centre for Disease Prevention and Control (ECDC) created the European Surveillance System (TESSy) to collect, analyse and disseminate surveillance data on notifiable infectious diseases in Europe. A procedure with a set of rules was developed for data submission, data storage and custody, data use and data access, and data protection. Relevant forms and notes are also available :
- Request for TESSy Data for Research Purposes
- Declaration Regarding Confidentiality and Data Use
- ECDC Data Disclaimer
- Conditions for Publishing Note
- Sample Agreement for Agencies and third parties
- Declaration on Data Destruction
10.13.2 European Commission’s proposal for a General Data Protection Regulation
The European Patients’ Forum, which is a not-for-profit, independent organisation and umbrella representative body for patient organisations throughout Europe, wrote a position statement on general data protection regulation, and made recommendations to the European Commission, the European Parliament and Member States to:
- Ensure that the Regulation protects patients’ rights as data subjects and as owners of their health and genetic data, and contains measures to enable patients to benefit from these rights effectively (e.g. access to data, data portability, right to information and transparency). Any restriction due to the special nature of the data processed or legitimate reasons for processing of such data should be justified and limited to what is necessary for public health, or the patients’ vital interests.
- Make the necessary adaptations to the Regulation in order not to hamper provision of care, the conduct of research and public health activities, including patient registries and activities carried out by patient organisations to advance research and patients’ rights, with clear and explicit provisions to ensure the good implementation of this Regulation in the health sector.
- Put in place effective cooperation measures between Member States and minimum security requirements to ensure an equivalent level of protection of personal data shared by patients for healthcare and research purposes across the European Union, and facilitate cross-border healthcare and research.
- Involve patient organisations in decision-making and activities at policy and programme level for questions that relate to the processing and sharing of patients’ personal data, transparency towards patients and informed consent, to ensure the processing is carried out ethically and in a transparent manner throughout the European Union .
10.13.3 European Data Protection Board, General Data Protection Regulation
The European Commission plans to unify data protection within the European Union (EU) with a single law, the General Data Protection Regulation (GDPR). The current EU Data Protection Directive 95/46/EC does not consider important aspects like globalisation and technological developments such as social networks and cloud computing sufficiently. New guidelines for data protection and privacy are required to address these issues. Therefore a proposal for a regulation was released in 2012. Subsequently numerous amendments have been proposed in the European Parliament and the Council of Ministers. The EU's European Council aimed for adoption of the GDPR in late 2014 and the regulation is planned to take effect after a transitional period of two years.
10.13.4 HIPAA Privacy and Security Rules for Public Health Data Exchange
In the United States the Health Insurance Portability and Accountability Act of 1996 (HIPAA) Privacy, Security and Breach Notification Rules were developed.
“The Office for Civil Rights enforces the HIPAA Privacy Rule, which protects the privacy of individually identifiable health information; the HIPAA Security Rule, which sets national standards for the security of electronic protected health information; the HIPAA Breach Notification Rule, which requires covered entities and business associates to provide notification following a breach of unsecured protected health information; and the confidentiality provisions of the Patient Safety Rule, which protect identifiable information being used to analyze patient safety events and improve patient safety.” 
- https://grdr.ncats.nih.gov/ last visited 20/07/2014
- http://confidential.oxfordradcliffe.net/anondata last visited 26/05/2014
- http://www.bristol.ac.uk/Depts/DeafStudiesTeaching/ethics/resource/anonymise.pdf last visited 26/05/2014
- http://www.jiscdigitalmedia.ac.uk/clinical-recordings/storage_anonymisation.html last visited 26/05/2014
- http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:31995L0046:EN:HTML last visited 20/07/2014
- http://www.ecdc.europa.eu/en/activities/surveillance/tessy/Pages/TESSy.aspx last visited 20/07/2014
- WhatIs.com Tech definitions http://whatis.techtarget.com/definition/compatibility]
- Development of a framework for Mapping European Seabed Habitats (MESH) http://www.searchmesh.net/default.aspx?page=1826
- http://www.esri.com/library/whitepapers/pdfs/hl7-spatial-interoperability.pdf last visited 20/7/2014]
- http://www.w3.org/TR/REC-xml/] which is a free open standard
- http://www.w3.org/TR/REC-rdf-syntax/, http://www.w3.org/TR/rdf-schema/
- http://www.w3.org/TR/owl-features/, http://www.w3.org/TR/owl2-overview/
- http://www.phdsc.org/health_info/ihe-task-force.asp last visited 20/07/2014
- A framework ontology for computer-based patient record systems http://ceur-ws.org/Vol-833/paper28.pdf last visited 20/07/2014
- http://www.cancer.gov/dictionary?cdrid=561717 Last visited: 20/07/2014
- https://itunews.itu.int/en/2472-E8209health-standards-and-interoperability.note.aspx last visited 20/07/2014
- Surjan G. Ontological definition of population for public health databases. Stud Health Technol Inform. 2005; 116:941-5.
- http://media.mofo.com/files/Uploads/Images/130729-BNA-Cross-Border-Information-Sharing-for-Effective-Services.pdf last visited 20/07/2014
- http://www.ecdc.europa.eu/en/activities/surveillance/tessy/documents/tessy-policy-data-submission-access-and-use-of-data-within-tessy-2011%20revision.pdf last visited 20/07/2014
- http://www.eu-patient.eu/Documents/Policy/Data-protection/Data-protection_Position-statement_10-12-2012.pdf last visited 20/07/2014
- http://www.cdc.gov/phin/resources/standards/data_interchange.html last visited 20/07/2014
- http://www.hhs.gov/ocr/privacy/index.html last visited 20/07/2014