8 Running a registry

From Parent-Wiki
Jump to: navigation, search

Methodological guidelines » 8 Running a registry

8.1 Sequential Processes
8.1.1 Collecting data Modes of Data Collection Case Report Form Data entry/import Patient/Data Provider Recruitment and Retention
8.1.2 Data Linkage
8.1.3 Controlling and Cleaning the Data
8.1.4 Storing Data
8.1.5 Analysis of Registry Data Data Analysis Plan Statistical Analysis Analytical considerations
8.1.6 Data Dissemination
8.2 Overarching Processes
8.2.1 Data Quality Assurance
8.2.2 Data Quality Assessment
8.2.3 Evaluation and Improvement of Registry Service
8.2.4 Governance
8.2.5 Auditing
8.2.6 Continuous Development
8.2.7 Information System Management

Running a registry is not a simple procedure. It requires technical knowledge, scientific aptitudes, and a rigorous execution of the previous plan. A multitude of aspects have to be considered. Sequential and overarching processes have to follow for running a patient registry.

Key principles:

  • The way of collecting data and the case report form (CRF) are crucial. Electronic based methods are preferable.
  • A plan to review each data source must be established and the processes to control and cleaning data do that have to be systematized.
  • Storing data regards to technical and legal aspects, especially for cross-border use (security, access permission, anonymization of personal data stored).
  • A data analysis plan has to exist and to be executed, considering the characteristics of the registry data.
  • The process of data dissemination has to be considered thinking in all the interested public and stakeholders.
  • What is to be measured and controlled has to be defined in order to assure and to assess the data quality.
  • The structure (steering committee, scientific advisory board) and the establishment of responsibilities, duties, roles of the people in charge of the registry have to be established for the registry governance.
  • A plan for audit (internal or external) is necessary to validate all the processes.
  • A registry is always in a continuous process of actualization. The development of the registry has to be continuously and periodically tested.
  • Technical problems have to be considered regarding to the information system management: access management, security, backing-up, archiving.

8.1 Sequential Processes

8.1.1 Collecting data

Data collection is defined as the ongoing systematic collection, analysis, and interpretation of health data necessary for the patient registry. Data collection can be considered as regards two major domains; data source and data provider (see chapter 6.2.4 ’Data collection procedure’).

The AMIA (the American Medical Informatics Association) has summarized the “Guiding principles for clinical data capture and documentation” that can be used to orient the implementation for collecting clinical data in a registry. Modes of Data Collection

The way of collecting data for a registry is a crucial part, because it determines its feasibility. Regarding the data sources there are two main sources: paper based and electronic.

In the past, paper based models were predominant but nowadays the electronic based methods are the main. However, paper can still play a core role in a registry.

Different paper based methods are listed and discussed in chapter Their important characteristic is that they are inexpensive and easy to create and develop, but in the registry’s whole process they imply a substantial cost because they need to be recorded in an electronic way and there is no easy and cheap way to do that. The existing paper based processes are being adapted to an electronic environment, with the risk that the paradigm for electronic data capture would be determined by the historical model of paper based documentation.

The electronic based methods are the present and most probably the future ones (but almost half of the EU registries are still based on paper-and-pen mode). Electronic based methods can be computer based or mobile devices based (smart phones or tablets), but the main focus has to be that the data captured would be accurate, relevant, confidential, reliable, valid, and complete. Sometimes, the electronic based methods are focused on integrating several clinical data sources and to produce a new electronic form with the outcomes of the integration (see chapter 8.1.2 ‘Data Linkage’).

In the past, traditionally, a distinction was made between “passive” collection of data and “active” methods, and the difference was that the passive way is based on the notification and in the active one based on the personnel of the registry visiting the various sources to identify and collect the data. Nowadays the registries use a mixture of methods. Case Report Form

A case report form (or CRF) is a paper or electronic questionnaire on which individual patient data, required by the registry, are recorded. The terminology is widely used in clinical trial research.

The CRF must include the common data elements planned in the design phase and it has to use standard definitions of items and variables (according to international recommendations). The principles of a good CRF are: easy and friendly use, standard based, short, understandable and connected (if it is possible) with other potential sources.

Obviously the CRF paper based are less flexible and usable. The electronic CRF allows a higher functionality: data entry control, coherence validity, automatic error correcting system or help to the user.

An example of data to be included in a CRF can be accessed in the book “Cancer Registration: Principles and Methods” (Available from http://www.iarc.fr/en/publications/pdfs-online/epi/sp95/). The EPIRARE project has worked to identify the common data elements for rare diseases registries across Europe and a questionnaire about it can be accessed in http://www.epirare.eu/del.html. Data entry/import

The data flow in a registry may include either the data entry (both paper o electronic based) or the capture, or it may import patients’ data from clinical databases.

In both cases it is important to establish the next items:

  • Who will enter the data?
  • Does the data entry program allow certain data items to be entered automatically, or is the data recorder able to make any changes?
  • Does the data entry program effectively validate the data?

Paper based:

If the CRF is paper based, a direct data entry can be used. A computer keyboard is used to enter data from the paper CRF into the registry database. It is the easiest way, however, it requires personnel specifically dedicated to record data. Another option is to capture the data from the paper CRF, by using a scanner as well as special software to extract the data from it. In this case, it requires specific CRF forms to avoid errors.

Electronic based:

The data entry can be carried out in a local computerized database, though usually this is an option only for localized registries with a few patients. It is more common to use central database servers using web based data entry forms. In this way the data entry for the registry can be shared in several places.

Mobile devices (smart phones, tables) can be also used as data entry tools, and it is specially indicated when the registry personnel goes to the clinical source.

Finally, a registry can get the data directly from the clinical databases. In this case, the data are captured or imported and require a data linkage process (see chapter 8.1.2 ) with specific decision algorithms. Patient/Data Provider Recruitment and Retention

A patient registry does not search completeness as a main goal, however, it is important to get enough patients to reach its objectives. In this way, there is a need to develop a source study to know where the data about patients are and which type of data could be used. A plan to review each data source must be established (periodicity of review, type of data source, way to get the data, permissions needed…). Sometimes, it will be necessary to contact the patients face to face and offer them the opportunity to include their data in the registry. An informed consent form has to be ready to use.

There are some incentives to recruit patients to the registry, but the most effective is the prestige and outcomes of the registry. If a registry is scientifically well considered, that patient will be more willing to participate. If there are some advantages, like the access to some specific health care processes or the increasing of the visibility of some diseases (especially important in the rare diseases field), patients will be willing to collaborate with the registry.

The transparency and the reputation of the registry are especially important: any problem regarding data protection vulnerability, for example, will imply the loss of patients’ confidence and will entail problems for their recruitment and retention.

If the cases are regularly followed up, it will be possible to produce outcomes like remission or survival. For this reason, a registry has to prepare strategies to get the patients’ status data regularly. It will be important to maintain an updated registry database with the date of each review. An active follow up process may be established by scanning different sources (mortality, treatment or drug prescriptions).

8.1.2 Data Linkage

The data linkage, or record linkage, process is referred to the task of identifying records in one or several datasets that correspond to the same individual or entity. This process may seem trivial if an identification code (ID) or a similar variable, unique for every entity, is available in the dataset(s) to be linked. Nevertheless, this setting is less usual in practice than could be expected or less usual than would be desirable.

Although it may seem obvious, it is worth mentioning the importance of a cleaning/purging phase on the dataset(s) of interest before proceeding to link them. This process should be done with particular attention to the variables used to link the databases. Dates in different formats or categorical variables with different codifications, such as {Male, Female} and {Man, Woman} for sex, are simple examples where this kind of issue may produce record linkage methods to fail dramatically. The pre-processing phase will also have to pay attention to string variables where different naming variants or nicknames could be used, such as Jim and James, and unify those variants into a single term.

When dealing with just a single dataset, it is very frequent that in case of having an ID variable included in the database, this is empty for a considerable number of registries. This is particularly frequent in health care registries, where sometimes urgent attention is required and not having access to the ID of the patient is not enough reason to deny the attention requested. This problem would be particularly prevalent in foreigners who do not have an ID of the corresponding health system because either they require attention during a temporary visit to that country or they are still in the process of getting their ID. In that case the ID corresponding to that record is forced to remain empty, with the problems that it may cause for identifying records corresponding to unique entities. A second problem when dealing with just a single dataset may come from records corresponding to children. In some health systems children do not have their own ID and they are recorded either with a missing ID or with the ID of one of their parents. This may cause some records corresponding to children to be linked to records of some of their parents (sometimes to one of them and sometimes to the other) altering the results of analyses that could be subsequently made from that dataset. This may be particularly frequent in new born babies, where administrative delays, as regards getting an ID, may make this setting as the general rule for this collective. Lastly, there is a possibility that the ID code of some of the records in the database were wrongly introduced due to typeset errors or to some other reasons. All these circumstances will make naïve record linkage methods fail and will make the use of more sophisticated methods necessary.

When dealing with more than one dataset this problem is even worse. Besides the already mentioned problems which will also be present in this case, the record linkage of two or more databases has some new particularities that should also be borne in mind.

Special care should be taken to ensure that the linking fields of the databases are of exactly the same type and of the same length, since otherwise the linking process of the datasets could miss some records that should be matched. This is a particularly frequent setting when the databases to be linked come from different providers or institutions.

It is also a very common occurrence that the databases used in the linking process were not specifically devised to be linked and were designed for very different aims. Therefore, it is not rare to find that both databases do not share a common ID field that allows linking of their records. This is also a very frequent situation when linking databases of different administrations, such as the health and economic authorities, since the identification codes used for any of them are usually different. Specific record linkage methods have been developed for these settings making use of several fields in the database instead of just one.

Record linkage methods can be divided into two sets: deterministic and probabilistic methods. Deterministic methods are used when the databases to be linked lack a common ID field univocally identifying their individuals. However, if the datasets to be linked contain a set of variables whose combination could be an approximate ID, that combination could be used to link them. For example, the set of variables: name, surname, date of birth and city of residence, could be merged as a unique code univocally identifying any individual in the dataset. In that case, record linkage could be made attending to that code. Nevertheless, errors in the information recorded on these fields or simply because some of them contain missing values, would make this procedure fail to detect some matching records. To make deterministic record linkage methods more robust against these scenarios, it is usual to include as many fields as possible in the linking process, and match only those records where the percentage of matching fields is above some threshold.

The second set of record linkage methods are those relying on probabilistic decision rules. Thus, not every field in the deterministic methods, such as sex on one hand and date of birth on the other hand, has the same probability of containing two matching records. Probabilistic methods take into account those probabilities to decide if two records belong, or not, to the same entity. It is common in probabilistic methods to build, for every pair of records, a score summarizing the probability of observing as many matching fields as they have, and compare them with a fixed threshold that separates those scores resulting just from chance, from those coming from records of a common entity.

Data linkage can be done with two main purposes: merging the records of several datasets of different providers (e. g. hospitals) in a unique dataset, or enhancing the information of the records in a dataset with those fields coming from a second dataset. In the first case, record linkage will identify records in the different databases to be merged with those that correspond to the same individual. This will avoid accounting for those individuals more times than it should, making it possible to derive reliable rates that otherwise would not be reliable at all if repeated records were not excluded from the analysis. In the second case, inaccurate record linkage methods will make the resulting database into a riddle of missing data coming from unlinked records, making posterior analyses of that database either unreliable, or more difficult to be done.

Data linkage is one of the most important topics regarding the anonymization legal aspect, because an ID is needed, which is an obvious piece of personal data. The individual right to integrity and protection of personal data has to be matched with the possibility of doing data linkage. There are several options to achieve it from a legal point of view, and currently a new regulation is in discussion in Europe. The perspectives of the new regulation in Europe are mentioned in chapter 6.1.4.

8.1.3 Controlling and Cleaning the Data

Data control and cleaning on patient registries involve the process by which erroneous data are removed or fixed and missing data are filled.

Three different phases in the cleaning process can be distinguished: screening, diagnosis and editing. All of them shall be applied not just as an independent step of the process, but also during the collection, linkage and analysis of the data.

The screening phase involves any action carried out to detect anomalies in the data. Several types of oddities can be found when screening data and each of them should be taken into account.

  • Lack of data can be disguised when data sources use internal codes to declare a missing value, like filling a date field with ‘99/99/9999’ or even literals like ‘missing’ or ‘unknown’. A chart of these internal codes must be built and used as a filter.
  • Duplicates can be detected by a redundant identification code of the patient or by a match in other identification variables such as name, date of birth, sex or external identification codes. Algorithms of approximate matching can detect non exact duplicates.
  • Format incoherence shall be scanned, detecting values that are incompatible with the preset format of the variable (if Sex is defined as ‘M’ or ‘F’, a field filled with ‘Male’ is erroneous, and shall be recoded).
  • The nature of variables offer ranges of values that are improbable or impossible (Age must be a non- negative number and is unlikely to be greater than 100). Thresholds must be defined to screen inconsistent and outlier values.
  • Joint distributions of variables present different and more restrictive improbable or impossible joint values, like some pathologies combined with sex or age (Sex=‘Female’ and Disease=‘Testicle Cancer’ are incompatible, though each value is coherent by itself). A particular case of this screening is the chronological coherence that dates and ages must have.

The diagnosis phase can classify each oddity detected as erroneous, correct or dubious. A ‘hard cutoff’ leaves outside logically or biologically impossible values of data that will automatically be classified as erroneous. Improbable but not impossible values are filtered through ‘soft cutoffs’, and declared as dubious. They should be crossed with external databases (like censuses or other registries) or checked with the primary sources.

Modification of the database in the editing phase can be done automatically or manually over erroneous data. Redundant data shall be merged or deleted. Erroneous values can be corrected or deleted. Linking more external databases provides a source to fill or correct missing or erroneous values. Special codes or flag variables can be set to distinguish corrected fields.

Proper documentation and transparency is required for good practice in data management. Procedures, criteria and actual modifications shall be documented. A good way to keep track of the modifications is to record in a different database the original entries of data before modification.

A cleaning process can provide feedback to collection and linkage processes, so that future errors are prevented. It is important to encourage data users to report any anomalies they may find in the data, to improve the controlling and cleaning process.

8.1.4 Storing Data

Storing and retrieval of data are among the IT services giving support to registry operations. In addition to the general considerations about running these types of services, some specific remarks are worth mentioning here (Refer to 8.2.7 Information System Management to complete the picture).

Data privacy is a major concern in European countries. At the time of writing (August, 2015) the legal framework of reference in this subject is still the Directive 95/46/EC, on the protection of individuals with regard to the processing of personal data and on the free movement of such data, and their different national implementations. In 2012 the European Commission announced a reform of this legal framework. After a lot of work and discussion, that reform is about to be completed.

Personal data about health are among the most sensitive issues. Accordingly, ethics, good corporate governance (transparency, responsibility, accountability, due diligence...) and regulations pose important restrictions on the processing and free movement of these data. Some restrictions have a direct impact on data storage and retrieval. Fines for noncompliance with regulatory requirements may be very important.

Access control (before). Procedures for proper user identification and authentication, as well as for granting and revoking access privileges have to be established. This also includes technical staff.

Access control (after). Logging procedures must keep track of every single access, even if it is only an attempt. Access logs must be kept safely, as they may become evidence, and be periodically examined. Any irregular event must be further investigated.

Data input/output. Any data input/output operation involving systems or facilities not under direct control of the registry owner must be previously approved and then recorded. Once again, these records must be periodically examined. These operations range from copying data to external devices to provide some sort of mobility, to data exports (or backups) to external facilities in order to provide data or operations recovery in case of disaster.

Cloud storage. Even when IT services, based on cloud computing, look interesting, they might be not appropriate at all. The registry owner and any potential provider of IT services (cloud based or not) must previously sign a detailed agreement. The following parts must be present in this document (among others):

  • the provider has to declare and assure his knowledge, will, and ability to fulfill all requirements posed by the aforementioned legal framework;
  • what the service provider has to do, what it is not allowed to, and what it must do when the engagement with the registry comes to an end;
  • the procedures or evidences available to the registry owner to reassure it that the service provider is running everything according to the terms of the agreement.

Many cloud services are provided outside of the EU, where the legal framework mentioned above cannot be enforced (See also the Safe Harbor framework developed by the U.S. Department of Commerce). Besides, most big providers of cloud services have their own set of terms of service and operate on the basis of “take it or leave it”. Any of these two handicaps may be determinant to discard a provider.

Data integrity and availability. Power shortages, disk crashes, roof leaks, floods, fires, human errors... These things happen. Whether it is acceptable that they have an impact on the registry operations (or rather how much impact can be acceptable) is something to be determined by the registry owner, who will have to enable adequate countermeasures. Backup procedures should be conducted according to data recovery objectives and business continuity plans. The ability to recover from the backups is not something to take for granted, but to be tested on a regular basis.

Anonymization. For those purposes (i.e. research) where patient identity is not of primary relevance, dissociation of health data from identity data must be done. Privacy restrictions do not apply to data that cannot be traced back to the identity of the patient. Therefore, adequate dissociation processes should be made available as an option for data retrieval. These processes may be either one way dissociation (=anonymization sensu stricto), or two way dissociation (=reversible dissociation). The difference is that, in the former one, it is virtually impossible to trace back to the identity of the patient. In the case of reversible dissociation, the keys and procedures to unveil patient identity must be kept under strict control.

8.1.5 Analysis of Registry Data

The analysis of registry data presents as much variety as can be found in the purpose and objectives of registries. Ideally, a detailed data analysis plan should be established beforehand, but flexibility is needed to deal with situations that registry planners could not originally foresee. Situations that call for unplanned analysis will often arise under two different circumstances: first, to address unexpected findings that can lead to new research questions, and second, to give answers to special requests set up by stakeholders. A planned analysis meets researchers’ objectives, whereas the foundation of a study based on unexpected findings is developed after making the observation; on the other hand, ad hoc analyses are directed to satisfy a registry user’s specific needs.

Closely linked to the data analysis plan, statistical methods should be stated in as much detail as possible. Researchers need to be cautious when interpreting registry data, which often has inherent biases. Potential sources of bias should be addressed in advance and, to the extent that it is possible, also the procedures for handling missing data and controlling any confounding. Data Analysis Plan

The data analysis plan depends on the registry objectives, but registry planners should be aware that some relevant research questions could arise over time and may not be defined a priori.

Registry-based studies can be descriptive or analytical, but most of the times registries have aims that are primarily descriptive. Descriptive studies focus on disease frequency, distribution patterns (by examining the person, place, and time in relationship to health events), clinical features of patients and natural history of diseases; descriptive studies can suggest risk factors and can help to generate all kinds of hypotheses that could be later tested by analytical studies.

In the case of rare diseases, patient registries are often a first step to try to understand the number of people affected and the characteristics of the disease and the patients, though the scope of these registries may evolve over time.

Disease-specific health indicators (morbidity, mortality and disability indicators) should be made available for the total studied population and for age and sex subgroups. Absolute numbers, as well as crude and age-standardised rates should be calculated. To ensure comparability, standardization should be based on the European standard population. [note 1]

The main measures of disease frequency are: incidence rate, cumulative incidence, point prevalence, period prevalence, lifetime prevalence and (for congenital diseases) prevalence at birth.

Incidence, often considered the most important measure in epidemiology, is usually expressed as incidence rate, which provides a measure of the occurrence of new disease cases per person-time unit; when incidence rate refers to one year, the denominator is the number of persons under surveillance. High mortality rate diseases, such as some cancers, are better measured in terms of incidence.

Point prevalence can be practically defined as the proportion of the population that has any given disease at some specific point in time, while period prevalence is the probability that an individual in a population will be a case, anytime during a given period of duration, often one year. Prevalence indicators are crucial in rare diseases, as prevalence itself constitutes the main criterion to define a disease as rare.

Mortality indicators, such as mortality rate and case fatality rate, provide a good measure of the burden of disease. Other health status indicators include premature mortality, measured by Years of Potential Life Lost (YPLL); disability-adjusted life year (DALY), a time-based measure that combines years of life lost due to premature mortality and years of life lost due to disability; and quality-adjusted life year (QALY), based on the number of quality years of life that would be added by an intervention. Analytical studies, such as cohort studies and case-control studies, focus on examining causal associations between exposures and outcomes, or between characteristics of patients and treatment, and health outcomes of interest. Data quality requirements in analytical studies are much higher than in descriptive studies.

For analytical studies, the association between a risk factor and outcome may be expressed as attributable risk, relative risk, odds ratio, or hazard ratio, depending on the nature of the data collected, the duration of the study, and the frequency of the outcome. Attributable risk is defined as the proportion of disease incidence that can be attributed to a specific exposure, and it may be used to indicate the impact of a particular exposure at a population level.

For economic analysis, although not very common in registry-based studies, the analytic approaches encountered are cost-effectiveness analysis and cost-utility studies. Statistical Analysis

Statistical analysis is used to summarize and transform the data stored in the registries into knowledge. This knowledge is the ending result of the registries, since it allows us to know the population covered by the registry and, if appropriate, to compare it with the general population. Besides this aim, registries have just an accounting aim for performing an administrative control of the registered people.

It is not easy to summarize a particular set of statistical tools of particular use in health registries, since these are devised for very different purposes and, depending on them, some statistical tools or some others will be needed. The first set of statistical tools to be used in the analysis of health registries are descriptive tools. Descriptive tools summarize the, sometimes overwhelming, information stored in these registries. For this aim, graphical tools, either depicting the distribution of the values of a single variable or relating the values of two or more of them, are of particular use. Descriptive statistics are also often used for summarizing information in the databases, thus, the mean, median and standard deviation are typical statistics used to summarize variables. If instead, we are pursuing some measure measuring the amount of dependence between two variables in one dataset, Pearson’s correlation coefficient is frequently the most widespread tool.

In addition to the descriptive aims above, it will often be interesting to make inference (learn) about some features of the population covered by the registry. In that case it would be firstly interesting to contrast some specific hypothesis in one’s own dataset. In that case, one should resort to statistical tests. There is a huge amount of statistical tests available for very different purposes and it is not within the scope of this section to make even a brief description of their use. Nevertheless, it is convenient to highlight chi-square and t-test as the most common tests for making data analysis. The t-test is usually an appropriate choice for comparing the mean of two different groups in the population, although it requires the variable to be studied to be Normal-shaped. If this condition is not achieved, some alternative non-parametric test should be used, such as the Wilcoxon’s signed rank test. On the other hand the chi-square test is used to assess dependency between two categorical variables.

Instead of testing some particular hypothesis in one’s dataset, it would be interesting to assume a statistical model for one’s dataset and to learn about the parameters ruling that model. Thus, as an example, a linear shaped relationship could be assumed between two variables and one could try to learn about the parameters defining that relationship. There are also several tools for achieving this goal. Thus, linear models (assuming a Normal outcome) are usually used for continuous variables, but if the outcome variable cannot be assumed to be Normal, Generalized Linear Models are the most usual tools to model this kind of settings. Logistic Regression and Poisson Regression models are just particular cases of Generalized Linear Models.

Survival analysis should also be mentioned as a statistical technique of particular use in health registries. Survival analysis is devised to study the time taken for an individual to develop an event of interest, such as the time survived before dying or developing a metastasis. The particularity of this kind of analysis is that many individuals in the dataset do not show the event of interest, maybe because they are not going to develop it in the future, because they have not developed it yet (although they will in the future) or because they have simply left out the study. This makes the variable of interest in these studies to be only partially known sometimes, and the analysis of this kind of data requires a particular treatment. If a more descriptive tool is wanted in this context Kaplan-Meier curves are the most usual tools, meanwhile, if one prefers to model the effect of some covariates on the time of survival, usually Cox regression models are the most widespread tools.

Finally, it is convenient to mention some available tools for carrying out this statistical analysis. Although this list of statistical packages is not intended to be comprehensive, SAS, Stata or SPSS are highlighted as the commercial packages of more frequent use in the health sciences in general. Any of those packages could be perfectly suitable to carry out the above-mentioned analysis in the context of health registries. Nevertheless, R is nowadays an open source alternative with widespread use well beyond health science. R is usually blamed for being a bit rough for non-statistical users. Nevertheless, some specific R packages, like Rcmdr, are intended to make the use of R for non-statistical users easier, so that they make R available for a wide community of users. The main advantage of R is that it is likely to have specific state-of-the-art packages for hardly any task that could be desired in a registry, such as record linkage, dealing with “confounding by indication”, missing values, …

There are lots of textbooks covering the statistical methods mentioned above; in fact, even specific monographs for any one of most of those methods have been published. Certainly the most appropriate book for any user will be that which illustrates their examples with the software habitually used for making the current statistical analysis. Thus, depending on the software used these could be the appropriate textbook choices: Le (2003) for SAS users, Cleophas and Zwinderman (2010) for SPSS users, Hills & De Stavola (2012) for Stata users and Lewis (2009) for R users. Once again this is not intended to be a comprehensive list of possibilities, but just a collection of useful textbooks. Analytical considerations

When undertaking the analysis of the information stored in health registries there are a series of issues that deserve particular attention and that should be always borne in mind. Below are some of those issues to make the reader aware of their existence and their effects. Potential sources of bias

There are numerous potential sources of bias when dealing with data-providing from health registries. Four specific sources of bias in observational studies in general are listed here: selection bias, non-response bias, information bias and recall bias.

One of these sources is selection bias, which is the result of the selection mechanism in the inclusion of individuals in one registry. Thus, as an example, assume that a diabetic patient registry is composed of patients recruited from their visits to a hospital. By definition, only those patients who have visited the hospital have the opportunity to be included in the registry. Regrettably patients visiting the hospital are not a random sample of the diabetic patients out of the whole number of patients but, on the contrary, they are patients with severe problems who have possibly had a complication related to his/her disease. This will make the results drawn from the registry to be non-representative of the diabetic patients within the whole population.

A related bias would be the non-response bias, in which all candidates to be included in the registry may have been previously recorded, however, some of them show missing values for some specific fields. These missing values can be rather innocuous if they are produced at random. Regrettably, quite often, the presence of missing values responds to a non-random mechanism, making those fields in the database not being representative of the population and, therefore, biasing the results if this potential bias is not taken into account during the statistical analysis.

The second source of bias that it is thought convenient to mention is information bias. Information bias is the bias coming from inconsistencies in the way that information is introduced into the registry. Some artefact in the process of retrieving or coding the information into the database could make that information not reflect the reality, but rather a biased and distorted image of that reality. This could be the case of a variable reflecting the vaccination status of the individuals in the database. By default, this variable could be set to “non-vaccinated” and changed to “vaccinated” if appropriate. Nevertheless, as the default value goes always in the same sense, it can often happen that vaccinated individuals are registered in the database as non-vaccinated for the reason that the person who administered the vaccine did not record it into the database. This systematic bias could introduce problems and further bias into later analyses of the information in the database.

Finally, the recall bias should also be borne in mind when working with health registries, mainly when part or the entire database is retrieved from interviews or questionnaires. This source of bias is produced by differences in the accuracy of the information of the people included in the registry coming from their past. People who have family members with a history of cancer may be more prone to develop cancer than people without such connections. So, the information of both kinds of people could be systematically biased towards different directions, simply by their particular circumstances.

These biases are usually incorporated in the database from the very moment of introducing information. Registries professionals should be very aware of them, so that, even from the design phase of the registries they are prevented and, when possible, these biases are avoided by means of appropriate statistical analysis. Confounding by indication

When analysing data coming from health registries it is quite common to study a variable as a function of some covariates. Nevertheless, the distribution of the values of the covariates in data coming from registries is not done at random or following a specific and controlled design. On the contrary, these values in observational studies in general, and health registries in particular, are the result of some factors not registered and out of the control of the study. For example, the decision of administering a medicine to a patient may be taken by a practitioner as the result of a general assessment of his/her health. As a consequence those patients with a worse general health status will take the medicine and those who are better will not. When assessing the effect of the medicine on a final outcome, such as dying in the following year, we could conclude that taking the medicine could increase the probability of dying, when this would be an effect of the previous health status of the patients. This effect is known as confounding by indication, and it may lead us to draw wrong conclusions on the effect of a variable because it is simply confused with other uncontrolled variable(s).

When interpreting the results of health-registry based analysis this potential problem should be very much borne in mind. If it is suspected that it could have had an influence on the estimation of the effect of a variable in the study, resort should be had to statistical techniques designed to control that effect, as for example, the inclusion of propensity scores into the analysis. Propensity scores will be auxiliary variables to be included into the analysis controlling the non-random mechanism that has generated the missing values in the dataset of the study. Missing data

Health registries quite often contain missing data for some of their variables. These missing data are a real problem for data analysis and should be treated with care to avoid the potential bias introduction.

It is very convenient to know the reason why the missing data are produced. The best, although the less likely, scenario is that missing values occur at random. That is, no relationship can be found between their occurrence and any known variable. In that case, missing data are not very harmful, although they introduce some difficulties in data analysis. If the dataset at hand is large, those individuals containing missing values could be simply removed from the analysis and big changes in the new results would not be expected. If, on the contrary, there is knowledge or evidence from data, that missing data have not been produced at random, this should be borne in mind because they could be much more harmful in the data analysis phase. In that case, removing these individuals from the analysis would mean removing a particular part of the whole population, that could produce little or large biases depending on the degree of particularity of that sample. Therefore, in this case, a naïve removal of these individuals from the analysis does not seem to be an option. In this case imputing the missing values is the main option, although that imputation should be made taking into account the mechanism generating the missing data, e. g. if those individuals with particularly large (or low) values of a variable tend to show missing values in a second variable we should take this into account. If these two variables showed some correlation, the value of the first variable should be considered in order to impute the values in the second one, instead of doing it completely at random.

8.1.6 Data Dissemination

Well established, multicenter or population-based registries that have held large data collections can be a rich source of information with many different users, while small locally-held registries have a limited number of potential users, but in both cases data should be made accessible to ensure that all information is used to the maximum benefit of the population it serves.

Data should be disseminated in different ways, depending of the addressee of the data. Thus, three different points of view should be taken into account concerning registry data dissemination: 1) registry holders or owners, 2) patients and general public, and 3) decision makers and researchers.

Patients and service users, researchers, health professionals and policymakers, as well as other stakeholders and even the general public, should have access to valid and properly presented information in order to make choices and decisions. By making outcome data transparent to stakeholders, well-managed registries enable medical professionals to engage in continuous learning and to identify and share best clinical practices. To identify potential stakeholders, it is important to consider to whom the research questions matter. It is useful to identify these stakeholders at an early stage of the registry planning process, as they may have a key role in disseminating the results of the registry.

Registry-based information can be made available in many different ways, such as periodical reports, extracts on request and specific tools provided to allow users to access the data themselves via online portals. The principles of good dissemination of data have to be considered. An example would be the United Nations Good Practices on National Official Statistics [note 2].

Writing reports, presentations, tables, graphs or maps can be used to show the registry outcomes. Understanding is the main principle and it is very important to use the right type of tool for presenting the information. However, if a particular dissemination tool (represented, for example, by a table, graph or map) does not add to or support the analysis, it should be left out.

The dissemination reports should contain only information, only data or both data and information (data with a text explaining those data). According to the addressee of the data, a good dissemination strategy (fixed in a dissemination plan) would have to consider the next features:

  1. Registry holders or owners: dissemination requires actions to reinforce the acknowledgment of the people implied in the registry process, such as data providers, clinicians or managers.
  2. Patients and general public: it will be necessary to disseminate basic indicators, mainly in the form of basic tables, as they are more easily understood by them. On the other hand, graphics, maps and other sort of representations are also needed.
  3. Decision makers and researchers: the dissemination has to be done in the form of aggregated data, but it is important to prepare individual anonymized data for researchers.

Every original finding and all scientifically significant information generated by disease registries should be communicated to the scientific community and finally rendered as scientific publications (on paper, online or both). Indeed publishing of results is inherently linked to the purpose of most if not all patient registries, as proper publishing can be considered an integral part of the scientific method.

Long-term population-based registries, an essential tool for public health surveillance, typically produce periodical descriptive analysis of data to be distributed to all potential users and especially to health professionals providing the data, as this feed-back enhances subsequent cooperation.

In clinical registries, data on disease progression or other long-term patient outcomes may not be available for many years, but safety data could be examined periodically over time. Studies based on patient registries, even short-term registries, may conduct intermediate analysis before all patients have been enrolled or all data collection has been completed, in order to document and monitor the progress of the project.

As the paradigm about health information goes, registry data should be collected once and used many times. Timeliness will be the key.

8.2 Overarching Processes

8.2.1 Data Quality Assurance

Electronic Health Records are generally designed for their primary use. As a consequence, when their data are collected with secondary, reuse purposes, such as for the construction of research repositories, their Data Quality (DQ) may not be optimal. Research repositories generally count with higher levels of DQ as specific, mostly due to manual curation and data profiling processes. However, DQ problems are still present. These can lead to suboptimal research processes, or even to inaccurate or wrong hypotheses. With the purpose to ensure the highest levels of quality, continuously improve DQ processes, and avoid further DQ problems, organizational DQ Assurance protocols should be established.

DQ Assurance protocols combine activities at different levels, from the design of the information system, the user training in DQ, to a continuous DQ control. To this end, many research and industrial DQ Assurance proposals have been related to the Total Quality Management Six Sigma process improvement methodology. Concretely, the DMAIC model can be used to improve the DQ and their related processes, involving the following cycle of steps: Define, Measure, Analyze, Improve and Control. We can affirm that defining what to measure and how to do it is the basis for the DQ Assurance, being them the initial steps to any DQ improvement. These steps, along with the DQ control, can be defined under a DQ Assessment framework.

8.2.2 Data Quality Assessment

The Data Quality (DQ) Assessment is managed according to DQ dimensions: attributes that represent a single aspect or construct of DQ. Dimensions can conform to data definitions or to user expectations. Thus, DQ Assessment concepts and methods can be defined according to specific domains or problems. A set of DQ dimensions can be established to assess the DQ of cross-border patient registries based on different studies (see chapter 4 ‘Quality dimensions of Registries).

There exist other dimensions which, rather than measured on data by themselves, can be measured on their related stakeholders. Data Availability refers to their degree of accessibility to users. Data Security refers to their degree of privacy and confidentiality. Finally, data Reliability refers to the degree of reputation and trust of the stakeholders and institutions involved in its acquisition.

DQ problems may affect single or combined variables within an individual patient registry, e.g. inconsistent combination of variable values. Otherwise, DQ problems may affect a sample composed of a set of registries, e.g., a biased sample mean. For that reason, according to the purpose, methods should be considered to be applicable to large-scale big data repositories.

To conclude, it is of upmost importance for the DQ Assessment to formally define what is to be measured and controlled according to the aforementioned dimensions. Based on that, strategies can be defined to correct or prevent DQ problems. DQ processes can be applied to off-line research datasets. However, continuously controlling (based on on-line methods or multi-site audits) DQ indicators within a DQ Assurance cycle, from which to obtain a feedback to improve processes, is a recommended strategy to continuously reduce the DQ problems and optimize resources.

8.2.3 Evaluation and Improvement of Registry Service

Quality assessment of a registry should be a continuous process integrated in the registry’s running. The dimensions needed to measure it (completeness, validity, opportunity,…) are common to different type of registries, but methods and indicators are related to the type of registry. The population cancer registries are one of the most advanced examples.

Regarding their complexity and cost, some methods can be routinely implemented, while others - contrasting it with an independent series of cases, which is one of the most used methods to assess completeness - should only to be used in a sporadic way.

However, using an external audit for the registry is a good idea, though external audit and accreditation - used in the health sector since decades, and considered useful to promote high quality products and services with efficacy and reliability - are less developed in the registry field, except in the United States of America.

For those registries in which the health administration is both the data supplier and data user (client), there is a need to incorporate in an active way the opinion of health planner and health management professionals.

An example could be the REDEPICAN (Latin America Network for Cancer Information Systems) Guide for the External Evaluation of Population-based Cancer Registries, used in several Spanish and Latin America cancer registries. It is a new tool inspired in the accreditation principles: voluntary process, standard and defined criteria, self-assessment, external verifying process, and independent organism report. The Guidelines assesses 7 dimensions (Structure, Procedures Manual, Registry Method, Comparability, Completeness, Validity, Outcome Dissemination, and Confidentiality and Ethical Aspects) through 68 criteria with three standard levels, allowing to assess the traditional indicators and procedures needed to make the necessary changes, in order to offer the maximum efficiency. The final score, and the criteria with a low score, identify problems to be solved in the registry with concrete objectives for improvement. An external audit with a homogeneous measurement tool is useful as the starting point to measure quality improvements and to compare between registries.

8.2.4 Governance

Patient registries’ governance comprises the systems and procedures by which a registry is directed and managed. It refers to guidance and high-level decision making, including concept, funding, execution, and dissemination of information.

Good governance must include:

- Accomplishment of the normative (regional, national, international). In some countries prior approval for the operation of a registry, by professional or health authorities is needed. The support and approval of the institution in which the registry is located is fundamental. The ethics committee’s approval is also needed.

- Principles in which the registry action is based. Some of them are: transparency, participation, accuracy, security and data protection.

- Operating rules definition. This is a document that specifies the rules, case definitions used, codes and classification used (assuring the semantic interoperability). All the operating procedures have to be elaborated and released to all the participants in the registry. The way in which the data of the registry may be accessed has to be clearly defined. A document for the consent and its procedures has to exist (see chapter 4.1 ‘Governance’).

- The structure of the governance board (and its role and responsibilities). According to the governance plan (see chapter 6.1.9 ‘Governance, Oversight and Registry Teams’) the governance board can be structured in several ways:

  • The prioritization is to have a project management team, and scientific committee and a quality assurance committee.
  • A scientific committee or expert group can be formed to guide the development of the registry and to ensure the scientific basis. Its role is as a consultant group.
  • The project management team can be developed also as a steering committee. It has to ensure that the registry is running according to the principles and objectives marked and planned. Its composition has to be done taking into account the institution in which the registry is based, the organism which funds the registry, the professionals implied, the health authorities, the academic or scientific institutions related to the subject of the registry and the patients and their families affected. Its role is to assume the responsibility of the registry. The Chair of the steering committee assumes final responsibility.

An example of Registry Governance Document is the National Cancer Registry of Ireland [1].

8.2.5 Auditing

According to the Dictionary of Epidemiology, an audit is an examination or review that establishes the extent to which a condition, process or performance conforms to predetermined standards or criteria. In a registry, audits may be carried out on the quality of data or completeness of records. Depending on the purpose of the registry, several types of audit can be performed. The audit can assess: enrolment of eligible patients, data completeness, selection bias, or data quality. An example of quality assessment is shown in chapter 8.2.3. The audit can be conducted either on the whole set of data of the registry, or just for a selected (random or systematic) sample of patients, using sampling techniques.

For example, the Spanish National Rare Diseases Registry has performed an audit in a Spanish region to assess the validity of diagnosis of aplastic anaemia by the International Classification of Diseases codes in hospital discharge data and the mortality registry, in order to detect cases to be included in the rare diseases registry. After getting the data from both databases the patient medical records were reviewed to confirm true aplastic anaemia cases. Only 15% of the cases were confirmed [note 3] .

The audit can be internal or external. Internal audit is carried out by the registry staff, using a concrete plan and specific indicators to assess the most significant sources of error as regards the purpose of the registry. External audit is performed by external personnel, in accordance with pre-established criteria.

8.2.6 Continuous Development

A registry is always in a continuous process of actualization. For example, the way of collecting data can experiment changes due to technological innovations, organizational modifications or new legal rules.

For that reason, a registry should be flexible and adaptive in all the facets of the registry process:

  • For paper-based registries, it is crucial to move on to electronic based ones.
  • New data elements could be added (new treatments or new disease stage for example).
  • Definitions can be modified according to improved knowledge.
  • Revisions of the classification systems happen and the registry has to be ready to be adapted to new ones.
  • It is necessary to foresee any legal modifications regarding to ethical and protection data rules, as the personal identification number can also change or may need to be encrypted.
  • The technological innovations affect the way in which a registry operates.
  • The methods of data quality processes should be adapted to the results achieved.
  • The reports and the diffusion mechanisms need to be flexible, because new data users can be incorporated and the stakeholders’ concerns may change.

The development of the registry has to be continuously and periodically tested, in order to progress and adapt to the potential changes.

All the modifications have to be done ensuring the quality and integrity of the data and planning the date of the beginning.

8.2.7 Information System Management

Running a registry requires dealing with a certain number of stakeholders (patients, providers, clients, partners, regulatory authorities...). Running a registry also takes a good deal of IT. A registry owner will therefore be interested in raising trust among the stakeholders, as well as in getting most value from his information systems. IT governance can provide both.

None of the IT activities within the registry should take place on an improvised, contingent or ad-hoc basis, but within an adequate governance and management framework [note 4]. This is the best way to:

  • Maintain high-quality information to support business decisions.
  • Achieve strategic goals and realize business benefits through the effective and innovative use of IT.
  • Achieve operational excellence through reliable, efficient application of technology.
  • Maintain IT-related risk at an acceptable level.
  • Optimize the cost of IT services and technology.
  • Support compliance with relevant laws, regulations, contractual agreements and policies.
  • Provide trust to all stakeholders.

Among IT activities there should be management processes related to deliver, service and support. In that area, the following processes have to be considered:

  • Manage operations.
  • Manage service requests and incidents.
  • Manage problems.
  • Manage continuity.
  • Manage security services.
  • Manage business process controls.

Other processes should be run to monitor, evaluate and assess performance and conformance, the system of internal controls, and compliance with external requirements. Key indicators are essential in this context, as they are a main source of knowledge and allow measuring variables like cost, risk, disruption, improvement, and others.

All of the above should discard/discourage anyone pretending to “take care of all IT stuff” around the registry without adequate knowledge or tools. Being proficient at making scrambled eggs at home does not qualify one as a chef.


  1. Available at: http://ec.europa.eu/eurostat/documents/3859598/5926869/KS-RA-13-028-EN.PDF/e713fa79-1add-44e8-b23d-5e8fa09b3f8f. Also WHO’s World Standard Population is defined; more on that can be found at: www.who.int/healthinfo/paper31.pdf.
  2. http://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx
  3. Ruiz E, Ramalle-Gómara E, Quiñones C, Rabasa P, Pisón C. Validation of diagnosis of aplastic anaemia in La Rioja (Spain) by International Classification of Diseases codes for case ascertainment for the Spanish National Rare Diseases Registry. Eur J Haematol. 2014 Aug 18. doi: 10.1111/ejh.12432. [Epub ahead of print]
  4. ISACA's COBIT 5 is the most comprehensive business framework for the governance and management of enterprise IT. This framework provides with good practice and guidance from the knowledge and experience of a large (>115000 members) community of IT audit, security, risk and governance professionals worldwide. Alternatively, different sets of ISO standards address some of the main issues (e.g. ISO/IEC 38500, ISO/IEC 20000, ISO/IEC 27000 ...). More at www.isaca.org/cobit/pages/default.aspx and www.iso.org.


  1. National Cancer Registry of Ireland: http://www.ncri.ie/sites/ncri/files/documents/GovernanceFrameworkfortheNationalCancerRegistry24September2010.pdf
  1. National Health Information Management Group. Minimum Guidelines for Health Registers for Statistical and Research Purposes. Australian Institute of Health and Welfare. 2001.
  2. Cruz-Correia RJ, Pereira Rodrigues P, Freitas A, Canario Almeida F, Chen R, Costa-Pereira A. Data Quality and Integration Issues in Electronic Health Records. In: Information Discovery On Electronic Health Records. V. Hristidis (ed.); 2010. p.55–96.
  3. Cusack CM, Hripcsak G, Bloomrosen M, Rosenbloom ST, Weaver CA, Wright A, Vawdrey DK, Walker J, Mamykina L. The future state of clinical data capture and documentation: a report from AMIA's 2011 Policy Meeting J Am Med Inform Assoc. 2013;20(1):134-40.
  4. ENERCA (European Network for Rare and Congenital Anaemias):http://www.enerca.org
  5. EPIRARE (Deliverable 1.4 Statistical Analysis of the EPIRARE survey data):http://www.epirare.eu/_down/del/D1.4_StatisticalAnalysisofRegistrySurveyData.pdf
  6. EUROCISS (Cardiovascular Indicators Surveillance Set): http://ec.europa.eu/health/ph_projects/2000/monitoring/fp_monitoring_2000_frep_10_en.pdf
  7. Gliklich R, Dreyer N, editors. Registries for Evaluating Patient Outcomes: A User’s Guide. 3rd ed. Rockville, MD: Agency for Healthcare Research and Quality; 2012 (Draft released for peer review).
  8. Health information and Quality Authority. Guiding Principles for National Health and Social Care Data Collections. Dublin: Health information and Quality Authority; 2013.
  9. Karr AF, Sanil AP, Banks DL. Data quality: A statistical perspective. Statistical Methodology. 2006, 3, 137:173.
  10. Larsson S, Lawyer P, Garellick G, Lindahl B, Lundström M. Use Of 13 Disease Registries In 5 Countries Demonstrates The Potential To Use Outcome Data To Improve Health Care’s Value. Health Aff. 2012;31:1220-227.
  11. Lee YW, Strong DM, Kahn BK, Wang RY. AIMQ: a methodology for information quality assessment. Information & Management. 2002. 40, 133-146.
  12. MacLennan R (1991). Items of patient information which may be collected by registries. In: Jensen OM, Parkin DM, MacLennan R, Muir CS, Skeet RG, editors. Cancer Registration: Principles and Methods. Lyon: International Agency for Research on Cancer (IARC Scientific Publications, No. 95); pp. 43–63. Available from http://www.iarc.fr/en/publications/pdfs-online/ epi/sp95/index.php.
  13. McMurry AJ, Murphy SN, MacFadden D, Weber G, Simons WW, Orechia J, et al. SHRINE: Enabling Nationally Scalable Multi-Site Disease Studies. PLoS ONE. 2013 03;8(3):e55811.
  14. Navarro C, Molina JA, Barrios E, Izarzugaza I, Loria D, Cueva P, Sánchez MJ, Chirlaque MD, Fernández L. [External evaluation of population-based cancer registries: the REDEPICAN Guide for Latin America]. Rev Panam Salud Publica. 2013 Nov;34(5):336-42
  15. Newton J, Gardner S. Disease registers in England. Institute of Health Sciences. University of Oxford. 2002.
  16. Posada de la Paz M, Villaverde-Hueso A, Alonso V, János S, Zurriaga O, Pollán M, Abaitua-Borda I. Rare Diseases Epidemiology Research. In: Posada de la Paz M, Groft S, editors. Rare Diseases Epidemiology. Heildelberg, London, New York: Springer Science+Business Media B.V.; 2010.
  17. Richesson R, Vehik K. Patient Registries: Utility, Validity and Inference. In: Posada de la Paz M, Groft S, editors. Rare Diseases Epidemiology. Heildelberg, London, New York: Springer Science+Business Media B.V.; 2010.
  18. Röthlin M. Management of Data Quality in Enterprise Resource Planning Systems. Josef Eul Verlag GmbH (ed.). 2010.
  19. Sáez C, Martínez-Miranda J, Robles M, García-Gómez JM. Organizing data quality assessment of shifting biomedical data. Stud Health Technol Inform. 2012; 180:721–725.
  20. Sáez C, Robles M, García-Gómez JM. Stability metrics for multi-source biomedical data based on simplicial projections from probability distribution distances. Under review. 2014a.
  21. Sáez C, Robles M, García-Gómez JM. Probabilistic change detection and visualization methods for the assessment of temporal stability in biomedical data quality. Under review. 2014b.
  22. Sáez C, Martínez-Miranda J, Robles M, García-Gómez JM. Organizing data quality assessment of shifting biomedical data. Stud Health Technol Inform. 2012; 180:721–725.
  23. Sebastian-Coleman L. Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework. Morgan Kaufmann (ed.). 2013.
  24. Van den Broeck J, Argeseanu Cunningham S, Eeckels R, Herbst K (2005) Data cleaning: Detecting, diagnosing, and editing data abnormalities. PLoS Med 2(10): e267.
  25. Wang RY, Strong DM. Beyond accuracy: what data quality means to data consumers? J Manage Inf Syst. 1996;12(4):5–33.
  26. Wang RY. A Product Perspective on Total Data Quality Managements. Communications of the ACM. 1998; 41(2):58–65.
  27. Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc. 2013 Jan; 20(1):144–151.
  28. Zurriaga O, Bosch A, García-Blasco MJ, Clèries M, Martínez-Benito MA, Vela E. [Methodological aspects of the registries for renal patients in replacement therapy]. Nefrología. 2000; 20 Suppl 5:S23-31.