Introduction

Welcome to the CALIBER data portal. This contains definitions of research variables using the data sources in CALIBER. This portal will be updated on a regular basis, as new variables are added or if we identify issues with the variables as defined here.

Each chapter relates approximately to an ICD-10 chapter. Within a chapter there may be groups of variables related to a particular condition (e.g. stable angina). Base variables may define demographic information, diagnoses, clinical measurements, test results or prescriptions, with information derived from a single source dataset. Composite variables are defined using a combination of variables which may draw information from more than one data source (e.g. defining hypertension using diagnostic codes and blood pressure measurements).

Principles for defining variables

Base variables based on Read or ICD-10 codes (diagnoses, procedures, symptoms etc.) classify a record into a number of categories based on the code. Wherever possible, ICD-10 and Read code definitions have corresponding categories, although specific categories may not be available in one or the other dictionary. Read codes are mapped to 'medcode' in the CALIBER General Practice Research Database (GPRD) tables.

The base variables are generally suffixed by the name of the source dataset, and have the same stem for corresponding variables in different data sources (e.g. hepatitis_gprd and hepatitis_hes). Where a dataset is not specified, the variable comes from GPRD or is a composite variable. The variable definitions have been agreed by both clinical and non-clinical researchers.

Variables based on additional information in GPRD

GPRD contains additional structured information apart from Read codes, such as numerical test results or categorisation of smoking status. Clinical signs or test results can be recorded either as a medcode (e.g. O/E blood pressure raised) or as additional data linked to specific entity codes (e.g. for entity code 1, the data1 field is diastolic blood pressure reading and the data2 field contains systolic blood pressure reading. In general we have aimed to give due weight to the data and not include codes or definitions where it is not clear how they have been used. Some entity codes are highly specific; for example "1 - blood pressure". However, other entity codes are less specific, such as "288 - other laboratory tests". Where entity codes are highly specific, we have not also required a medcode to include the data; where the entity codes are less specific, we have required them to be linked with a relevant medcode before including the data.

For some categorical variables, the response categories from the Read code list and the response categories from the associated entity code could theoretically contradict each other. Where there are code conflicts between two sources of data, we have usually added a response category for the code conflicts, so that researchers can determine themselves how to use this information. In order to ensure that we identified all the relevant recorded clinical signs or test results, we searched for all entity codes relating to the coding lists for our variables and added the relevant ones to our variable definitions.

Values for observations of clinical signs or test results are only used in the CALIBER variable definitions if they are specific (i.e. equal to a given value) and if the units recorded are those stipulated in our variable definition or are missing. In general, we have also excluded data if there is a contradiction between the Read code used (e.g. "total cholesterol: HDL ratio") and the name of the linked entity code ("HDL/LDL ratio") because we have no way of knowing which is correct and whether this is the same for all records.