Monitoring early diagnosis is a priority of cancer policy in England. Information on stage has not always been available for a large proportion of patients, however, which may bias temporal comparisons. We previously estimated that early-stage diagnosis of colorectal cancer rose from 32% to 44% during 2008–2013, using multiple imputation. Here we examine the underlying assumptions of multiple imputation for missing stage using the same dataset.
Individually-linked cancer registration, Hospital Episode Statistics (HES), and audit data were examined. Six imputation models including different interaction terms, post-diagnosis treatment, and survival information were assessed, and comparisons drawn with the a priori optimal model. Models were further tested by setting stage values to missing for some patients under one plausible mechanism, then comparing actual and imputed stage distributions for these patients. Finally, a pattern-mixture sensitivity analysis was conducted.
Data from 196,511 colorectal patients were analysed, with 39.2% missing stage. Inclusion of survival time increased the accuracy of imputation: the odds ratio for change in early-stage diagnosis during 2008–2013 was 1.7 (95% CI: 1.6, 1.7) with survival to 1 year included, compared to 1.9 (95% CI 1.9–2.0) with no survival information. Imputation estimates of stage were accurate in one plausible simulation. Pattern-mixture analyses indicated our previous analysis conclusions would only change materially if stage were misclassified for 20% of the patients who had it categorised as late.
Multiple imputation models can substantially reduce bias from missing stage, but data on patient’s one-year survival should be included for highest accuracy.
- • Missing stage data can bias assessments of early diagnosis trends.
- • Multiple imputation analyses utilising one year’s survival data reduce bias.
- • Including survival after one year did not improve accuracy, in one setting.
- • We demonstrate a pattern mixture sensitivity analysis to validate trends.
There have been improvements in the range and completeness of information recorded in cancer registration in England over recent decades . Recording of the disease stage at diagnosis improved greatly, from 37% of patients diagnosed in 2008–2009 84 % in 2012–13, and higher percentages in more recent years . This improvement makes accurate evaluations of temporal trends in early-stage diagnosis increasingly viable. However, patients without stage recorded in their registration (either because it was not ascertained at the hospital or not recorded) need to be accounted for. These patients have poorer outcomes on average than patients with stage recorded , partly because it may have been considered clinically inappropriate to complete staging for the patient or they died before staging was completed. Current practice for routine surveillance is to exclude these patients. However, with appropriate methods to account for the missing stage data, robust evaluations of the impact of key events on early diagnosis can be done. For example, of the impact of nationwide colorectal cancer screening from 2007 , the implementation of the Health and Social Care Act 2012 , or the COVID-19 pandemic .
One statistical approach to handle missing values is multiple imputation (MI) . MI uses auxiliary information on the patients whose stage information is missing to impute a likely distribution for stage for each patient missing it using a statistical model, then sampling repeatedly from this distribution to create m datasets where data is complete for every patient. Estimates of the parameter of interest are calculated using each of the m complete datasets . Imputation has been shown to be less biased than either a complete-case analyses or a ‘missing value’ analysis, because it makes more plausible assumptions about the missing data . However, the approach assumes stage is missing randomly conditional only on the auxiliary variables used in the imputation model (“missing at random” or MAR), and that the relationships between variables specified in the imputation model match the actual (unknown) relationships in the data. If these assumptions are broken some residual bias may be present.
We previously estimated stage trends during 2008–2013 in England for colorectal cancer using MI . Here we empirically examine the two assumptions made: that the imputation model is correctly specified, and that the data are MAR. For the first assumption we compare estimates between models of different complexity and which include different auxiliary variables. We then test whether the models correctly estimate stage trends under a plausible MAR scenario. For the assumption that the stage data are MAR itself, we perform a pattern-mixture sensitivity analysis : estimating the proportion of missing values which would have to have been imputed incorrectly for conclusions from our previous analysis to change. We consider the plausibility of this with reference to likely mechanisms for missing data. The implications of our findings for the surveillance are then considered with reference to the early diagnosis targets set in NHS long Term Plan (2019) .