Data Propagation (LOCF)
Detects suspiciously long runs of identical consecutive values in a column, consistent with copy-paste or last-observation-carried-forward fabrication.
Technical description
Last-observation-carried-forward (LOCF) imputation, and the manual copy-paste that fabricators substitute for it, replaces a missing follow-up value with the participant's previous value and so leaves long runs of identical consecutive values in columns that should be independently measured at each visit. In genuine longitudinal data even high-autocorrelation measurements such as blood pressure or weight change slightly between visits because of biological variation and measurement noise, so a stretch of identical values is implausible as real repeated measurement. D24 measures, for each numeric column of the individual-patient data, the fraction of adjacent row pairs that match within tolerance and the longest unbroken run of matching values, and corrects the match rate for the rate expected by chance given the column's value distribution.
How it works
Layer 2 (contextual): requires at least three numeric columns and at least fifteen complete rows. Two values match when their absolute difference is below 0.001. For each column whose standard deviation exceeds 0.01 it computes the consecutive-same rate, the fraction of adjacent row pairs that match, and the maximum run length, the longest unbroken sequence of matching values. It subtracts a chance baseline equal to the column's Simpson collision probability, the sum of squared relative frequencies of its values clustered on the matching grid, which is the expected adjacent-match rate under a random ordering; the corrected rate is the observed rate minus this baseline, floored at zero. The mean corrected rate across columns adds 3.0 above thirty percent, 2.0 above fifteen, or 1.0 above eight. The longest run across columns adds 1.5 at ten or 0.5 at five, and a further 0.5 is added when more than half the columns carry a run of at least three. Independently, the count of adjacent-matching pairs in a column follows a binomial distribution with n minus one trials and success probability equal to that column's collision probability under random ordering, and the smallest one-sided upper-tail probability across columns below one in a million adds a further 0.5. The total is capped at 5.0, and the minimum binomial tail probability is recorded in the metadata.
Why this matters
Fabricators generating longitudinal data sometimes copy values from the baseline visit to later visits rather than simulating plausible temporal variation, producing step-function patterns that hold constant then jump. LOCF imputation leaves the same artefact even in honest analyses, and large language models generating visit-by-visit data show it when they run out of ways to vary a measurement. Detecting propagated and duplicated values is a recognised part of the central statistical monitoring that uncovers data fraud missed by on-site review.
Score thresholds
- 0-1
- Run lengths and repeat rates consistent with natural variation
- 2-3
- Some unusually long runs or an elevated repeat rate above chance
- 4-5
- Long consecutive identical runs and a high corrected repeat rate, consistent with LOCF or copy-paste fabrication
Limitations
Assumes the rows carry a meaningful order so that adjacency corresponds to successive visits or records; with arbitrary order only genuine ties that happen to be adjacent appear as runs, which the chance correction discounts but cannot fully remove. Truly stable attributes repeated across rows can produce legitimate runs, so a flag prompts inspection of provenance rather than proving fabrication. Columns with standard deviation below 0.01 are treated as constant and excluded, and limited-resolution instruments can yield legitimate short runs. The 0.001 tolerance fixes what counts as identical and the rate and run thresholds are heuristic. Whole-row duplication is indicator D23 and within-table value duplication is indicator S14; D24 focuses on consecutive single-column propagation in the IPD.
References
- Lachin JM. (2016). Fallacies of last observation carried forward analyses. Clinical Trials
- George SL, Buyse M. (2015). Data fraud in clinical trials. Clinical Investigation
- Carlisle JB. (2021). False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia
- Al-Marzouki S, Evans S, Marshall T, Roberts I. (2005). Are these data real? Statistical methods for the detection of data fabrication in clinical trials. BMJ 331(7511):267-270
- Carlisle JB. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia 72(8):944-952
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
- Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
- Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380