ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
D24Statistical analysisFabrication ExtendedLayer 2 (Contextual)

Data Propagation (LOCF)

Looks for stretches of a variable that stay exactly the same from one participant row to the next. Genuine measurements drift between people and between visits because of biological variation and measurement noise, so a column that holds one value for many rows in a row is a signature of last-observation-carried-forward (LOCF) imputation or of copy-paste data generation. The indicator measures how often adjacent rows repeat a value and how long the longest unbroken run is, correcting for how often repeats would arise by chance in that column. It works on the individual-patient data (IPD).

Technical description

D24 is a contextual screen for value propagation in individual-patient data, where last-observation-carried-forward (LOCF) means a missing or unrecorded follow-up value is replaced by the same participant's previous value. LOCF imputation, and the manual copy-paste that fabricators use in its place, leaves long runs of identical consecutive values in columns that should be independently measured. The indicator selects the numeric columns, drops rows with missing entries, and requires at least three columns and at least fifteen rows. For each column whose standard deviation exceeds a small floor it computes two quantities: the consecutive-same rate, that is the fraction of adjacent row pairs whose values match within tolerance, and the maximum run length, that is the longest unbroken sequence of values that match within tolerance. The consecutive-same rate is then corrected by subtracting the rate expected under a random ordering of the same values, so that columns with few distinct values do not register as propagation merely because chance matches are common in them. The score combines the mean corrected rate across columns with the longest run found and the share of columns carrying long runs.

How it works

A value matches its neighbour when the absolute difference is below 0.001. For each qualifying column the consecutive-same rate is the fraction of the n minus one adjacent pairs that match under this tolerance, and the maximum run length is the length of the longest stretch of consecutive matching values. The chance baseline for a column is the collision probability of its value distribution, the sum over distinct values of the squared relative frequency, which is exactly the expected adjacent-match rate when the same values are placed in random order; values are first clustered onto the same tolerance grid used for matching. The corrected rate is the observed rate minus this baseline, floored at zero, and the indicator scores on the mean corrected rate across columns. That mean adds 3.0 above thirty percent, 2.0 above fifteen, or 1.0 above eight, these bands being mutually exclusive. The longest run across all columns adds 1.5 when it reaches ten or 0.5 when it reaches five, and a further 0.5 is added when more than half the columns carry a run of at least three. Independently of these heuristic bands, the count of adjacent-matching pairs in a column follows a binomial distribution with n minus one trials and a success probability equal to that column's collision probability under random ordering, and the one-sided upper-tail probability of the observed count is evaluated for every column; when the smallest such tail across columns falls below one in a million the run structure is too clustered to be a chance product of the column's value distribution, adding a further 0.5. The total is capped at 5.0. Columns whose standard deviation falls below 0.01 are treated as constant and skipped, and if every column is constant the indicator skips with that reason. The metadata records the mean raw rate, the mean corrected rate, the maximum run, the share of columns with long runs, the names of those columns, the number of qualifying columns, the minimum binomial tail probability across columns, and the row count.

Score thresholds

Score Meaning
0 to 1 Run lengths and repeat rates consistent with natural variation.
2 to 3 Some unusually long runs or an elevated repeat rate above chance.
4 to 5 Long consecutive identical runs and a high corrected repeat rate, consistent with LOCF or copy-paste fabrication.

Why this matters

In genuine longitudinal data even highly stable measurements such as blood pressure or weight change a little between visits because of biological variation and measurement error, so a column that holds one value across many consecutive rows is not a plausible record of repeated measurement. Lachin's review of LOCF shows that the method imputes missing follow-up values with the participant's last recorded value and that the resulting flat segments are a known and questionable artefact of the imputation rather than observed data [1]. The same flat-run pattern arises when fabricators copy a baseline value forward instead of simulating plausible change, and George and Buyse describe the detection of such duplicated and propagated values as part of the central statistical monitoring that uncovers data fraud not caught by on-site review [2]. Carlisle's examination of trials submitted with individual-patient data found recycled and repeated records among the features that exposed fabricated datasets [3]. Correcting the repeat rate for the chance baseline matters because a binary or coarse ordinal column repeats often by accident, and only the excess over that expectation, together with the clustering of repeats into long runs, distinguishes propagation from the ordinary repetition of a discrete variable. Recent forensic re-analyses, scoping reviews, and trustworthiness instruments place carried-forward and duplicated-value checks among the standard screens for fabricated and machine-generated data [4, 5, 6, 7, 8].

Limitations

The check assumes the rows carry a meaningful order, so that adjacency corresponds to successive visits or records; a dataset whose row order is arbitrary can show propagation-like runs only where genuine ties happen to be adjacent, which the chance correction is designed to discount but cannot fully remove. Truly stable attributes recorded once and repeated, such as a persistent marker, can produce legitimate runs, so a flag is a prompt to inspect provenance rather than proof of fabrication. Columns with standard deviation below 0.01 are treated as constant and excluded, and an instrument with limited resolution can produce legitimate short runs that lift the rate without indicating propagation. The tolerance of 0.001 fixes what counts as identical and can merge values that differ only in the last reported place. The thresholds on the corrected rate and the run length are heuristic. Whole-row duplication across many variables is indicator D23 and within-table value duplication is indicator S14, so D24 focuses on consecutive single-column propagation in the individual-patient data.

Theoretical background

D24 rests on the contrast between the temporal structure of real measurement and the flat segments left by imputation or copying. A genuinely measured longitudinal series, even one with strong autocorrelation, almost never repeats a continuous value exactly because measurement noise perturbs every reading, so the probability of a long exact run falls steeply with its length. LOCF replaces that noise with literal repetition, producing step functions that stay constant until the next observed value and then jump, which is the pattern the run-length statistic captures. The chance correction supplies the null the rate is judged against: under a random permutation of a column's values the expected fraction of adjacent matches equals the Simpson collision probability, the sum of squared value frequencies, which is near zero for a continuous all-distinct column but can approach one half for a balanced binary column. Subtracting this baseline isolates the excess repetition that ordering, rather than the value distribution, is responsible for, so that propagation in a continuous column registers while the routine ties of a discrete column do not. Long fabricated runs raise both the corrected rate and the maximum run length together, and the joint use of the two statistics makes the indicator sensitive to propagation that is concentrated in a few long stretches as well as to propagation spread thinly across many short ones. Treating that same Simpson collision probability as the success rate of a binomial over the n minus one adjacent pairs turns the chance correction from a single subtraction into a significance test, so a column whose repeat count is individually improbable under its own value distribution is flagged even when its corrected rate sits inside the heuristic bands.

References

  1. Lachin JM. Fallacies of last observation carried forward analyses. Clinical Trials. 2016;13(2):161-168. DOI: 10.1177/1740774515602688
  2. George SL, Buyse M. Data fraud in clinical trials. Clinical Investigation. 2015;5(2):161-173. DOI: 10.4155/cli.14.116
  3. Carlisle JB. False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia. 2021;76(4):472-479. DOI: 10.1111/anae.15263
  4. Al-Marzouki S, Evans S, Marshall T, Roberts I. Are these data real? Statistical methods for the detection of data fabrication in clinical trials. BMJ. 2005;331(7511):267-270. DOI: 10.1136/bmj.331.7511.267
  5. Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
  6. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
  7. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
  8. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861