ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
D4Statistical analysisFabrication DetectionLayer 2 (Contextual)

Inlier Clustering

Looks at each numeric variable in the raw data and checks whether its values huddle too tightly around the average. People who invent data pick numbers near the middle and shy away from the extremes that real measurement produces, so a fabricated column is often too peaked, with too many values close to the mean and an implausibly narrow spread. The indicator runs three peakedness checks per column and flags a column when at least two agree.

Technical description

A contextual screen for the inlier-clustering signature of manual fabrication. For each numeric column of individual-patient data with at least thirty non-missing values, three checks run: excess kurtosis (unbiased estimator, normal = 0) flagged above 1; the inlier proportion (fraction within half a standard deviation of the mean) flagged above 0.55, well above the ~0.38 of normal data; and the studentized range ((max - min)/SD) flagged below 3, since genuine data of this size spans four to six standard deviations. A column is flagged when at least two of the three trigger, and the score reflects the fraction of columns flagged. A constant column is treated as non-peaked for kurtosis, fully clustered for the inlier check, and undefined for range.

How it works

Layer 2 (contextual): each qualifying column is summarised by mean and SD. The kurtosis check adds a trigger when unbiased excess kurtosis exceeds 1, the inlier check when the proportion within half a SD exceeds 0.55, and the range check when the studentized range falls below 3, and a fourth from a binomial test of the inlier count against the 0.3829 normal baseline that triggers when its upper tail is below one in a million; two or more triggers flag the column. Fraction of flagged columns maps to score: below 0.20 gives 0, 0.20-0.50 gives 2.0, above 0.50 gives 4.0, capped at 5.0. Each flagged column yields a finding naming which checks fired, severity rising above the mild band. Metadata records columns_tested, columns_flagged, avg_inlier_pct, avg_range_sd_ratio, expected_inlier_pct_normal (the 0.383 normal baseline against which the 0.55 inlier threshold is read), and min_inlier_binomial_p.

Why this matters

A consistent finding in the study of fabricated data is that people cannot imitate the spread of real measurement: inventing numbers, they gravitate to the centre and under-produce extremes. Mosimann and colleagues showed directly that individuals asked to generate data fail to reproduce the variability and tail behaviour of genuine distributions. Simonsohn turned this into a detection method, catching fabrication from implausibly low variability, and Al-Marzouki and colleagues used the spread and shape of trial variables to separate fabricated from genuine data. D4's three checks capture complementary faces of this single failure (the sharp peak, the over-density near the mean, the missing tails); requiring two of three before flagging, and weighing how many columns are affected, guards against the occasional naturally peaked variable.

Score thresholds

0
Fewer than a fifth of columns show inlier clustering, consistent with real spread.
2
Between a fifth and a half of columns are unnaturally clustered.
4-5
More than half of columns are unnaturally clustered, a signature of manual fabrication.

Limitations

Requires individual-patient data with numeric columns of at least thirty values, so a small or summary-only study is out of scope. The statistics rest on the mean, standard deviation, and fourth moment, which are sensitive to the very outliers whose absence is being probed, so a single extreme value can raise kurtosis and inflate the standard deviation in interacting ways; requiring two triggers mitigates but does not remove this. Genuinely peaked or heavy-tailed real distributions (such as a Laplace-shaped biomarker) can legitimately exceed the kurtosis and inlier thresholds, so a flag prompts inspection. The studentized-range threshold of 3 is applied regardless of sample size even though the expected range grows with n, making the check conservative for large columns. The thresholds are directional. The dataset-level too-clean signals on summary statistics are S15 and S16.

References

  1. Mosimann JE, Wiseman CV, Edelman RE. (1995). Data fabrication: Can people generate random digits?. Accountability in Research 4(1):31-55
  2. Simonsohn U. (2013). Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science 24(10):1875-1888
  3. Al-Marzouki S, Evans S, Marshall T, Roberts I. (2005). Are these data real? Statistical methods for the detection of data fabrication in clinical trials. BMJ 331(7511):267-270
  4. Carlisle JB. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia 72(8):944-952
  5. Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380
  6. Carlisle JB. (2021). False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia 76(4):472-479
  7. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
  8. Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
  9. Knepper D, Lindblad AS, Sharma G, et al.. (2016). Statistical Monitoring in Clinical Trials: Best Practices for Detecting Data Anomalies Suggestive of Fabrication or Misconduct. Therapeutic Innovation & Regulatory Science 50(2):144-154