Inlier Clustering
Looks at each numeric variable in the raw data and checks whether its values huddle too tightly around the average. People who invent data tend to pick numbers near the middle and shy away from the extremes that real measurement produces, so a fabricated column is often too peaked, with too many values close to the mean and an implausibly narrow spread. The indicator runs three peakedness checks per column and flags a column when at least two agree, then scores the dataset by how many columns are affected. It works on the individual-patient data when available.
Technical description
D4 is a contextual screen for the inlier-clustering signature of manually fabricated data, in which values pack around the centre and the natural tails are missing. For each numeric column of the individual-patient data with at least thirty non-missing values, it runs three checks. Excess kurtosis, computed with the unbiased estimator so that a normal distribution sits at zero, flags a value above 1, indicating a distribution more peaked than normal. The inlier proportion, the fraction of values lying within half a standard deviation of the mean, flags a value above 0.55, well above the roughly 0.38 expected for normal data. The studentized range, the span from minimum to maximum divided by the standard deviation, flags a value below 3, since genuine data of this size spreads over four to six standard deviations. A column is flagged when at least two of the three checks trigger, and the score reflects the fraction of columns flagged.
How it works
Each qualifying column is summarised by its mean and standard deviation, and the three checks are evaluated. A constant column, having no spread, is treated as non-peaked for the kurtosis check, counts as fully clustered for the inlier check, and has no defined range. The kurtosis check adds a trigger when excess kurtosis exceeds 1, the inlier check when the proportion within half a standard deviation exceeds 0.55, and the range check when the studentized range falls below 3. A fourth trigger comes from a binomial test of the inlier count against the normal baseline: the number of values within half a standard deviation follows a binomial with success probability 0.3829, so an upper-tail probability below one in a million marks a count too high to be a normal sample, a deliberately strict cutoff that fires only on genuine over-clustering. Two or more triggers flag the column.
The fraction of flagged columns maps to the score: below 0.20 scores 0, from 0.20 up to 0.50 scores 2.0, and above 0.50 scores 4.0, capped at 5.0. Each flagged column produces a finding naming which checks fired and their values, with severity rising once the score exceeds the mild band. The metadata records the columns tested, the columns flagged, the average inlier proportion, the average studentized range, the normal baseline inlier proportion of 0.383 against which the 0.55 threshold is read, and the smallest inlier binomial tail probability across columns.
Score thresholds
| Score | Meaning |
|---|---|
| 0 | Fewer than a fifth of columns show inlier clustering, consistent with real spread. |
| 2 | Between a fifth and a half of columns are unnaturally clustered. |
| 4 to 5 | More than half of columns are unnaturally clustered, a signature of manual fabrication. |
Why this matters
A consistent finding in the study of fabricated data is that people cannot imitate the spread of real measurement: when inventing numbers they gravitate to the centre and under-produce extremes. Mosimann and colleagues demonstrated this directly, showing that individuals asked to generate data fail to reproduce the variability and tail behaviour of genuine distributions [1]. Simonsohn turned the same idea into a detection method, catching fabrication from summary statistics whose variability was implausibly low for real sampling [2], and Al-Marzouki and colleagues used the spread and shape of trial variables to separate fabricated from genuine datasets [3]. The three checks in D4 capture complementary faces of this single failure: excess kurtosis sees the sharp central peak, the inlier proportion sees the over-density near the mean, and the studentized range sees the missing tails. Requiring two of the three to agree before flagging a column, and then weighing how many columns are affected, guards against the occasional naturally peaked variable while remaining sensitive to a dataset that is clustered throughout. Recent forensic re-analyses, scoping reviews, and trustworthiness instruments place insufficient-variability and inlier-clustering checks among the standard screens for fabricated data [4, 5, 6, 7, 8], and Knepper and colleagues name inlier detection explicitly among the best-practice statistical-monitoring methods for flagging data anomalies suggestive of fabrication in clinical trials [9].
Limitations
The check requires individual-patient data with numeric columns of at least thirty values, so a small or summary-only study is outside its scope. The three statistics are built on the mean and standard deviation and on the fourth moment, which are sensitive to the very outliers whose absence the indicator is probing, so a single extreme value can simultaneously raise kurtosis and inflate the standard deviation in ways that interact; requiring two triggers mitigates but does not remove this. Genuinely peaked or heavy-tailed real distributions, such as a Laplace-shaped biomarker, can legitimately exceed the kurtosis and inlier thresholds, so a flag is a prompt to inspect rather than proof. The studentized-range threshold of 3 is a single value applied regardless of sample size, even though the expected range grows with the number of observations, making the check conservative for large columns. The thresholds are directional. The dataset-level too-clean signals on summary statistics are indicators S15 and S16, and the absence of outliers as one facet is also checked there, so D4 stays on per-column inlier clustering in the raw data.
Theoretical background
D4 rests on the way real measurement fills out a distribution that fabrication leaves hollow. A genuine variable arises from many small influences and from sampling, so its values spread across several standard deviations, place a predictable minority near the mean, and occasionally reach the tails; for a roughly normal variable, about thirty-eight percent of values fall within half a standard deviation of the mean and the full range covers four to six standard deviations once the sample is moderately large. Hand-fabricated data departs from this because a person choosing plausible numbers anchors on the central value and rarely commits to an extreme, so the resulting column is leptokurtic, over-dense at the centre, and short in the tails. The three checks measure these departures through different moments and order statistics: kurtosis is a fourth-moment measure of peakedness, the inlier proportion is a direct count of central density, and the studentized range is an order-statistic measure of spread. Because each can be triggered occasionally by a real but unusual variable, the indicator treats them as a small panel, requiring agreement between at least two before flagging a column, and then asks the dataset-level question of whether clustering is pervasive, since a fabricator's habit shows up across many columns at once rather than in one. Reading the inlier proportion as a binomial draw against the normal central density of 0.3829 turns the over-density check into a significance test, so a column whose count within half a standard deviation is improbable under normality is flagged on statistical grounds rather than against a fixed percentage alone.
References
- Mosimann JE, Wiseman CV, Edelman RE. Data fabrication: Can people generate random digits? Accountability in Research. 1995;4(1):31-55. DOI: 10.1080/08989629508573866
- Simonsohn U. Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science. 2013;24(10):1875-1888. DOI: 10.1177/0956797613480366
- Al-Marzouki S, Evans S, Marshall T, Roberts I. Are these data real? Statistical methods for the detection of data fabrication in clinical trials. BMJ. 2005;331(7511):267-270. DOI: 10.1136/bmj.331.7511.267
- Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
- Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
- Carlisle JB. False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia. 2021;76(4):472-479. DOI: 10.1111/anae.15263
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
- Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
- Knepper D, Lindblad AS, Sharma G, et al. Statistical Monitoring in Clinical Trials: Best Practices for Detecting Data Anomalies Suggestive of Fabrication or Misconduct. Therapeutic Innovation & Regulatory Science. 2016;50(2):144-154. DOI: 10.1177/2168479016630576