ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
D13Statistical analysisFabrication DetectionLayer 2 (Contextual)

Heteroscedasticity Absent

Checks whether a variable's spread stays suspiciously constant across the low, middle, and high parts of its value range. Real measurements usually spread out more in some parts of their range than others, but values generated from a single simple distribution, a common machine-fabrication shortcut, tend to be uniformly spread throughout. The indicator sorts each numeric column, splits it into thirds by value, compares the variance within each third, and flags columns whose variance is nearly identical across the thirds. It works on the individual-patient data.

Technical description

D13 is a contextual screen for an implausibly uniform spread across a variable's value range, a univariate proxy for the absence of the variance heterogeneity that real data usually shows. It requires at least twenty rows and at least three numeric columns that have non-trivial standard deviation and at least ten distinct values, the latter so that the variance comparison is not dominated by ties in binary or ordinal-coded columns. For each qualifying column it sorts the values, splits them into three equal-sized thirds by value, computes the variance within each third, and forms the ratio of the largest to the smallest of the three variances. A column whose ratio is below 1.5, meaning the spread is nearly the same in every third, is classed as suspiciously homogeneous. The proportion of columns so classed maps to the score, with a high proportion indicating data drawn from a fixed simple distribution rather than measured.

How it works

The complete values of each qualifying column are sorted and divided into low, middle, and high thirds. The within-third variances are computed with the sample formula, and the homogeneity ratio is the maximum divided by the maximum of the minimum and a small floor. A ratio below 1.5 marks the column as suspiciously homogeneous; the formal median-centred Levene test of Brown and Forsythe is also computed across the thirds and its median p-value across columns reported [4]. The proportion of such columns maps to the score: above 0.90 scores 4.0, above 0.75 scores 3.0, above 0.60 scores 2.0, above 0.45 scores 1.0, and otherwise 0. When five or more columns are checked and every one is homogeneous, a further half point is added; and when the median Brown-Forsythe p-value across columns exceeds 0.95, so the within-third variance is statistically indistinguishable for the typical column, a further half point is added in corroboration. The total is capped at 5.0. A finding is raised once the proportion exceeds 0.45, listing the affected columns. The metadata records the columns checked, the number classed homogeneous, the proportion, and the median Brown-Forsythe p-value across columns.

Score thresholds

Score Meaning
0 to 1 Variance differs across the value range as in most real data.
2 to 3 Many columns show suspiciously uniform spread across their range.
4 to 5 Almost all columns have near-identical variance across thirds, consistent with a fixed generating distribution.

Why this matters

Real measured variables rarely have the same spread everywhere along their range: physiological and behavioural quantities tend to fan out at higher values, pile up against natural limits, or mix subpopulations, so the local variability changes across the range. Micceri's survey of real datasets documented exactly this departure from clean, idealised shapes [1]. Data generated by sampling each value independently from one fixed distribution, the default behaviour of a model asked to produce a dataset, carries that distribution's uniform structure across its whole range, so the spread is homogeneous by construction. Taloni and colleagues showed that a model can fabricate a clinical dataset whose variables are far too regular to be real [2], and the broader literature treats implausible statistical regularity as a fabrication signal [3]. A dataset in which nearly every column has constant spread across its range is therefore more consistent with synthetic generation than with measurement. Recent forensic re-analyses, scoping reviews, and trustworthiness instruments treat implausibly regular variance structure among the standard screens for fabricated and machine-generated data [5, 6, 7, 8].

Limitations

This is a univariate proxy, not a test of true conditional heteroscedasticity, which concerns how the variance of a response changes with a predictor and requires a paired predictor that a single column does not provide. What the indicator actually measures is whether the within-range dispersion of one variable is constant across its own sorted value range, which detects uniform-like or single-distribution data but does not correspond to the regression notion the name evokes. The thirds-variance ratio is sensitive near its 1.5 threshold, so the classification of a borderline column can hinge on the exact split. A genuinely uniformly distributed real variable, or a bounded score, will be classed as homogeneous even though it is real, so a flag is a screening signal rather than proof. The check needs at least twenty rows and excludes constant and low-cardinality columns. The thresholds are heuristic. The related too-clean distributional signals are indicators D2 and D28, so D13 focuses on constancy of spread across the value range.

Theoretical background

D13 rests on how the local spread of a variable behaves across its range. For a variable whose values are independent draws from a single distribution, the population is homogeneous in the sense that any contiguous slice of the value range, once the slice is wide enough to contain many points, reflects the same underlying generating process, and for some distributions, notably the uniform, the variance of equal-count slices is nearly constant. Real measured variables more often arise from mixtures and bounded or multiplicative processes, so their density and local variability change across the range, making the variance of a low slice differ from that of a high slice. Sorting the column and partitioning it into equal-count thirds approximates these slices, and the ratio of the largest to the smallest within-third variance is a coarse measure of how much the spread changes. A ratio near one indicates a flat, single-distribution structure, while a larger ratio indicates the changing spread typical of real data. The method is deliberately simple and operates on one column at a time, which is its strength for screening and its limitation in rigour, since it cannot see the predictor-conditional variance that the term heteroscedasticity properly denotes; excluding low-cardinality columns keeps the variance estimates meaningful rather than tie-dominated. The Brown-Forsythe test recasts the same thirds comparison as a formal, median-robust test of equal variance, so a median p-value across columns that is implausibly high, a failure to reject equality everywhere, now adds to the score and turns the ratio heuristic into a calibrated homogeneity statement.

References

  1. Micceri T. The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin. 1989;105(1):156-166. DOI: 10.1037/0033-2909.105.1.156
  2. Taloni A, Scorcia V, Giannaccare G. Large Language Model Advanced Data Analysis Abuse to Create a Fake Data Set in Medical Research. JAMA Ophthalmology. 2023;141(12):1174-1175. DOI: 10.1001/jamaophthalmol.2023.5162
  3. Simonsohn U. Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science. 2013;24(10):1875-1888. DOI: 10.1177/0956797613480366
  4. Brown MB, Forsythe AB. Robust tests for the equality of variances. Journal of the American Statistical Association. 1974;69(346):364-367. DOI: 10.1080/01621459.1974.10482955
  5. Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
  6. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
  7. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
  8. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861