D31Statistical analysisFabrication ExtendedLayer 2 (Contextual)

Carlisle-Stouffer IPD

Applies the Carlisle-Stouffer method for baseline balance testing across the full IPD, not just reported baseline p-values in the paper.

Technical description

Under genuine randomisation, comparing two trial arms on each baseline variable gives p-values distributed Uniform[0,1], because any differences are chance. Fabrication betrays itself in two opposite ways: arms from clearly different populations give too many significant differences, and arms whose means were forced to match give p-values bunched near 1. D31 applies the Carlisle method of baseline-distribution testing with Stouffer's pooling of p-values to the individual-patient data (IPD) rather than the printed baseline table. It locates a group column, runs a Welch t-test on each numeric baseline column between the arms, and tests the collected p-values for uniformity with the Kolmogorov-Smirnov statistic and Stouffer's combined Z. A Cramer-von Mises test against Uniform[0,1], more sensitive than Kolmogorov-Smirnov in the tails, is also applied, and the more significant of the two drives the uniformity score.

How it works

Layer 2 (contextual): a group column is detected from keywords (group, treatment, arm, allocation), its first two levels defining the arms; absent one, the rows are halved by index with a one-point proxy penalty. Welch t-tests on the numeric baseline columns yield p-values (columns constant in both arms skipped; at least five required). The more significant of the Kolmogorov-Smirnov and Cramer-von Mises tests against Uniform[0,1] adds 2.5 below p 0.01 or 1.5 below 0.05. Stouffer's combined Z, the summed inverse-normal transforms over the square root of their count with p-values clipped from 0 and 1, adds 1.5 when its absolute value exceeds 3. A significant proportion above thirty percent adds 1.0, a proportion below one tenth of a percent across at least ten tests adds 1.5, and a mean more than 0.20 from one half adds 0.5. Capped at 5.0 after the proxy penalty. Metadata records the p-value count, the proportions of significant and of high p-values, the mean, the Kolmogorov-Smirnov statistic and p-value, the Stouffer Z, the Cramer-von Mises statistic and p-value, the group column, and whether the proxy split was used.

Why this matters

Baseline-comparison behaviour is one of the most established forensic signals for trial integrity: Carlisle's distribution testing helped expose a large body of fabricated trials, and applying it to the raw IPD cannot be evaded by selective reporting of baseline p-values. Stouffer's method pools the comparisons into a single standard-normal Z that sensitively detects systematic directional shift, catching both an excess of significant differences (non-randomised or differently drawn arms) and a suspicious absence of them (forced balance).

Score thresholds

0-1: Baseline comparisons scatter as randomisation would produce
2-3: The baseline p-values depart from uniform in one or more ways
4-5: Strong non-uniformity, an excess of significant differences or of forced non-significance, consistent with non-randomised or fabricated arms

Limitations

The test depends on correctly identifying the arms, and the row-index proxy used when no group column is found can pair unrelated halves, which is why a confidence penalty applies. With few baseline columns the uniformity tests have low power. Stratified or matched designs reduce baseline scatter legitimately, and genuine baseline imbalance can raise the significant-proportion check. The Welch t-test compares means only, so fabrication that matches means while distorting variances is not directly caught. The keyword match can select the wrong column if a baseline variable shares a treatment-like name. Reported baseline p-values in the paper text are handled by the table-level baseline indicators; D31 focuses on the reconstructed comparisons across the IPD.