Temporal Anomalies
Detects impossible or implausible temporal patterns in longitudinal IPD: enrollment spikes, perfectly regular visit intervals, or impossible date sequences.
Technical description
Genuine clinical data spreads dates across weekdays and months with natural irregularity, while fabricated dates often land mostly on weekends, bunch into one week, fall on a single day, sit at a perfectly even interval, or lie in the future or before 1900. D33 identifies candidate date columns of the individual-patient data (IPD) by name keyword, parses each into timestamps (skipping numeric columns, whose values would be misread as nanosecond epochs), and requires at least ten parsed dates. It then computes the weekend fraction, the uniformity of the day-of-week distribution by a chi-square goodness-of-fit test, week-window clustering, presence of impossible dates, perfectly uniform spacing, and single-day concentration.
How it works
Layer 1 (deterministic): a column is a date candidate when its name contains a keyword such as date, time, visit, enrolled, admission, discharge, dob, or birth. Values are parsed with the datetime parser only when the column is textual or already datetime; a numeric column is skipped because the parser reads its numbers as nanoseconds since 1970. A column needs at least ten parsed dates. The weekend rate adds 2.5 above one half or 1.5 above thirty percent (mutually exclusive). When at least twenty dates are present, a chi-square goodness-of-fit test of the seven day-of-week counts against a uniform expectation adds 1.5 if the distribution is indistinguishable from uniform (p above 0.10) while the weekend share is between twenty and thirty percent, so the weekend check has not fired. Clustering of more than half the dates within a seven-day window adds 2.0. A future date or a date before 1900 each adds 1.0. All dates on one calendar day adds 3.0; otherwise perfectly uniform spacing (all consecutive intervals equal within one day) adds 1.5. Capped at 5.0.
Why this matters
Dates are metadata that fabrication frequently neglects. Structural impossibilities in individual-patient records are among the features that expose fabricated datasets, and central statistical monitoring of trial data quality relies on implausible patterns invisible to a casual reader, including the timing and clustering of records. Genuine trials leave incidental temporal structure, the weekday rhythm of clinic visits, the gradual accrual of enrolment, the natural jitter of appointment intervals, that a fabricator typing or generating a table rarely reproduces.
Score thresholds
- 0-1
- Dates spread across weekdays and time as real scheduling produces
- 2-3
- One strong anomaly, such as heavy weekend scheduling or week-long clustering
- 4-5
- Several anomalies, or all dates on one day, consistent with programmatically generated dates
Limitations
The checks are heuristic and a flag prompts inspection rather than proving fabrication. Some designs legitimately violate the assumptions: weekend or single-session recruitment, fixed-interval intensive protocols, and batch-entered registries can each raise a signal without fraud. Only columns whose names carry a date keyword are examined, so unlabelled date columns are missed, and numeric date encodings such as year integers or spreadsheet serials are deliberately skipped to avoid misreading measurements as dates. Parsing depends on recognisable formats, and ambiguous day-month orders can be misread. The clustering and spacing checks need enough dates to be meaningful. Implausible demographic dates relative to age are indicator D03; D33 focuses on the standalone temporal structure of the IPD date columns.
References
- Carlisle JB. (2021). False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia
- George SL, Buyse M. (2015). Data fraud in clinical trials. Clinical Investigation
- Buyse M, George SL, Evans S, et al.. (1999). The role of biostatistics in the prevention, detection and treatment of fraud in clinical trials. Statistics in Medicine
- Carlisle JB. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia
- Bordewijk EM, Wang R, Askie LM, et al.. (2020). Data integrity of 35 randomised controlled trials in women's health. European Journal of Obstetrics & Gynecology and Reproductive Biology
- Bordewijk EM, Li W, van Eekelen R, et al.. (2021). Methods to assess research misconduct in health-related research: a scoping review. Journal of Clinical Epidemiology
- Parker L, Boughton S, Lawrence R, Bero L. (2022). Experts identified warning signs of fraudulent research: a qualitative study to inform a screening tool. Journal of Clinical Epidemiology
- Grey A, Bolland MJ, Avenell A, Klein AA, Gunsalus CK. (2020). Check for publication integrity before misconduct. Nature