ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
G2-imgImage forensicsChart AnalysisLayer 1 (Deterministic)

Typographic Coherence

Screens chart and figure images for typographic and geometric incoherence that a chart drawn by real plotting software would not show: mixed or pasted-in text, barely legible labels typical of generated figures, axes that are not perpendicular, unevenly spaced tick marks, and Cyrillic look-alike characters smuggled into Latin labels. It works from optical character recognition (OCR) and pixel geometry alone, with no model.

Technical description

G2 is a deterministic, generator-agnostic screen for charts whose text and axis geometry do not match the clean, regular output of genuine plotting software. A chart produced by a plotting library renders every label in one font at one quality, places its axes at a right angle, spaces its tick marks evenly, and keeps each label in a single script. A chart that was hand-edited, assembled from several sources, or synthesised by a text-to-image generator departs from one or more of those regularities in a measurable way. The indicator runs OCR over the figure to obtain text regions with per-region confidence, detects the axis lines and tick positions, and sums five signals, font consistency, garbled-text legibility, axis perpendicularity, tick equidistance, and homoglyph contamination, into a 0 to 5 score (capped). It requires the image to be at least 32 by 32 pixels.

OCR confidence here is the Tesseract per-word confidence on a 0 to 100 scale; the shared extractor keeps only regions scoring above 30, so the regions G2 reasons over already sit in the 30 to 100 band, and clean vector-rendered labels typically score in the high 80s or 90s.

How it works

The indicator runs deterministically at Layer 1 on the optical character recognition (OCR) output of extract_text_regions and the axis geometry of detect_axes. Each OCR text region carries a Tesseract confidence c on a 0 to 100 scale, and the shared extractor keeps only regions with c > 30, so the values reasoned over already sit in the 30 to 100 band.

The font-consistency signal measures the dispersion of legibility across labels. When at least two text regions are present with confidences c_1, ..., c_k and mean c̄ = (1/k) Σ c_i, the variance is V = (1/k) Σ (c_i − c̄)². Mixed fonts, or text pasted from another source, leave some labels crisp and others not and so raise V. When V exceeds the threshold of 400 (a confidence standard deviation near sqrt(400) = 20), the signal contributes min(2.0, V/400) at warning severity.

The garbled-text signal measures uniformly low legibility rather than dispersion. With at least three regions, let f = #{i : c_i < 55} / k be the fraction whose absolute confidence falls below the legibility floor of 55. A high f is the signature of diffusion-rendered labels that OCR can barely read, so when f > 0.5 the signal contributes min(1.5, f / 0.5) at warning severity. This is orthogonal to the variance signal: the variance fires on heterogeneous (spliced) text, the fraction on uniformly garbled text.

The axis-perpendicularity signal compares the orientations of the two detected axes. For an axis line with endpoints (x1, y1) and (x2, y2) the orientation is θ = atan2(y2 − y1, x2 − x1) in degrees, and with θ_x and θ_y the orientations of the x and y axes the deviation from a right angle is Δ = | |θ_x − θ_y| − 90 |. Genuine plotting software places axes at exactly 90 degrees, so when Δ > 3 degrees the signal contributes min(1.5, (Δ/3) · 0.5) at warning severity.

The tick-equidistance signal tests the spacing regularity of an axis with at least three detected ticks at coordinates p_1, ..., p_m. The inter-tick gaps are g_i = p_{i+1} − p_i, and their coefficient of variation is CV = σ(g) / |μ(g)|, where μ(g) and σ(g) are the mean and standard deviation of the gaps. Evenly spaced ticks give CV near 0, so when CV > 0.15 the axis contributes min(0.75, (CV/0.15) · 0.25), and the two axes together are capped at 1.5.

The homoglyph signal scans each region's string for the simultaneous presence of a Latin letter and a Cyrillic character drawn from a fixed table of eleven Cyrillic-to-Latin look-alikes (for example Cyrillic А, В, С, Е, О, Р mapping to Latin A, B, C, E, O, P). Letting H be the total count of such Cyrillic characters across all mixed-script regions, each region is reported at error severity and the signal contributes min(2.0, 0.5 · H).

The five contributions are summed and the score is reported as min(5.0, total). The metadata records the number of text regions, the confidence variance V, the garbled fraction f and count, the axis deviation Δ, the per-axis tick coefficients of variation, and the homoglyph count H.

Score thresholds

Score Meaning
0 to 1 Uniform, legible labels in one script, perpendicular axes, evenly spaced ticks. Consistent with a chart drawn once by genuine plotting software.
2 to 3 One signal present: mixed or pasted text, broadly low legibility, a slightly sheared axis, or uneven ticks.
4 to 5 Several signals together, or a homoglyph contamination. Consistent with a hand-edited, multiply-sourced, or generator-rendered figure.

Why this matters

Generated and edited chart typography is a documented, measurable failure mode. Diffusion text-to-image generators render image-embedded text that is frequently jumbled, misspelled, or visually fragmented, and the stress-test benchmark STRICT traces the cause to a locality bias that prevents the model from maintaining glyph consistency over the span of a label [1]. The underlying mechanism has been characterised directly: text hallucination arises from local generation bias, where individually plausible strokes are assembled into nonsensical words [2]. The problem is severe enough across state-of-the-art systems that an empirical evaluation measuring generation quality through OCR found text rendering to be a persistent weak point [4], and that dedicated post-processing exists purely to retouch the typos these models leave behind [3]. Reading OCR confidence across a figure therefore turns a generator weakness into a detector: uniformly low legibility is what generated labels look like to a recogniser, and heterogeneous legibility is what spliced labels look like, a cue that OCR-graph manipulation detection exploits to localise document tampering [7]. The geometric signals rest on equally firm ground. Chart elements such as axes, tick marks, and labels are well-defined objects that detection systems are built to localise [5], which means their structural regularities, a right angle between axes and even spacing between ticks, are exactly the regularities a forged or carelessly rendered chart breaks, and recent work on detecting misleading and manipulated visualisations treats such anomalies as integrity signals [6]. Homoglyph contamination closes the set: substituting Cyrillic look-alikes for Latin letters is a Unicode confusable technique with its own deep-learning detection literature [8], and in a chart label it is a deliberate, high-confidence tell. As AI-generated figures move into academic publishing and prompt explicit editorial policy and tooling [9], a deterministic typographic screen gives a fast, model-free first pass.

Limitations

G2 depends on OCR and on axis detection, so the signals it can raise depend on what those stages recover: a figure with no readable text yields no font, garbled, or homoglyph signal and is judged on axis geometry alone, and a figure without clearly drawn axes or with fewer than three ticks per axis is judged on its text alone. The thresholds were calibrated against typical chart imagery and are directional rather than exact; small or stylised fonts can lower OCR confidence without manipulation, and a legitimately rotated or broken-axis design can raise the geometry signals. The homoglyph table covers the eleven most common Cyrillic-to-Latin confusables and the Latin-versus-Cyrillic case; Greek and other confusable scripts are out of its current scope. Body-text Unicode anomalies and homoglyphs in the document text are handled in the text-module E-series, while G2 reads homoglyphs in the rendered chart through OCR, a different surface. Scale and axis-semantics incoherence, such as truncated or non-linear scales and axis labels inconsistent with the plotted data, is the domain of the sibling chart indicator, so G2 stays on local typographic and geometric regularity to avoid duplicating it; raster-quality degradation stays in the resolution indicator.

Theoretical background

G2 combines three lines of evidence. The first is the rendering weakness of generative text: diffusion models exhibit a locality bias that breaks long-range glyph coherence, so generated labels are often garbled and barely legible to OCR, a property quantified by text-rendering benchmarks and by OCR-based evaluations of generator quality. G2 reads this through two complementary statistics of OCR confidence, its dispersion across labels (the splicing and mixed-font case) and the share of labels below a legibility floor (the uniformly generated case). The second is chart structure: a chart is a constrained graphical object whose axes meet at a right angle and whose ticks are evenly spaced, regularities that element-detection research takes as given and that misleading-visualisation detection treats as testable; G2 measures the angle between detected axes and the coefficient of variation of inter-tick gaps as direct tests of those regularities. The third is the Unicode confusable channel: characters from different scripts can be visually identical, and substituting them is a recognised spoofing and steganography technique; G2 flags any chart label that mixes Latin with a Cyrillic look-alike. Each signal is a physical or structural trace of how the figure was produced, not a learned fingerprint of a particular generator, which keeps the screen robust as generators change.

References

  1. Zhang T, Wang X, Tai Z, Li L, Chi J, Tian J, He H, Wang S. STRICT: Stress Test of Rendering Images Containing Text. arXiv preprint arXiv:2505.18985. 2025. https://arxiv.org/abs/2505.18985
  2. Lu R, Wang R, Lyu K, Jiang X, Huang G, Wang M. Towards Understanding Text Hallucination of Diffusion Models via Local Generation Bias. arXiv preprint arXiv:2503.03595. 2025. https://arxiv.org/abs/2503.03595
  3. Shimoda W, Inoue N, Haraguchi D, Mitani H, Uchida S, Yamaguchi K. Type-R: Automatically Retouching Typos for Text-to-Image Generation. arXiv preprint arXiv:2411.18159. 2024. https://arxiv.org/abs/2411.18159
  4. Zhang P, Xu H, Zhang J, Xu G, Zheng X, Yang Z, Liu J, Zhang Y. Aesthetics is Cheap, Show me the Text: An Empirical Evaluation of State-of-the-Art Generative Models for OCR. arXiv preprint arXiv:2507.15085. 2025. https://arxiv.org/abs/2507.15085
  5. Yan P, Ahmed S, Doermann D. Context-Aware Chart Element Detection. In: International Conference on Document Analysis and Recognition (ICDAR). 2023. arXiv:2305.04151. https://arxiv.org/abs/2305.04151
  6. Tonglet J, Zimny J, Tuytelaars T, Gurevych I. Is this chart lying to me? Automating the detection of misleading visualizations. In: Proceedings of the Association for Computational Linguistics (ACL). 2026. arXiv:2508.21675. https://arxiv.org/abs/2508.21675
  7. Joren H, Gupta O, Raviv D. OCR Graph Features for Manipulation Detection in Documents. arXiv preprint arXiv:2009.05158. 2020. https://arxiv.org/abs/2009.05158
  8. Deng P, Linsky C, Wright M. Weaponizing Unicodes with Deep Learning: Identifying Homoglyphs with Weakly Labeled Data. In: IEEE International Conference on Intelligence and Security Informatics (ISI). 2020. arXiv:2010.04382. https://arxiv.org/abs/2010.04382
  9. Chen D. AI-Generated Figures in Academic Publishing: Policies, Tools, and Practical Guidelines. arXiv preprint arXiv:2603.16159. 2026. https://arxiv.org/abs/2603.16159