ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
M3Image forensicsMicroscopyLayer 1 (Deterministic)

Inpainting Detection

Detects regions that have been filled, erased, or painted over. Inpainting synthesises a patch by smooth interpolation or by copying nearby texture, and either way it suppresses the fine sensor noise that a genuine capture carries everywhere. The indicator estimates the local noise floor robustly from the diagonal wavelet coefficients and flags blocks where that floor collapses far below the rest of the image, and separately flags blocks whose intensity range is unnaturally narrow. A locally suppressed noise floor and a flat local histogram are the fill-and-erase signatures it scores. It works on the pixels alone, with no model.

Technical description

M3 is a deterministic, generator-agnostic screen for localized fill and erasure. A genuine image carries a roughly uniform high-frequency noise floor from the sensor, present in every region regardless of content. Inpainting breaks that uniformity locally: diffusion-based fill solves a smoothing equation that produces a region with almost no high-frequency energy, and exemplar-based fill copies texture that lacks the independent noise of a fresh capture, so a retouched block reads as a hole in the noise floor. M3 decomposes the grayscale image with a single-level Haar wavelet transform, estimates the noise standard deviation on small blocks of the diagonal detail band with the robust median absolute deviation, and looks for blocks whose noise floor is suppressed far below the image norm. In parallel it measures the local intensity range and flags blocks whose range is unnaturally narrow, the histogram signature of a uniform fill. The image must be at least 64 by 64 pixels, and the wavelet grid must contain at least four blocks, or the indicator returns a zero score and records that it was skipped.

How it works

The image is transformed once with the Haar wavelet, giving the diagonal detail band cD, whose coefficients are dominated by noise because the diagonal high-pass cancels smooth content. The band is tiled into blocks, and the noise level of each block is estimated with the median absolute deviation:

sigma = median(|cD - median(cD)|) / 0.6745,

where 0.6745 is the factor that relates the median absolute deviation to the standard deviation of a normal distribution. Using the median rather than the variance makes the estimate robust: a few strong coefficients from an edge do not inflate it, so it tracks the noise floor rather than the texture. Let the per-block estimates be sigma_1 to sigma_n, with mean sigma_bar and standard deviation s; the coefficient of variation CV = s / sigma_bar summarizes how uniform the noise floor is.

Three signals are read from the block estimates. A block is a wavelet outlier when its noise floor departs from the mean by more than one standard deviation in either direction, |sigma_i - sigma_bar| > s. A block is critical when its noise floor nearly vanishes relative to the image, sigma_i < 0.05 sigma_bar, evaluated only when the mean noise floor sigma_bar exceeds 2.0 so that a genuinely clean image is not carved into false fills; a critical block is the direct signature of a region that was filled with a uniform value or a smooth gradient. Separately, on the full-resolution image, each block's intensity range is measured as the 95th minus the 5th percentile, and a block is range-suspicious when its range falls below 0.15 of the median block range, evaluated only when the median range exceeds 20 so that a low-contrast image is not over-flagged.

The score combines the signals. With total_blocks the number of wavelet blocks and ratio the fraction that are outliers, the wavelet score is min(4.0, 8 CV + 5 ratio), the range score is min(3.0, 20 times the fraction of range-suspicious blocks), and the critical bonus is min(2.0, 0.5 times the number of critical blocks). The final score is min(5.0, max(wavelet_score, range_score) + critical_bonus), so the noise and the histogram evidence are combined by taking the stronger, and a confirmed fill adds on top. Findings are emitted critical first, then wavelet outliers ordered by deviation, then range blocks, each carrying the block's bounding box, capped at eight. The metadata records the noise coefficient of variation, the outlier and range-suspicious block counts, the total block count, and the mean noise floor.

Score thresholds

Score Meaning
0 to 1 The noise floor and local contrast are uniform, consistent with a single unedited capture.
2 to 3 Some blocks deviate in noise level or local range, a possible local edit or natural texture variation.
4 to 5 One or more blocks have a collapsed noise floor or a flat histogram. Consistent with a filled, erased, or inpainted region.

Why this matters

Erasing an inconvenient feature or painting one in is among the most damaging forms of image manipulation, because unlike a duplication it leaves no second copy to compare against, and journals warn explicitly that adjustments which obscure or fabricate content are misconduct even when they are locally seamless [4]. The forensic handle is that synthesis cannot reproduce the sensor noise floor: diffusion-based inpainting governed by a smoothing partial differential equation leaves a region whose high-frequency content is abnormally low, and localization methods built specifically for inpainting exploit exactly that suppressed local variance to map the filled area [1]. The same logic underlies blind noise forensics in general, where a region whose noise statistics differ from the host image is exposed as foreign [2]. M3 measures the noise floor with the robust median estimator from the finest wavelet band, sigma = median(|coefficients|) / 0.6745, the estimator introduced by Donoho and Johnstone whose resistance to edges and texture is what lets a block's true noise level be read [3]. Reading a local collapse of that floor, rather than the global spread, is what makes M3 a fill-and-erase detector rather than a generic noise-consistency test.

Limitations

A suppressed noise floor is necessary but not unique to inpainting. Genuinely smooth content, such as a saturated highlight, a uniform background, or an out-of-focus region, carries little high-frequency energy and can read as a critical or range-suspicious block, which is why both signals are gated on the image having a meaningful noise floor and contrast. Strong denoising applied to the whole image flattens the floor everywhere and erases the contrast that the screen depends on. JPEG compression suppresses and blocks the noise, which can both hide a real fill and mimic one along the compression grid. The 16-pixel block bounds the spatial resolution, so a fill smaller than a block is averaged with its surroundings. The thresholds are directional rather than exact. M3 deliberately performs no error level analysis, which is the recompression-based screen of indicator I1, and it does not test tonal or contrast response continuity, which is indicator I9; it also differs from the global noise-consistency indicator I3 by looking for a one-sided local collapse of the noise floor rather than the overall spread, so the three corroborate each other on a suspected edit.

Theoretical background

M3 rests on the difference between captured and synthesised pixels. A sensor adds an independent, roughly stationary noise component to every pixel, so the high-frequency residual of a genuine image has a noise floor that is present everywhere and varies little across the frame. Inpainting replaces a region with values computed from its neighbourhood: diffusion methods solve a Laplace or heat equation that is by construction smooth and therefore noise-free, while exemplar methods paste texture whose noise is correlated with its source rather than freshly and independently sampled. In both cases the synthesised region cannot carry the original noise floor, so its high-frequency energy collapses. The diagonal wavelet band isolates that high-frequency residual, and the median absolute deviation reads its level while rejecting the sparse large coefficients that real structure contributes, so a hole in the noise floor stands out against the surrounding capture. The intensity-range signal adds an independent, photometric view of the same event, because a uniform fill also flattens the local histogram. Reading a local suppression of the noise floor turns the physics of capture into a test that depends on whether a region was photographed or computed.

References

  1. Li H, Luo W, Huang J. Localization of Diffusion-Based Inpainting in Digital Images. IEEE Transactions on Information Forensics and Security. 2017;12(12):3050-3064. DOI: 10.1109/TIFS.2017.2730822
  2. Mahdian B, Saic S. Using noise inconsistencies for blind image forensics. Image and Vision Computing. 2009;27(10):1497-1503. DOI: 10.1016/j.imavis.2009.02.001
  3. Donoho DL, Johnstone IM. Ideal spatial adaptation by wavelet shrinkage. Biometrika. 1994;81(3):425-455. DOI: 10.1093/biomet/81.3.425
  4. Rossner M, Yamada KM. What's in a picture? The temptation of image manipulation. The Journal of Cell Biology. 2004;166(1):11-15. DOI: 10.1083/jcb.200406019