How SeeThrough works
SeeThrough combines deep visual analysis with temporal frequency features in a multi-signal ensemble to detect AI-generated video content.
Overview
Modern AI video generators (Sora, Runway, Pika, Kling, etc.) produce increasingly realistic output, but they leave subtle traces. SeeThrough is designed to detect these traces by analyzing videos through multiple complementary lenses.
Rather than relying on a single detection method, SeeThrough fuses deep visual features extracted from a vision foundation model with temporal frequency analysis that examines how frames change over time. This ensemble approach makes the system more robust across different generators and content types.
The system produces a three-way classification — real, uncertain, or AI-generated — alongside attribution heatmaps that highlight which regions of each frame contributed to the decision.
Detection Pipeline
Frame Extraction
The video is split into uniformly sampled frames. Sampling density adapts to video length to balance speed with coverage.
Frames are preprocessed and normalized before being fed to the feature extraction stage.
Deep Visual Feature Analysis
A vision foundation model extracts multi-layer semantic features from each frame, capturing both low-level textures and high-level scene understanding.
By analyzing features across multiple network depths, the system can detect subtle artifacts that single-layer approaches miss — from pixel-level noise patterns to semantic inconsistencies.
Temporal Frequency Analysis
A dedicated module examines how visual signals change over time, looking for unnatural temporal patterns that AI generators struggle to produce consistently.
This captures frequency-domain features, frame-to-frame consistency metrics, motion gradients, and compression block patterns — signals that are hard for generators to fake across many frames.
Ensemble Fusion
Separate classifiers trained on visual and temporal features are combined through a weighted ensemble, producing a final probability score.
Combining multiple independent signals makes the system more robust than any single detector — each signal catches different types of generation artifacts.
Metadata & Heuristic Checks
Beyond neural analysis, the system inspects video metadata for provenance signals like C2PA content credentials and other authenticity markers.
These heuristics provide additional context, especially useful for videos that carry digital provenance information from their source platform.
Attribution Heatmaps
For each analyzed frame, gradient-based attribution maps highlight which spatial regions contributed most to the model's decision.
Warm regions in the heatmap indicate areas the model considers synthetic. This gives users a visual explanation of why the model made its prediction — not just a score.
Limitations & Honest Assessment
No detection system is perfect, and SeeThrough is no exception. We believe in transparency about what our system can and cannot do.
- •Rapidly evolving generators: New models produce fewer artifacts. Detection must continuously adapt.
- •Compression artifacts: Heavy re-encoding (common on social platforms) can mask or mimic generation artifacts, causing false positives.
- •Short clips: Videos under 3 seconds provide limited temporal data for reliable analysis.
- •Hybrid content: Videos mixing real footage with AI-generated segments are harder to classify cleanly.
- •Adversarial evasion: Motivated adversaries can deliberately craft content to evade detection.
SeeThrough is a research prototype and screening tool — not a forensic instrument. Always combine automated analysis with human judgement for high-stakes decisions.