Skip to main content

Audio & Voice Detection

Understand how Reality Defender detects AI-generated voice content.

Emily Essig avatar
Written by Emily Essig
Updated this week

Learn the difference between TTS and Voice Conversion (VC) methods, how our spectrogram-based neural models work, how the ensemble decision rule determines classification, and what factors affect results such as noise, speaker count, and clip length.

How RD Detects Audio Deepfakes

Reality Defender analyzes two primary types of AI-generated voice content:

  • Text-to-Speech (TTS):
    Converts written text into speech using a target speaker’s voice. Modern TTS systems can convincingly clone a voice from just a few seconds of recorded speech.

  • Voice Conversion (VC):
    Takes speech from a source speaker and alters it to sound like a target speaker — essentially impersonation. VC models also need only a few seconds of the target’s voice for convincing synthesis.

In some rare cases, TTS and VC can be combined, though these hybrid approaches are typically less efficient.

Our Detection Methodology

Reality Defender’s detection models convert input audio into spectrograms — image-like representations showing how frequency content changes over time.

  • Each deepfake generation method (TTS vs VC) leaves unique artifacts or irregularities within these spectrograms.

  • Our custom neural network architecture identifies and interprets these artifact patterns.

  • The model outputs a confidence score (1–99%), indicating the likelihood that a given audio sample is fake or manipulated.

This probabilistic scoring gives users greater transparency and interpretability than a simple “real/fake” binary label.


How Final Audio Classification Is Determined

Reality Defender’s audio detection pipeline uses an ensemble system of two specialized neural models.

Step 1: Segmentation

  • Audio files are divided into segments for efficient parallel processing.

  • Each segment is independently evaluated by both models.

Step 2: Independent Scoring

  • Each model outputs its own fake/real probability score for each segment.

Step 3: Ensemble Decision Rule

  • The two model outputs are combined by an ensemble model that makes the final classification.

  • To classify an entire audio file as fake, at least two consecutive segments must be marked as fake.

  • This rule helps minimize false positives, ensuring that genuine audio is rarely misclassified.

Important: The current system is not optimized to detect partial fakes (where only part of an audio clip is manipulated). The classifier determines authenticity for the entire file based on overall signal consistency.


Related FAQs

Question

Answer

What types of audio deepfakes are detected?

TTS (text-to-speech) and VC (voice conversion).

How confident are detections?

Each model produces a 1–99% confidence score; the ensemble aggregates them for final classification.

Can you detect overlapping speakers?

Not yet — the current system focuses on single-speaker scenarios.

Can you tell if it’s TTS vs voice-conversion?

No, differentiation between TTS and VC isn’t yet available.

Do you create voice prints or verify identity?

No. Reality Defender only detects whether content is synthetic — not who produced it.

Did this answer your question?