Skip to main content

Audio Detection: How Results Are Returned

Reality Defender’s audio detection uses multiple specialized models to assess synthetic speech, with model scores for interpretation.

W
Written by Wen Huang

Reality Defender's audio detection analyzes speech content using multiple models to identify signs of AI generation or manipulation. Each audio file receives a single overall judgement and score, produced by combining signals across models that ran on that content.


What's included in audio analysis

Audio detection includes signals from two individual models, combined by an ensemble:

  • rd-everest-aud (Advanced) — uses general-purpose embeddings to discriminate synthetic from authentic speech

  • rd-marconi-aud (Foundational) — uses embeddings extracted from an internal proprietary model trained to expose a variety of generative artifacts in synthetic speech

  • rd-slim-aud (Generalizable) — detects mismatches in style and linguistic patterns that are otherwise unique to real human speech

Both models contribute to a single rd-aud-ensemble score, which represents the combined assessment of manipulation likelihood for the file.


What you'll see in the API response

Each model returns its own result in the models[] array, including a status, finalScore, and predictionNumber. The top-level resultsSummary contains the overall verdict and aggregated score.

Models that are not applicable to a given media type (for example, video or image models on an audio file) will return "status": "NOT_APPLICABLE" with null score fields. This is expected behavior.


How to interpret audio model results

The two models use different approaches to detect synthetic speech. Divergence between them is expected — one model may detect manipulation that the other does not, depending on how the audio was generated. The ensemble accounts for this, combining both signals into a single calibrated score.

Audio models analyze content in 3-second segments, with an aggregation model producing the final decision for the full file. The score you receive reflects the aggregated assessment across the entire audio.

For most use cases, the overall audio judgement and score in resultsSummary should be used as the primary decision signal. Individual model results are available for debugging, auditing, or deeper investigation.

Did this answer your question?