How does Reality Defender detect deepfaked audio?

Our team builds neural network models that help separate fake audio from real audio. In the back end, our models convert the input speech signal into image representations called spectrograms, where the signal variations in both time and frequency are exposed. The SOTA TTS and VC methods leave different types of artifacts in the spectrograms, which we successfully capture using a custom neural network architecture.

Our models output a probability score between 1 to 99%, with higher scores denoting a higher chance that the input speech is a fake/manipulated one. By reporting a probability score, instead of a simple yes or no, we are able to demonstrate the amount of confidence we have in our decision on whether an audio is fake or real.

How is the final audio file classification determined?

The audio deepfake detector currently leverages two distinct audio models to classify audio models as fake or real. The two models are run independently on the audio file, and the results are then processed by an ensemble model that ultimately classifies an audio file as real or fake.

Audio files are broken up into six second segments, and each segment is run through the models to get a result. It’s important to note that although we process audio files in increments, our models are not optimized to detect partial fakes, where certain portions of the audio file are real while others are fake.

The audio deepfake detector bases the final audio file classification on the results of the ensemble model. Two consecutive segments of audio must be detected as fake by the ensemble model in order for an audio file to be classified as fake. Note that this only needs to be two consecutive segments of usable audio. If a segment is classified as fake, the following segment is classified as un-usable, and the next segment is classified as fake, the audio file will still be classified as fake.

Our model behaves in this way in order to get a very strong signal for whether or not a file is real or not. This means that the model is less likely to classify a real audio file as fake, although some audio files that are fake may be classified as real.

We are constantly updating the ensemble model and individual models to improve accuracy and performance.