What kind of AI-generated audio does Reality Defender look at?

There are two main types of voice generation methods today: text-to-speech (TTS) and voice conversion (VC).

TTS methods take as input a piece of text (e.g. sentence or paragraph) and generate a speech signal from it by giving it the voice of a target speaker. Today, TTS methods only need a few seconds of the target speaker’s voice to produce good quality fake speech

VC methods take as input the speech from a source speaker and pursue the goal of impersonation. That is, they generate a speech signal by giving the voice of a target speaker to the content spoken by a source speaker. Similar to TTS, VC methods today require only a few seconds of the target speaker’s voice to impersonate.

It is also possible to do VC on top of TTS, but these instances are rare and the methodology is generally less efficient.