Deepfakes are getting infinitely harder to detect without proactive detection platforms like Reality Defender. For instance, you can now create a convincing enough sound clip and generate text that sounds and looks entirely authentic.
That said, below are some tips from our team on how to train yourself in spotting potentially AI-generated or -manipulated materials.
What to look for when listening to a suspected manipulated or fake audio:
“Warped” or warbly audio sounds
Hard to put into words
Almost like a “tremolo” effect
Kind of the like “breath” is taken out of the audio only on certain syllables, seemingly at random
Pacing is too monotone
Most text-conditioned vocal-synth algorithms use prosody prediction like MFA or PL-Bert, etc
This means the length of each letter (or phoneme) will have a certain, defined length
The end result is an audio sample that sounds too “straightforward” in the speech pattern.
There may be diversity in the length of words, but it won’t be enough
Length of letters/phonemes occurs in quantized chunks
If the length of every letter/phoneme is already some multiple of each other, this is a dead giveaway for text-conditioned speech synthesis
“Raw”/”Raspy” sounding voice
In voice-to-voice systems, the end result will often end up sounding raspy
Original SR is a “known” value
The original SR maybe 16K, 22050, or 24K
The SR may be changed, but the change will be obvious, and it’ll be easy to find the original SR
Doesn’t always work - DM produces audio at 48K (HD audio standard for TV/Film)
Inspect the MelSpec
Always look at the MelSpec and waveform while listening to the audio, never just use your ears alone.
You’re looking for the same things in the MelSpec, but using eyes and ears together is infinitely more powerful
Elements that appear NOT fake
Large emotions like laughing, crying, yelling - these are hard for most systems to fake for now
Large diversity of prosody/speech patterns
Large diversity in pitch
Large diversity in volume
Things that don’t help
Background noise - it’s trivial to add background noise to AI generated speech. The presence or lack of background noise doesn’t mean anything
What to look for when watching a suspected manipulated or fake video:
Focus on the boundary of the face to detect faceswap type deepfakes. Eg. skin color mismatch in regions next to the face boundary.
Unnatural face muscle movements are a sign of synthetic emotions on a real image.
If the whole face is still but only lips/eyes are moving, it could be face-reenactment type deepfake.
Repetition of only small set of expressions/gestures are a sign of fully synthetic person video generated using GAN/Diffusion.
What to look for when inspecting a suspected manipulated or fake image:
Eye symmetry and eye-reflection symmetry
Patterns of “tentacles” coming out of the heads. This usually happen with hair. If you look at the boundary between hair and the background, you can sometimes see a pattern of singular hairs that protrude out from the head into the background.
Overly-smooth images that look ethereal or too good to be true, like a highly edited image. This is common for diffusion images.
GAN-generated images have distinct backgrounds
Subtle boundary edges between the face features and the rest of the head. This can indicate faceswaps.
Lighting/shadow/texture inconsistencies
Inconsistencies in the details of the face.
What to look for when reading suspected manipulated or fake text:
For pre-ChatGPT models:
Repetition, redundancy
Logical inconsistencies, commonsense errors
Incoherence
For LLMs that came after ChatGPT release:
All of the patterns/artifacts above still apply to post-ChatGPT LLMs. However, they are significantly less present. In most cases, there may not be enough human identifiable surface-level artifacts present that are strongly indicative of a LLM text.