Skip to main content

How to Spot Deepfakes

Diana Hsieh avatar
Written by Diana Hsieh
Updated over a year ago

Deepfakes are getting infinitely harder to detect without proactive detection platforms like Reality Defender. For instance, you can now create a convincing enough sound clip and generate text that sounds and looks entirely authentic.

That said, below are some tips from our team on how to train yourself in spotting potentially AI-generated or -manipulated materials.

What to look for when listening to a suspected manipulated or fake audio:

  • “Warped” or warbly audio sounds

    • Hard to put into words

    • Almost like a “tremolo” effect

    • Kind of the like “breath” is taken out of the audio only on certain syllables, seemingly at random

  • Pacing is too monotone

    • Most text-conditioned vocal-synth algorithms use prosody prediction like MFA or PL-Bert, etc

    • This means the length of each letter (or phoneme) will have a certain, defined length

    • The end result is an audio sample that sounds too “straightforward” in the speech pattern.

    • There may be diversity in the length of words, but it won’t be enough

  • Length of letters/phonemes occurs in quantized chunks

    • If the length of every letter/phoneme is already some multiple of each other, this is a dead giveaway for text-conditioned speech synthesis

  • “Raw”/”Raspy” sounding voice

    • In voice-to-voice systems, the end result will often end up sounding raspy

  • Original SR is a “known” value

    • The original SR maybe 16K, 22050, or 24K

    • The SR may be changed, but the change will be obvious, and it’ll be easy to find the original SR

    • Doesn’t always work - DM produces audio at 48K (HD audio standard for TV/Film)

  • Inspect the MelSpec

    • Always look at the MelSpec and waveform while listening to the audio, never just use your ears alone.

    • You’re looking for the same things in the MelSpec, but using eyes and ears together is infinitely more powerful

  • Elements that appear NOT fake

    • Large emotions like laughing, crying, yelling - these are hard for most systems to fake for now

    • Large diversity of prosody/speech patterns

    • Large diversity in pitch

    • Large diversity in volume

  • Things that don’t help

    • Background noise - it’s trivial to add background noise to AI generated speech. The presence or lack of background noise doesn’t mean anything

What to look for when watching a suspected manipulated or fake video:

  • Focus on the boundary of the face to detect faceswap type deepfakes. Eg. skin color mismatch in regions next to the face boundary.

  • Unnatural face muscle movements are a sign of synthetic emotions on a real image.

  • If the whole face is still but only lips/eyes are moving, it could be face-reenactment type deepfake.

  • Repetition of only small set of expressions/gestures are a sign of fully synthetic person video generated using GAN/Diffusion.

What to look for when inspecting a suspected manipulated or fake image:

  • Eye symmetry and eye-reflection symmetry

  • Patterns of “tentacles” coming out of the heads. This usually happen with hair. If you look at the boundary between hair and the background, you can sometimes see a pattern of singular hairs that protrude out from the head into the background.

  • Overly-smooth images that look ethereal or too good to be true, like a highly edited image. This is common for diffusion images.

  • GAN-generated images have distinct backgrounds

  • Subtle boundary edges between the face features and the rest of the head. This can indicate faceswaps.

  • Lighting/shadow/texture inconsistencies

  • Inconsistencies in the details of the face.

What to look for when reading suspected manipulated or fake text:

For pre-ChatGPT models:

  • Repetition, redundancy

  • Logical inconsistencies, commonsense errors

  • Incoherence

For LLMs that came after ChatGPT release:

  • All of the patterns/artifacts above still apply to post-ChatGPT LLMs. However, they are significantly less present. In most cases, there may not be enough human identifiable surface-level artifacts present that are strongly indicative of a LLM text.

Did this answer your question?