We use state-of-the-art neural networks (CNNs, Transformers, ViTs, including large foundation models) to model discriminating features that help differentiate generative media from real media. In this process, we perform spatial, temporal, and frequency domain analysis along with domain specific feature losses (such as artifacts in images, etc.) to the model the detection problem. Further, we support a multi-model and multi-modal ensemble for various modalities.
We have created a diverse in-house dataset consisting of videos, image, audio (including telephone quality) and text.