publications | Soumik Mukhopadhyay

2024

Do text-free diffusion models learn discriminative visual representations?

Soumik Mukhopadhyay* , Matthew Gwilliam*, Yosuke Yamaguchi, and 5 more authors

In Proceedings of the European Conference on Computer Vision (ECCV), Sep 2024

Abs arXiv PDF Code Website

While many unsupervised learning models focus on one family of tasks, either generative or discriminative, we explore the possibility of a unified representation learner: a model which addresses both families of tasks simultaneously. We identify diffusion models, a state-of-the-art method for generative tasks, as a prime candidate. Such models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high-fidelity, diverse, novel images. We find that the intermediate feature maps of the U-Net are diverse, discriminative feature representations. We propose a novel attention mechanism for pooling feature maps and further leverage this mechanism as DifFormer, a transformer feature fusion of features from different diffusion U-Net blocks and noise steps. We also develop DifFeed, a novel feedback mechanism tailored to diffusion. We find that diffusion models are better than GANs, and, with our fusion and feedback mechanisms, can compete with state-of-the-art unsupervised image representation learning methods for discriminative tasks - image classification with full and semi-supervision, transfer for fine-grained classification, object detection and segmentation, and semantic segmentation.
Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization

Soumik Mukhopadhyay , Saksham Suri, Ravi Teja Gadde, and 1 more author

In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Jan 2024

Abs arXiv PDF Code Colab Notebook Poster Slides Website Hugging Face 🤗 video

The task of lip synchronization (lip-sync) seeks to match the lips of human faces with different audio. It has various applications in the film industry as well as for creating virtual avatars and for video conferencing. This is a challenging problem as one needs to simultaneously introduce detailed, realistic lip movements while preserving the identity, pose, emotions, and image quality. Many of the previous methods trying to solve this problem suffer from image quality degradation due to a lack of complete contextual information. In this paper, we present Diff2Lip, an audio-conditioned diffusion-based model which is able to do lip synchronization in-the-wild while preserving these qualities. We train our model on Voxceleb2, a video dataset containing in-the-wild talking face videos. Extensive studies show that our method outperforms popular methods like Wav2Lip and PC-AVS in Fréchet inception distance (FID) metric and Mean Opinion Scores (MOS) of the users. We show results on both reconstruction (same audio-video inputs) as well as cross (different audio-video inputs) settings on Voxceleb2 and LRW datasets. Video results and code can be accessed from our project page ( https://soumik-kanad.github.io/diff2lip ).

2023

Diffusion Models Beat GANs on Image Classification

Soumik Mukhopadhyay* , Matthew Gwilliam*, Vatsal Agarwal, and 5 more authors

arXiv preprint, Jan 2023

Abs arXiv PDF Website Hugging Face 🤗

While many unsupervised learning models focus on one family of tasks, either generative or discriminative, we explore the possibility of a unified representation learner: a model which uses a single pre-training stage to address both families of tasks simultaneously. We identify diffusion models as a prime candidate. Diffusion models have risen to prominence as a state-of-the-art method for image generation, denoising, inpainting, super-resolution, manipulation, etc. Such models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high fidelity, diverse, novel images. The U-Net architecture, as a convolution-based architecture, generates a diverse set of feature representations in the form of intermediate feature maps. We present our findings that these embeddings are useful beyond the noise prediction task, as they contain discriminative information and can also be leveraged for classification. We explore optimal methods for extracting and using these embeddings for classification tasks, demonstrating promising results on the ImageNet classification task. We find that with careful feature selection and pooling, diffusion models outperform comparable generative-discriminative methods such as BigBiGAN for classification tasks. We investigate diffusion models in the transfer learning regime, examining their performance on several fine-grained visual classification datasets. We compare these embeddings to those generated by competing architectures and pre-trainings for classification tasks.

2021

Deep learning based needle tracking in prostate fusion biopsy

Soumik Mukhopadhyay , Praful Mathur, Aditya Bhardwaj, and 5 more authors

In Medical Imaging 2021: Image-Guided Procedures, Robotic Interventions, and Modeling, Jan 2021

Abs HTML PDF

Fusion of pre-operative Magnetic Resonance Imaging (MRI) and Trans-Rectal Ultrasound (TRUS) guided biopsy (Fusion Biopsy) has proven to be more effective as compared to cognitive biopsy for the detection of prostate cancer. The detection of the biopsy needle used during the Ultrasound procedure has multiple applications like reporting, repeat biopsy planning and planning therapy. Earlier methods to solve this problem have only used image processing techniques like Hough- Transform or Graph-Cut. These techniques lack robustness because only image-based solution cannot take care of the huge variability in the data as well as the problem of needle going out of plane. Recent deep learning (DL) based solutions for needle detection have high latency and does not exploit temporal information present in TRUS imaging. In this paper, we propose a method to automatically detect the short-lived needle triggers and its position using temporal context incorporated into a DL model termed as Samsung Multi-Decoder Network (S-MDNet). The proposed solution has been tested on 8 patients and yields high sensitivity (96%) and specificity (95%) for the detection of the needle trigger event.

2020

Rigid and deformable corrections in real-time using deep learning for prostate fusion biopsy

Aditya Bhardwaj, Jun-Sung Park, Soumik Mukhopadhyay , and 4 more authors

In Medical Imaging 2020: Image-Guided Procedures, Robotic Interventions, and Modeling, Jan 2020

Abs HTML PDF

Fusion biopsy reduces false negative rates in prostatic cancer detection compare to systemic biopsy. However, accuracy in biopsy sampling depends upon quality of alignment between pre-operative 3D MR and intra-operative 2D US. During live biopsy, the US-MR alignment may be disturbed due to prostate or patient rigid motion. Further, prostate gland deform due to probe pressure, which add error in biopsy sampling. In this paper, we describe a method for real-time 2D-3D multimodal registration, utilizing deep learning, to correct for rigid and deformable errors. Our method do not require an intermediate 3D US and works in real-time with an average runtime of 112 ms for both rigid and deformable corrections. On 12 patient data, our method reduces mean trans-registration error (TRE) from 8.890±5.106 mm to 2.988±1.513 mm, comparable to other state of the arts in accuracy.