KDD2020

Multimodal Machine Learning for Video and Image Analysis

Shalini Ghosh

1 citation

Abstract

In this talk, we will first discuss multimodal ML for video content analysis. Videos typically have data in multiple modalities like audio, video, and text (captions). Understanding and modeling the interaction between different modalities is key for video analysis tasks like categorization, object detection, activity recognition, etc. However, data modalities are not always correlated -- so, learning when modalities are correlated and using that to guide the influence of one modality on the other is crucial. Another salient feature of videos is the coherence between successive frames due to continuity of video and audio, a property that we refer to as temporal coherence. We show how using non-linear guided cross-modal signals and temporal coherence can improve the performance of multimodal ML models for video analysis tasks like categorization. We also created a hierarchical taxonomy of categories internally. Our experiments on the large-scale YouTube-8M dataset show how our approach significantly outperforms state-of-the-art multimodal ML model for video categorization using our taxonomy, as well as generalizes well to an internal dataset of video segments from actual TV programs. The next part of the talk will briefly discuss our work on explainability of multimodal ML models. We will conclude the talk by outlining other multimodal ML applications like incremental object detection and visual dialog, and discuss potential applications of multimodal ML to various domains.