CVPR2024

Rethinking Multi-View Representation Learning via Distilled Disentangling

Guanzhou Ke, Bo Wang, Xiaoli Wang, Shengfeng He

Abstract

We utilized the Mutual Information Neural Estimator (MINE) 1 [3] as a mutual information estimator to independently assess the mutual information between viewconsistent representations and view-specific representations proposed by CONAN 2 [11], DVIB 3 [2], Multi-VAE 4 [25], and our approach. To ensure a fair comparison, we standardized the representation dimensions of all comparative methods to 10. For constructing the MINE estimator, we employed fully connected layers with Rectified Linear Unit (ReLU) activation, specifying the network architecture as 20-100-100-100-1. We use Adam with the learning rate of 1×10 -4 and the batch size of 128 to train the model for 500 epochs. To mitigate randomness, we executed the MINE procedure 10 times and recorded the average results. B. Related Work Multi-view Representation Learning. The goal of MvRL is to extract both shared and view-specific information from multiple data sources, integrating them into a cohesive representation that is advantageous for predictive tasks [5, 13, 16] . Existing approaches in this field generally fall into three categories: statistic-based, deep learningbased, and hybrid methods. Statistic-based methods, employing techniques like canonical correlation analysis [6, 15] , non-negative matrix factorization [14, 23] , and subspace methods [4, 22] , excel in deriving interpretable models. However, they struggle with datasets that are high-dimensional or large-scale. In contrast, deep learning-based methods have gained prominence, especially in unsupervised settings, where generative models such as autoencoders [1, 21, 27] and generative adversarial networks [29] are used to learn latent representations. Although effective, these methods face the challenge