Learning Multi-modal Fusion for RGB-D Salient Object Detection

Song, Peipei

Learning Multi-modal Fusion for RGB-D Salient Object Detection

Date

2025

Authors

Song, Peipei

Abstract

Salient objects refer to the conspicuous objects within an image that stand out prominently from their surroundings. Saliency detection can be applied in multiple fields, such as object recognition, image compression, and video surveillance. However, it is challenging to detect salient objects from RGB images in cases, such as low illumination or bad weather. Depth maps (D) are commonly utilized as supplementary inputs to RGB images for Salient Object Detection (SOD), namely RGB-D SOD. Although many RGB-D SOD methods have explored various multi-modal fusion strategies for boosting saliency performance, there still exist unsolved problems. In this thesis, we focus on three problems: the first problem is low-quality depth. Due to the diverse acquisition sensors, such as infrared detectors and stereo cameras, the quality of acquired depth maps is inconsistent. The low-quality depth introduces noise that may seriously reduce the detection accuracy. The second problem is the limited receptive field of conventional Convolutional Neural Networks (CNNs). CNN-based multi-modal fusion strategies fail to extensively model the correlation between the two modalities (appearance information from the RGB image and geometric information from the depth data). The third problem is ineffective feature fusion for multi-modal data. Selecting helpful information from each modality is essential to ensure effective data fusion. In this thesis, we introduce three approaches to tackle the problems above. Firstly, we propose a triple attention framework based on a 3D CNN. Triple attention leverages multi-modal features at three levels: modalities, channels, and spatial positions. The modality attention learns the quality factors based on the overall modal features. The channel attention highlights features in the dimension of channels, and the patch-level spatial attention establishes long-range dependencies across the entire image. Using attention mechanisms enlarges the limited receptive field in CNNs and obtains a comprehensive understanding of the context. Existing SOD studies neglect quality assessment of depth maps. To address this gap and enable the evaluation of SOD methods across diverse depth quality levels, we propose a novel quality assessment criterion for depth maps to re-categorize existing RGB-D datasets into three levels: high-quality, mid-quality, and low-quality. Secondly, due to the great success of transformer networks in visual tasks, we propose a transformer-based multi-modal fusion module for saliency detection. Traditional CNNs process images in a hierarchical manner, while the Transformer treats an image as a sequence of tokens and captures long-range dependencies and relations across the entire image. The self-attention mechanism can enhance the unique features of each modality, and the cross-attention can effectively model the correlation between the RGB image and depth map. By combining self-attention with cross-attention in our proposed parallel structure, our model can effectively explore the contribution of each mode for RGB-D saliency detection. Thirdly, we focus on the variant of stereo image pair-based saliency detection, where the depth is implicitly encoded in the stereo image pair for effective RGB-D saliency detection. Compared with RGB-D models, it does not suffer from the modality gap or the low-quality depth issue. Instead of explicitly using the depth map, we propose to model the correlation between the stereo image pair to capture geometric information implicitly using the network. Channel-weighted attention and depth-aware feature grouping are proposed for stereo feature fusion. In summary, we propose three SOD approaches. Comparison results with state-of-the-art methods demonstrate the effectiveness of our proposed approaches. Extensive ablation studies are conducted to validate the contributions of each proposed approach in our work.