Cross-Aligned Fusion for Multimodal Understanding

Feb 1, 2025·

Abhishek Rajora

Shubham Gupta

Suman Kundu

· 0 min read

PDF Cite Code

Abstract

Recent multimodal frameworks often grapple with semantic misalignment and noise impeding effective integration of diverse modalities. In order to solve this problem this study presents CaMN (Cross-aligned Multimodal Network) a framework designed to enhance multimodal understanding through a robust cross-alignment mechanism. Unlike conventional fusion methods our framework aligns features extracted from images text and graphs via a tailored loss function enabling seamless integration and exploitation of complementary information. Leveraging Abstract Meaning Representation (AMR) we extract intricate semantic structures from textual data enriching the multimodal representation with contextual depth. Furthermore to enhance robustness we employ a masked autoencoder to simulate noise-independent feature space. Through comprehensive evaluation on the crisisMMD dataset CaMN demonstrates superior performance in crisis event classification tasks highlighting its potential in advancing multimodal understanding across diverse domains.

Type

Conference paper

Publication

Proceedings of the Winter Conference on Applications of Computer Vision (WACV 2025)

Last updated on Feb 1, 2025

Authors

Suman Kundu

Assistant Professor

My research interests lies in the intersection of Graph Algorithms and AI including graph representation learning, social network analysis, network data science, streaming algorithms, information retrival, big data, and data visualization.

TSGAN: Temporal Social Graph Attention Network for Aggressive Behaviour Forecasting Feb 1, 2025 →