Abstract. Video summarization condenses video content into concise and informative summaries. To enhance the quality of video summaries, emerging multimodal approaches integrate visual and textual information, thereby outperforming conventional methods that rely solely on visual cues. However, the existing methods fail to effectively model the temporal dynamics within each modality and ensure semantic alignment across modalities simultaneously. In this paper, we introduce an unsupervised method for learning unified represent ations of video and text to address these challenges. Specifically, we propose a novel dual attentive network framework, which enhances the video representation through conditional text-derived information at a local scale and models long-term cross-modal dependencies at a global scale to leverage the tem-poral information across different scales of data. To further refine this model, we incorporate a hard negatives loss function within a contrastive learning framework, which learns to identify the irrelevant visual-textual representation pairs that closely resemble the relevant ones. Additionally, we propose a Dynamic Time Warping-based temporal alignment loss to maintain coherent sequential constraints over time within the same modality, addressing intra-modal dynamics. To evaluate our approach, we validate extensive experiments on standard video summarization datasets. The experimental results not only highlight the superiority of our approach but also emphasize its potential for practical applications in various domains.
@inproceedings{ying2024enhancing,
title={Enhancing Multimodal Video Summarization via Temporal and Semantic Alignment},
author={Ying, Fangli and Luo, Ziyue and Phaphuangwittayakul, Aniwat},
booktitle={International Conference on Neural Information Processing},
pages={17--31},
year={2024},
organization={Springer}
}