Wednesday 28 May 2025
12:00 PM – 2:00 PM
Location: WC2R 2LS
To register for this event, please follow this link.
Andrew Gilbert (University of Surrey), DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description
Abstract
Audio Description is a narrated commentary designed to aid vision-impaired audiences in perceiving key visual elements in a video. While short-form video understanding has advanced rapidly, a solution for maintaining coherent long-term visual storytelling remains unresolved. I’ll introduce existing methods that rely solely on frame-level embeddings, effectively describing object-based content but lacking contextual information across scenes. Then, I’ll introduce DANTE-AD, an enhanced video description model leveraging a dual-vision Transformer-based architecture to address this gap. DANTE-AD sequentially fuses both frame and scene-level embeddings to improve long-term contextual understanding. I’ll explore the evaluation of the approach on a broad range of key scenes from well-known movie clips, across traditional NLP metrics and LLM-based evaluations. Finally, I’ll look at where the field of AI-assisted audio description is heading.
About the speaker
Andrew Gilbert is an Associate Professor in Machine Learning at the University of Surrey, where he co-leads the Centre for Creative Arts and Technologies (C-CATS). His research lies at the intersection of computer vision, generative modelling, and multimodal learning, with a particular focus on building interpretable and human-centred AI systems. His work aims to develop machines that see and recognise the world and understand and creatively respond to it.
Dr. Gilbert has made significant contributions to the fields of video understanding, long-form video captioning, visual image style modelling, and AI-driven story understanding. A distinctive feature of his research is its integration into the creative industries, applying technical advances to domains such as media production, performance capture, and digital arts. This includes training models to classify genres from movie trailers and designing systems that can generate synthetic images and narrative content.