Speaker: Ms. Akshita Gupta
From: TU Darmstadt
Abstract
In this talk, I will be discussing advancements in Temporal Action Localization (TAL) with a focus on two key innovations: Efficient Large Model Adaptation and Open-Vocabulary Recognition in Videos.
The first part of the talk introduces the Long-Short-range Adapter (LoSA), a memory-efficient backbone adapter designed for untrimmed videos. LoSA modifies intermediate layers across various temporal ranges to enhance video features, enabling end-to-end adaptation of billion-parameter models like VideoMAEv2. This approach ensures efficient utilization of state-of-the-art video models, even with the complexities of untrimmed video data.
The second part of the talk explores the OVFormer framework, which addresses Open-Vocabulary TAL. OVFormer leverages a language model to generate rich class descriptions and aligns these descriptions with video features using cross-attention. The framework employs a two-stage training strategy to enable generalization to novel categories, extending the range of recognizable actions beyond predefined categories.
Additionally, I will briefly discuss my internship work at Apple, where I worked on generating speech from videos of people and their transcripts.
For more info, please follow this link.
Read More