Advancing Temporal Action Localization: Efficient Large Model Adaptation and Open-Vocabulary Recognition in Videos

Friday, February 7, 2025 2 p.m. to 3 p.m.

Speaker: Ms. Akshita Gupta

From: TU Darmstadt

Abstract

In this talk, I will be discussing advancements in Temporal Action Localization (TAL) with a focus on two key innovations: Efficient Large Model Adaptation and Open-Vocabulary Recognition in Videos. 

The first part of the talk introduces the Long-Short-range Adapter (LoSA), a memory-efficient backbone adapter designed for untrimmed videos. LoSA modifies intermediate layers across various temporal ranges to enhance video features, enabling end-to-end adaptation of billion-parameter models like VideoMAEv2. This approach ensures efficient utilization of state-of-the-art video models, even with the complexities of untrimmed video data. 

The second part of the talk explores the OVFormer framework, which addresses Open-Vocabulary TAL. OVFormer leverages a language model to generate rich class descriptions and aligns these descriptions with video features using cross-attention. The framework employs a two-stage training strategy to enable generalization to novel categories, extending the range of recognizable actions beyond predefined categories.

Additionally, I will briefly discuss my internship work at Apple, where I worked on generating speech from videos of people and their transcripts.

For more info, please follow this link.

Read More

Locations:

Research 1: 101 [ View Website ]

Contact:


Calendar:

CS/CRCV Seminars

Category:

Speaker/Lecture/Seminar

Tags:

UCFCRCV