Advancing Temporal Action Localization: Efficient Large Model Adaptation and Open-Vocabulary Recognition in Videos

Speaker: Ms. Akshita Gupta

From: TU Darmstadt

Abstract

In this talk, I will be discussing advancements in Temporal Action Localization (TAL) with a focus on two key innovations: Efficient Large Model Adaptation and Open-Vocabulary Recognition in Videos.

The first part of the talk introduces the Long-Short-range Adapter (LoSA), a memory-efficient backbone adapter designed for untrimmed videos. LoSA modifies intermediate layers across various temporal ranges to enhance video features, enabling end-to-end adaptation of billion-parameter models like VideoMAEv2. This approach ensures efficient utilization of state-of-the-art video models, even with the complexities of untrimmed video data.

The second part of the talk explores the OVFormer framework, which addresses Open-Vocabulary TAL. OVFormer leverages a language model to generate rich class descriptions and aligns these descriptions with video features using cross-attention. The framework employs a two-stage training strategy to enable generalization to novel categories, extending the range of recognizable actions beyond predefined categories.

Additionally, I will briefly discuss my internship work at Apple, where I worked on generating speech from videos of people and their transcripts.

For more info, please follow this link.

Locations:

Research 1: 101 [ View Website ]

Virtual [ Open Virtual Location Link ]

Contact:

Cherry Place cherry@crcv.ucf.edu

Calendar:: CS/CRCV Seminars
Category:: Speaker/Lecture/Seminar
Tags:: UCFCRCV

Admin Options

Locations:

Contact:

Calendar:

Category:

Tags: