General Chair

Yun (Raymond) Fu
Yun (Raymond) Fu
Northeastern University, USA
Website

ย 

Workshop Chairs

Joseph P. Robinson
Joseph P. Robinson
Meta, USA
Website
Ming Shao
Ming Shao
UMass, Lowell, USA
Website
Sheng Li
Sheng Li
University of Virginia, USA
Website
Zhiqiang Tao
Zhiqiang Tao
Rochester Institute of Technology, USA
Website
Yu Yin
Yu Yin
Case Western Reserve University, USA
Website

Webmaster

Liang Shi
Liang Shi
Northeastern University, USA
Website

ย 

Program Schedule

on October 20th

8:30 AM Opening Remarks
Session I โ€“ Motion and Action Understanding
8:45 AM Generating Tennis Action Instruction Based on a Large Language Model ๐Ÿ‘ค Siyu Xia (Southeast Univ.)
9:00 AM A Generalized Two-stage Approach to Motion Style Transfer ๐Ÿ‘ค Siyu Xia (Southeast Univ.)
9:15 AM Attend and Replay: Efficient Action Understanding in Long Videos via Mechanistic Interpretability ๐Ÿ‘ค Di Huang (Beihang Univ.)
9:30 AM โ˜• Coffee Break
10:00 AM Invited Talk I: Dr. Umar Iqbal, Senior Research Scientist at NVIDIA Research
Session II โ€“ Diffusion Models and Generation
11:00 AM FreqCross: A Multi-Modal Frequency-Spatial Fusion Network for Robust Detection of Stable Diffusion 3.5 Generated Images ๐Ÿ‘ค Guang Yang (UC Berkeley)
11:15 AM Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images ๐Ÿ‘ค Emanuel Garbin (META)
11:30 AM High-Fidelity Character Animation: Generating Coherent and Controllable Motion Videos from Static Images ๐Ÿ‘ค Yongming Huang (Southeast Univ.)
12:00 PM Lunch Break
POSTERS 12:00-1:30 PM
Session III โ€“ Multimodal Intelligence in Medicine
1:45 PM Foundational Multi-Task Multimodal Model for Upper GI Endoscopy ๐Ÿ‘ค Yu Cao (UMass Lowell)
Session IV โ€“ Diffusion Models and Generation (Continued)
2:00 PM Fitting Image Diffusion Models on Video Datasets ๐Ÿ‘ค Simon Woo (Sungkyunkwan Univ.)
2:30 PM โ˜• Coffee Break
3:00 PM Invited Talk II: Dr. Tim K. Marks, Senior Principal Research Scientist at MERL
3:45 PM Invited Talk III: Dr. Xiaoming Liu, MSU Foundation Professor and Anil and Nandita Jain Endowed Professor at MSU
4:30 PM Open Discussion, Best Paper Voting, Closing Remarks

Keynote Speech

Speaker 1 photo
Dr. Umar Iqbal
Senior Research Scientist at NVIDIA Research
From Videos to Motion: Understanding Humans in the Wild

As Physical AI and humanoid robotics advance, the ability to learn human behaviors directly from the vast collection of human videos available online is becoming increasingly crucial. In this talk, I will share our recent progress on understanding and modeling human motion and geometry from in-the-wild videos. I will highlight two key directions: first, developing generalist motion models that unify motion estimation and generation within a single framework, enabling controllable and multimodal human motion synthesis; and second, designing accurate and temporally consistent models for reconstructing detailed 3D human geometry from monocular videos.

Speaker 2 photo
Dr. Tim K. Marks
Senior Principal Research Scientist at MERL
Your Face is a Window to your Heart: Modular Methods for Imaging Photoplethysmography

Camera-based monitoring of vital signs, also known as imaging photoplethysmography (iPPG), involves sensing the underlying cardiac pulse from video of the skin and estimating vital signs such as the heart rate or a full pulse waveform. iPPG enables health monitoring for situations in which contact-based devices are either not available, too intrusive, or too expensive, with wide-ranging applications from driver monitoring to healthcare. MERL has developed several approaches to iPPG that break down the problem into three modules: face and landmark detection, time series extraction, and pulse signal/pulse rate estimation. Unlike many deep-learning methods for iPPG that make use of a single black-box model that maps directly from input video to output signal or heart rate, our modular approach enables each of the three parts of the pipeline to be interpreted individually. For pulse signal denoising, MERL has developed three different methods, each more accurate than the last. The first approach, AutoSparsePPG, uses adaptive sparse spectrum estimation to isolate the quasi-periodic heartbeat signal. The second approach, TURNIP, uses a temporal U-Net deep network to enable more robust recovery of the pulse signal. Our most recent iPPG solution utilizes deep unrolling and deep equilibrium models to denoise the heart rate signal, achieving state-of-the-art performance on standard benchmark datasets.

Speaker 3 photo
Dr. Xiaoming Liu
MSU Foundation Professor and Anil and Nandita Jain Endowed Professor at MSU
Human Recognition in the Era of Foundation Models

Human recognition aims to recognize human given an imagery or video. This is one of the most fundamental tasks that computer vision researchers are striving to solve in the past decades. With the recent emerging of large foundation models, being LLM or VLM, researchers are actively studying how to advance visual recognition in the era of foundation models. Some common questions include 1) how to leverage pre-trained foundation models for visual recognition? 2) how to continue the innovation of vision transformer in light of downstream tasks? 3) how to design and learn our own foundation model for a specific task? This talk will shed some lights on how we have been answering these questions in recent years.

๐Ÿ† Best Paper Awards

๐Ÿฅ‡

Best Paper Award

"A Generalized Two-stage Approach to Motion Style Transfer"

Tong Guo, Zhiqian Xia, Feng Yu, Haifeng Xia, and Siyu Xia

Southeast University

๐Ÿฅˆ

Best Paper (Honorable Mention)

"Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images"

Emanuel Garbin, Guy Adam, Oded Krams, Zohar Barzelay, Eran Guendelman, Michael Schwarz, Matteo Presutto, Moran Vatelmacher, Yigal Shenkman, Eli Peker, Itai Druker, Uri Patish, Yoav Blum, Max Bluvstein, Junxuan Li, Rawal Khirodkar, Shunsuke Saito

Meta

Congratulations to all award recipients for their exceptional contributions to the field!

ย 


ย