12th Workshop on AMFG @ 2025 ICCV

Call for Papers

🚀 Motivation

The domains of face, gesture, and cross-modal recognition have experienced tremendous progress fueled by deep learning and large-scale annotated datasets. From the early days of AlexNet to today's transformer-based architectures, performance across public benchmarks has improved dramatically. However, this success has come at a cost — namely, decreased model explainability, limited generalization to unconstrained environments, and a growing dependence on opaque, pre-trained systems.

Despite saturation on traditional datasets, real-world applications often expose critical weaknesses: performance degrades under extreme pose variations, poor lighting, partial occlusion, or unpredictable subject behavior. Addressing these limitations requires robust training strategies, better data representations, and deeper, more interpretable models.

Meanwhile, multimodal learning has surged, integrating signals such as voice, face, and gesture to power applications in social media, HCI, surveillance, and affective computing. The next generation of systems must go beyond recognition — they must reason, adapt, and function reliably in complex, real-world conditions.

🎯 Topics of Interest

📌 Core Vision Tasks

Face, gesture, and full-body detection in images and videos
2D and 3D tracking of faces, hands, bodies, and actions across time
Robust recognition across pose, occlusion, age, illumination, and resolution
Segmentation and parsing of face and body parts for fine-grained analysis

🎨 Generative & Neural Rendering

Face and body generation, manipulation, and swapping (e.g., talking face, deepfakes)
Neural rendering of expressions, dance, or gestures in AR/VR environments
Controllable diffusion models and GANs for person-specific synthesis
Cross-domain generative modeling (e.g., sketch-to-photo, video-to-avatar)

🧠 Learning Paradigms

Self-supervised, semi-supervised, and unsupervised representation learning
Few-shot, zero-shot, continual, and domain-adaptive learning techniques
Vision-language foundation models for face/gesture understanding
Large Vision Models (LVMs) and LLVMs as foundation or fine-tuned systems
AutoML and architecture search for face/gesture pipelines

🧬 Soft Biometrics & Identity Understanding

Demographic inference: age, gender, ethnicity, kinship
Emotion, personality, attention, fatigue, and sentiment analysis
Social signal processing and behavioral trait inference
Explainable and trustworthy models for identity-related inference

🎥 Multimodal and Cross-Modal Analysis

Fusion of visual, audio, depth, IMU, EEG, and biosignal data
Multimodal transformers for joint face-body-speech analysis
Cross-modal generation: text-to-face, speech-to-gesture, etc.
Alignment and synchronization across modalities (e.g., lip-sync)

🧪 Applications, Benchmarks, and Analysis

Benchmark creation, dataset curation, and annotation protocols
Deployment studies and case reports in real-world scenarios
Failure analysis, uncertainty estimation, and model auditing
Interactive and interruptible AI systems for decision support

🧭 Nature-Inspired & Cognitive Systems

Bio-inspired models of face and gesture perception
Vision systems for ethology, animal behavior, and neuroscience
Cognitive modeling of gaze, micro-expressions, and attentional cues
Integrating affective computing with behavioral science

⚖️ Ethics, Fairness, and Society

Bias mitigation in demographic recognition and identity verification
Interpretability and transparency in face/gesture pipelines
Regulatory frameworks and societal impacts of face technologies
AI for accessibility, assistive tech, and inclusive HCI

🔗 Related Workshops

The first AMFG was held in conjunction with the 2003 ICCV in Nice, France. It has since been successfully held ten times. Below are links to the most recent workshops:

🗂️ Past AMFG Workshops

AMFG2018@CVPR

AMFG2019@CVPR

AMFG2021@CVPR

AMFG2023@CVPR

Face and gesture (hands) modeling have been long-standing problems in the computer vision community. While many related workshops have emerged, AMFG retains a distinct focus. Other workshops with complementary emphases include:

🧩 Complementary Workshops

Face recognition with security focus: ChaLearn2020@ECCV, ChaLearn2021@ICCV, MFR2021@ICCV

Hands modeling for action understanding: HANDS2022@ECCV, HBHA2022@ECCV

Face and gesture modeling in VR/AR: WCPA2022@ECCV, CV4ARVR2022@CVPR

Recognizing Families In the Wild (RFIW): RFIW2020, RFIW2019, RFIW2018, RFIW2017

AMFG continues to provide theoretical and technical depth in face and gesture research. Its impact spans broader domains including human-computer interaction, multimodal learning, egocentric vision, artificial ethics, and robotics.

📅 Important Dates

[ 07/04/2025 ]	Submission Deadline
[ 07/10/2025 ]	Notification
[ 08/18/2025 ]	Camera-Ready Due

✍️ Author Guidelines

Submissions are handled via the workshop's OpenReview page.

Follow the official ICCV2025 guidelines:
ICCV Submission Guidelines

• 8 pages (excluding references)

• Anonymous submission

• Use the ICCV LaTeX templates

Workshop Chairs

Joseph P. Robinson
Meta, USA
Website

Ming Shao
UMass, Lowell, USA
Website

Sheng Li
University of Virginia, USA
Website

Zhiqiang Tao
Rochester Institute of Technology, USA
Website

Yu Yin
Case Western Reserve University, USA
Website

Webmaster

Liang Shi
Northeastern University, USA
Website

Program Schedule

on October 20th

8:30 AM	Opening Remarks
Session I – Motion and Action Understanding
8:45 AM	Generating Tennis Action Instruction Based on a Large Language Model 👤 Siyu Xia (Southeast Univ.)
9:00 AM	A Generalized Two-stage Approach to Motion Style Transfer 👤 Siyu Xia (Southeast Univ.)
9:15 AM	Attend and Replay: Efficient Action Understanding in Long Videos via Mechanistic Interpretability 👤 Di Huang (Beihang Univ.)
9:30 AM	☕ Coffee Break
10:00 AM	Invited Talk I: Dr. Umar Iqbal, Senior Research Scientist at NVIDIA Research
Session II – Diffusion Models and Generation
11:00 AM	FreqCross: A Multi-Modal Frequency-Spatial Fusion Network for Robust Detection of Stable Diffusion 3.5 Generated Images 👤 Guang Yang (UC Berkeley)
11:15 AM	Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images 👤 Emanuel Garbin (META)
11:30 AM	High-Fidelity Character Animation: Generating Coherent and Controllable Motion Videos from Static Images 👤 Yongming Huang (Southeast Univ.)
12:00 PM	Lunch Break
POSTERS 12:00-1:30 PM

Session III – Multimodal Intelligence in Medicine
1:45 PM	Foundational Multi-Task Multimodal Model for Upper GI Endoscopy 👤 Yu Cao (UMass Lowell)
Session IV – Diffusion Models and Generation (Continued)
2:00 PM	Fitting Image Diffusion Models on Video Datasets 👤 Simon Woo (Sungkyunkwan Univ.)
2:30 PM	☕ Coffee Break
3:00 PM	Invited Talk II: Dr. Tim K. Marks, Senior Principal Research Scientist at MERL
3:45 PM	Invited Talk III: Dr. Xiaoming Liu, MSU Foundation Professor and Anil and Nandita Jain Endowed Professor at MSU
4:30 PM	Open Discussion, Best Paper Voting, Closing Remarks

Keynote Speech

Dr. Umar Iqbal

Senior Research Scientist at NVIDIA Research

From Videos to Motion: Understanding Humans in the Wild

As Physical AI and humanoid robotics advance, the ability to learn human behaviors directly from the vast collection of human videos available online is becoming increasingly crucial. In this talk, I will share our recent progress on understanding and modeling human motion and geometry from in-the-wild videos. I will highlight two key directions: first, developing generalist motion models that unify motion estimation and generation within a single framework, enabling controllable and multimodal human motion synthesis; and second, designing accurate and temporally consistent models for reconstructing detailed 3D human geometry from monocular videos.

Dr. Tim K. Marks

Senior Principal Research Scientist at MERL

Your Face is a Window to your Heart: Modular Methods for Imaging Photoplethysmography

Camera-based monitoring of vital signs, also known as imaging photoplethysmography (iPPG), involves sensing the underlying cardiac pulse from video of the skin and estimating vital signs such as the heart rate or a full pulse waveform. iPPG enables health monitoring for situations in which contact-based devices are either not available, too intrusive, or too expensive, with wide-ranging applications from driver monitoring to healthcare. MERL has developed several approaches to iPPG that break down the problem into three modules: face and landmark detection, time series extraction, and pulse signal/pulse rate estimation. Unlike many deep-learning methods for iPPG that make use of a single black-box model that maps directly from input video to output signal or heart rate, our modular approach enables each of the three parts of the pipeline to be interpreted individually. For pulse signal denoising, MERL has developed three different methods, each more accurate than the last. The first approach, AutoSparsePPG, uses adaptive sparse spectrum estimation to isolate the quasi-periodic heartbeat signal. The second approach, TURNIP, uses a temporal U-Net deep network to enable more robust recovery of the pulse signal. Our most recent iPPG solution utilizes deep unrolling and deep equilibrium models to denoise the heart rate signal, achieving state-of-the-art performance on standard benchmark datasets.

Dr. Xiaoming Liu

MSU Foundation Professor and Anil and Nandita Jain Endowed Professor at MSU

Human Recognition in the Era of Foundation Models

Human recognition aims to recognize human given an imagery or video. This is one of the most fundamental tasks that computer vision researchers are striving to solve in the past decades. With the recent emerging of large foundation models, being LLM or VLM, researchers are actively studying how to advance visual recognition in the era of foundation models. Some common questions include 1) how to leverage pre-trained foundation models for visual recognition? 2) how to continue the innovation of vision transformer in light of downstream tasks? 3) how to design and learn our own foundation model for a specific task? This talk will shed some lights on how we have been answering these questions in recent years.

🏆 Best Paper Awards

🥇

Best Paper Award

"A Generalized Two-stage Approach to Motion Style Transfer"

Tong Guo, Zhiqian Xia, Feng Yu, Haifeng Xia, and Siyu Xia

Southeast University

🥈

Best Paper (Honorable Mention)

"Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images"

Emanuel Garbin, Guy Adam, Oded Krams, Zohar Barzelay, Eran Guendelman, Michael Schwarz, Matteo Presutto, Moran Vatelmacher, Yigal Shenkman, Eli Peker, Itai Druker, Uri Patish, Yoav Blum, Max Bluvstein, Junxuan Li, Rawal Khirodkar, Shunsuke Saito

AMFG2025