About

I am a Member of Technical Staff at Physical Intelligence, working on π-family VLM and omni-model pre-training.

Previously, I was a Principal Applied Scientist at Amazon AGI, where I led the team that built the multimodal understanding capabilities of the Nova family. Earlier, I was a Staff Research Scientist at ByteDance and a Senior Applied Scientist at AWS AI, shipping multimodal and video models in production. Ph.D. from Rutgers University (2018); B.Eng. from UESTC (2013).

My research centers on multimodal understanding across domains, with a deep focus on video understanding and a strong bias toward real-world impact.

Highlights
π0.7 · Physical Intelligence

A Steerable Model with Emergent Capabilities

Generalist robot foundation model. Read the blog →

TubeR · CVPR 2022 Oral

Tubelet Transformer for Video Action Detection

End-to-end spatiotemporal action detection. Paper →

News
  • 2026Blog π0.7 released — steerable generalist robot foundation model with emergent compositional capabilities. Blog
  • 2026CVPR STORM — unified MLLM for referring multi-object tracking; ships with STORM-Bench. Paper · Code
  • 2026WACV Compact Video Representations for efficient long-form video understanding in LMMs. Paper
  • 2025Nova Nova 2 and Nova Multimodal Embedding released at Amazon AGI. Nova 2 · MM-Embed
  • 2025ICCV SemiVisBooster — text-guided semi-supervised learning for fine-grained classification. Paper
  • 2024NeurIPS Video Token Merging — efficient token reduction for long-form video. Paper
  • 2024ECCV Text-Guided Video MAE — masked video pretraining guided by language. Paper

See all work on the Publications page or Google Scholar.