About
I am a Member of Technical Staff at Physical Intelligence, working on π-family VLM and omni-model pre-training.
Previously, I was a Principal Applied Scientist at Amazon AGI, where I led the team that built the multimodal understanding capabilities of the Nova family. Earlier, I was a Staff Research Scientist at ByteDance and a Senior Applied Scientist at AWS AI, shipping multimodal and video models in production. Ph.D. from Rutgers University (2018); B.Eng. from UESTC (2013).
My research centers on multimodal understanding across domains, with a deep focus on video understanding and a strong bias toward real-world impact.
A Steerable Model with Emergent Capabilities
Generalist robot foundation model. Read the blog →
Tubelet Transformer for Video Action Detection
End-to-end spatiotemporal action detection. Paper →
- 2026Blog π0.7 released — steerable generalist robot foundation model with emergent compositional capabilities. Blog
- 2026CVPR STORM — unified MLLM for referring multi-object tracking; ships with STORM-Bench. Paper · Code
- 2026WACV Compact Video Representations for efficient long-form video understanding in LMMs. Paper
- 2025Nova Nova 2 and Nova Multimodal Embedding released at Amazon AGI. Nova 2 · MM-Embed
- 2025ICCV SemiVisBooster — text-guided semi-supervised learning for fine-grained classification. Paper
- 2024NeurIPS Video Token Merging — efficient token reduction for long-form video. Paper
- 2024ECCV Text-Guided Video MAE — masked video pretraining guided by language. Paper
See all work on the Publications page or Google Scholar.