About
I am a Member of Technical Staff (MTS) at Physical Intelligence, working on VLA and omni-models.
Previously, I was a Principal Applied Scientist at Amazon AGI, where I led multimodal understanding for the Nova model family. Before that, I was a Staff Research Scientist at ByteDance and a Senior Applied Scientist at AWS AI, leading multimodal and video modeling efforts deployed in production.
Updates
Model Release
- Nova 2 Family: Multimodal reasoning and generation models. Technical Report
- Nova Multimodal Embedding: State-of-the-art multimodal embeddings for agentic RAG and semantic search across video, image, document, and audio. Technical Report
- Nova 1 Family: Amazon’s first generation multimodal foundation models. Nova 1 and Nova 1 Premier
Publications
- ICCV25: Easy and accurate semi-supervised learning with text guidance. SemiVisBooster
- WACV25: Build VLM datasets using captions of multiple semantic granularities. Scalable Multi-grained Video-Language Learning (GEXIA)
- WACV25: Generate audio descriptions with in-video contextual awareness. Context-Aware Automatic Audio Description
- NeurIPS24: Efficient token reduction for long-form video understanding. Video Token Merging
- ECCV24: Masked video pretraining guided by language supervision. Text-Guided Video Masked Autoencoder
- ICCV23: Align and distill multimodal cues for cinematic video segmentation. MEGA
- CVPR23: Rethink multimodal contrastive learning from patches to discrete tokens. Revisiting Multimodal Representation
