About Me
I am a Principal Applied Scientist at Amazon AGI, leading the development of large-scale multimodal foundation models in the Nova family. My work spans encoder and multimodal embeddings, M-LLM training and evaluation, with a focus on video, cross-modal reasoning, and unified omni-model architectures.
Previously, I was a Staff Research Scientist at ByteDance and a Senior Applied Scientist at AWS AI, leading multimodal and video modeling efforts deployed in production.
I received my Ph.D. from Rutgers University in 2018 and my B.S. from the University of Electronic Science and Technology of China in 2013.
Updates
Model Release
- Nova 2 Family: Multimodal reasoning and generation models. Technical Report
- Nova Multimodal Embedding: State-of-the-art multimodal embeddings for agentic RAG and semantic search across video, image, document, and audio. Technical Report
- Nova 1 Family: Amazon’s first generation multimodal foundation models. Nova 1 and Nova 1 Premier
Publications
- WACV26:Efficient video compression from reconstruction. Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models
- ICCV25: Easy and accurate semi-supervised learning with text guidance. SemiVisBooster
- WACV25: Build VLM datasets using captions of multiple semantic granularities. Scalable Multi-grained Video–Language Learning (GEXIA)
- WACV25: Generate audio descriptions with in-video contextual awareness. Context-Aware Automatic Audio Description
- NeurIPS24: Efficient token reduction for long-form video understanding. Video Token Merging
- ECCV24: Masked video pretraining guided by language supervision. Text-Guided Video Masked Autoencoder
- ICCV23: Motion-aware masking for spatiotemporal representation learning. Motion-Guided Masking
- ICCV23: Align and distill multimodal cues for cinematic video segmentation. MEGA
- CVPR23: Rethink multimodal contrastive learning from patches to discrete tokens. Revisiting Multimodal Representation
- ICLR23: Unsupervised video learning via nearest-neighbor inter–intra contrast. NN Inter–Intra Contrastive Learning
- ICASSP23: Causal transformer architectures for audio classification. CAT
- WACV23: Image modeling directly in the frequency domain. Discrete Cosin Transformer
- CVPR22 (Oral): Transformer-based tubelet modeling for video action detection. TubeR
- CVPR22 (Oral): Memory-efficient training for large video models. Temporal Gradient Dropout
- CVPR22 (Oral): Joint semantic and spatial reasoning for human–object interaction. Semantic & Spatial Refined Transformer
- CVPR22: Annotation-free learning for re-ID. Id-Free Person Similarity Learning
- NeurIPS21 (Spotlight): Online action detection with long short-term transformers. LSTR
- ICCV21: One of the first transformer architecture for video understanding. VidTr
- ICCV21: Feature compression for efficient activity recognition inference. Selective Feature Compression
- CVPR21: Modeling label correlations for multi-label activity recognition. Multi-Label Activity Recognition
- CVPR21: Siamese transformer framework for multi-object tracking. SiamMOT
