About

I am a Member of Technical Staff at Physical Intelligence, working on VLA and omni-models.

Previously, I was a Principal Applied Scientist at Amazon AGI, where I led multimodal understanding for the Nova model family. Before that, I was a Staff Research Scientist at ByteDance and a Senior Applied Scientist at AWS AI, leading multimodal and video modeling efforts deployed in production.

I received my Ph.D. from Rutgers University in 2018 and my bachelor’s degree from the University of Electronic Science and Technology of China in 2013.

Updates

Model Release

Nova 2 Family: Multimodal reasoning and generation models. Technical Report
Nova Multimodal Embedding: State-of-the-art multimodal embeddings for agentic RAG and semantic search across video, image, document, and audio. Technical Report
Nova 1 Family: Amazon’s first generation multimodal foundation models. Nova 1 and Nova 1 Premier

Publications

ICCV25: Easy and accurate semi-supervised learning with text guidance. SemiVisBooster
WACV25: Build VLM datasets using captions of multiple semantic granularities. Scalable Multi-grained Video-Language Learning (GEXIA)
WACV25: Generate audio descriptions with in-video contextual awareness. Context-Aware Automatic Audio Description
NeurIPS24: Efficient token reduction for long-form video understanding. Video Token Merging
ECCV24: Masked video pretraining guided by language supervision. Text-Guided Video Masked Autoencoder
ICCV23: Align and distill multimodal cues for cinematic video segmentation. MEGA
CVPR23: Rethink multimodal contrastive learning from patches to discrete tokens. Revisiting Multimodal Representation

Xinyu Li

About

Updates

Model Release

Publications