About Me

I am a Principal Applied Scientist at Amazon AGI, I am leading multi-modal understanding efforts, including multimodal encoder and multimodal embedding model, video understanding and cross-modal understadning. Checkout our latest models Nova and Nova Premier

Before to AGI, I was a senior applied scientist at Amazon Prime Video. Prior to that, I was the senior research scientist at ByteDance/TikTok and Amazon AWS AI from 2018 to 2022, leading video related research and products. I was also one of the major contributor to open-source tools GluonCV and tools GluonMM.

I received my Ph.D. Degree (2018) at Rutgers University supervised by Prof. Ivan Marsic and my Bachelor’s Degree (2013) at University of Electronic Science and Technology of China.

News

[2025] Nova Models Nova Premier. Tech report
[2025] WACV 2025 publications: “GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning”. Paper
[2025] WACV 2025 publications: “Now You See Me: Context-Aware Automatic Audio Description”. Paper
[2024] Nova Models Amazon Nova model family. Tech report
[2024] NeurIPS 2024 publication: “Video token merging for long-form video understanding”. Paper
[2024] ECCV 2024 publication: “Text-Guided Video Masked Autoencoder”. Paper
[2023] ICCV 2023 publication: “Motion-Guided Masking for Spatiotemporal Representation Learning”. Paper
[2023] ICCV 2023 publication: “MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation”. Paper
[2023] CVPR 2023 publication: “Revisiting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens”. Paper
[2023] ICASSP 2023 publication: “CAT: Causal Audio Transformer for Audio Classification”. Paper
[2023] ICLR 2023 publication: “Nearest-Neighbor Inter-Intra Contrastive Learning from Unlabeled Videos”. Paper
[2023] WACV 2023 publication: “Discrete Cosin TransFormer: Image Modeling From Frequency Domain”. Paper
[2022] CVPR 2022 (Oral) publication: “TubeR: Tubelet Transformer for Video Action Detection”. Paper
[2022] CVPR 2022 publication: “Id-Free Person Similarity Learning”. Paper
[2022] CVPR 2022 (Oral) publication: “What to Look at and Where: Semantic and Spatial Refined Transformer for Detecting Human-Object Interactions”. Paper
[2022] CVPR 2022 (Oral) publication: “Temporal Gradient Dropout: A Memory Efficient Strategy for Training Video Models”. Paper
[2022] WACV 2022 Two papers accepted by WACV 2022: NUTA and SSCAP.
[2021] NeurIPS 2021 (Spotlight) “Long Short-Term Transformer for Online Action Detection”. Paper
[2021] GluonMM is now available Link
[2021] ICCV 2021 publication: “VidTr: Video Transformer Without Convolutions”. Paper
[2021] ICCV 2021 publication: “Selective Feature Compression for Efficient Activity Recognition Inference”. Paper
[2021] CVPR 2021 publication: “Multi-Label Activity Recognition using Activity-specific Features and Activity Correlations”. Paper
[2021] CVPR 2021 publication: “SiamMOT: Siamese Multi-Object Tracking”. Paper

Xinyu Li

About Me

News