Publications

Selected work on VLM/M-LLM, multimodal understanding, and efficient learning systems. For the complete list (including co-authored and earlier work), see my Google Scholar.

Technical Reports

Nova 2 — Multimodal reasoning and generation foundation models.
Nova 2 Technical Report
Nova Multimodal Embedding — Multimodal embeddings for agentic RAG and semantic search across video, image, document, and audio.
Nova MM-embedding Technical Report
Nova 1 / Nova 1 Premier — Earlier-generation multimodal foundation models.
Nova 1 · Nova 1 Premier

VLM / M-LLM

ICCV 2025 — Semi-supervised learning with pseudo-label semantic guidance for fine-grained classification.
Boosting Semi-Supervised Learning for Fine-Grained Classification through Pseudo-Label Semantic Guidance
WACV 2025 — Multi-grained video–text learning via dataset granularity expansion and scalable alignment modeling.
Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning
WACV 2025 — End-to-end vision-to-text audio description: detect AD events + generate scripts with long-context modeling.
Context-Aware Automatic Audio Description
ECCV 2024 — Video MAE with text-guided masking + joint video-text contrastive learning for better representations.
Text-Guided Video Masked Autoencoder
NeurIPS 2024 — Learnable token merging for long-form video transformers (big memory/throughput wins).
Video Token Merging
ICCV 2023 — Motion-aware masking for spatiotemporal representation learning.
Motion-Guided Masking for Spatiotemporal Representation Learning
CVPR 2023 — CLIP-style pretraining with finite discrete tokens to close granularity gaps between vision and language.
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens

Multimodal Understanding

ICCV 2023 — Multimodal alignment/fusion for long-form cinematic video segmentation across modalities.
Multimodal alignment aggregation and distillation for cinematic video segmentation
ICASSP 2023 — Audio modeling with causal audio transformers + multi-resolution features.
Causal Audio Transformer for Audio Classification
CVPR 2022 (Oral) — Tubelet transformer for spatiotemporal action detection (end-to-end tube detection).
Tubelet Transformer for Video Action Detection
CVPR 2022 — Identity-free person similarity learning.
Id-Free Person Similarity Learning
CVPR 2022 (Oral) — Semantic+spatial refined transformer for human–object interaction detection.
SSRT for HOI
WACV 2022 — Action recognition by focusing on important temporal segments.
Non-uniform Temporal Aggregation for Action Recognition
NeurIPS 2021 (Spotlight) — Online action detection with long/short-term temporal modeling.
Long Short-Term Transformer for Online Action Detection
ICCV 2021 — Video transformer without convolutions (video understanding architecture).
Video Transformer Without Convolutions
CVPR 2021 — Siamese multi-object tracking.
SiamMOT
ICCV 2021 — Video–language contrastive learning using global context.
Video Contrastive Learning with Global Context
WACV 2022 — Unsupervised temporal action segmentation via self-supervised co-occurrence parsing.
Self-supervised Co-occurrence Action Parsing for Unsupervised Temporal Action Segmentation
ECCV 2020 — Make convolution temporal aware. Directional Temporal Modeling

Learning Methods, Efficiency & Others

WACV 2023 — Frequency-domain image modeling.
Discrete Cosine Transformer
CVPR 2022 (Oral) — Memory-efficient training for video models (stochastic backprop / gradient dropout).
Stochastic Backpropagation / Temporal Gradient Dropout
ICCV 2021 — Efficient inference via selective feature compression.
Selective Feature Compression for Efficient Activity Recognition Inference
CVPR 2021 — Multi-label activity recognition with label correlations.
Multi-Label Activity Recognition
Pre-print — A survey of recent action recognition methods for Gluon CV release.
A Comprehensive Study of Deep Video Action Recognition
GluonCV
Open Source (2021) — Multimodal research toolkit.
GluonMM

Xinyu Li

Technical Reports

VLM / M-LLM

Multimodal Understanding

Learning Methods, Efficiency & Others