Publications
Selected work on vision–language–action (VLA), multimodal foundation models, video understanding, and efficient learning systems. For the complete list (including co-authored and earlier work), see my Google Scholar.
Models & Technical Reports
π0.7 (Physical Intelligence, 2026) — Steerable generalist robot foundation model with emergent compositional capabilities across dexterous manipulation tasks and robot platforms.
π0.7 BlogNova 2 (Amazon, 2025) — Multimodal reasoning and generation foundation models.
Technical ReportNova Multimodal Embedding (Amazon, 2025) — Multimodal embeddings for agentic RAG and semantic search across video, image, document, and audio.
Technical ReportNova 1 / Nova 1 Premier (Amazon, 2024–2025) — First-generation Nova multimodal foundation models.
Nova 1 · Nova 1 Premier
Vision–Language Models
CVPR 2026 — End-to-end referring multi-object tracking with a unified MLLM that jointly performs grounding and tracking; introduces STORM-Bench.
STORM: End-to-End Referring Multi-Object Tracking in VideosWACV 2026 — Compact video representations that let large multimodal models efficiently handle long-form video while preserving comprehension quality.
Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal ModelsICCV 2025 — Semi-supervised learning with pseudo-label semantic guidance for fine-grained classification.
SemiVisBooster: Boosting Semi-Supervised Learning for Fine-Grained Classification through Pseudo-Label Semantic GuidanceWACV 2025 — Multi-grained video–text learning via dataset granularity expansion and scalable alignment modeling.
GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video–Language LearningWACV 2025 — End-to-end vision-to-text audio description: detect AD events and generate scripts with long-context modeling.
Context-Aware Automatic Audio DescriptionECCV 2024 — Video MAE with text-guided masking and joint video–text contrastive learning.
Text-Guided Video Masked AutoencoderNeurIPS 2024 — Learnable token merging for long-form video transformers with large memory and throughput wins.
Video Token MergingICCV 2023 — Motion-aware masking for spatiotemporal representation learning.
Motion-Guided Masking for Spatiotemporal Representation LearningCVPR 2023 — CLIP-style pretraining with finite discrete tokens to close granularity gaps between vision and language.
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens
Multimodal Understanding
ICCV 2023 — Multimodal alignment and fusion for long-form cinematic video segmentation.
MEGA: Multimodal Alignment, Aggregation and Distillation for Cinematic Video SegmentationICASSP 2023 — Causal audio transformers with multi-resolution features.
Causal Audio Transformer for Audio ClassificationCVPR 2022 (Oral) — Tubelet transformer for end-to-end spatiotemporal action detection.
Tubelet Transformer for Video Action DetectionCVPR 2022 (Oral) — Semantic and spatial refined transformer for human–object interaction detection.
SSRT for HOICVPR 2022 — Identity-free person similarity learning.
Id-Free Person Similarity LearningNeurIPS 2021 (Spotlight) — Online action detection with long/short-term temporal modeling.
Long Short-Term Transformer for Online Action DetectionICCV 2021 — Convolution-free video transformer architecture.
Video Transformer Without ConvolutionsICCV 2021 — Video–language contrastive learning using global context.
Video Contrastive Learning with Global ContextCVPR 2021 — Siamese multi-object tracking.
SiamMOTWACV 2022 — Action recognition by focusing on important temporal segments.
NUTA: Non-uniform Temporal Aggregation for Action RecognitionWACV 2022 — Unsupervised temporal action segmentation via self-supervised co-occurrence parsing.
SSCAP: Self-Supervised Co-Occurrence Action Parsing for Unsupervised Temporal Action SegmentationECCV 2020 — Making convolutions temporal-aware.
Directional Temporal Modeling
Learning Methods, Efficiency & Open Source
WACV 2023 — Frequency-domain image modeling.
Discrete Cosine TransformerCVPR 2022 (Oral) — Memory-efficient video training via stochastic backprop / temporal gradient dropout.
Stochastic Backpropagation for Video ModelsICCV 2021 — Efficient inference via selective feature compression.
Selective Feature Compression for Efficient Activity Recognition InferenceCVPR 2021 — Multi-label activity recognition with label correlations.
Multi-Label Activity RecognitionPre-print — Survey of deep video action recognition, released with GluonCV.
A Comprehensive Study of Deep Video Action Recognition · GluonCVOpen Source (2021) — Multimodal research toolkit.
GluonMM