Publications
Selected work on VLM / M-LLM, multimodal understanding, and efficient learning systems.
For the full list, please see my Google Scholar.
Technical Reports
Nova 2 — Multimodal reasoning and generation foundation models.
Nova 2 Technical ReportNova Multimodal Embedding — Multimodal embeddings for agentic RAG and semantic search across video, image, document, and audio.
Nova MM-embedding Technical ReportNova 1 / Nova 1 Premier — Earlier-generation multimodal foundation models.
Nova 1 · Nova 1 Premier
VLM / M-LLM
ICCV 2025 — Semi-supervised learning with pseudo-label semantic guidance for fine-grained classification.
Boosting Semi-Supervised Learning for Fine-Grained Classification through Pseudo-Label Semantic GuidanceWACV 2025 — Multi-grained video–text learning via dataset granularity expansion and scalable alignment modeling.
Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language LearningWACV 2025 — End-to-end vision-to-text audio description: detect AD events + generate scripts with long-context modeling.
Context-Aware Automatic Audio DescriptionECCV 2024 — Video MAE with text-guided masking + joint video-text contrastive learning for better representations.
Text-Guided Video Masked AutoencoderNeurIPS 2024 — Learnable token merging for long-form video transformers (big memory/throughput wins).
Video Token MergingICCV 2023 — Motion-aware masking for spatiotemporal representation learning.
Motion-Guided Masking for Spatiotemporal Representation LearningCVPR 2023 — CLIP-style pretraining with finite discrete tokens to close granularity gaps between vision and language.
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens
Multimodal Understanding
ICCV 2023 — Multimodal alignment/fusion for long-form cinematic video segmentation across modalities.
Multimodal alignment aggregation and distillation for cinematic video segmentationICASSP 2023 — Audio modeling with causal audio transformers + multi-resolution features.
Causal Audio Transformer for Audio ClassificationCVPR 2022 (Oral) — Tubelet transformer for spatiotemporal action detection (end-to-end tube detection).
Tubelet Transformer for Video Action DetectionCVPR 2022 — Identity-free person similarity learning.
Id-Free Person Similarity LearningCVPR 2022 (Oral) — Semantic+spatial refined transformer for human–object interaction detection.
SSRT for HOIWACV 2022 — Action recognition by focusing on important temporal segments.
Non-uniform Temporal Aggregation for Action RecognitionNeurIPS 2021 (Spotlight) — Online action detection with long/short-term temporal modeling.
Long Short-Term Transformer for Online Action DetectionICCV 2021 — Video transformer without convolutions (video understanding architecture).
Video Transformer Without ConvolutionsCVPR 2021 — Siamese multi-object tracking.
SiamMOTICCV 2021 — Video–language contrastive learning using global context.
Video Contrastive Learning with Global ContextWACV 2022 — Unsupervised temporal action segmentation via self-supervised co-occurrence parsing.
Self-supervised Co-occurrence Action Parsing for Unsupervised Temporal Action SegmentationECCV 2020 — Make convolution temporal aware. Directional Temporal Modeling
Learning Methods, Efficiency & Others
WACV 2023 — Frequency-domain image modeling.
Discrete Cosin TransformerCVPR 2022 (Oral) — Memory-efficient training for video models (stochastic backprop / gradient dropout).
Stochastic Backpropagation / Temporal Gradient DropoutICCV 2021 — Efficient inference via selective feature compression.
Selective Feature Compression for Efficient Activity Recognition InferenceCVPR 2021 — Multi-label activity recognition with label correlations.
Multi-Label Activity RecognitionPre-print — A survey of recent action recognition methods for Gluon CV release.
A Comprehensive Study of Deep Video Action Recognition
GluonCVOpen Source (2021) — Multimodal research toolkit.
GluonMM
