Publications

For full pulication list, please see my google scholar
* denotes equal contribution

2024

image Text-Guided Video Masked Autoencoder
David Fan, Jue Wang, Shuai Liao, Zhikang Zhang, Vimal Bhat, Xinyu Li.
ECCV 2024 Paper Link



2023

image Motion-Guided Masking for Spatiotemporal Representation Learning
David Fan, Jue Wang, Leo Liao, Yi Zhu, Vimal Bhat, Hector Santos, Xinyu Li.
ICCV 2023 Paper Link



image MEGA: Multimodal alignment aggregation and distillation for cinematic video segmentation
Najmeh Sadoughi, Xinyu Li Avijit Vajpayee, David Fan, Bing Shuai, Hector Santos-Villalobos, Vimal Bhat.
ICCV 2023 Paper Link



image Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens
Yuxiao Chen, Jianbo Yuan, Yu Tian, Shijie Geng, Xinyu Li, Ding Zhou, Dimitris N. Metaxas, Hongxia Yang.
CVPR 2023 Paper Link

image CAT: Causal Audio Transformer for Audio Classification
Xiaoyu Liu, Hanlin Lu, Jianbo Yuan, Xinyu Li.
ICASSP 2023 Paper Link

image Nearest-Neighbor Inter-Intra Contrastive Learning from Unlabeled Videos
David Fan, Deyu Yang, Xinyu Li, Vimal Bhat, Rohith MV.
ICLR2023 Workshop Paper Link




image Discrete Cosin TransFormer: Image Modeling From Frequency Domain
Xinyu Li*, Yanyi Zhang, Jianbo Yuan, Hanlin Lu, Yibo Zhu.
WACV23 Paper Link





2022

image TubeR: Tubelet Transformer for Video Action Detection
Jiaojiao Zhao*, Yanyi Zhang*, Xinyu Li*,Hao Chen, Bing Shuai, Mingze Xu, Chunhui Liu, Kaustav Kundu, Yuanjun Xiong, Ivan Marsic, Cees G.M. Snoek, Joseph Tighe.
cvpr22 Oral Paper Link

image
Id-Free Person Similarity Learning
Bing Shuai, Xinyu Li, Kaustav Kundu, Joseph Tighe.
CVPR22 Paper Link

image What to Look at and Where: Semantic and Spatial Refined Transformer for Detecting Human-Object Interactions
A S M Iftekhar, Hao Chen, Kaustav Kundu, Xinyu Li, Joseph Tighe, Davide Modolo.
CVPR22 Oral Paper Link

image Stochastic Backpropagation: A Memory Efficient Strategy for Training Video Models
Feng Cheng, Mingze Xu, Yuanjun Xiong, Hao Chen, Xinyu Li, Wei Li, Wei Xia.
CVPR22 Oral Paper Link

image
NUTA: Non-uniform Temporal Aggregation for Action Recognition
Xinyu Li*, Chunhui Liu*, Bing Shuai, Yi Zhu, Hao Chen, Joseph Tighe.
WACV22 Paper Link

image
SSCAP: Self-supervised Co-occurrence Action Parsing for Unsupervised Temporal Action Segmentation
Zhe Wang, Hao Chen, Xinyu Li, Chunhui Liu, Yuanjun Xiong, Joseph Tighe, Charless Fowlkes. WACV22 Link

2021

image
Long Short-Term Transformer for Online Action Detection
Mingze Xu, Yuanjun Xiong, Hao Chen, Xinyu Li, Wei Xia, Zhuowen Tu, Stefano Soatto.
NeurIPS 2021 Spotlight Paper Link

image
VidTr: Video Transformer Without Convolutions
Xinyu Li*, Yanyi Zhang*, Chunhui Liu, Bing Shuai, Yi Zhu, Hao Chen, Ivan Marsic, Joseph Tighe.
ICCV 2021 Paper Link


image
Selective Feature Compression for Efficient Activity Recognition Inference
Chunhui Liu*, Xinyu Li*, Hao Chen, Joseph Tighe.
ICCV 2021 Paper Link

image Video Contrastive Learning with Global Context
Haofei Kuang, Yi Zhu, Zhi Zhang, Xinyu Li, Joseph Tighe, Sören Schwertfeger, Cyrill Stachniss, Mu Li.
ICCV 2021 workshop Paper Link



image
Multi-Label Activity Recognition using Activity-specific Features and Activity Correlations
Yanyi Zhang, Xinyu Li*, Ivan Marsic.
CVPR 2021 Paper Link



image SiamMOT: Siamese Multi-Object Tracking
Bing Shuai, Andrew G. Berneshawi, Xinyu Li, Davide Modolo, Joseph Tighe.
CVPR 2021 Paper Link

2020

image A Comprehensive Study of Deep Video Action Recognition
Zhu, Yi, Xinyu Li, Chunhui Liu, Mohammadreza Zolfaghari, Yuanjun Xiong, Chongruo Wu, ZhiZhang, Joseph Tighe, R. Manmatha, and Mu Li. Pre-print

image Directional temporal modeling for action recognition
Xinyu Li, Bing Shuai, and Joseph Tighe.
ECCV 2020 Paper Link

image Application of Multi-Object Tracking with Siamese Track-RCNNto the Human in Events Dataset
Bing Shuai, Andrew G. Berneshawi, Manchen Wang, Chunhui Liu, Davide Modolo, Xinyu Li, Joseph Tighe.
ACM MM 2020 Paper Link

2019

image Speech Audio Super-Resolution for Speech Recognition
Xinyu Li, Venkata Chebiyyam, Katrin Kirchhoff.
INTERSPEECH 2019 Paper Link

image Multi-stream Network With Temporal Attention For Environmental Sound Classification
Xinyu Li, Venkata Chebiyyam, Katrin Kirchhoff.
INTERSPEECH 2019 Paper Link



image Mutual Correlation Attentive Factors In Dyadic Fusion Networks For Speech Emotion Recognition
Gu, Yue, Xinyu Lyu, Weijia Sun, Weitian Li, Shuhong Chen, Xinyu Li, and Ivan Marsic.
ACM MM 2019 Paper Link

2018

image Human conversation analysis using attentive multimodal networks with hierarchical encoder-decoder
Yue Gu, Xinyu Li, Kaixiang Huang, Shiyu Fu, Kangning Yang, Shuhong Chen, Moliang Zhou, and Ivan Marsic.
ACM MM 2018 Paper Link

image Hybrid Attention based Multimodal Network for Spoken Language Classification
Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, and Ivan Marsic.
COLING 2018 Paper Link

2017

image Region-based Activity Recognition Using Conditional GAN
Xinyu Li, Yanyi Zhang, Jianyu Zhang, Yueyang Chen, Huangcan Li, Ivan Marsic, and Randall S. Burd.
ACM MM 2017 Paper Link



image Progress Estimation And Phase Detection For Sequential Processes
Xinyu Li, Yanyi Zhang, Jianyu Zhang, Moliang Zhou, Shuhong Chen, Yue Gu, Yueyang Chen, Ivan Marsic.
UBICOMP 2017 Paper Link

2016 and Before

Deep Learning for RFID-based Activity Recognition
Xinyu Li, Yanyi Zhang, Ivan Marsic, Aleksandra Sarcevic, and Randall S. Burd.
SenSys 2016

Activity Recognition For Medical Teamwork Based on Passive RFID
Xinyu Li, Dongyang Yao, Xuechao Pan, Jonathan Johannaman, JaeWon Yang, Rachel Webman, Aleksandra Sarcevic, Ivan Marsic, and Randall S. Burd.
IEEE RFID 2016

A Novel Single Image Dehazing Method
Yanjing Yang, Zhizhong Fu, Xinyu Li, Chang Shu, and Xiaofeng Li.
ICCP 2013