Deep learning models has become the backbone of modern computer vision systems, achieving striking success in a wide range of tasks from image to video understanding. While the benchmark performance of deep neural networks never appears to saturate as they are fed with more data and compute, real-world applications of these models often fall short of expectations. This is due to the presence of biases in the data, which misrepresents the underlying distribution and leads to poor generalization and discriminatory outcomes. We investigate the problem of bias in vision and multimodal learning systems, proposing methods to identify, measure, and mitigate biases in both the data and the models.
This project aims to characterize and quantify representation bias in video action recognition datasets. We measure the bias of popular action datasets towards object, scene and person representations, and collect a new dataset (Diving48) with explicit control of static biases, using it to evaluate the temporal modeling of action recognition models.
This project studies the problem of representation bias mitigation in vision datasets by resampling. We propose an adversarial resampling scheme that minimizes the bias of any given dataset, and demonstrate its effectiveness on reducing color bias in a synthetic Colored MNIST dataset, as well as static bias in action recognition. We show that representation debiasing is crucial in improving the generalization of models.
This project explores model debiasing as a means to improve transfer learning and generalization in video understanding. We quantify the dynamicness of video models by the difference between their predictions and those of a purely spatial model, and propose a dynamic representation learning (DRL) framework to improve transfer learning by optimizing the dynamic scores.
This project addresses model debiasing from an architectural perspective. We observe that video-language models suffer from static biases similar to those in video action recognition, preventing them from learning longer-term temporal dependencies. We propose a sparse architecture to improve the modeling efficiency of long clips, as well as a temporal expansion curriculum to facilitate learning beyond the static frames.
Holistic Bias Mitigation in Computer Vision and Beyond
Yi Li
Ph.D. Thesis, University of California San Diego,
2024.
SViTT: Temporal Learning of Sparse Video-Text Transformers
Yi Li, Kyle Min, Subarna Tripathi and Nuno Vasconcelos
IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR),
2023.
Improving Video Model Transfer with Dynamic Representation Learning
Yi Li and Nuno Vasconcelos
IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR),
2022.
REPAIR: Removing Representation Bias by Dataset Resampling
Yi Li and Nuno Vasconcelos
IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR),
2019.
RESOUND: Towards Action Recognition without Representation Bias
Yingwei Li, Yi Li and Nuno Vasconcelos
European Conf. on Computer Vision (ECCV),
2018.