Throughout my academic journey, I am driven by a vision: building an intelligent system to interact with the world across multiple modalities. I am fortunate to pursue this passion under the supervision of Prof. Hsuan-Tien Lin, Prof. Yu-Chiang Frank Wang at National Taiwan University, and Prof. Ming-Hsuan Yang at the University of California, Merced. My research on computer vision and AI, particularly domain adaptation, continual learning for vision-language models, and 3D object articulation synthesis, has resulted in four publications [, , , ], including three first-author publications. Below, I detail my past research experiences and outline my prospective future directions for my Ph.D. career.
Domain Adaptation. Traditional learning theory ensures that test risk could be bounded if the training and test data are drawn i.i.d. from the same distribution. However, in real-world applications, this assumption rarely holds. During my master's studies under the supervision of Prof. Hsuan-Tien Lin, I began addressing this domain shift problem. I observed that source data does not necessarily need to be well-classified when training an ideal target model. Based on this insight, I re-formulated Semi-Supervised Domain Adaptation (SSDA) as a Noisy Label Learning problem, where the source labels are treated as noisy and the target labels as clean. I proposed a novel approach leveraging a prototypical network to correct the noisy source labels and re-train the model with these cleaned labels, resulting in significant performance improvements over state-of-the-art methods on various domain adaptation benchmark datasets. This work was accepted at CVPR 2023 and served as the basis of my master’s thesis. It also earned the Master Thesis Award from the Taiwanese Association for Artificial Intelligence (TAAI), ranking first out of around 50 submissions.
Continual Learning for Vision-Language Models. As large-scale vision-language models (VLMs) gained prominence, I realized that simply scaling up training data can also allow zero-shot adaptation, mitigating domain shifts in another way. However, fine-tuning VLMs for specific tasks in the most naive way suffers from a well-known catastrophic forgetting issue. After discussing this limitation with Prof. Yu-Chiang Frank Wang, I pursued a new project on continual learning for VLMs, since I believed truly AI systems must learn continuously like humans. With guidance from Prof. Wang, I instantly identified the major challenge to preserve previously acquired knowledge without access to prior data. Based on an empirical experiment, I discovered that feature distance between the original and fine-tuned VLM served as a good indicator to determine if data has been learned before or not. Guided by this key insight, I designed a dual-teacher knowledge distillation framework that preserved past knowledge and maintained the original zero-shot capabilities of a fine-tuned VLM, achieving state-of-the-art performance in continual learning benchmark datasets. This work , accepted at ECCV 2024, further reinforced my dedication to developing a continual multi-modal AI learner with adaptability to real-world applications.
Temporal Reasoning for Multi-Modal Models. Under the supervision of Prof. Wang at the Vision Learning Lab, I also extended my experience on vision-language models to the more challenging task of video temporal reasoning. With an additional temporal dimension, we observed that even state-of-the-art video reasoning models struggle when questions and their corresponding answers span different video segments. To properly evaluate these advanced temporal reasoning capabilities for the current multi-modal large language models, we developed an automated pipeline to generate temporal reasoning question-answer pairs. Leveraging my expertise in multi-modal models, I collaborated with fellow graduate students to structure this pipeline, curating a benchmark with approximately 10,000 machine-generated training datasets and approximately 3,000 carefully vetted validation samples. This project , accepted at NeurIPS 2024, enhanced my understanding of the limitations of current multi-modal models and their potential for temporal reasoning applications.
Open-domain 3D articulation Synthesis. In addition to my work on 2D cross-modal visual understanding, I also saw the potential to integrate 2D and 3D vision models, which perfectly aligns with my long-term research objectives. During my visit to Prof. Ming-Hsuan Yang's lab at the University of California, Merced, I started a new project focusing on 3D object articulation synthesis , which is currently under review. In this work, I proposed a novel setup -- synthesizing plausible interactions between open-domain rigged objects and a given 3D scene. This setup is quite challenging due to the absence of training data and prior topological assumptions about open-domain rigs. To tackle these challenges, we developed a systematic pipeline leveraging 2D inpainting models to generate realistic object postures from multiple viewpoints in 2D space. We then extracted diffusion features to compute semantic correspondences between rendered and inpainted images, aligning object postures with multi-view guidance. Our framework demonstrates a pioneering attempt to synthesize static open-domain 3D object interactions, with promising potential to extend these ideas to the more complex realm of 4D dynamic interaction synthesis.
I imagine a future where every individual has their own intelligent assistant who seamlessly interacts with themselves. Understand their preferences, answer their questions, act as their best friends, and even preserve and replay their cherished personal memories, such as their lovely pets or family members. All of my past research experiences -- spanning domain adaptation, continual learning, multi-modal video reasoning, and 3D articulation synthesis -- contribute to realizing this ambitious dream. In my future Ph.D. career, I plan to focus on two key directions: (1) continual learning across advanced applications such as question-answering, video understanding, and generative modeling. (2) 4D open-domain object dynamic interaction synthesis.