Hello, I am Zhu Fangrui from China. Currently, I am a third-year master student supervised by Prof. Yanwei Fu in School of Data Science, Fudan University. My home’s in Beijing, a city with beautiful scene and cultural inheritance.
My current research works mainly concentrate on deep learning based computer vison. I hope to combine the knowledge of neuroscience and psychology into the tasks of computer vision, exploring the insights of how people think, make predictions and take actions.
MS. of Data Science, Statistics, 2018 - 2021
Fudan University, Shanghai, China.
B.Eng. in Software Engineering, 2014 - 2018
Tongji University, Shanghai, China.
The objective of this paper is self-supervised representation learning, with the goal of solving semi-supervised video object segmentation (a.k.a. dense tracking). We make the following contributions: (i) we propose to improve the existing selfsupervised approach, with a simple, yet more effective memory mechanism for long-term correspondence matching, which resolves the challenge caused by the disappearance and reappearance of objects; (ii) by augmenting the self-supervised approach with an online adaptation module, our method successfully alleviates tracker drifts caused by spatial-temporal discontinuity, e.g. occlusions or dis-occlusions, fast motions; (iii) we explore the efficiency of self-supervised representation learning for dense tracking, surprisingly, we show that a powerful tracking model can be trained with as few as 100 raw video clips (equivalent to a duration of 11mins), indicating that low-level statistics have already been effective for tracking tasks; (iv) we demonstrate state-of-the-art results among the self-supervised approaches on DAVIS-2017 and YouTube-VOS, as well as surpassing most of methods trained with millions of manual segmentation annotations, further bridging the gap between self-supervised and supervised learning.
Person re-identification (Re-ID) aims to match a target person across camera views at different locations and times. In this work, we focus on a much more difficult yet practical setting where person match- ing is conducted over long-duration, e.g., over days and months and therefore inevitably under the new challenge of changing clothes. This problem, termed Long-Term Cloth-Changing (LTCC) Re-ID is much understudied due to the lack of large scale datasets. The first contribution of this work is a new LTCC dataset containing people captured over a long period of time with frequent clothing changes. As a second contribution, we propose a novel Re-ID method specifically designed to ad- dress the cloth-changing challenge. Specifically, we consider that under cloth-changes, soft-biometrics such as body shape would be more reliable. We, therefore, introduce a shape embedding module as well as a cloth-elimination shape-distillation module aiming to eliminate the now unreliable clothing appearance features and focus on the body shape information. Extensive experiments show that superior performance is achieved by the proposed model on the new LTCC dataset.
Building an intelligent defect segmentation system for textured images has attracted much increasing attention in both research and industrial communities, due to its significance values in the practical applications of industrial inspection and quality control. It is desirable to learn a general deep learning based representation for the automatic segmentation of defects. Furthermore, it is relatively less study in efficiently extracting the deep features in the frequency domain, which, nevertheless, should be very important to understand the patterns of textured images. In this paper, we propose a novel defect segmentation deep network – Main-Secondary Network (MS-Net). Our MS-Net is trained to model both features from the spatial domain and the frequency domain, where wavelet transform is utilized to extract discriminative information from the frequency domain. Extensive experiments show the effectiveness of our MS-Net.
Calibrating narrow field of view soccer cameras is challenging because there are very few field markings in the image. Unlike previous solutions, we propose a two-point method, which requires only two point correspondences given the prior knowledge of base location and orientation of a pan-tilt-zoom (PTZ) camera. We deploy this new calibration method to annotate pan-tilt-zoom data from soccer videos. The collected data are used as references for new images. We also propose a fast random forest method to predict pan-tilt angles without image-to-image feature matching, leading to an efficient calibration method for new images. We demonstrate our system on synthetic data and two real soccer datasets. Our two-point approach achieves superior performance over the state-of-the-art method.