In this paper, we focus on a realistic yet challenging task, Single Domain Generalization Object Detection (S-DGOD), where only one source domain’s data can be used for training object detectors, but have to generalize multiple distinct target domains. In S-DGOD, both high-capacity fitting and generalization abilities are needed due to the task’s complexity. Differentiable Neural Architecture Search (NAS) is known for its high capacity for complex data fitting and we propose to leverage Differentiable NAS to solve S-DGOD. However, it may confront severe over-fitting issues due to the feature imbalance phenomenon, where parameters optimized by gradient descent are biased to learn from the easy-to-learn features, which are usually non-causal and spuriously correlated to ground truth labels, such as the features of background in object detection data. Consequently, this leads to serious performance degradation, especially in generalizing to unseen target domains with huge domain gaps between the source domain and target domains. To address this issue, we propose the Generalizable loss (G-loss), which is an OoD-aware objective, preventing NAS from over-fitting by using gradient descent to optimize parameters not only on a subset of easy-to-learn features but also the remaining predictive features for generalization, and the overall framework is named G-NAS. Experimental results on the S-DGOD urban-scene datasets demonstrate that the proposed G-NAS achieves SOTA performance compared to baseline methods.
ICLR
Object Detection with OOD Generalizable Neural Architecture Search
Fan Wu, Kaican Li, Jinling Gao, and
5 more authors
To improve the Out-of-Distribution (OOD) Generalization on Object Detection, we present a Neural Architecture Search (NAS) framework guided by feature orthogonalization. We believe that the failure to generalize on OOD data is due to the spurious correlations of category-related features and context-related features. The category-related features describe the causal information for predicting the target objects, such as "a car with four wheels”, while the context-related features describe the non-causal information, such as "a car driving at night”. However, due to the distinct data distribution between training and testing sets, the context-related features are often mistaken for causal information. To address this, we aim to automatically discover an optimal architecture that can disentangle the category-related features and the context-related features with a novel weight-based detector head. Both theoretical and experimental results show that the proposed scheme can achieve disentanglement and better performance on both IID and OOD.
Journal
AGNet: Automatic generation network for skin imaging reports
Fan Wu, Haiqiong Yang, Linlin Peng, and
5 more authors
Medical imaging has been increasingly adopted in the process of medical diagnosis, especially for skin diseases, where diagnoses based on skin pathology are extremely accurate. The diagnostic reports of skin pathology images has the distinguishing features of extreme repetitiveness and rigid formatting. However, reports written by inexperienced radiologists and pathologists can have a high error rate, and even experienced clinicians can find the reporting task both tedious and time-consuming. To address this challenge, this paper studies the automatic generation of diagnostic reports based on images of skin pathologies. A novel deep learning-based image caption framework named the automatic generation network (AGNet), which is an effective network for the automatic generation of skin imaging reports, is proposed. The proposed AGNet consists of four parts: (1) the image model that extracts features and classifies images; (2) the language model that codes data and generates words using comprehensible language; (3) the attention module that connects the “tail” of the image model and the “head” of the language model, and computes the relationship between images and captions; (4) the embedding and labeling module that processes the input caption data. In case study, The AGNet is verified on a skin pathological image dataset and compared with several state-of-the-art models. The results show that the AGNet achieves the highest scores of the evaluation metrics of image caption among all comparison models, demonstrating the promising performance of the proposed method.
KBS
A robust end-to-end deep learning framework for detecting Martian landforms with arbitrary orientations
Shancheng Jiang, Fan Wu, Kai-Leung Yung, and
4 more authors
With increasingly massive amounts of high-resolution images of Mars, automated detection of geological landforms on Mars has received widespread interest. It is significant for acquiring knowledge of distant planetary surfaces and processes, or manifold onboard applications such as spacecraft motion estimation and obstacle avoidance. This is a challenging task, not only because of the multiple sizes of targets and complex image backgrounds, but also the various orientations of some bar-shaped landforms in satellite images captured from the top view. The existing methods for directed landform detection require several pre or post-processing operations to extract possible regions of interest and final detection results with orientation, which are very time consuming. In this paper, a new end-to-end deep learning framework is developed for detecting arbitrarily-directed landforms. This framework, named Rotated-SSD (Single Shot MultiBox Detector, SSD), can locate and identify different landforms on Mars in one pass, by using rotatable anchor-box based mechanism. To enhance its robustness against angle variation of the targets and complex backgrounds, a new efficient match strategy is proposed for anchoring default boxes to ground truth boxes in the model training process and an autoencoder-based unsupervised pre-training operation is introduced to improve both the model training and inference performance. The proposed framework is tested for detection of bar-shaped buttes and dark slope streaks on satellite images. The detection results show that our framework can significantly contribute to onboard motion estimation systems. The comparative results demonstrate that the proposed match strategy outperforms other state-of-the-art match strategies with regard to model training efficiency and prediction accuracy. The pre-training strategy can facilitate the training of deep architectures in case of limited available training data.