Abstract
Remote sensing cross-modal text-image retrieval (RSCTIR) is a flexible and human-centered approach to retrieving rich information from different modalities, which has attracted plenty of attention in recent years. It remains challenging because the current methods usually ignore the varying difficulty levels of different sample pairs, stemming from the large image distribution difference and the high text similarity in the remote sensing (RS) field. Therefore, in this article, we propose an innovative hypersphere-based visual semantic alignment (HVSA) network via curriculum learning (CL). Specifically, we first design an adaptive alignment strategy based on CL, which aligns RS image-Text pairs from easy to hard. Sample pairs with different levels of difficulty are treated unequally, and we obtain a better embedding representation when projecting the features onto the unit hypersphere. Then, to measure the robustness of cross-modal feature alignment on the unit hypersphere, we introduce the feature uniformity strategy. It reduces the occurrence of mismatching cases and improves generalization performance. Finally, we design the key-entity attention (KEA) mechanism to alleviate the problem of information imbalance among different modalities. KEA has the ability to extract information about the key entity which is aligned with textual information. Despite its conciseness, our framework achieves state-of-The-Art performance on classical datasets of RSCTIR tasks while enjoying faster inference. The summed recall of HVSA on the RISCD and RSITMD is 120.97 and 198.94, 2.50 and 10.49 points ahead of the current best methods, respectively. Extensive experiments demonstrate the competitiveness of our method. The code has been released at https://github.com/ZhangWeihang99/HVSA.
Original language | English (US) |
---|---|
Article number | 5621815 |
Journal | IEEE Transactions on Geoscience and Remote Sensing |
Volume | 61 |
DOIs | |
State | Published - 2023 |
Keywords
- Adaptive alignment strategy
- curriculum learning (CL)
- feature uniformity
- hypersphere
- key-entity attention (KEA)
- remote sensing cross-modal text-image retrieval (RSCTIR)
ASJC Scopus subject areas
- Electrical and Electronic Engineering
- General Earth and Planetary Sciences