The encoder-decoder framework is prevalent in existing remote-sensing image captioning (RSIC) models. The appearance of attention mechanisms brings significant results. However, current attention-based caption models only build up the relationships between the local features without introducing the global visual feature and removing redundant feature components. It will cause caption models to generate descriptive sentences that are weakly related to the scene of images. To solve the problems, this article proposed a global visual feature-guided attention (GVFGA) mechanism. First, GVFGA introduces the global visual feature and fuses them with local visual features to build up their relationships between them. Second, an attention gate utilizing the global visual feature is proposed in GVFGA to filter out redundant feature components in the fused image features and provide more salient image features. In addition, to relieve the hidden state's burden, a linguistic state (LS) is proposed to specifically provide textual features, making the hidden state only guiding visual-textual attention process. What's more, to further refine the fusion of visual features and textual features, a LS-Guided Attention (LSGA) mechanism is proposed. It can also filter out the irrelevant information in the fused visual-textual feature with the help of an attention gate. The experimental results show that this proposed image captioning model can achieve better results on three RSIC datasets, UCM-Captions, Sydney-Captions, and RSICD datasets.
|Original language||English (US)|
|Journal||IEEE Transactions on Geoscience and Remote Sensing|
|State||Published - Jan 1 2022|