Global Visual Feature and Linguistic State Guided Attention for Remote Sensing Image Captioning

Zhengyuan Zhang, Wenkai Zhang, Menglong Yan, Xin Gao, Kun Fu, Xian Sun

Research output: Contribution to journalArticlepeer-review

37 Scopus citations

Abstract

The encoder-decoder framework is prevalent in existing remote-sensing image captioning (RSIC) models. The appearance of attention mechanisms brings significant results. However, current attention-based caption models only build up the relationships between the local features without introducing the global visual feature and removing redundant feature components. It will cause caption models to generate descriptive sentences that are weakly related to the scene of images. To solve the problems, this article proposed a global visual feature-guided attention (GVFGA) mechanism. First, GVFGA introduces the global visual feature and fuses them with local visual features to build up their relationships between them. Second, an attention gate utilizing the global visual feature is proposed in GVFGA to filter out redundant feature components in the fused image features and provide more salient image features. In addition, to relieve the hidden state's burden, a linguistic state (LS) is proposed to specifically provide textual features, making the hidden state only guiding visual-textual attention process. What's more, to further refine the fusion of visual features and textual features, a LS-Guided Attention (LSGA) mechanism is proposed. It can also filter out the irrelevant information in the fused visual-textual feature with the help of an attention gate. The experimental results show that this proposed image captioning model can achieve better results on three RSIC datasets, UCM-Captions, Sydney-Captions, and RSICD datasets.
Original languageEnglish (US)
JournalIEEE Transactions on Geoscience and Remote Sensing
Volume60
DOIs
StatePublished - Jan 1 2022
Externally publishedYes

Fingerprint

Dive into the research topics of 'Global Visual Feature and Linguistic State Guided Attention for Remote Sensing Image Captioning'. Together they form a unique fingerprint.

Cite this