TY - JOUR
T1 - Global Visual Feature and Linguistic State Guided Attention for Remote Sensing Image Captioning
AU - Zhang, Zhengyuan
AU - Zhang, Wenkai
AU - Yan, Menglong
AU - Gao, Xin
AU - Fu, Kun
AU - Sun, Xian
N1 - Generated from Scopus record by KAUST IRTS on 2023-09-21
PY - 2022/1/1
Y1 - 2022/1/1
N2 - The encoder-decoder framework is prevalent in existing remote-sensing image captioning (RSIC) models. The appearance of attention mechanisms brings significant results. However, current attention-based caption models only build up the relationships between the local features without introducing the global visual feature and removing redundant feature components. It will cause caption models to generate descriptive sentences that are weakly related to the scene of images. To solve the problems, this article proposed a global visual feature-guided attention (GVFGA) mechanism. First, GVFGA introduces the global visual feature and fuses them with local visual features to build up their relationships between them. Second, an attention gate utilizing the global visual feature is proposed in GVFGA to filter out redundant feature components in the fused image features and provide more salient image features. In addition, to relieve the hidden state's burden, a linguistic state (LS) is proposed to specifically provide textual features, making the hidden state only guiding visual-textual attention process. What's more, to further refine the fusion of visual features and textual features, a LS-Guided Attention (LSGA) mechanism is proposed. It can also filter out the irrelevant information in the fused visual-textual feature with the help of an attention gate. The experimental results show that this proposed image captioning model can achieve better results on three RSIC datasets, UCM-Captions, Sydney-Captions, and RSICD datasets.
AB - The encoder-decoder framework is prevalent in existing remote-sensing image captioning (RSIC) models. The appearance of attention mechanisms brings significant results. However, current attention-based caption models only build up the relationships between the local features without introducing the global visual feature and removing redundant feature components. It will cause caption models to generate descriptive sentences that are weakly related to the scene of images. To solve the problems, this article proposed a global visual feature-guided attention (GVFGA) mechanism. First, GVFGA introduces the global visual feature and fuses them with local visual features to build up their relationships between them. Second, an attention gate utilizing the global visual feature is proposed in GVFGA to filter out redundant feature components in the fused image features and provide more salient image features. In addition, to relieve the hidden state's burden, a linguistic state (LS) is proposed to specifically provide textual features, making the hidden state only guiding visual-textual attention process. What's more, to further refine the fusion of visual features and textual features, a LS-Guided Attention (LSGA) mechanism is proposed. It can also filter out the irrelevant information in the fused visual-textual feature with the help of an attention gate. The experimental results show that this proposed image captioning model can achieve better results on three RSIC datasets, UCM-Captions, Sydney-Captions, and RSICD datasets.
UR - https://ieeexplore.ieee.org/document/9632558/
UR - http://www.scopus.com/inward/record.url?scp=85120572976&partnerID=8YFLogxK
U2 - 10.1109/TGRS.2021.3132095
DO - 10.1109/TGRS.2021.3132095
M3 - Article
SN - 1558-0644
VL - 60
JO - IEEE Transactions on Geoscience and Remote Sensing
JF - IEEE Transactions on Geoscience and Remote Sensing
ER -