TY - JOUR
T1 - VAA: Visual aligning attention model for remote sensing image captioning
AU - Zhang, Zhengyuan
AU - Zhang, Wenkai
AU - Diao, Wenhui
AU - Yan, Menglong
AU - Gao, Xin
AU - Sun, Xian
N1 - Generated from Scopus record by KAUST IRTS on 2023-09-21
PY - 2019/1/1
Y1 - 2019/1/1
N2 - Owing to the effectiveness in selectively focusing on regions of interest of images, the attention mechanism has been widely used in image caption task, which can provide more accurate image information for training deep sequential models. Existing attention-based models typically rely on top-down attention mechanism. While somewhat effective, attention masks in these attention-based models are queried from image features by hidden states of LSTM, rather than optimized by the objective functions. This indirectly supervised training approach cannot ensure that attention layers accurately focus on regions of interest. To address the above issue, in this paper, a novel attention model, Visual Aligning Attention model (VAA), is proposed. In this model, the attention layer is optimized by a well-designed visual aligning loss during the training phase. The visual aligning loss is obtained by explicitly calculating the feature similarity of attended image features and corresponding word embedding vectors. Besides, in order to eliminate the influence of non-visual words in training the attention layer, a visual vocab used for filtering out non-visual words in sentences is proposed, which can neglect the non-visual words when calculating the visual aligning loss. Experiments on UCM-Captions and Sydney-Captions prove that the proposed method is more effective in remote sensing image caption task.
AB - Owing to the effectiveness in selectively focusing on regions of interest of images, the attention mechanism has been widely used in image caption task, which can provide more accurate image information for training deep sequential models. Existing attention-based models typically rely on top-down attention mechanism. While somewhat effective, attention masks in these attention-based models are queried from image features by hidden states of LSTM, rather than optimized by the objective functions. This indirectly supervised training approach cannot ensure that attention layers accurately focus on regions of interest. To address the above issue, in this paper, a novel attention model, Visual Aligning Attention model (VAA), is proposed. In this model, the attention layer is optimized by a well-designed visual aligning loss during the training phase. The visual aligning loss is obtained by explicitly calculating the feature similarity of attended image features and corresponding word embedding vectors. Besides, in order to eliminate the influence of non-visual words in training the attention layer, a visual vocab used for filtering out non-visual words in sentences is proposed, which can neglect the non-visual words when calculating the visual aligning loss. Experiments on UCM-Captions and Sydney-Captions prove that the proposed method is more effective in remote sensing image caption task.
UR - https://ieeexplore.ieee.org/document/8843891/
UR - http://www.scopus.com/inward/record.url?scp=85077815756&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2019.2942154
DO - 10.1109/ACCESS.2019.2942154
M3 - Article
SN - 2169-3536
VL - 7
SP - 137355
EP - 137364
JO - IEEE Access
JF - IEEE Access
ER -