TY - JOUR
T1 - Reasoning like Humans: On Dynamic Attention Prior in Image Captioning[Formula presented]
AU - Wang, Yong
AU - Sun, Xian
AU - Li, Xuan
AU - Zhang, Wenkai
AU - Gao, Xin
N1 - Generated from Scopus record by KAUST IRTS on 2023-09-21
PY - 2021/9/27
Y1 - 2021/9/27
N2 - Attention-based models have been widely used in image captioning. Nevertheless, most conventional deep attention models perform attention operations for each block/step independently, which neglects prior knowledge obtained by previous steps. In this paper, we propose a novel method — DYnamic Attention PRior (DY-APR), which combines both attention distribution prior and local linguistic context for caption generation. Like human beings, DY-APR can gradually shift its attention from a multitude of objects to the one of keen interest when coping with an image of a complex scene. DY-APR first captures rough information and then explicitly updates attention weights step by step. Besides, DY-APR fully leverages local linguistic context from the previous tokens, that is, capitalizes on local information when performing global attention — which we refer to as “local–global attention”. We show that the prior knowledge from previous steps provides meaningful semantic information, serving as guidance to build more accurate attention for the latter layers. Experiments on the MS-COCO dataset demonstrate the effectiveness of DY-APR, leading to CIDEr-D improvement by 2.32% with less than 0.2% additional FLOPs and parameters.
AB - Attention-based models have been widely used in image captioning. Nevertheless, most conventional deep attention models perform attention operations for each block/step independently, which neglects prior knowledge obtained by previous steps. In this paper, we propose a novel method — DYnamic Attention PRior (DY-APR), which combines both attention distribution prior and local linguistic context for caption generation. Like human beings, DY-APR can gradually shift its attention from a multitude of objects to the one of keen interest when coping with an image of a complex scene. DY-APR first captures rough information and then explicitly updates attention weights step by step. Besides, DY-APR fully leverages local linguistic context from the previous tokens, that is, capitalizes on local information when performing global attention — which we refer to as “local–global attention”. We show that the prior knowledge from previous steps provides meaningful semantic information, serving as guidance to build more accurate attention for the latter layers. Experiments on the MS-COCO dataset demonstrate the effectiveness of DY-APR, leading to CIDEr-D improvement by 2.32% with less than 0.2% additional FLOPs and parameters.
UR - https://linkinghub.elsevier.com/retrieve/pii/S095070512100575X
UR - http://www.scopus.com/inward/record.url?scp=85110580439&partnerID=8YFLogxK
U2 - 10.1016/j.knosys.2021.107313
DO - 10.1016/j.knosys.2021.107313
M3 - Article
SN - 0950-7051
VL - 228
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
ER -