Reasoning like Humans: On Dynamic Attention Prior in Image Captioning[Formula presented]

Yong Wang, Xian Sun, Xuan Li, Wenkai Zhang, Xin Gao

Research output: Contribution to journalArticlepeer-review

14 Scopus citations

Abstract

Attention-based models have been widely used in image captioning. Nevertheless, most conventional deep attention models perform attention operations for each block/step independently, which neglects prior knowledge obtained by previous steps. In this paper, we propose a novel method — DYnamic Attention PRior (DY-APR), which combines both attention distribution prior and local linguistic context for caption generation. Like human beings, DY-APR can gradually shift its attention from a multitude of objects to the one of keen interest when coping with an image of a complex scene. DY-APR first captures rough information and then explicitly updates attention weights step by step. Besides, DY-APR fully leverages local linguistic context from the previous tokens, that is, capitalizes on local information when performing global attention — which we refer to as “local–global attention”. We show that the prior knowledge from previous steps provides meaningful semantic information, serving as guidance to build more accurate attention for the latter layers. Experiments on the MS-COCO dataset demonstrate the effectiveness of DY-APR, leading to CIDEr-D improvement by 2.32% with less than 0.2% additional FLOPs and parameters.
Original languageEnglish (US)
JournalKnowledge-Based Systems
Volume228
DOIs
StatePublished - Sep 27 2021
Externally publishedYes

ASJC Scopus subject areas

  • Management Information Systems
  • Artificial Intelligence
  • Software
  • Information Systems and Management

Fingerprint

Dive into the research topics of 'Reasoning like Humans: On Dynamic Attention Prior in Image Captioning[Formula presented]'. Together they form a unique fingerprint.

Cite this