Semantic-meshed and content-guided transformer for image captioning

Xuan Li, Wenkai Zhang, Xian Sun, Xin Gao

Research output: Contribution to journalArticlepeer-review

5 Scopus citations

Abstract

The transformer architecture has been the dominant framework for today's image captioning tasks because of its superior performance. However, existing methods based on transformer often lack the integrated use of multi-level semantic information and are weak in maintaining the relevance of captions to the image. In this paper, a semantic-meshed and content-guided transformer network is introduced for image captioning to solve these problems. The semantic-meshed mechanism allows the model to generate words by selecting semantic information of multiple interaction levels adaptively through attention-based reconstruction. And the content-guided module guides the words generation by using attribute features that represent the image content, which aims to keep the generated caption consistent with the main content of the image. Experiments on dataset on the MSCOCO captioning dataset are conducted to validate the authors’ model and achieve superior results compared to other state-of-the-art method approaches.
Original languageEnglish (US)
Pages (from-to)431-444
Number of pages14
JournalIET Computer Vision
Volume16
Issue number5
DOIs
StatePublished - Aug 1 2022
Externally publishedYes

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'Semantic-meshed and content-guided transformer for image captioning'. Together they form a unique fingerprint.

Cite this