Without detection: Two-step clustering features with local–global attention for image captioning

Xuan Li, Wenkai Zhang, Xian Sun, Xin Gao

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

The current image captioning methods usually integrate an object detection network to obtain image features at the level of objects and other salient regions. However, the detection network needs to be independently pre-trained on additional data. Thus, mainly due to the demand for extra training data and computing resources, the detection network's utilization will impose higher training costs on the overall captioning model. In this work, the authors propose a local–global attention model based on two-step clustering features for image captioning. The two-step clustering features can be obtained at a relatively low cost and have the presentation ability in objects or other salient image regions. To make the model perceive the image better, the authors introduce a novel local–global attention mechanism. The model will analyse the clustering features from local perspectives to global ones at each time step, making the model better understand the image contents. The authors evaluate the proposed method on the MSCOCO test server, achieving BLEU-4/METEOR/ROUGE-L scores of 36.8, 27.4, and 57.2, respectively. With the benefit of reducing training costs, the authors' model also achieves closing results compared with the models using detection features.
Original languageEnglish (US)
Pages (from-to)280-294
Number of pages15
JournalIET Computer Vision
Volume16
Issue number3
DOIs
StatePublished - Apr 1 2022
Externally publishedYes

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'Without detection: Two-step clustering features with local–global attention for image captioning'. Together they form a unique fingerprint.

Cite this