Abstract
The current image captioning methods usually integrate an object detection network to obtain image features at the level of objects and other salient regions. However, the detection network needs to be independently pre-trained on additional data. Thus, mainly due to the demand for extra training data and computing resources, the detection network's utilization will impose higher training costs on the overall captioning model. In this work, the authors propose a local–global attention model based on two-step clustering features for image captioning. The two-step clustering features can be obtained at a relatively low cost and have the presentation ability in objects or other salient image regions. To make the model perceive the image better, the authors introduce a novel local–global attention mechanism. The model will analyse the clustering features from local perspectives to global ones at each time step, making the model better understand the image contents. The authors evaluate the proposed method on the MSCOCO test server, achieving BLEU-4/METEOR/ROUGE-L scores of 36.8, 27.4, and 57.2, respectively. With the benefit of reducing training costs, the authors' model also achieves closing results compared with the models using detection features.
Original language | English (US) |
---|---|
Pages (from-to) | 280-294 |
Number of pages | 15 |
Journal | IET Computer Vision |
Volume | 16 |
Issue number | 3 |
DOIs | |
State | Published - Apr 1 2022 |
Externally published | Yes |
ASJC Scopus subject areas
- Software
- Computer Vision and Pattern Recognition