VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Jun Chen, Han Guo, Kai Yi, Boyang Li, Mohamed Elhoseiny

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

74 Scopus citations

Abstract

The limited availability of annotated data often hinders real-world applications of machine learning. To efficiently learn from small quantities of multimodal data, we leverage the linguistic knowledge from a large pre-trained language model (PLM) and quickly adapt it to new domains of image captioning. To effectively utilize a pretrained model, it is critical to balance the visual input and prior linguistic knowledge from pretraining. We propose VisualGPT, which employs a novel self-resurrecting encoder-decoder attention mechanism to quickly adapt the PLM with a small amount of in-domain image-text data. The proposed self-resurrecting activation unit produces sparse activations that prevent accidental overwriting of linguistic knowledge. When trained on 0.1%, 0.5% and 1% of the respective training sets, VisualGPT surpasses the best baseline by up to 10.0% CIDEr on MS COCO [43] and 17.9% CIDEr on Conceptual Captions [63]. Furthermore, VisualGPT achieves the state-of-the-art result on IU X-ray [15], a medical report generation dataset. Our code is available at https://github.com/Vision-CAIR/VisualGPT.

Original languageEnglish (US)
Title of host publicationProceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
PublisherIEEE Computer Society
Pages18009-18019
Number of pages11
ISBN (Electronic)9781665469463
DOIs
StatePublished - 2022
Event2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 - New Orleans, United States
Duration: Jun 19 2022Jun 24 2022

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2022-June
ISSN (Print)1063-6919

Conference

Conference2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
Country/TerritoryUnited States
CityNew Orleans
Period06/19/2206/24/22

Keywords

  • Efficient learning and inferences
  • Representation learning
  • Transfer/low-shot/long-tail learning
  • Vision + language

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning'. Together they form a unique fingerprint.

Cite this