Boosting Generic Visual-Linguistic Representation with Dynamic Contexts

Guoqing Ma, Yalong Bai, Wei Zhang, Ting Yao, Basem Shihada, Tao Mei

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Pretraining large models on generous multi-modal corpora has accelerated the development of visual-linguistic (VL) representation and achieved great success on various vision-and-language downstream tasks. Learning these models is usually executed by predicting the randomly masked words of captions or patches in images. Such approaches, nevertheless, seldom explore the supervision of causalities behind the caption descriptions or the procedure of generating events beyond still images. In this work, we endow the pretrained models with high-level cognition by delving into dynamic contexts to model the visual and linguistic causalities uniformly. Specifically, we format the dynamic contexts of an image as the sentences describing the events before , on , and after image. Unlike traditional caption-wise similarity, we propose a novel dynamic contexts-based similarity (DCS) metric, in which the correlation of potential causes and effects besides immediate visual content are considered to measure the relevance among images. DCS can be further simplified by parameterizing event continuity to relax the requirements on dense contextual event annotations. A new pre-task is designed to minimize the feature distances of dynamically contextual relevant images and incorporate the event causality and commonsense knowledge into the VL representation learning. Models based on our dynamic contexts significantly outperform typical VL models on multiple cross-modal downstream tasks, including the conventional visual commonsense reasoning (VCR), visual question answering (VQA), zero-shot image-text retrieval, and extended image / event ordering tasks.
Original languageEnglish (US)
Pages (from-to)1-13
Number of pages13
JournalIEEE Transactions on Multimedia
DOIs
StatePublished - Jan 18 2023

ASJC Scopus subject areas

  • Media Technology
  • Signal Processing
  • Computer Science Applications
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Boosting Generic Visual-Linguistic Representation with Dynamic Contexts'. Together they form a unique fingerprint.

Cite this