TY - GEN
T1 - Improving Intra- And Inter-Modality Visual Relation for Image Captioning
AU - Wang, Yong
AU - Zhang, Wen Kai
AU - Liu, Qing
AU - Zhang, Zhengyuan
AU - Gao, Xin
AU - Sun, Xian
N1 - Generated from Scopus record by KAUST IRTS on 2023-09-21
PY - 2020/10/12
Y1 - 2020/10/12
N2 - It is widely shared that capturing relationships among multi-modality features would be helpful for representing and ultimately describing an image. In this paper, we present a novel Intra- and Inter-modality visual Relation Transformer to improve connections among visual features, termed I2RT. Firstly, we propose Relation Enhanced Transformer Block (RETB) for image feature learning, which strengthens intra-modality visual relations among objects. Moreover, to bridge the gap between inter-modality feature representations, we align them explicitly via Visual Guided Alignment (VGA) module. Finally, an end-to-end formulation is adopted to train the whole model jointly. Experiments on the MS-COCO dataset show the effectiveness of our model, leading to improvements on all commonly used metrics on the "Karpathy"test split. Extensive ablation experiments are conducted for the comprehensive analysis of the proposed method.
AB - It is widely shared that capturing relationships among multi-modality features would be helpful for representing and ultimately describing an image. In this paper, we present a novel Intra- and Inter-modality visual Relation Transformer to improve connections among visual features, termed I2RT. Firstly, we propose Relation Enhanced Transformer Block (RETB) for image feature learning, which strengthens intra-modality visual relations among objects. Moreover, to bridge the gap between inter-modality feature representations, we align them explicitly via Visual Guided Alignment (VGA) module. Finally, an end-to-end formulation is adopted to train the whole model jointly. Experiments on the MS-COCO dataset show the effectiveness of our model, leading to improvements on all commonly used metrics on the "Karpathy"test split. Extensive ablation experiments are conducted for the comprehensive analysis of the proposed method.
UR - https://dl.acm.org/doi/10.1145/3394171.3413877
UR - http://www.scopus.com/inward/record.url?scp=85106747344&partnerID=8YFLogxK
U2 - 10.1145/3394171.3413877
DO - 10.1145/3394171.3413877
M3 - Conference contribution
SN - 9781450379885
SP - 4190
EP - 4198
BT - MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
ER -