TY - GEN
T1 - Contextualized keyword representations for multi-modal retinal image captioning
AU - Huang, Jia Hong
AU - Wu, Ting Wei
AU - Worring, Marcel
N1 - KAUST Repository Item: Exported on 2021-10-07
Acknowledgements: This work is supported by competitive research funding from King Abdullah University of Science and Technology (KAUST) and University of Amsterdam.
This publication acknowledges KAUST support, but has no KAUST affiliated authors.
PY - 2021/8/21
Y1 - 2021/8/21
N2 - Medical image captioning automatically generates a medical description to describe the content of a given medical image. Traditional medical image captioning models create a medical description based on a single medical image input only. Hence, an abstract medical description or concept is hard to be generated based on the traditional approach. Such a method limits the effectiveness of medical image captioning. Multi-modal medical image captioning is one of the approaches utilized to address this problem. In multi-modal medical image captioning, textual input, e.g., expert-defined keywords, is considered as one of the main drivers of medical description generation. Thus, encoding the textual input and the medical image effectively are both important for the task of multi-modal medical image captioning. In this work, a new end-to-end deep multi-modal medical image captioning model is proposed. Contextualized keyword representations, textual feature reinforcement, and masked self-attention are used to develop the proposed approach. Based on the evaluation of an existing multi-modal medical image captioning dataset, experimental results show that the proposed model is effective with an increase of +53.2% in BLEU-avg and +18.6% in CIDEr, compared with the state-of-the-art method. https://github.com/Jhhuangkay/Contextualized-Keyword-Representations-for-Multi-modal-Retinal-Image-Captioning
AB - Medical image captioning automatically generates a medical description to describe the content of a given medical image. Traditional medical image captioning models create a medical description based on a single medical image input only. Hence, an abstract medical description or concept is hard to be generated based on the traditional approach. Such a method limits the effectiveness of medical image captioning. Multi-modal medical image captioning is one of the approaches utilized to address this problem. In multi-modal medical image captioning, textual input, e.g., expert-defined keywords, is considered as one of the main drivers of medical description generation. Thus, encoding the textual input and the medical image effectively are both important for the task of multi-modal medical image captioning. In this work, a new end-to-end deep multi-modal medical image captioning model is proposed. Contextualized keyword representations, textual feature reinforcement, and masked self-attention are used to develop the proposed approach. Based on the evaluation of an existing multi-modal medical image captioning dataset, experimental results show that the proposed model is effective with an increase of +53.2% in BLEU-avg and +18.6% in CIDEr, compared with the state-of-the-art method. https://github.com/Jhhuangkay/Contextualized-Keyword-Representations-for-Multi-modal-Retinal-Image-Captioning
UR - http://hdl.handle.net/10754/672161
UR - https://dl.acm.org/doi/10.1145/3460426.3463667
UR - http://www.scopus.com/inward/record.url?scp=85114874649&partnerID=8YFLogxK
U2 - 10.1145/3460426.3463667
DO - 10.1145/3460426.3463667
M3 - Conference contribution
SN - 9781450384636
SP - 645
EP - 652
BT - Proceedings of the 2021 International Conference on Multimedia Retrieval
PB - ACM
ER -