TY - GEN
T1 - Fusion architectures for word-based audiovisual speech recognition
AU - Wand, Michael
AU - Schmidhuber, Jürgen
N1 - Generated from Scopus record by KAUST IRTS on 2022-09-14
PY - 2020/1/1
Y1 - 2020/1/1
N2 - In this study we investigate architectures for modality fusion in audiovisual speech recognition, where one aims to alleviate the adverse effect of acoustic noise on the speech recognition accuracy by using video images of the speaker's face as an additional modality. Starting from an established neural network fusion system, we substantially improve the recognition accuracy by taking single-modality losses into account: late fusion (at the output logits level) is substantially more robust than the baseline, in particular for unseen acoustic noise, at the expense of having to determine the optimal weighting of the input streams. The latter requirement can be removed by making the fusion itself a trainable part of the network.
AB - In this study we investigate architectures for modality fusion in audiovisual speech recognition, where one aims to alleviate the adverse effect of acoustic noise on the speech recognition accuracy by using video images of the speaker's face as an additional modality. Starting from an established neural network fusion system, we substantially improve the recognition accuracy by taking single-modality losses into account: late fusion (at the output logits level) is substantially more robust than the baseline, in particular for unseen acoustic noise, at the expense of having to determine the optimal weighting of the input streams. The latter requirement can be removed by making the fusion itself a trainable part of the network.
UR - https://www.isca-speech.org/archive/interspeech_2020/wand20_interspeech.html
UR - http://www.scopus.com/inward/record.url?scp=85098231449&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2020-2117
DO - 10.21437/Interspeech.2020-2117
M3 - Conference contribution
SP - 3491
EP - 3495
BT - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
PB - International Speech Communication Association
ER -