Abstract
We present a novel lipreading system that improves on the task of speaker-independent word recognition by decoupling motion and content dynamics. We achieve this by implementing a deep learning architecture that uses two distinct pipelines to process motion and content and subsequently merges them, implementing an end-to-end trainable system that performs fusion of independently learned representations. We obtain a average relative word accuracy improvement of ≈6.8% on unseen speakers and of ≈3.3% on known speakers, with respect to a baseline which uses a standard architecture.
Original language | English (US) |
---|---|
Title of host publication | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 4407-4411 |
Number of pages | 5 |
ISBN (Print) | 9781509066315 |
DOIs | |
State | Published - May 1 2020 |
Externally published | Yes |