We present a novel lipreading system that improves on the task of speaker-independent word recognition by decoupling motion and content dynamics. We achieve this by implementing a deep learning architecture that uses two distinct pipelines to process motion and content and subsequently merges them, implementing an end-to-end trainable system that performs fusion of independently learned representations. We obtain a average relative word accuracy improvement of ≈6.8% on unseen speakers and of ≈3.3% on known speakers, with respect to a baseline which uses a standard architecture.
|Original language||English (US)|
|Title of host publication||ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings|
|Publisher||Institute of Electrical and Electronics Engineers Inc.|
|Number of pages||5|
|State||Published - May 1 2020|