TY - GEN
T1 - Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training
AU - Hao, Weituo
AU - Li, Chunyuan
AU - Li, Xiujun
AU - Carin, Lawrence
AU - Gao, Jianfeng
N1 - Generated from Scopus record by KAUST IRTS on 2021-02-09
PY - 2020/1/1
Y1 - 2020/1/1
N2 - Learning to navigate in a visual environment following natural-language instructions is a challenging task, because the multimodal inputs to the agent are highly variable, and the training data on a new task is often limited. We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks. By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions. It can be easily used as a drop-in for existing VLN frameworks, leading to the proposed agent PREVALENT1. It learns more effectively in new tasks and generalizes better in a previously unseen environment. The performance is validated on three VLN tasks. On the Room-to-Room [3] benchmark, our model improves the state-of-the-art from 47% to 51% on success rate weighted by path length. Further, the learned representation is transferable to other VLN tasks. On two recent tasks, vision-and-dialog navigation [30] and “Help, Anna!” [22], the proposed PREVALENT leads to significant improvement over existing methods, achieving a new state of the art.
AB - Learning to navigate in a visual environment following natural-language instructions is a challenging task, because the multimodal inputs to the agent are highly variable, and the training data on a new task is often limited. We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks. By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions. It can be easily used as a drop-in for existing VLN frameworks, leading to the proposed agent PREVALENT1. It learns more effectively in new tasks and generalizes better in a previously unseen environment. The performance is validated on three VLN tasks. On the Room-to-Room [3] benchmark, our model improves the state-of-the-art from 47% to 51% on success rate weighted by path length. Further, the learned representation is transferable to other VLN tasks. On two recent tasks, vision-and-dialog navigation [30] and “Help, Anna!” [22], the proposed PREVALENT leads to significant improvement over existing methods, achieving a new state of the art.
UR - https://ieeexplore.ieee.org/document/9156554/
UR - http://www.scopus.com/inward/record.url?scp=85092192414&partnerID=8YFLogxK
U2 - 10.1109/CVPR42600.2020.01315
DO - 10.1109/CVPR42600.2020.01315
M3 - Conference contribution
SP - 13134
EP - 13143
BT - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
PB - IEEE Computer [email protected]
ER -