We examine whether providing visual imagery corresponding to described landmarks improves the performance of agents following natural language navigation instructions. Consider the natural language navigation instruction presented in the figure above which asks an agent to “Go straight, take a left at the pool table to enter the kitchen. Walk to the bedroom and stop”. This instruction provides unconditional action directives like “go straight” but also frequently conditions the given directions on visual landmarks in the scene such as the pool table, kitchen, and bedroom.
We instantiate our idea using text-to-image generation models to produce imagery matching the semantics of these visual references prior to navigation. Drawing an analogy to the substantial work in cognitive science on the impact of mental imagery on task performance, we refer to these generated images as visual imaginations and study whether providing them in addition to corresponding language-based instructions can improve the performance of vision-and-language navigation agents.