Do Visual Imaginations Improve Vision-and-Language Navigation Agents?

Akhil Perincherry¹, Jacob Krantz, Stefan Lee¹

¹Oregon State University
CVPR 2025

Illustration of visual imaginations. (Top) A natural language instruction specifying sub-goals pool table, kitchen, and bedroom. (Bottom) Visual imaginations of landmarks pool table, kitchen and bedroom referenced by the sub-goals in the instruction. In our work, we study if these visual imaginations generated using text-to-image models can convey navigational cues to improve performance in VLN agents.

Abstract

Vision-and-Language Navigation (VLN) agents are tasked with navigating an unseen environment using natural language instructions. In this work, we study if visual representations of sub-goals implied by the instructions can serve as navigational cues and lead to increased navigation performance. To synthesize these visual representations or “imaginations”, we leverage a text-to-image diffusion model on landmark references contained in segmented instructions. These imaginations are provided to VLN agents as an added modality to act as landmark cues and an auxiliary loss is added to explicitly encourage relating these with their corresponding referring expressions. Our findings reveal an increase in success rate (SR) of ∼1 point and up to ∼0.5 points in success scaled by inverse path length (SPL) across agents. These results suggest that the proposed approach reinforces visual understanding compared to relying on language instructions alone.

Navigating with Imaginations

We examine whether providing visual imagery corresponding to described landmarks improves the performance of agents following natural language navigation instructions. Consider the natural language navigation instruction presented in the figure above which asks an agent to “Go straight, take a left at the pool table to enter the kitchen. Walk to the bedroom and stop”. This instruction provides unconditional action directives like “go straight” but also frequently conditions the given directions on visual landmarks in the scene such as the pool table, kitchen, and bedroom.

We instantiate our idea using text-to-image generation models to produce imagery matching the semantics of these visual references prior to navigation. Drawing an analogy to the substantial work in cognitive science on the impact of mental imagery on task performance, we refer to these generated images as visual imaginations and study whether providing them in addition to corresponding language-based instructions can improve the performance of vision-and-language navigation agents.

An overview of our approach. (Left) Imaginations generated using valid sub-instructions from an instruction as determined by our filtering scheme are first passed to a pre-trained ViT to obtain feature vectors. A type embedding t_Im for imagination modality is then added to the features which are encoded using a 3-layer MLP to obtain imagination embeddings h_I. (Right) To integrate imagination modality into a VLN agent, the imagination embeddings h_I are concatenated with instruction embeddings t_I that are encoded using a text encoder f_T(W). The concatenated imagination-text embeddings are passed to the VLN agent’s cross-modal encoder f_X along with visual embeddings to predict a distribution over the agent’s action space.

In summary, we:

Develop a pipeline for generating visual imaginations from navigation instructions and synthesize the R2R-Imagine dataset to enable studying the impact of text-to-image models on VLN agents.
Propose an agent-agnostic method to incorporate visual imaginations into existing VLN agents and show improved performance for HAMT and DUET.
Show generalized performance improvements across fine-grained instructions (Room-to-Room) and coarse-grained instructions (REVERIE).

Results

Model	R2R
	Val Unseen		Test
	SR	SPL	SR	SPL
HAMT	66.24	61.51	65	60
HAMT-Imagine (Ours)	67.26	62.02	65	60
DUET	71.52	60.41	69	59
DUET-Imagine (Ours)	72.12	60.48	71	60

Comparison of our approach with selected prior work on the Room-to-Room (R2R) dataset. Adding our visual imagination approach to HAMT and DUET base models leads to improved success rate (SR) and success weighted by inverse path length (SPL).

Model	REVERIE (Val Unseen)
Model	SR	SPL	RGS	RGSPL
DUET	46.98	33.73	32.15	23.03
DUET-Imagine (Ours)	48.28	33.76	32.97	23.25

Performance comparison of our approach on REVERIE with baseline. We observe improvements of our method across navigation and grounding metrics on DUET when provided coarse-grained instructions.

Qualitative Visualizations

Qualitative examples illustrating the role of imaginations as pivots between language and observation images. Each row contains one example. The first column contains sub-instruction from a random instruction from R2R, the second column contains the imagination generated using the sub-instruction. The third and fourth columns show highest attended language tokens and observation images from an attention head in HAMT’s cross-modal transformer at a time step the associated observations are first visible. In the second example (row 2), the sub-instruction contains references to “unicycle” which are captured in the imagination along with neighboring nouns “easel” and “door”. We observe that in a head where top attending language tokens to the imagination query are references to nouns associated with the sub-instruction, its top attended observations to the imagination query are images of the same concept (“unicycle”). In this example, the imagination of a unicycle is being used to associate language tokens belonging to “unicycle” to observations of unicycle hinting at the utility of imaginations in navigation.

BibTeX


        @InProceedings{aperinch_2025_VLN_Imagine,
          title={Do Visual Imaginations Improve Vision-and-Language Navigation Agents?},
          author={Akhil Perincherry and Jacob Krantz and Stefan Lee},
          booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
          month={June},
          year={2025},
        }