Existing Multimodal Knowledge-Based Visual Question Answering (MKB-VQA) benchmarks suffer from “visual shortcuts”, as the query image typically matches the primary subject entity of the target document. We demonstrate that models can exploit these shortcuts, achieving comparable results using visual cues alone. To address this, we introduce Relational Entity Text-Image kNowledge Augmented (RETINA) benchmark, automatically constructed using an LLM-driven pipeline, consisting of a training and human-curated test set. RETINA contains queries referencing secondary subjects (i.e. related entities) and pairs them with images of these related entities, removing the visual shortcut. When evaluated on RETINA existing models show significantly degraded performance, confirming their reliance on the shortcut. Furthermore, we propose Multi-Image MultImodal Retriever (MIMIR), which enriches document embeddings by augmenting images of multiple related entities, effectively handling RETINA, unlike prior work that uses only a single image per document. Our experiments validate the limitations of existing benchmarks and demonstrate the effectiveness of RETINA and MIMIR.
RETINA Benchmark Generation Pipeline (a) constructs one-hop neighborhood graphs by extracting named entities (gray box) related to the answer entity (white box) and their relations using an LLM; (b) samples an query entity (green) and qualifying entity (teal) to form a target subgraph with the answer entity (red); and (c) feeds the target subgraph into an LLM to generate a textual query and collect a corresponding image from M2KR Images. The query is then paraphrased to minimize lexical overlap with the document.
Overview of MIMIR Document Encoder Architecture. (a) given a document, related named entities are identified with an LLM, and corresponding images are collected from the KB; (b) textual, global image, and patch-level features are extracted, with patch features attending to textual features through cross-attention to yield multimodal features; and (c) entity token embeddings are incorporated into the textual features prior to cross-attention for richer contextualization; (d) the final document embedding jointly integrates textual, global, and multimodal features projected into the same embedding space.
Qualitative Examples Existing MKB-VQA benchmarks are biased toward scenarios where the main entity of the target document typically has the same entity as the query image. To address this bias, we focus on collecting samples that do not permit such visual shortcuts, by deliberately selecting a query image that differ from the main entity of the target Wikipedia document.
Qualitative Comparison of Retrieval on the RETINA Benchmark. Results for (a) single image baseline (MuKA) and (b) MIMIR. GT documents are shown in green, and the query entity within them is also highlighted in green. The single image baseline tends to rely on visual shortcuts, often retrieving documents with images that merely resemble the query. In contrast, MIMIR benefits from additional visual-similarity scoring based on multiple related image-augmented document embeddings, enabling it to correctly retrieve the GT document. These visual results illustrate that both textual and visual information must be jointly considered to solve the task.
@misc{lee2025breakingvisualshortcutsmultimodal,
title={Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering},
author={Dosung Lee and Sangwon Jung and Boyoung Kim and Minyoung Kim and Sungyeon Kim and Junyoung Sung and Paul Hongsuck Seo},
year={2025},
eprint={2511.22843},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.22843},
}