Multi-hop question answering (MHQA) involves reasoning across multiple documents to answer complex questions. Dense retrievers typically outperform sparse methods like BM25 by leveraging semantic embeddings; however, they require labeled query-document pairs for fine-tuning. This poses a significant challenge in MHQA due to the high variability of queries---(reformulated) questions---throughout the reasoning steps. To overcome this limitation, we introduce Retriever Supervision with Consistency and Relevance (ReSCORE), a novel method for training dense retrievers for MHQA without labeled documents. ReSCORE leverages large language models to capture each document's relevance to the question and consistency with the correct answer and use them to train a retriever within an iterative question-answering framework. Experiments on three MHQA benchmarks demonstrate the effectiveness of ReSCORE, with significant improvements in retrieval, and in turn, the state-of-the-art MHQA performance.
Overview of ReSCORE At each iteration within a iterative RAG process, the retriever receives gradients from the KL-Divergence loss of the retrieval distribution against the pseudo-GT distribution, which is derived from the LLM probabilities of question and answer given each document with normalization. The number of iterations is dynamically determined by the LLM and the process ends if the LLM predicts an answer which is not “unknown”. The red dashed lines represents gradient flows for the retriever.
In order to identify which documents are relevant to the input question when labels are unavailable, we measure the distribution \(Q_\text{LM}^{(i)}(d_j^{(i)} \mid q^{(i)})\), which represents the likelihood of retrieving a document \(d_j^{(i)}\) given a query \(q^{(i)}\) at iteration \(i\). Formally, expressed as: \[ Q_{\text{LM}}^{(i)}(d_{j}^{(i)} \mid q) \propto P_{\text{LM}}^{(i)}(a, q \mid d_{j}^{(i)}) \] \[ = P_{\text{LM}}^{(i)}(q \mid d_{j}^{(i)}) \cdot P_{\text{LM}}^{(i)}(a \mid q, d_{j}^{(i)}) \] where \( P_\text{LM} \) denotes the probability of a token sequence as computed by the LLM.
While \( P_{\text{LM}}^{(i)}(a \mid q, d_j^{(i)}) \) aligns more directly with the QA training objective, it often fails to fully capture a document’s relevance to a query. This is because the language model might assign high scores based on superficial word alignments, even when the document is irrelevant. For example, a document titled "Paris" might score higher than more relevant documents for a question about Acura Legend’s history, simply because it contains the correct year, "1981", which confuses the model despite the document's irrelevance. On the other hand, \( P_{\text{LM}}^{(i)}(a, q \mid d_j^{(i)}) \) incorporates the term \( P_{\text{LM}}^{(i)}(q \mid d_j^{(i)}) \), which measures a document’s relevance to the query. This helps avoid issues like the one described above, ensuring that documents with better contextual relevance are preferred.
Comparisons to State-of-the-Art Iterative RAG Frameworks on three MHQA benchmarks EM and F1 scores are measured on each dataset. † Scores are sourced from (Wang et al., 2024). ‡ Scores are reproduced using the official codes. ‡‡ Scores are sourced from the original paper (Jeong et al., 2024).
Comparison of GT and Pseudo-GT Labels on All Relevant Document Retrieval The y-axis shows the proportion of questions for which all relevant documents were found, which are all needed to correctly answer a given complex question. Pseudo-GT labels lead to improved performance as the number of iterations increases.
@misc{lee2025rescorelabelfreeiterativeretriever,
title={ReSCORE: Label-free Iterative Retriever Training for Multi-hop Question Answering with Relevance-Consistency Supervision},
author={Dosung Lee and Wonjun Oh and Boyoung Kim and Minyoung Kim and Joonsuk Park and Paul Hongsuck Seo},
year={2025},
eprint={2505.21250},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.21250},
}