Multi-hop question answering (MHQA) involves reasoning across multiple documents to answer complex questions. Dense retrievers typically outperform sparse methods like BM25 by leveraging semantic embeddings; however, they require labeled query-document pairs for fine-tuning. This poses a significant challenge in MHQA due to the high variability of queries---(reformulated) questions---throughout the reasoning steps. To overcome this limitation, we introduce Retriever Supervision with Consistency and Relevance (ReSCORE), a novel method for training dense retrievers for MHQA without labeled documents. ReSCORE leverages large language models to capture each document's relevance to the question and consistency with the correct answer and use them to train a retriever within an iterative question-answering framework. Experiments on three MHQA benchmarks demonstrate the effectiveness of ReSCORE, with significant improvements in retrieval, and in turn, the state-of-the-art MHQA performance.
Overview of ReSCORE At each iteration within a iterative RAG process, the retriever receives gradients from the KL-Divergence loss of the retrieval distribution against the pseudo-GT distribution, which is derived from the LLM probabilities of question and answer given each document with normalization. The number of iterations is dynamically determined by the LLM and the process ends if the LLM predicts an answer which is not “unknown”. The red dashed lines represents gradient flows for the retriever.
In order to identify which documents are relevant to the input question when labels are unavailable, we measure the distribution \(Q_\text{LM}^{(i)}(d_j^{(i)} \mid q^{(i)})\), which represents the likelihood of retrieving a document \(d_j^{(i)}\) given a query \(q^{(i)}\) at iteration \(i\). Formally, expressed as: \[ Q_{\text{LM}}^{(i)}(d_{j}^{(i)} \mid q) \propto P_{\text{LM}}^{(i)}(a, q \mid d_{j}^{(i)}) \] \[ = P_{\text{LM}}^{(i)}(q \mid d_{j}^{(i)}) \cdot P_{\text{LM}}^{(i)}(a \mid q, d_{j}^{(i)}) \] where \( P_\text{LM} \) denotes the probability of a token sequence as computed by the LLM.
While \( P_{\text{LM}}^{(i)}(a \mid q, d_j^{(i)}) \) aligns more directly with the QA training objective, it often fails to fully capture a document’s relevance to a query. This is because the language model might assign high scores based on superficial word alignments, even when the document is irrelevant. For example, a document titled "Paris" might score higher than more relevant documents for a question about Acura Legend’s history, simply because it contains the correct year, "1981", which confuses the model despite the document's irrelevance. On the other hand, \( P_{\text{LM}}^{(i)}(a, q \mid d_j^{(i)}) \) incorporates the term \( P_{\text{LM}}^{(i)}(q \mid d_j^{(i)}) \), which measures a document’s relevance to the query. This helps avoid issues like the one described above, ensuring that documents with better contextual relevance are preferred.
Comparisons to State-of-the-Art Iterative RAG Frameworks on three MHQA benchmarks EM and F1 scores are measured on each dataset. † Scores are sourced from (Wang et al., 2024). ‡ Scores are reproduced using the official codes. ‡‡ Scores are sourced from the original paper (Jeong et al., 2024).
Comparison of GT and Pseudo-GT Labels on All Relevant Document Retrieval The y-axis shows the proportion of questions for which all relevant documents were found, which are all needed to correctly answer a given complex question. Pseudo-GT labels lead to improved performance as the number of iterations increases.
@inproceedings{lee-etal-2025-rescore,
title = "{R}e{SCORE}: Label-free Iterative Retriever Training for Multi-hop Question Answering with Relevance-Consistency Supervision",
author = "Lee, Dosung and
Oh, Wonjun and
Kim, Boyoung and
Kim, Minyoung and
Park, Joonsuk and
Seo, Paul Hongsuck",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.16/",
doi = "10.18653/v1/2025.acl-long.16",
pages = "341--359",
ISBN = "979-8-89176-251-0",
abstract = "Multi-hop question answering (MHQA) involves reasoning across multiple documents to answer complex questions. Dense retrievers typically outperform sparse methods like BM25 by leveraging semantic embeddings in many tasks; however, they require labeled query-document pairs for fine-tuning, which poses a significant challenge in MHQA due to the complexity of the reasoning steps. To overcome this limitation, we introduce Retriever Supervision with Consistency and Relevance (ReSCORE), a novel method for training dense retrievers for MHQA without the need for labeled documents. ReSCORE leverages large language models to measure document-question relevance with answer consistency and utilizes this information to train a retriever within an iterative question-answering framework. Evaluated on three MHQA benchmarks, our extensive experiments demonstrate the effectiveness of ReSCORE, with significant improvements in retrieval performance that consequently lead to state-of-the-art Exact Match and F1 scores for MHQA."
}