BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression

1Fudan University   2University of California, Los Angeles

*Equal contribution

Abstract

Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge. However, as the number of retrieved documents increases, the input length to LLMs grows linearly, causing a dramatic increase in latency and a degradation in long-context understanding. This is particularly serious for multi-hop questions that require a chain of reasoning across documents. To accelerate inference, reduce costs, and minimize distractions, this paper presents BRIEF (Bridging Retrieval and Inference through Evidence Fusion), a lightweight approach that performs query-aware multi-hop reasoning by compressing retrieved documents into highly dense textual summaries to integrate into in-context learning. To enable learning compression for multi-hop reasoning, we curate synthetic data by extracting atomic proposition expressions that encapsulate distinct factoids from the source documents to compose synthetic summaries. Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries and enables a range of LLMs to achieve exceptional open-domain question answering (QA) performance. For example, on HotpotQA, BRIEF improves the compression rate by 2 times compared to the state-of-the-art baseline, while outperforming it by 3.00% EM and 4.16% F1 with Flan-UL2 as the reader LM. It also generates more concise summaries than proprietary GPT-3.5, while demonstrating nearly identical QA performance.


The BRIEF Compressor

BRIEF is a lightweight, T5-based approach that performs query-aware multi-hop reasoning by compressing retrieved documents into highly dense textual summaries to integrate into in-context learning.

Unlike conventional methods that focus on compression for single-hop questions (Xu et al., 2024a; Cao et al., 2024), BRIEF is specifically trained to summarize the most pertinent knowledge from multiple documents that is essential for answering multi-hop questions.

Compare to token-, phrase-, or sentence-level compression (Jiang et al., 2023; Li et al., 2023), the summaries produced by BRIEF organize and synthesize evidence relevant to the query in a more concise and natural language format, making them more effective for use by the follow-up reader LM.

Synthetic Data

Unlike state-of-the-art fine-tuned compressor distilled from extreme-scale proprietary LLMs (Xu et al., 2024a), BRIEF is trained on synthetic data through a pipeline built entirely by open-source models, without relying on any proprietary LLMs and human annotations.

 

  • The synthetic data pipeline extracts atomic proposition expressions that encapsulate distinct factoids from the source documents to compose synthetic summaries.
  • The pipeline designs an automatic validation mechanism to filter out spurious multi-hop questions and corresponding summaries, ensuring that only those requiring genuine multi-hop reasoning are retained, ultimately improving the quality and reliability of the synthetic data.
  • Besides, the synthetic data exhibits an impressive awareness of multi-hop reasoning and potential to scale up, offering a data-centric approach to constructing high-quality and cost-effective synthetic data for context compression.

Experimental Results

We evaluated on the following datasets: HotpotQA (Yang et al., 2018), MuSiQue (Trivedi et al., 2022), Natural Questions (NQ) (Kwiatkowski et al., 2019), and TriviaQA (Joshi et al., 2017). Notably, the first two datasets primarily consist of multi-hop questions, whereas the latter two are mainly composed of single-hop questions. Especially for TriviaQA and NQ, we have curated high-quality multi-hop versions using our proposed synthetic data pipeline, named MultiHop-TriviaQA and MultiHop-NQ. We contribute high-quality multi-hop test sets that reveal the limitations of previous compressors, which perform well in single-hop but fall behind our method in multi-hop settings.

  • RIEF achieves a compression rate of 19.19x, with only a 1.60-point decrease in EM and a 1.83-point decrease in F1 compared to prepending full documents on HotpotQA.
  • Compared to RECOMP, BRIEF compresses by higher 19.19x than its 10.02x, while still outperforming it by 3.00-point EM and 4.16-point F1 on HotpotQA. On MultiHop-NQ, we observed a similar trend, with BRIEF’s higher 16.85x than RECOMP’s 10.84x, while outperforming RECOMP by 3.78-point EM and 4.39-point F1.
  • Compared to the proprietary LLM GPT-3.5, BRIEF achieves higher compression rates while delivering competitive QA performance. Take the results on HotpotQA as an example, GPT-3.5 achieves a compression rate of 14.77x, and QA performance of 31.60% EM and 42.65% F1. While BRIEF achieves higher 19.19x and can still deliver nearly similar QA results of 31.20% EM and 42.07% F1 performance.

  • BRIEF achieves a compression rate of 29.76x, with only a 2.55-point decrease in EM and a 3.49-point decrease in F1 compared to prepending full documents on TriviaQA. On NQ, we observed a similar trend, with a compression rate of 17.67x, resulting in only a 2.99-point decrease in EM and a 3.28-point decrease in F1.
  • Compared to RECOMP, BRIEF compresses by higher 29.76x than its 16.23x, while still outperforming RECOMP on TriviaQA.
  • Com-pared to GPT-3.5, BRIEF achieves competitive QA performance, while its compression rate of 17.67x significantly outperforms GPT-3.5’s 11.33x.

Analysis

The transfer ability of compressed summaries across LMs

  • This ability involves evaluating how well a compressed summary can maintain the core semantics relevant to the query, while also using an expression format that is compatible with a wider range of LMs. We selected models from the same family to avoid model selection bias.
  • Since our compression takes the form of propositions, it is more interpretable and transfers better across LMs compared to RECOMP and GPT-3.5.
  • In comparison to RECOMP and GPT-3.5, the performance of BRIEF drops more slightly when transferring from Phi-3-mini to Phi-3-small, and enlarges more from Phi-3-small to Phi-3-medium. These results implied the robustness and consistency of the compressed summaries generated by BRIEF.

The sensitivity of summary length to multi-hop nature of questions

  • The variation in summary length regarding question complexity can, to some extent, reflect the compressor’s sensitivity to that complexity.
  • As there is no established ground truth for the length of compressed summaries for each question, the results from GPT-3.5 were used as the reference oracle.
  • The results indicate that BRIEF consistently aligns with GPT-3.5 in terms of the sensitivity to the multi-hop nature of questions while generating more concise summaries. This alignment suggests that BRIEF effectively understands the complexity of questions and adaptively collects the necessary evidence based on specific demands to formulate a complete and accurate summary for answering this question.

The improvement of latency in terms of the over-all computational overhead

  • When employing BRIEF for compression, the number of GFLOPs required to process the compressed documents is significantly reduced compared to the amount required when using Flan-UL2 alone on the original, uncompressed set of top-5 documents. The total amount of computation is reduced to less than 30% of what it was before compression.
  • This reduction in GFLOPs highlights BRIEF’s potential to optimize inference, especially for large-scale document retrieval and processing, by enabling the LM to focus on compressed, more relevant infor-mation while maintaining comparable accuracy.

The scalability to compress longer documents

  • We further explored whether the proposed compressors could be effectively applied to more complex scenarios, particularly those involving documents whose lengths are an order of magnitude longer. A preliminary study was conducted by expanding the scope of retrieved documents from the top-5 to the top-25.
  • To avoid document position bias, these documents were shuffled and uniformly divided into document chunks, each containing five documents. Each chunk was then compressed using the trained compressor according to standard procedures. Finally, the compressed results of each chunk were concatenated to produce the overall compressed summary.
  • BRIEF demonstrates better scalability in scenarios where the document length is significantly longer. BRIEF is relatively stable, while RECOMP shows significant performance degradation. This result suggests that RECOMP has a limited ability to identify relevant evidence within a longer context containing more distracting information. Overall, our findings suggest that BRIEF has the potential to be extended but still requires further investigation for compressing longer contexts, which will be explored in future.

Takeaways

  • This study pioneers the exploration of long-context reasoning and compression of RAG for multi-hop questions.
  • A synthetic data pipeline, built entirely by open-source models, is designed to enhance the awareness of multi-hop reasoning and shows promise to scale up due to its low cost.
  • BRIEF, trained on a curated dataset, achieves exceptional QA performance with more concise summaries compared to proprietary LLM-based compressors.
  • We contribute high-quality multi-hop test sets that reveal the limitations of previous compressors, which perform well in single-hop but fall behind our method in multi-hop settings.

BibTeX

@article{li2024brief,
 title = "BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression",
 author = "Li, Yuankai  and
           Gu, Jia-Chen  and
           Wu, Di  and
           Chang, Kai-Wei  and
           Peng, Nanyun",
 journal={arXiv preprint arXiv:2410.15277},
 year = "2024"
}