BRIEF-Pro: Universal Context Compression with Short-to-Long Synthesis for Fast and Accurate Multi-Hop Reasoning

University of California, Los Angeles 
* Equal Contribution

Abstract

As retrieval-augmented generation (RAG) tackles complex tasks, increasingly expanded contexts offer richer information, but at the cost of higher latency and increased cognitive load on the model. To mitigate this bottleneck, especially for intricate multi-hop questions, we introduce BRIEF-PRO. It is a universal, lightweight compressor that distills relevant evidence for a given query from retrieved documents into a concise summary for seamless integration into in-context RAG. Using seed data consisting of relatively short contexts (fewer than 1k words), BRIEF-PRO is trained to perform abstractive compression of extended contexts exceeding 10k words across a wide range of scenarios. Furthermore, BRIEFPRO offers flexible user control over summary length by allowing users to specify the desired number of sentences. Experiments on four open-domain multi-hop question-answering datasets show that BRIEF-PRO generates more concise and relevant summaries, enhancing performance across small, large, and proprietary language models. With the 70B reader model, 32x compression by BRIEF-PRO improves QA performance by 4.67% on average over LongLLMLingua's 9x, while requiring only 23% of its computational overhead


BRIEF-Pro

  • BRIEF-Pro pioneers the exploration of multi-hop reasoning and compression of RAG for long contexts of 10k+ words across diverse scenarios.
  • A synthetic data pipeline, built on short-context seed data, is designed to synthesize long-context training data for compression learning.
  • BRIEF-PRO, trained on the curated dataset, generates concise summaries that accelerate the inference and enhance the accuracy of a wide range of small, large, and proprietary language models.

Long-Context Synthetic Data Pipeline,

 

  • Short-to-Long Context Expansion: We introduce a "short-to-long" context expansion pipeline to turn short documents into long, coherent training examples. We locate each document's source Wikipedia page and exact position, then expand it by adding a controlled number of preceding and following sentences. We sample an expansion ratio from a normal distribution to create substantially longer yet diverse contexts that improve model generalization.
  • Compact Summary Curation: We curate compact summaries from retrieved documents by removing redundant text in oracle annotations. We define a sentence's helpfulness by the LM's end-task performance change when that sentence is removed, and iteratively prune head and tail sentences until only helpful segments remain. We then concatenate the pruned segments across documents to produce the final target summary.
  • User-controllable Compression: We enable user-controllable compression so users can directly set the desired summary length. We create instruction-tuning data by counting the sentences in each prebuilt summary and pairing it with the corresponding instruction, teaching the model the mapping between the numeric request and the actual length. We generate diverse pairs across contexts to help the model generalize and produce summaries of user-specified lengths.

Experimental Results

  • We evaluated the BRIEF-PRO series on the following four open-domain multi-hop QA datasets: MuSiQue (Trivedi et al., 2022), HotpotQA (Yang et al., 2018), and 2WikiMultiHopQA (Ho et al., 2020) which are extended versions from LongBench (Bai et al., 2024b), and LongSeal (Pham et al., 2025).
  • To demonstrate that BRIEFPRO can benefit a wide range of models, small (Llama-3.1-8B-Instruct), large (Llama-3.1-70B-Instruct), and proprietary (GPT-4.1-nano) language models were used as the reader M.

  • BRIEF-PRO demonstrates promising multi-hop performance in both QA and document compression. Compared to non-compression, BRIEF-PRO-AUTO achieves an average compression rate of 32x, while still outperforming it by 6.70%, 0.60%, and 7.27% across three different reader models, respectively.
  • Although the curated training data for FILM-7B and ProLong8B improves their long-context capabilities, their performance still underperforms BRIEF-PRO significantly while also requiring more computational resources. Compared to LongLLMLingua, BRIEF-PRO-AUTO compresses by higher 32x than its 9x on average, while still outperforming it by 6.77%, 4.67%, and 7.77%, respectively.
  • For the BRIEF-PRO series, the three compression levels offer varying granularities in preserving key semantic information, as guided by the user-specified instruction. This balance of efficiency and accuracy demonstrates its robustness and versatility across diverse scenarios.

Analysis

The improvement of latency in terms of overall computational overhead    

  • when BRIEF-PRO is adopted for compression, the overall required TFLOPs are significantly lower than those needed for the original, uncompressed long contexts. The total amount of computation is reduced to 45% and 8% of what it was before compression using Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct as the LM M, respectively.
  • Compared to LongLLMLingual which uses Llama-2-7B-Chat to compute sentence and token perplexity, BRIEF-PRO consumes less than 20% and 24% of LongLLMLingual's resources, respectively, while delivering better performance.
  • This substantial reduction in TFLOPs highlights BRIEF-PRO's potential to optimize inference, especially for large-scale long context and with larger reader models, by enabling reader models to focus on compressed, more relevant information without sacrificing accuracy

The comparison with exhaustively expanding only oracle documents

  • To demonstrate the effectiveness of expanding both oracle and distractor documents to form the input long context, our approach is compared against a strategy that performs exhaustive expansion solely on oracle documents, utilizing complete Wikipedia pages
  • The experimental results demonstrate a significant performance degradation when only oracle documents are expanded. This suggests that expanding only oracle documents might lead to a somewhat artificially "clean" context, potentially overestimating the model's ability to handle complex, noisy inputs.
  • In contrast, incorporating the expansion of distractor documents provides essential contextual diversity and better reflects realistic long-context scenarios, where relevant and irrelevant information is often interspersed.
  • On the other hand, our method is statistically able to synthesize significantly longer contexts (Avg. 6.0k vs. 3.6k words). The ability to process and learn from these substantially longer, noisy contexts directly contributes to the observed performance gains.

The accuracy of user-controllable instructions in terms of target sentence count

  • The accuracy of the instruction hinges on how precisely the system can adhere to the user-specified sentence count. While the intent is to provide a flexible and intuitive mechanism for controlling summary granularity, the actual accuracy of this control can vary. Table 3 presents the average sentence counts in the generated summary in various compression modes across all four test sets.
  • Although fitting the summary perfectly within the specified sentence limit is challenging, the results show that BRIEF-PRO performs well in following the HIGH and MEDIUM compression instructions. The reason lies in the sufficient training data for the target summaries within this length range.

BibTeX

@misc{gu2025briefprouniversalcontextcompression,
            title={BRIEF-Pro: Universal Context Compression with Short-to-Long Synthesis for Fast and Accurate Multi-Hop Reasoning}, 
            author={Jia-Chen Gu and Junyi Zhang and Di Wu and Yuankai Li and Kai-Wei Chang and Nanyun Peng},
            year={2025},
            eprint={2510.13799},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2510.13799}, 
        }

BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression

Accepted by NAACL 2025 (Findings)🎉

1Fudan University   2University of California, Los Angeles

*Equal contribution

Abstract

Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge. However, as the number of retrieved documents increases, the input length to LLMs grows linearly, causing a dramatic increase in latency and a degradation in long-context understanding. This is particularly serious for multi-hop questions that require a chain of reasoning across documents. To accelerate inference, reduce costs, and minimize distractions, this paper presents BRIEF (Bridging Retrieval and Inference through Evidence Fusion), a lightweight approach that performs query-aware multi-hop reasoning by compressing retrieved documents into highly dense textual summaries to integrate into in-context learning. To enable learning compression for multi-hop reasoning, we curate synthetic data by extracting atomic proposition expressions that encapsulate distinct factoids from the source documents to compose synthetic summaries. Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries and enables a range of LLMs to achieve exceptional open-domain question answering (QA) performance. For example, on HotpotQA, BRIEF improves the compression rate by 2 times compared to the state-of-the-art baseline, while outperforming it by 3.00% EM and 4.16% F1 with Flan-UL2 as the reader LM. It also generates more concise summaries than proprietary GPT-3.5, while demonstrating nearly identical QA performance.


The BRIEF Compressor

BRIEF is a lightweight, T5-based approach that performs query-aware multi-hop reasoning by compressing retrieved documents into highly dense textual summaries to integrate into in-context learning.

Unlike conventional methods that focus on compression for single-hop questions (Xu et al., 2024a; Cao et al., 2024), BRIEF is specifically trained to summarize the most pertinent knowledge from multiple documents that is essential for answering multi-hop questions.

Compare to token-, phrase-, or sentence-level compression (Jiang et al., 2023; Li et al., 2023), the summaries produced by BRIEF organize and synthesize evidence relevant to the query in a more concise and natural language format, making them more effective for use by the follow-up reader LM.

Synthetic Data

Unlike state-of-the-art fine-tuned compressor distilled from extreme-scale proprietary LLMs (Xu et al., 2024a), BRIEF is trained on synthetic data through a pipeline built entirely by open-source models, without relying on any proprietary LLMs and human annotations.

 

  • The synthetic data pipeline extracts atomic proposition expressions that encapsulate distinct factoids from the source documents to compose synthetic summaries.
  • The pipeline designs an automatic validation mechanism to filter out spurious multi-hop questions and corresponding summaries, ensuring that only those requiring genuine multi-hop reasoning are retained, ultimately improving the quality and reliability of the synthetic data.
  • Besides, the synthetic data exhibits an impressive awareness of multi-hop reasoning and potential to scale up, offering a data-centric approach to constructing high-quality and cost-effective synthetic data for context compression.

Experimental Results

We evaluated on the following datasets: HotpotQA (Yang et al., 2018), MuSiQue (Trivedi et al., 2022), Natural Questions (NQ) (Kwiatkowski et al., 2019), and TriviaQA (Joshi et al., 2017). Notably, the first two datasets primarily consist of multi-hop questions, whereas the latter two are mainly composed of single-hop questions. Especially for TriviaQA and NQ, we have curated high-quality multi-hop versions using our proposed synthetic data pipeline, named MultiHop-TriviaQA and MultiHop-NQ. We contribute high-quality multi-hop test sets that reveal the limitations of previous compressors, which perform well in single-hop but fall behind our method in multi-hop settings.

  • RIEF achieves a compression rate of 19.19x, with only a 1.60-point decrease in EM and a 1.83-point decrease in F1 compared to prepending full documents on HotpotQA.
  • Compared to RECOMP, BRIEF compresses by higher 19.19x than its 10.02x, while still outperforming it by 3.00-point EM and 4.16-point F1 on HotpotQA. On MultiHop-NQ, we observed a similar trend, with BRIEF’s higher 16.85x than RECOMP’s 10.84x, while outperforming RECOMP by 3.78-point EM and 4.39-point F1.
  • Compared to the proprietary LLM GPT-3.5, BRIEF achieves higher compression rates while delivering competitive QA performance. Take the results on HotpotQA as an example, GPT-3.5 achieves a compression rate of 14.77x, and QA performance of 31.60% EM and 42.65% F1. While BRIEF achieves higher 19.19x and can still deliver nearly similar QA results of 31.20% EM and 42.07% F1 performance.

  • BRIEF achieves a compression rate of 29.76x, with only a 2.55-point decrease in EM and a 3.49-point decrease in F1 compared to prepending full documents on TriviaQA. On NQ, we observed a similar trend, with a compression rate of 17.67x, resulting in only a 2.99-point decrease in EM and a 3.28-point decrease in F1.
  • Compared to RECOMP, BRIEF compresses by higher 29.76x than its 16.23x, while still outperforming RECOMP on TriviaQA.
  • Com-pared to GPT-3.5, BRIEF achieves competitive QA performance, while its compression rate of 17.67x significantly outperforms GPT-3.5’s 11.33x.

Analysis

The transfer ability of compressed summaries across LMs

  • This ability involves evaluating how well a compressed summary can maintain the core semantics relevant to the query, while also using an expression format that is compatible with a wider range of LMs. We selected models from the same family to avoid model selection bias.
  • Since our compression takes the form of propositions, it is more interpretable and transfers better across LMs compared to RECOMP and GPT-3.5.
  • In comparison to RECOMP and GPT-3.5, the performance of BRIEF drops more slightly when transferring from Phi-3-mini to Phi-3-small, and enlarges more from Phi-3-small to Phi-3-medium. These results implied the robustness and consistency of the compressed summaries generated by BRIEF.

The sensitivity of summary length to multi-hop nature of questions

  • The variation in summary length regarding question complexity can, to some extent, reflect the compressor’s sensitivity to that complexity.
  • As there is no established ground truth for the length of compressed summaries for each question, the results from GPT-3.5 were used as the reference oracle.
  • The results indicate that BRIEF consistently aligns with GPT-3.5 in terms of the sensitivity to the multi-hop nature of questions while generating more concise summaries. This alignment suggests that BRIEF effectively understands the complexity of questions and adaptively collects the necessary evidence based on specific demands to formulate a complete and accurate summary for answering this question.

The improvement of latency in terms of the over-all computational overhead

  • When employing BRIEF for compression, the number of GFLOPs required to process the compressed documents is significantly reduced compared to the amount required when using Flan-UL2 alone on the original, uncompressed set of top-5 documents. The total amount of computation is reduced to less than 30% of what it was before compression.
  • This reduction in GFLOPs highlights BRIEF’s potential to optimize inference, especially for large-scale document retrieval and processing, by enabling the LM to focus on compressed, more relevant infor-mation while maintaining comparable accuracy.

The scalability to compress longer documents

  • We further explored whether the proposed compressors could be effectively applied to more complex scenarios, particularly those involving documents whose lengths are an order of magnitude longer. A preliminary study was conducted by expanding the scope of retrieved documents from the top-5 to the top-25.
  • To avoid document position bias, these documents were shuffled and uniformly divided into document chunks, each containing five documents. Each chunk was then compressed using the trained compressor according to standard procedures. Finally, the compressed results of each chunk were concatenated to produce the overall compressed summary.
  • BRIEF demonstrates better scalability in scenarios where the document length is significantly longer. BRIEF is relatively stable, while RECOMP shows significant performance degradation. This result suggests that RECOMP has a limited ability to identify relevant evidence within a longer context containing more distracting information. Overall, our findings suggest that BRIEF has the potential to be extended but still requires further investigation for compressing longer contexts, which will be explored in future.

Takeaways

  • This study pioneers the exploration of long-context reasoning and compression of RAG for multi-hop questions.
  • A synthetic data pipeline, built entirely by open-source models, is designed to enhance the awareness of multi-hop reasoning and shows promise to scale up due to its low cost.
  • BRIEF, trained on a curated dataset, achieves exceptional QA performance with more concise summaries compared to proprietary LLM-based compressors.
  • We contribute high-quality multi-hop test sets that reveal the limitations of previous compressors, which perform well in single-hop but fall behind our method in multi-hop settings.

BibTeX

@article{li2024brief,
 title = "BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression",
 author = "Li, Yuankai  and
           Gu, Jia-Chen  and
           Wu, Di  and
           Chang, Kai-Wei  and
           Peng, Nanyun",
 journal={arXiv preprint arXiv:2410.15277},
 year = "2024"
}