Looks like fine tuning out performs RAG and Large context LLMs for data accuracy with longer contexts Paper: https://lnkd.in/gj9-3jsK Summary of "BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack" - Introduction to BABILong Benchmark: - Designed to test LLMs' reasoning abilities over extremely long documents. - Comprises 20 diverse reasoning tasks, such as fact chaining, induction, deduction, counting, and handling lists/sets. - Uses natural text from the PG19 corpus, extendable to document lengths up to 1 million tokens. - Current LLM Performance: - Popular LLMs utilize only 10-20% of the context effectively. - Performance declines with increased reasoning complexity. - Retrieval-Augmented Generation (RAG) methods achieve 60% accuracy on single-fact QA, independent of context length. - Top Performing Models: - Recurrent Memory Transformers (RMT) show the best performance, processing up to 11 million tokens. - Small models like Mamba (130M) and fine-tuned RMT (137M) achieve high accuracy on long-context tasks. - Evaluation Findings: - Current benchmarks (e.g., Longbench, L-Eval) are insufficient for LLMs with capabilities beyond 40,000 tokens. - BABILong tests models' efficiency in distinguishing relevant facts from irrelevant text over long contexts. - Popular LLMs fail to maintain high performance as context length and complexity increase. - Fine-Tuning Insights: - Fine-tuning improves model performance significantly. - Mamba and RMT demonstrate successful QA with long contexts, outperforming RAG models. - RMT's memory mechanism allows efficient processing of long sequences, with marginal quality degradation up to 11 million tokens. - Benchmark Significance: - BABILong provides a rigorous evaluation framework for LLMs' long-context reasoning abilities. - Highlights the limitations of current LLMs and the need for improved context processing mechanisms. - Key Contributions: - Introduction of BABILong, a scalable generative multi-task benchmark. - Evaluation of over 20 recent LLMs on various context lengths and tasks. - Demonstration of the limitations of current models in long-context utilization. - Setting a new record for sequence size processed by a single model with RMT.
Nice find, Fahad Jalal. BABILong eval shows limitations of popular LLMs and standout of RMT. Fine-tuning significantly improves accuracy. Crucial to advance models to harness long-form data. Excited to see field respond to this benchmark. Nicely done!
Founder and CEO
4moFrom the paper