I also have my gripes about the way 2 hop is mentioned here. With figure 3 being the canonical example of what I would consider too trivial/misleading (The exact text match of "Eric Watts" being in the question and in the context). It leads to the natural question of how does it do compared to an LLM with a grep tool.
What I would consider more interesting is practical synthesis over such a large context where you can't just string lookup answers. For example maybe dumping all of Intel's x86 manuals into context and then asking an LLM to try to write assembly or something.
The more we can drive towards selective attention over larger and larger sets of "working memory", the better, I think.
I suspect cleverer mechanisms of context injection/pruning/updating would result in effective memory more so than my suspicion increasing the context window forever will do, regardless of what tricks we apply to distil attention over it.
There is probably a lot of low hanging fruit in this area.
I also think some of the benchmarks are misleading. Getting a RAG system to do an attention benchmark and then comparing it against a model without RAG just isn't fair. It is obviously better but it's not apples to apples. Some of the benchmarks compare against model+RAG and there the delta in performance is much smaller.