Abstract: The continued improvements in language model capability have unlocked their widespread use as drivers of autonomous agents, for example in coding or computer use applications. However, the core of these systems has not changed much since early instruction-tuned models like ChatGPT. Even advanced AI agents function on message exchange formats, successively exchanging messages with users, systems, with itself (i.e. chain-of-thought) and tools in a single stream of computation. This bottleneck to a single stream in chat models leads to a number of limitations: the agent cannot act (generate output) while reading, and in reverse, cannot react to new information while writing. Similarly, the agent cannot act while thinking and cannot think while reading or acting on information. In this work, we show that models can be unblocked by switching from instruction-tuning for sequential message formats to instruction-tuning for multiple, parallel streams of computation, splitting each role into a separate stream. Every forward pass of the language model then simultaneously reads from multiple input streams and generates tokens in multiple output streams, all of which causally depend on earlier timesteps. We argue that this data-driven change remedies a number of usability limitations as outlined above, improves model efficiency through parallelization, improves model security through better separation of concerns and can further improve model monitorability.
"However, our models are nevertheless relatively small and trained on tiny amounts of instruction examples, compared to the scale of modern instruction data and multiple post-training stages used to reinforce the default message-based format. We do think that parallel streams are a conceptually enticing format, and that future work on a larger scale will go further to show these benefits."
It's the same asynchronous stream pattern we're used to dealing with in regular software engineering. We have a fixed thread pool, lots of work that can be scheduled concurrently. Since these are streams, we can do the compute incrementally to reduce the time-to-first-byte/token/response.
Since so many tool calls are inherently asynchronous, and subagent task decomposition can be modelled as such, the IO streams can be oversubscribed, and incoming responses can be priority queued.
On the intelligence front, it's incredible how much better frontier models perform when you just interrupt them every so often and go 'is that the best you can do?', or re-iterate instructions, or repeat the overall goal. I find instruction following _so poor_, especially for 'presentation layer' aspects. Yet if I ask the model to rewrite its last response, it does so perfectly. Why can't the model do this 'internally' and save me having to say 'try again'!
Just because the 'model' is autoregressive doesn't mean the system as a whole needs to present a single stream of immutable text.
It also gives a lot of new levers to play with. I'd assume you could tweak (sweep?) the amount of attention given to the same stream vs. cross stream, have different streams prompted / seeded with an objective, score each independently vs. together, etc. A bit reminiscent of the direction oAI took w/ their harmony template, where they define channels and the model learns to output to each channel (but that's sequential).
Would have loved to see even a small attempt at RL on top of this. Could probably get gnarly with so many avenues to explore, but even a few hundred steps could have informed if there's something to it.
One concern I have is w/ how the data was prepared. They used a 80b model to transform from sequential instruct format to this multi-stream format. There are a lot of ways where stuff can "leak" from the process, and contaminate the results. That's why I'd have loved to see some further RL on this, but anyway. Cool paper, worth a revisit sometime.
I am perfectly content with a medium-speed golden goose. It seems to be a lot more predictable and happy this way. The business and other developers are already saturated by the serialized technique. Going faster would only serve to distract others at this point.
I think Navy SEALs have an apt slogan here.