The StreamingLLM framework is launched, claiming to "allow large models to process unlimited length text"
Researchers from MIT and Meta AI have recently developed a framework called StreamingLLM, which proposes a series of solutions to the RAM and generalization problems that large language models may encounter. It claims to be able to "allow language models to handle infinite lengths." text content" .
▲Image source GitHub
The research focus of StreamingLLM is to solve the obstacles in implementing Efficient Streaming Language Models (ESLM), especially the problems that may arise in "multi-turn dialogue scenarios with long interactions".
The researchers pointed out that there are two main challenges in this streaming language model:
The first challenge: During the decoding phase, obtaining the key value (Value) status of the token will consume a lot of RAM. The second challenge: The currently popular large language models are difficult to generalize to long texts that “exceed the training sequence length”.
The author has noticed that there have been many studies in the past that have tried to solve the above challenges, such as "expanding the attention window" to allow the language model to handle long texts that exceed the length of the pre-training sequence; or establishing a fixed-size active window to only focus on the most recent tokens The key value status ensures that the RAM usage and decoding speed remain stable, but if "the sequence length exceeds the cache size" is encountered, this strategy will fail.
The biggest challenge of current streaming language models is "how to process long text input without consuming too much RAM and without damaging model performance."
The strategy adopted by StreamingLLM is to "use the attention sinking phenomenon" . Researchers have observed that in the autoregressive language model, regardless of the correlation between a specific token and the language model itself, if a large amount of attention is allocated to the generation token . These tokens that receive high attention will show the phenomenon of attention sinking. Even if these tokens are not semantically important, they still receive strong attention from the model (that is, giving a lot of attention to specific token content, thereby obtaining most of the model's attention). attention, and these specific token contents contain the "key value of the sinking token", thus ensuring that the model's attention calculation can remain stable no matter how long the input sequence is).
The important contribution of StreamingLLM is that it proposes a simple and efficient solution so that the language model can handle infinite length text without fine-tuning. This solves the current dilemma of language models in streaming applications. Although streaming language models are imperative in the future, the development of related models is still challenged due to limitations in RAM efficiency and performance issues in processing long sequences.
The research team has confirmed that StreamingLLM can enable Llama 2, MPT, Falcon and Pythia to reliably process text of up to 4 million tokens , providing more deployment possibilities for streaming language models.