[2408.00539] Intermittent Semi-Working Mask: A New Masking Paradigm for LLMs
Summary
The paper introduces the Intermittent Semi-Working Mask (ISM), a novel masking paradigm for Large Language Models (LLMs) that enhances multi-turn dialogue and context-intensive tasks while maintaining efficiency.
Why It Matters
As LLMs face challenges in handling long histories in dialogues, ISM offers a solution that balances contextual understanding with inference efficiency. This innovation could significantly improve the performance of LLMs in real-world applications, making them more effective for complex tasks.
Key Takeaways
- ISM integrates sparse bidirectional attention into causal LLMs.
- It eliminates the need for triplet expansion during training.
- The approach maintains KV-cache reuse, reducing latency.
- ISM outperforms traditional causal baselines in multi-turn dialogues.
- The method is architecture-agnostic and adds minimal latency.
Computer Science > Computation and Language arXiv:2408.00539 (cs) [Submitted on 1 Aug 2024 (v1), last revised 17 Feb 2026 (this version, v2)] Title:Intermittent Semi-Working Mask: A New Masking Paradigm for LLMs Authors:HaoYuan Hu, Mingcong Lu, Di Luo, XinYa Wu, Jiangcai Zhu, Taoye Yin, Zheng Li, Hao Wang, Shusheng Zhang, KeZun Zhang, KaiLai Shao, Chao Chen, Feng Wang View a PDF of the paper titled Intermittent Semi-Working Mask: A New Masking Paradigm for LLMs, by HaoYuan Hu and 12 other authors View PDF HTML (experimental) Abstract:Multi-turn dialogues and context-intensive tasks challenge Large Language Models (LLMs) to integrate long histories without sacrificing generation quality. Although prefix LLMs can better exploit historical context via bidirectional attention on prefix tokens, they are rarely used in practice because multi-turn training requires many duplicated triplets, and its bidirectional prefix prevents KV-cache reuse at inference time, driving up high cost and latency. To retain the contextual understanding of prefix mask while preserving the inference-time efficiency of causal mask, we introduce Intermittent Semi-working Mask (ISM), a masking scheme that injects sparse bidirectional attention into the causal backbone. ISM alternates bidirectional attention over query segments with unidirectional attention over answer segments, enabling the synthesis of in-context while preserving global causality. This design eliminates triplet expansion during training...