[2602.13427] Backdooring Bias in Large Language Models
Summary
The paper explores backdoor attacks in large language models (LLMs), focusing on how biases can be induced through syntactically and semantically triggered methods, and evaluates the effectiveness of defense mechanisms against these attacks.
Why It Matters
As LLMs become more prevalent, understanding how biases can be manipulated through backdoor attacks is crucial for ensuring ethical AI deployment. This research highlights vulnerabilities in model training and the challenges of bias mitigation, which are essential for developers and researchers in AI safety.
Key Takeaways
- Backdoor attacks can effectively induce biases in LLMs using both syntactic and semantic triggers.
- Semantically-triggered attacks are generally more effective at inducing negative biases compared to syntactically-triggered ones.
- Existing defense mechanisms can mitigate backdoor attacks but often at the cost of model utility or require significant computational resources.
- The research emphasizes the need for a white-box threat model to better understand the adversarial capabilities of model builders.
- High poisoning ratios and data augmentation can enhance the understanding of backdoor attack potential.
Computer Science > Cryptography and Security arXiv:2602.13427 (cs) [Submitted on 13 Feb 2026] Title:Backdooring Bias in Large Language Models Authors:Anudeep Das, Prach Chantasantitam, Gurjot Singh, Lipeng He, Mariia Ponomarenko, Florian Kerschbaum View a PDF of the paper titled Backdooring Bias in Large Language Models, by Anudeep Das and 5 other authors View PDF Abstract:Large language models (LLMs) are increasingly deployed in settings where inducing a bias toward a certain topic can have significant consequences, and backdoor attacks can be used to produce such models. Prior work on backdoor attacks has largely focused on a black-box threat model, with an adversary targeting the model builder's LLM. However, in the bias manipulation setting, the model builder themselves could be the adversary, warranting a white-box threat model where the attacker's ability to poison, and manipulate the poisoned data is substantially increased. Furthermore, despite growing research in semantically-triggered backdoors, most studies have limited themselves to syntactically-triggered attacks. Motivated by these limitations, we conduct an analysis consisting of over 1000 evaluations using higher poisoning ratios and greater data augmentation to gain a better understanding of the potential of syntactically- and semantically-triggered backdoor attacks in a white-box setting. In addition, we study whether two representative defense paradigms, model-intrinsic and model-extrinsic backdoor remov...