Llms Machine Learning Robotics Ai Safety

[2602.13427] Backdooring Bias in Large Language Models

arXiv - AI February 17, 2026 4 min read Article

Summary

The paper explores backdoor attacks in large language models (LLMs), focusing on how biases can be induced through syntactically and semantically triggered methods, and evaluates the effectiveness of defense mechanisms against these attacks.

Why It Matters

As LLMs become more prevalent, understanding how biases can be manipulated through backdoor attacks is crucial for ensuring ethical AI deployment. This research highlights vulnerabilities in model training and the challenges of bias mitigation, which are essential for developers and researchers in AI safety.

Key Takeaways

Backdoor attacks can effectively induce biases in LLMs using both syntactic and semantic triggers.
Semantically-triggered attacks are generally more effective at inducing negative biases compared to syntactically-triggered ones.
Existing defense mechanisms can mitigate backdoor attacks but often at the cost of model utility or require significant computational resources.
The research emphasizes the need for a white-box threat model to better understand the adversarial capabilities of model builders.
High poisoning ratios and data augmentation can enhance the understanding of backdoor attack potential.

Computer Science > Cryptography and Security arXiv:2602.13427 (cs) [Submitted on 13 Feb 2026] Title:Backdooring Bias in Large Language Models Authors:Anudeep Das, Prach Chantasantitam, Gurjot Singh, Lipeng He, Mariia Ponomarenko, Florian Kerschbaum View a PDF of the paper titled Backdooring Bias in Large Language Models, by Anudeep Das and 5 other authors View PDF Abstract:Large language models (LLMs) are increasingly deployed in settings where inducing a bias toward a certain topic can have significant consequences, and backdoor attacks can be used to produce such models. Prior work on backdoor attacks has largely focused on a black-box threat model, with an adversary targeting the model builder's LLM. However, in the bias manipulation setting, the model builder themselves could be the adversary, warranting a white-box threat model where the attacker's ability to poison, and manipulate the poisoned data is substantially increased. Furthermore, despite growing research in semantically-triggered backdoors, most studies have limited themselves to syntactically-triggered attacks. Motivated by these limitations, we conduct an analysis consisting of over 1000 evaluations using higher poisoning ratios and greater data augmentation to gain a better understanding of the potential of syntactically- and semantically-triggered backdoor attacks in a white-box setting. In addition, we study whether two representative defense paradigms, model-intrinsic and model-extrinsic backdoor remov...

Read Original Article

[2602.13427] Backdooring Bias in Large Language Models

Summary

Why It Matters

Key Takeaways

Related Articles

Is the Mirage Effect a bug, or is it Geometric Reconstruction in action? A framework for why VLMs perform better "hallucinating" than guessing, and what that may tell us about what's really inside these models

Paper Finds That Leading AI Chatbots Like ChatGPT and Claude Remain Incredibly Sycophantic, Resulting in Twisted Effects on Users

Claude Code leak exposes a Tamagotchi-style ‘pet’ and an always-on agent | The Verge

You can now use ChatGPT with Apple’s CarPlay | The Verge

No comments

Stay updated with AI News