[2602.13427] Backdooring Bias in Large Language Models

[2602.13427] Backdooring Bias in Large Language Models

arXiv - AI 4 min read Article

Summary

The paper explores backdoor attacks in large language models (LLMs), focusing on how biases can be induced through syntactically and semantically triggered methods, and evaluates the effectiveness of defense mechanisms against these attacks.

Why It Matters

As LLMs become more prevalent, understanding how biases can be manipulated through backdoor attacks is crucial for ensuring ethical AI deployment. This research highlights vulnerabilities in model training and the challenges of bias mitigation, which are essential for developers and researchers in AI safety.

Key Takeaways

  • Backdoor attacks can effectively induce biases in LLMs using both syntactic and semantic triggers.
  • Semantically-triggered attacks are generally more effective at inducing negative biases compared to syntactically-triggered ones.
  • Existing defense mechanisms can mitigate backdoor attacks but often at the cost of model utility or require significant computational resources.
  • The research emphasizes the need for a white-box threat model to better understand the adversarial capabilities of model builders.
  • High poisoning ratios and data augmentation can enhance the understanding of backdoor attack potential.

Computer Science > Cryptography and Security arXiv:2602.13427 (cs) [Submitted on 13 Feb 2026] Title:Backdooring Bias in Large Language Models Authors:Anudeep Das, Prach Chantasantitam, Gurjot Singh, Lipeng He, Mariia Ponomarenko, Florian Kerschbaum View a PDF of the paper titled Backdooring Bias in Large Language Models, by Anudeep Das and 5 other authors View PDF Abstract:Large language models (LLMs) are increasingly deployed in settings where inducing a bias toward a certain topic can have significant consequences, and backdoor attacks can be used to produce such models. Prior work on backdoor attacks has largely focused on a black-box threat model, with an adversary targeting the model builder's LLM. However, in the bias manipulation setting, the model builder themselves could be the adversary, warranting a white-box threat model where the attacker's ability to poison, and manipulate the poisoned data is substantially increased. Furthermore, despite growing research in semantically-triggered backdoors, most studies have limited themselves to syntactically-triggered attacks. Motivated by these limitations, we conduct an analysis consisting of over 1000 evaluations using higher poisoning ratios and greater data augmentation to gain a better understanding of the potential of syntactically- and semantically-triggered backdoor attacks in a white-box setting. In addition, we study whether two representative defense paradigms, model-intrinsic and model-extrinsic backdoor remov...

Related Articles

Llms

Is the Mirage Effect a bug, or is it Geometric Reconstruction in action? A framework for why VLMs perform better "hallucinating" than guessing, and what that may tell us about what's really inside these models

Last week, a team from Stanford and UCSF (Asadi, O'Sullivan, Fei-Fei Li, Euan Ashley et al.) dropped two companion papers. The first, MAR...

Reddit - Artificial Intelligence · 1 min ·
Llms

Paper Finds That Leading AI Chatbots Like ChatGPT and Claude Remain Incredibly Sycophantic, Resulting in Twisted Effects on Users

https://futurism.com/artificial-intelligence/paper-ai-chatbots-chatgpt-claude-sycophantic Your AI chatbot isn’t neutral. Trust its advice...

Reddit - Artificial Intelligence · 1 min ·
Claude Code leak exposes a Tamagotchi-style ‘pet’ and an always-on agent | The Verge
Llms

Claude Code leak exposes a Tamagotchi-style ‘pet’ and an always-on agent | The Verge

Anthropic says “human error” resulted in a leak that exposed Claude Code’s source code. The leaked code, which has since been copied to G...

The Verge - AI · 4 min ·
You can now use ChatGPT with Apple’s CarPlay | The Verge
Llms

You can now use ChatGPT with Apple’s CarPlay | The Verge

ChatGPT is now accessible from your CarPlay dashboard if you have iOS 26.4 or newer and the latest version of the ChatGPT app.

The Verge - AI · 3 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime