[2603.08104] Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
About this article
Abstract page for arXiv paper 2603.08104: Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Computer Science > Machine Learning arXiv:2603.08104 (cs) [Submitted on 9 Mar 2026 (v1), last revised 22 Mar 2026 (this version, v2)] Title:Invisible Safety Threat: Malicious Finetuning for LLM via Steganography Authors:Guangnian Wan, Xinyin Ma, Gongfan Fang, Xinchao Wang View a PDF of the paper titled Invisible Safety Threat: Malicious Finetuning for LLM via Steganography, by Guangnian Wan and 3 other authors View PDF HTML (experimental) Abstract:Understanding and addressing potential safety alignment risks in large language models (LLMs) is critical for ensuring their safe and trustworthy deployment. In this paper, we highlight an insidious safety threat: a compromised LLM can maintain a facade of proper safety alignment while covertly generating harmful content. To achieve this, we finetune the model to understand and apply a steganographic technique. At inference time, we input a prompt that contains a steganographically embedded malicious target question along with a plaintext cover question. The model, in turn, produces a target response similarly embedded within a benign-looking cover response. In this process, human observers only see the model being prompted with a cover question and generating a corresponding cover response, while the malicious content is hidden from view. We demonstrate this invisible safety threat on GPT-4.1 despite the OpenAI finetuning API's safeguards. The finetuned model produces steganographic malicious outputs in response to hidden malici...