Llms Machine Learning Ai Safety

[2602.14012] From SFT to RL: Demystifying the Post-Training Pipeline for LLM-based Vulnerability Detection

arXiv - AI February 17, 2026 4 min read Article

Summary

This article explores the post-training pipeline for LLM-based vulnerability detection, detailing methods from supervised fine-tuning (SFT) to reinforcement learning (RL) and offering insights into effective model training strategies.

Why It Matters

As the integration of large language models (LLMs) into vulnerability detection becomes more prevalent, understanding the post-training processes is crucial for enhancing model performance and reliability. This research provides foundational insights that can guide future developments in AI-driven security solutions.

Key Takeaways

SFT based on rejection sampling outperforms rationalization-based supervision.
Excessive SFT can inhibit self-exploration during RL, limiting performance gains.
Fine-grained reward signals improve RL training efficiency compared to coarse-grained signals.
Filtering hard-to-detect vulnerabilities can enhance RL training but may incur performance costs.
Models trained with GRPO show significant advantages over those using SFT and preference optimization.

Computer Science > Cryptography and Security arXiv:2602.14012 (cs) [Submitted on 15 Feb 2026] Title:From SFT to RL: Demystifying the Post-Training Pipeline for LLM-based Vulnerability Detection Authors:Youpeng Li, Fuxun Yu, Xinda Wang View a PDF of the paper titled From SFT to RL: Demystifying the Post-Training Pipeline for LLM-based Vulnerability Detection, by Youpeng Li and 2 other authors View PDF HTML (experimental) Abstract:The integration of LLMs into vulnerability detection (VD) has shifted the field toward interpretable and context-aware analysis. While post-training methods have shown promise in general coding tasks, their systematic application to VD remains underexplored. In this paper, we present the first comprehensive investigation into the post-training pipeline for LLM-based VD, spanning from cold-start SFT to off-policy preference optimization and on-policy RL, uncovering how data curation, stage interactions, reward mechanisms, and evaluation protocols collectively dictate the efficacy of model training and assessment. Our study identifies practical guidelines and insights: (1) SFT based on rejection sampling greatly outperforms rationalization-based supervision, which can introduce hallucinations due to ground-truth leakage. (2) While increased SFT epochs constantly benefit preference optimization, excessive SFT inhibits self-exploration during RL, ultimately limiting performance gains. (3) Coarse-grained reward signals often mislead RL, whereas fine-gra...

Read Original Article