[2604.00938] WARP: Guaranteed Inner-Layer Repair of NLP Transformers
About this article
Abstract page for arXiv paper 2604.00938: WARP: Guaranteed Inner-Layer Repair of NLP Transformers
Computer Science > Machine Learning arXiv:2604.00938 (cs) [Submitted on 1 Apr 2026] Title:WARP: Guaranteed Inner-Layer Repair of NLP Transformers Authors:Hsin-Ling Hsu, Min-Yu Chen, Nai-Chia Chen, Yan-Ru Chen, Yi-Ling Chang, Fang Yu View a PDF of the paper titled WARP: Guaranteed Inner-Layer Repair of NLP Transformers, by Hsin-Ling Hsu and 5 other authors View PDF HTML (experimental) Abstract:Transformer-based NLP models remain vulnerable to adversarial perturbations, yet existing repair methods face a fundamental trade-off: gradient-based approaches offer flexibility but lack verifiability and often overfit; methods that do provide repair guarantees are restricted to the final layer or small networks, significantly limiting the parameter search space available for repair. We present WARP (Weight-Adjusted Repair with Provability), a constraint-based repair framework that extends repair beyond the last layer of Transformer models. WARP formulates repair as a convex quadratic program derived from a first-order linearization of the logit gap, enabling tractable optimization over a high-dimensional parameter space. Under the condition that the first-order approximation holds, this formulation induces three per-sample guarantees: (i) a positive margin constraint ensuring correct classification on repaired inputs, (ii) preservation constraints over a designated remain set, and (iii) a certified robustness radius derived from Lipschitz continuity. To ensure feasibility across var...