[2602.23121] Automated Vulnerability Detection in Source Code Using Deep Representation Learning
Summary
This article presents a convolutional neural network model designed to automate the detection of vulnerabilities in C source code, achieving higher recall rates than previous methods.
Why It Matters
As software vulnerabilities pose significant risks to security, this research advances automated detection methods, potentially improving software safety and reducing exploitation risks. The specialized focus on C code enhances the relevance for developers working in systems programming and security.
Key Takeaways
- The model utilizes a convolutional neural network to identify bugs in C code.
- It is trained on two datasets, enhancing its detection capabilities.
- The approach achieves higher recall rates compared to previous studies.
- The model effectively identifies real vulnerabilities with a low false-positive rate.
- This research contributes to improving automated security measures in software development.
Computer Science > Cryptography and Security arXiv:2602.23121 (cs) [Submitted on 26 Feb 2026] Title:Automated Vulnerability Detection in Source Code Using Deep Representation Learning Authors:C. Seas, G. Fitzpatrick, J. A. Hamilton, M. C. Carlisle View a PDF of the paper titled Automated Vulnerability Detection in Source Code Using Deep Representation Learning, by C. Seas and 2 other authors View PDF HTML (experimental) Abstract:Each year, software vulnerabilities are discovered, which pose significant risks of exploitation and system compromise. We present a convolutional neural network model that can successfully identify bugs in C code. We trained our model using two complementary datasets: a machine-labeled dataset created by Draper Labs using three static analyzers and the NIST SATE Juliet human-labeled dataset designed for testing static analyzers. In contrast with the work of Russell et al. on these datasets, we focus on C programs, enabling us to specialize and optimize our detection techniques for this language. After removing duplicates from the dataset, we tokenize the input into 91 token categories. The category values are converted to a binary vector to save memory. Our first convolution layer is chosen so that the entire encoding of the token is presented to the filter. We use two convolution and pooling layers followed by two fully connected layers to classify programs into either a common weakness enumeration category or as ``clean.'' We obtain higher recal...