[2604.00339] When Career Data Runs Out: Structured Feature Engineering and Signal Limits for Founder Success Prediction
About this article
Abstract page for arXiv paper 2604.00339: When Career Data Runs Out: Structured Feature Engineering and Signal Limits for Founder Success Prediction
Computer Science > Machine Learning arXiv:2604.00339 (cs) [Submitted on 1 Apr 2026] Title:When Career Data Runs Out: Structured Feature Engineering and Signal Limits for Founder Success Prediction Authors:Yagiz Ihlamur View a PDF of the paper titled When Career Data Runs Out: Structured Feature Engineering and Signal Limits for Founder Success Prediction, by Yagiz Ihlamur View PDF HTML (experimental) Abstract:Predicting startup success from founder career data is hard. The signal is weak, the labels are rare (9%), and most founders who succeed look almost identical to those who fail. We engineer 28 structured features directly from raw JSON fields -- jobs, education, exits -- and combine them with a deterministic rule layer and XGBoost boosted stumps. Our model achieves Val F0.5 = 0.3030, Precision = 0.3333, Recall = 0.2222 -- a +17.7pp improvement over the zero-shot LLM baseline. We then run a controlled experiment: extract 9 features from the prose field using Claude Haiku, at 67% and 100% dataset coverage. LLM features capture 26.4% of model importance but add zero CV signal (delta = -0.05pp). The reason is structural: anonymised_prose is generated from the same JSON fields we parse directly -- it is a lossy re-encoding, not a richer source. The ceiling (CV ~= 0.25, Val ~= 0.30) reflects the information content of this dataset, not a modeling limitation. In characterizing where the signal runs out and why, this work functions as a benchmark diagnostic -- one that points...