[2604.01554] EXHIB: A Benchmark for Realistic and Diverse Evaluation

[2604.01554] EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild

arXiv - Machine Learning April 03, 2026 4 min read

About this article

Abstract page for arXiv paper 2604.01554: EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild

Computer Science > Cryptography and Security arXiv:2604.01554 (cs) [Submitted on 2 Apr 2026] Title:EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild Authors:Yiming Fan (1), Jun Yeon Won (1), Ding Zhu (1), Melih Sirlanci (1), Mahdi Khalili (1), Carter Yagemann (1) ((1) The Ohio State University) View a PDF of the paper titled EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild, by Yiming Fan (1) and 5 other authors View PDF HTML (experimental) Abstract:Binary Function Similarity Detection (BFSD) is a core problem in software security, supporting tasks such as vulnerability analysis, malware classification, and patch provenance. In the past few decades, numerous models and tools have been developed for this application; however, due to the lack of a comprehensive universal benchmark in this field, researchers have struggled to compare different models effectively. Existing datasets are limited in scope, often focusing on a narrow set of transformations or types of binaries, and fail to reflect the full diversity of real-world applications. We introduce EXHIB, a benchmark comprising five realistic datasets collected from the wild, each highlighting a distinct aspect of the BFSD problem space. We evaluate 9 representative models spanning multiple BFSD paradigms on EXHIB and observe performance degradations of up to 30% on firmware and semantic datasets compared to standard settings, revealing subst...

Originally published on April 03, 2026. Curated by AI News.

Machine Learning

HydraLM: 22× faster decoding and 16× smaller state memory in long-context inference experiments [P]

I’ve been experimenting with HydraLM, a long-context model for inference, and the numbers are getting a bit wild: the repo’s benchmark su...

Reddit - Machine Learning · 1 min · about 1 hour ago

Machine Learning

How to know if a research-oriented role is for you? [D]

I’m currently a first-year Master’s student in Data Science & AI, and I’m trying to figure out whether a research-oriented career is ...

Reddit - Machine Learning · 1 min · about 1 hour ago

Machine Learning

GPU Compass – open-source, real-time GPU pricing across 20+ clouds [P]

We maintain an open-source catalog of cloud GPU offerings (skypilot-catalog, Apache 2.0). It auto-fetches pricing from 20+ cloud APIs eve...

Reddit - Machine Learning · 1 min · about 4 hours ago

Machine Learning

5 AI Models Tried to Scam Me. Some of Them Were Scary Good | WIRED

The cyber capabilities of AI models have experts rattled. AI’s social skills may be just as dangerous.

Wired - AI · 8 min · about 4 hours ago

[2604.01554] EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild

About this article

Related Articles

HydraLM: 22× faster decoding and 16× smaller state memory in long-context inference experiments [P]

How to know if a research-oriented role is for you? [D]

GPU Compass – open-source, real-time GPU pricing across 20+ clouds [P]

5 AI Models Tried to Scam Me. Some of Them Were Scary Good | WIRED

No comments

Stay updated with AI News