[2603.03321] DIALEVAL: Automated Type-Theoretic Evaluation of LLM

[2603.03321] DIALEVAL: Automated Type-Theoretic Evaluation of LLM Instruction Following

arXiv - AI March 05, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.03321: DIALEVAL: Automated Type-Theoretic Evaluation of LLM Instruction Following

Computer Science > Computation and Language arXiv:2603.03321 (cs) [Submitted on 10 Feb 2026] Title:DIALEVAL: Automated Type-Theoretic Evaluation of LLM Instruction Following Authors:Nardine Basta, Dali Kaafar View a PDF of the paper titled DIALEVAL: Automated Type-Theoretic Evaluation of LLM Instruction Following, by Nardine Basta and 1 other authors View PDF HTML (experimental) Abstract:Evaluating instruction following in Large Language Models requires decomposing instructions into verifiable requirements and assessing satisfaction--tasks currently dependent on manual annotation and uniform criteria that do not align with human judgment patterns. We present DIALEVAL, a type-theoretic framework using dual LLM agents to automate instruction decomposition into typed predicates and implement type-specific satisfaction semantics. The framework enforces formal atomicity and independence constraints during automated extraction, then applies differentiated evaluation criteria--semantic equivalence for content predicates, exact precision for numerical predicates--mirroring empirically observed human assessment patterns. Extended to multi-turn dialogues through history-aware satisfaction functions, DIALEVAL enables evaluation in conversational contexts where single-turn methods fail. Validation demonstrates 90.38% accuracy (26.45% error reduction over baselines) and substantially stronger correlation with human judgment for complex instructions. Comments: Subjects: Computation and ...

Originally published on March 05, 2026. Curated by AI News.

Llms

The Rationing: AI companies are using the "subsidize, addict, extract" playbook — and developers are the product

Anthropic just ran the classic platform playbook on developers: offer generous limits to build dependency, then tighten the screws once t...

Reddit - Artificial Intelligence · 1 min · 23 minutes ago

Llms

CLI for Google AI Search (gai.google) — run AI-powered code/tech searches headlessly from your terminal

Google AI (gai.google) gives Gemini-powered answers for technical queries — think AI-enhanced search with code understanding. I built a C...

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

Llms

Why are we blindly trusting AI companies with our data?

Lately I’ve been seeing a story floating around that really made me pause. Apparently, there were claims that the US government asked Ant...

Reddit - Artificial Intelligence · 1 min · about 5 hours ago

Llms

De-aged casts, ChatGPT-generated programs: How AI is changing Korean TV

Artificial intelligence is transforming every corner of industry, and television is no exception. Major networks in Korea have recently a...

AI Tools & Products · 4 min · about 6 hours ago

[2603.03321] DIALEVAL: Automated Type-Theoretic Evaluation of LLM Instruction Following

About this article

Related Articles

The Rationing: AI companies are using the "subsidize, addict, extract" playbook — and developers are the product

CLI for Google AI Search (gai.google) — run AI-powered code/tech searches headlessly from your terminal

Why are we blindly trusting AI companies with our data?

De-aged casts, ChatGPT-generated programs: How AI is changing Korean TV

No comments

Stay updated with AI News