Top AI Startups This Week
The most engaging ai startups content from this week, curated by AI News.
-
1
LLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]
I built a small website called LLM Win: https://llm-win.com It turns LLM benchmark results into a directed graph: text If model A beats model B on benchmark X, add an edge A -> B. Then it search...
Reddit - Machine Learning · 2 days ago -
2
[2605.07572] Open-Ended Task Discovery via Bayesian Optimization
Abstract page for arXiv paper 2605.07572: Open-Ended Task Discovery via Bayesian Optimization
arXiv - AI · about 8 hours ago -
3
[2605.07584] Parallel Lifted Planning via Semi-Naive Datalog Evaluation
Abstract page for arXiv paper 2605.07584: Parallel Lifted Planning via Semi-Naive Datalog Evaluation
arXiv - AI · about 8 hours ago -
4
[2605.07186] The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval
Abstract page for arXiv paper 2605.07186: The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval
arXiv - AI · about 8 hours ago -
5
[2605.07699] DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain
Abstract page for arXiv paper 2605.07699: DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain
arXiv - AI · about 8 hours ago -
6
[2605.07751] Vibe coding before the trend
Abstract page for arXiv paper 2605.07751: Vibe coding before the trend
arXiv - AI · about 8 hours ago -
7
[2605.07872] Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models
Abstract page for arXiv paper 2605.07872: Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models
arXiv - AI · about 8 hours ago -
8
[2605.07905] CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers
Abstract page for arXiv paper 2605.07905: CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers
arXiv - AI · about 8 hours ago -
9
[2605.07985] Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
Abstract page for arXiv paper 2605.07985: Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
arXiv - AI · about 8 hours ago -
10
[2605.07986] Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios
Abstract page for arXiv paper 2605.07986: Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios
arXiv - AI · about 8 hours ago -
11
[2510.00436] Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization
Abstract page for arXiv paper 2510.00436: Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization
arXiv - AI · about 8 hours ago -
12
[2511.15204] Physics-Based Benchmarking Metrics for Multimodal Synthetic Images
Abstract page for arXiv paper 2511.15204: Physics-Based Benchmarking Metrics for Multimodal Synthetic Images
arXiv - AI · about 8 hours ago -
13
[2605.02278] HELIX: Hybrid Encoding with Learnable Identity and Cross-dimensional Synthesis for Time Series Imputation
Abstract page for arXiv paper 2605.02278: HELIX: Hybrid Encoding with Learnable Identity and Cross-dimensional Synthesis for Time Series Imputation
arXiv - AI · 6 days ago -
14
[2605.04785] AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use
Abstract page for arXiv paper 2605.04785: AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use
arXiv - AI · 4 days ago -
15
[2605.01452] Stable Localized Conformal Prediction via Transduction
Abstract page for arXiv paper 2605.01452: Stable Localized Conformal Prediction via Transduction
arXiv - Machine Learning · 6 days ago -
16
[2605.04410] Evaluation Cards for XAI Metrics
Abstract page for arXiv paper 2605.04410: Evaluation Cards for XAI Metrics
arXiv - AI · 4 days ago -
17
[2605.04098] Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology
Abstract page for arXiv paper 2605.04098: Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology
arXiv - AI · 4 days ago -
18
[2605.05214] MedMamba: Recasting Mamba for Medical Time Series Classification
Abstract page for arXiv paper 2605.05214: MedMamba: Recasting Mamba for Medical Time Series Classification
arXiv - AI · 3 days ago -
19
[2605.04505] JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
Abstract page for arXiv paper 2605.04505: JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
arXiv - AI · 4 days ago -
20
[2605.07394] BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning
Abstract page for arXiv paper 2605.07394: BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning
arXiv - AI · about 8 hours ago
Stay updated with AI News
Get the latest news, tools, and insights delivered to your inbox.
Daily or weekly digest • Unsubscribe anytime