Top AI Startups This Week

The most engaging ai startups content from this week, curated by AI News.

  1. 1

    LLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]

    I built a small website called LLM Win: https://llm-win.com It turns LLM benchmark results into a directed graph: text If model A beats model B on benchmark X, add an edge A -> B. Then it search...

    Reddit - Machine Learning · 2 days ago
  2. 2

    [2605.07572] Open-Ended Task Discovery via Bayesian Optimization

    Abstract page for arXiv paper 2605.07572: Open-Ended Task Discovery via Bayesian Optimization

    arXiv - AI · about 8 hours ago
  3. 3

    [2605.07584] Parallel Lifted Planning via Semi-Naive Datalog Evaluation

    Abstract page for arXiv paper 2605.07584: Parallel Lifted Planning via Semi-Naive Datalog Evaluation

    arXiv - AI · about 8 hours ago
  4. 4

    [2605.07186] The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval

    Abstract page for arXiv paper 2605.07186: The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval

    arXiv - AI · about 8 hours ago
  5. 5

    [2605.07699] DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain

    Abstract page for arXiv paper 2605.07699: DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain

    arXiv - AI · about 8 hours ago
  6. 6

    [2605.07751] Vibe coding before the trend

    Abstract page for arXiv paper 2605.07751: Vibe coding before the trend

    arXiv - AI · about 8 hours ago
  7. 7

    [2605.07872] Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models

    Abstract page for arXiv paper 2605.07872: Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models

    arXiv - AI · about 8 hours ago
  8. 8

    [2605.07905] CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers

    Abstract page for arXiv paper 2605.07905: CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers

    arXiv - AI · about 8 hours ago
  9. 9

    [2605.07985] Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation

    Abstract page for arXiv paper 2605.07985: Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation

    arXiv - AI · about 8 hours ago
  10. 10

    [2605.07986] Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios

    Abstract page for arXiv paper 2605.07986: Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios

    arXiv - AI · about 8 hours ago
  11. 11

    [2510.00436] Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization

    Abstract page for arXiv paper 2510.00436: Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization

    arXiv - AI · about 8 hours ago
  12. 12

    [2511.15204] Physics-Based Benchmarking Metrics for Multimodal Synthetic Images

    Abstract page for arXiv paper 2511.15204: Physics-Based Benchmarking Metrics for Multimodal Synthetic Images

    arXiv - AI · about 8 hours ago
  13. 13

    [2605.02278] HELIX: Hybrid Encoding with Learnable Identity and Cross-dimensional Synthesis for Time Series Imputation

    Abstract page for arXiv paper 2605.02278: HELIX: Hybrid Encoding with Learnable Identity and Cross-dimensional Synthesis for Time Series Imputation

    arXiv - AI · 6 days ago
  14. 14

    [2605.04785] AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

    Abstract page for arXiv paper 2605.04785: AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

    arXiv - AI · 4 days ago
  15. 15

    [2605.01452] Stable Localized Conformal Prediction via Transduction

    Abstract page for arXiv paper 2605.01452: Stable Localized Conformal Prediction via Transduction

    arXiv - Machine Learning · 6 days ago
  16. 16

    [2605.04410] Evaluation Cards for XAI Metrics

    Abstract page for arXiv paper 2605.04410: Evaluation Cards for XAI Metrics

    arXiv - AI · 4 days ago
  17. 17

    [2605.04098] Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

    Abstract page for arXiv paper 2605.04098: Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

    arXiv - AI · 4 days ago
  18. 18

    [2605.05214] MedMamba: Recasting Mamba for Medical Time Series Classification

    Abstract page for arXiv paper 2605.05214: MedMamba: Recasting Mamba for Medical Time Series Classification

    arXiv - AI · 3 days ago
  19. 19

    [2605.04505] JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

    Abstract page for arXiv paper 2605.04505: JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

    arXiv - AI · 4 days ago
  20. 20

    [2605.07394] BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

    Abstract page for arXiv paper 2605.07394: BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

    arXiv - AI · about 8 hours ago

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime