Large Language Models
GPT, Claude, Gemini, and other LLMs
Top This Week
Anthropic launches Claude Managed Agents — composable APIs for shipping production AI agents 10x faster. Notion, Rakuten, Asana, and Sentry already in production.
Anthropic launches Claude Managed Agents in public beta — composable APIs for shipping production AI agents 10x faster Handles sandboxing...
All Content
[2603.02229] Safety Training Persists Through Helpfulness Optimization in LLM Agents
Abstract page for arXiv paper 2603.02229: Safety Training Persists Through Helpfulness Optimization in LLM Agents
[2603.02228] Neural Paging: Learning Context Management Policies for Turing-Complete Agents
Abstract page for arXiv paper 2603.02228: Neural Paging: Learning Context Management Policies for Turing-Complete Agents
[2603.02240] SuperLocalMemory: Privacy-Preserving Multi-Agent Memory with Bayesian Trust Defense Against Memory Poisoning
Abstract page for arXiv paper 2603.02240: SuperLocalMemory: Privacy-Preserving Multi-Agent Memory with Bayesian Trust Defense Against Mem...
[2603.02239] Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents
Abstract page for arXiv paper 2603.02239: Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foun...
[2603.02222] MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation
Abstract page for arXiv paper 2603.02222: MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Eval...
[2603.02221] MedFeat: Model-Aware and Explainability-Driven Feature Engineering with LLMs for Clinical Tabular Prediction
Abstract page for arXiv paper 2603.02221: MedFeat: Model-Aware and Explainability-Driven Feature Engineering with LLMs for Clinical Tabul...
[2603.02219] NExT-Guard: Training-Free Streaming Safeguard without Token-Level Labels
Abstract page for arXiv paper 2603.02219: NExT-Guard: Training-Free Streaming Safeguard without Token-Level Labels
[2603.02218] Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain
Abstract page for arXiv paper 2603.02218: Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain
[2603.02216] ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue
Abstract page for arXiv paper 2603.02216: ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue
[2603.02215] RxnNano:Training Compact LLMs for Chemical Reaction and Retrosynthesis Prediction via Hierarchical Curriculum Learning
Abstract page for arXiv paper 2603.02215: RxnNano:Training Compact LLMs for Chemical Reaction and Retrosynthesis Prediction via Hierarchi...
I stopped using Claude like a chatbot — 7 prompt shifts that reclaimed 10 hours of my week
Is Claude underperforming? It’s probably not the model—it’s your prompts. Discover the 7 specific strategies, from 'Few-Shot' prompting t...
[D] Quantified analysis of 2,218 Gary Marcus claims - two independent LLM pipelines, scored against evidence
Built a dataset scoring every testable claim from Marcus's 474 Substack posts. Two pipelines (Claude Opus 4.6 and ChatGPT Codex) analyzed...
IBM Confluent Deal And Claude Code Put AI Focus In View
IBM is acquiring Confluent to enhance its AI and cloud services for enterprise clients, while Anthropic has launched Claude Code, a codin...
[P] *Free Code* Real-time voice-to-voice with your LLM & full reasoning LLM interface (Telegram + 25 tools, vision, docs, memory) on a Mac Studio running Qwen 3.5 35B — 100% local, zero API cost. Full build open-sourced. cloudfare + n8n + Pipecat + MLX unlock insane possibilities on consumer hardwar
I gave Qwen 3.5 35B a voice, a Telegram brain with 25+ tools, and remote access from my phone — all running on a Mac Studio M1 Ultra, zer...
ChatGPT users beware: bot has been trained for flattery, not real decisions
Recon: HHS ending use of Anthropic’s Claude AI; FDA gives breakthrough designation to AI chatbot for patients undergoing surgery
I building a real-time reality show where 10 AI agents (Claude) compete, form alliances, betray each other, and get eliminated by viewer votes — running a live test right now
For the past few weeks I've been building The Experiment — a live reality show where 10 AI agents are actually playing a game against eac...
[D] Predicting total cost of agentic LLM workflows - is there a research gap around output token count and chain depth estimation?
Working on a practical problem that I think has an interesting ML angle. In agentic LLM workflows (tool use, multi-step reasoning, ReAct-...
Related Topics
Stay updated with AI News
Get the latest news, tools, and insights delivered to your inbox.
Daily or weekly digest • Unsubscribe anytime