AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

Top This Week

Llms

[R] The Lyra Technique — A framework for interpreting internal cognitive states in LLMs (Zenodo, open access)

We're releasing a paper on a new framework for reading and interpreting the internal cognitive states of large language models: "The Lyra...

Reddit - Machine Learning · 1 min · about 1 hour ago

Machine Learning

[P] If you're building AI agents, logs aren't enough. You need evidence.

I have built a programmable governance layer for AI agents. I am considering to open source completely. Looking for feedback. Agent demos...

Reddit - Machine Learning · 1 min · about 10 hours ago

Ai Safety

[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

Abstract page for arXiv paper 2510.14628: RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

arXiv - AI · 4 min · about 13 hours ago

All Content

Ai Agents

[2602.13855] From Fluent to Verifiable: Claim-Level Auditability for Deep Research Agents

The paper discusses the need for claim-level auditability in deep research agents, highlighting the shift from factual errors to weak cla...

arXiv - AI · 3 min · about 2 months ago

Llms

[2602.13626] Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?

This paper examines benchmark data leakage in LLM-based recommendation systems, revealing how it can distort performance metrics and misl...

arXiv - Machine Learning · 3 min · about 2 months ago

Machine Learning

[2602.13550] Out-of-Support Generalisation via Weight Space Sequence Modelling

This paper presents a novel approach to out-of-support generalization in machine learning, introducing the WeightCaster framework for imp...

arXiv - Machine Learning · 3 min · about 2 months ago

Llms

[2602.13524] Singular Vectors of Attention Heads Align with Features

This paper explores the alignment of singular vectors of attention heads with feature representations in language models, providing theor...

arXiv - AI · 3 min · about 2 months ago

Llms

[2602.13486] Preventing Rank Collapse in Federated Low-Rank Adaptation with Client Heterogeneity

This paper introduces raFLoRA, a method to prevent rank collapse in federated low-rank adaptation (FedLoRA) due to client heterogeneity, ...

arXiv - AI · 3 min · about 2 months ago

Ai Startups

[2602.13485] Federated Learning of Nonlinear Temporal Dynamics with Graph Attention-based Cross-Client Interpretability

This paper presents a federated learning framework for understanding nonlinear temporal dynamics across decentralized systems, enhancing ...

arXiv - Machine Learning · 4 min · about 2 months ago

Machine Learning

[2602.13595] The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning

This paper explores the limitations of neural scaling laws in AI, revealing a 'quantization trap' where reducing numerical precision can ...

arXiv - AI · 3 min · about 2 months ago

Ai Agents

[2602.13587] A First Proof Sprint

This paper presents a multi-agent proof sprint addressing ten research-level problems, utilizing rapid draft generation and adversarial v...

arXiv - AI · 3 min · about 2 months ago

Machine Learning

[2602.13264] Directional Concentration Uncertainty: A representational approach to uncertainty quantification for generative models

The paper introduces Directional Concentration Uncertainty (DCU), a flexible framework for uncertainty quantification in generative model...

arXiv - AI · 4 min · about 2 months ago

Llms

[2602.13568] Who Do LLMs Trust? Human Experts Matter More Than Other LLMs

This paper explores how large language models (LLMs) prioritize feedback from human experts over other LLMs in decision-making tasks, rev...

arXiv - AI · 3 min · about 2 months ago

Llms

[2602.13516] SPILLage: Agentic Oversharing on the Web

The paper introduces SPILLage, a framework addressing unintentional oversharing by web agents powered by LLMs, highlighting behavioral ov...

arXiv - AI · 4 min · about 2 months ago

Llms

[2602.13477] OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage

The paper 'OMNI-LEAK' explores security vulnerabilities in multi-agent systems, revealing how a coordinated attack can lead to data leaka...

arXiv - AI · 4 min · about 2 months ago

Ai Safety

[2602.13372] MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents

The paper introduces MoralityGym, a benchmark for assessing hierarchical moral alignment in AI decision-making, utilizing 98 ethical dile...

arXiv - Machine Learning · 3 min · about 2 months ago

Llms

[2602.13367] Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts

Nanbeige4.1-3B is a novel small generalist language model that excels in reasoning, alignment, and code generation, demonstrating signifi...

arXiv - AI · 4 min · about 2 months ago

Robotics

[2602.13323] Contrastive explanations of BDI agents

This article discusses the extension of Belief-Desire-Intention (BDI) agents to provide contrastive explanations, enhancing transparency ...

arXiv - AI · 3 min · about 2 months ago

Llms

[2602.13320] Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol

This article presents a theoretical framework for analyzing error propagation in tool-using LLM agents, proving linear growth of cumulati...

arXiv - AI · 3 min · about 2 months ago

Llms

[2602.13321] Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction

This study explores automated detection of jailbreak attempts in clinical training large language models (LLMs) using linguistic feature ...

arXiv - Machine Learning · 4 min · about 2 months ago

Ai Safety

[2602.13292] Mirror: A Multi-Agent System for AI-Assisted Ethics Review

The paper presents Mirror, a multi-agent system designed to enhance AI-assisted ethics reviews, addressing the limitations of current eth...

arXiv - AI · 4 min · about 2 months ago

Machine Learning

[2602.13283] Accuracy Standards for AI at Work vs. Personal Life: Evidence from an Online Survey

This article examines how individuals prioritize accuracy in AI tools differently in professional versus personal contexts, based on an o...

arXiv - AI · 4 min · about 2 months ago

Llms

[2602.13280] BEAGLE: Behavior-Enforced Agent for Grounded Learner Emulation

The paper presents BEAGLE, a neuro-symbolic framework that simulates student learning behaviors in open-ended problem-solving environment...

arXiv - AI · 4 min · about 2 months ago

Previous Page 114 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

[R] The Lyra Technique — A framework for interpreting internal cognitive states in LLMs (Zenodo, open access)

[P] If you're building AI agents, logs aren't enough. You need evidence.

[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

All Content

[2602.13855] From Fluent to Verifiable: Claim-Level Auditability for Deep Research Agents

[2602.13626] Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?

[2602.13550] Out-of-Support Generalisation via Weight Space Sequence Modelling

[2602.13524] Singular Vectors of Attention Heads Align with Features

[2602.13486] Preventing Rank Collapse in Federated Low-Rank Adaptation with Client Heterogeneity

[2602.13485] Federated Learning of Nonlinear Temporal Dynamics with Graph Attention-based Cross-Client Interpretability

[2602.13595] The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning

[2602.13587] A First Proof Sprint

[2602.13264] Directional Concentration Uncertainty: A representational approach to uncertainty quantification for generative models

[2602.13568] Who Do LLMs Trust? Human Experts Matter More Than Other LLMs

[2602.13516] SPILLage: Agentic Oversharing on the Web

[2602.13477] OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage

[2602.13372] MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents

[2602.13367] Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts

[2602.13323] Contrastive explanations of BDI agents

[2602.13320] Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol

[2602.13321] Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction

[2602.13292] Mirror: A Multi-Agent System for AI-Assisted Ethics Review

[2602.13283] Accuracy Standards for AI at Work vs. Personal Life: Evidence from an Online Survey

[2602.13280] BEAGLE: Behavior-Enforced Agent for Grounded Learner Emulation

Related Topics

Stay updated with AI News