AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

Llms

[R] The Lyra Technique — A framework for interpreting internal cognitive states in LLMs (Zenodo, open access)

We're releasing a paper on a new framework for reading and interpreting the internal cognitive states of large language models: "The Lyra...

Reddit - Machine Learning · 1 min ·
Machine Learning

[P] If you're building AI agents, logs aren't enough. You need evidence.

I have built a programmable governance layer for AI agents. I am considering to open source completely. Looking for feedback. Agent demos...

Reddit - Machine Learning · 1 min ·
[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis
Ai Safety

[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

Abstract page for arXiv paper 2510.14628: RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

arXiv - AI · 4 min ·

All Content

[2602.13855] From Fluent to Verifiable: Claim-Level Auditability for Deep Research Agents
Ai Agents

[2602.13855] From Fluent to Verifiable: Claim-Level Auditability for Deep Research Agents

The paper discusses the need for claim-level auditability in deep research agents, highlighting the shift from factual errors to weak cla...

arXiv - AI · 3 min ·
[2602.13626] Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?
Llms

[2602.13626] Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?

This paper examines benchmark data leakage in LLM-based recommendation systems, revealing how it can distort performance metrics and misl...

arXiv - Machine Learning · 3 min ·
[2602.13550] Out-of-Support Generalisation via Weight Space Sequence Modelling
Machine Learning

[2602.13550] Out-of-Support Generalisation via Weight Space Sequence Modelling

This paper presents a novel approach to out-of-support generalization in machine learning, introducing the WeightCaster framework for imp...

arXiv - Machine Learning · 3 min ·
[2602.13524] Singular Vectors of Attention Heads Align with Features
Llms

[2602.13524] Singular Vectors of Attention Heads Align with Features

This paper explores the alignment of singular vectors of attention heads with feature representations in language models, providing theor...

arXiv - AI · 3 min ·
[2602.13486] Preventing Rank Collapse in Federated Low-Rank Adaptation with Client Heterogeneity
Llms

[2602.13486] Preventing Rank Collapse in Federated Low-Rank Adaptation with Client Heterogeneity

This paper introduces raFLoRA, a method to prevent rank collapse in federated low-rank adaptation (FedLoRA) due to client heterogeneity, ...

arXiv - AI · 3 min ·
[2602.13485] Federated Learning of Nonlinear Temporal Dynamics with Graph Attention-based Cross-Client Interpretability
Ai Startups

[2602.13485] Federated Learning of Nonlinear Temporal Dynamics with Graph Attention-based Cross-Client Interpretability

This paper presents a federated learning framework for understanding nonlinear temporal dynamics across decentralized systems, enhancing ...

arXiv - Machine Learning · 4 min ·
[2602.13595] The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning
Machine Learning

[2602.13595] The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning

This paper explores the limitations of neural scaling laws in AI, revealing a 'quantization trap' where reducing numerical precision can ...

arXiv - AI · 3 min ·
[2602.13587] A First Proof Sprint
Ai Agents

[2602.13587] A First Proof Sprint

This paper presents a multi-agent proof sprint addressing ten research-level problems, utilizing rapid draft generation and adversarial v...

arXiv - AI · 3 min ·
[2602.13264] Directional Concentration Uncertainty: A representational approach to uncertainty quantification for generative models
Machine Learning

[2602.13264] Directional Concentration Uncertainty: A representational approach to uncertainty quantification for generative models

The paper introduces Directional Concentration Uncertainty (DCU), a flexible framework for uncertainty quantification in generative model...

arXiv - AI · 4 min ·
[2602.13568] Who Do LLMs Trust? Human Experts Matter More Than Other LLMs
Llms

[2602.13568] Who Do LLMs Trust? Human Experts Matter More Than Other LLMs

This paper explores how large language models (LLMs) prioritize feedback from human experts over other LLMs in decision-making tasks, rev...

arXiv - AI · 3 min ·
[2602.13516] SPILLage: Agentic Oversharing on the Web
Llms

[2602.13516] SPILLage: Agentic Oversharing on the Web

The paper introduces SPILLage, a framework addressing unintentional oversharing by web agents powered by LLMs, highlighting behavioral ov...

arXiv - AI · 4 min ·
[2602.13477] OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage
Llms

[2602.13477] OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage

The paper 'OMNI-LEAK' explores security vulnerabilities in multi-agent systems, revealing how a coordinated attack can lead to data leaka...

arXiv - AI · 4 min ·
[2602.13372] MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents
Ai Safety

[2602.13372] MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents

The paper introduces MoralityGym, a benchmark for assessing hierarchical moral alignment in AI decision-making, utilizing 98 ethical dile...

arXiv - Machine Learning · 3 min ·
[2602.13367] Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts
Llms

[2602.13367] Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts

Nanbeige4.1-3B is a novel small generalist language model that excels in reasoning, alignment, and code generation, demonstrating signifi...

arXiv - AI · 4 min ·
[2602.13323] Contrastive explanations of BDI agents
Robotics

[2602.13323] Contrastive explanations of BDI agents

This article discusses the extension of Belief-Desire-Intention (BDI) agents to provide contrastive explanations, enhancing transparency ...

arXiv - AI · 3 min ·
[2602.13320] Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol
Llms

[2602.13320] Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol

This article presents a theoretical framework for analyzing error propagation in tool-using LLM agents, proving linear growth of cumulati...

arXiv - AI · 3 min ·
[2602.13321] Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction
Llms

[2602.13321] Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction

This study explores automated detection of jailbreak attempts in clinical training large language models (LLMs) using linguistic feature ...

arXiv - Machine Learning · 4 min ·
[2602.13292] Mirror: A Multi-Agent System for AI-Assisted Ethics Review
Ai Safety

[2602.13292] Mirror: A Multi-Agent System for AI-Assisted Ethics Review

The paper presents Mirror, a multi-agent system designed to enhance AI-assisted ethics reviews, addressing the limitations of current eth...

arXiv - AI · 4 min ·
[2602.13283] Accuracy Standards for AI at Work vs. Personal Life: Evidence from an Online Survey
Machine Learning

[2602.13283] Accuracy Standards for AI at Work vs. Personal Life: Evidence from an Online Survey

This article examines how individuals prioritize accuracy in AI tools differently in professional versus personal contexts, based on an o...

arXiv - AI · 4 min ·
[2602.13280] BEAGLE: Behavior-Enforced Agent for Grounded Learner Emulation
Llms

[2602.13280] BEAGLE: Behavior-Enforced Agent for Grounded Learner Emulation

The paper presents BEAGLE, a neuro-symbolic framework that simulates student learning behaviors in open-ended problem-solving environment...

arXiv - AI · 4 min ·
Previous Page 114 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime