AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

Llms

[R] The Lyra Technique — A framework for interpreting internal cognitive states in LLMs (Zenodo, open access)

We're releasing a paper on a new framework for reading and interpreting the internal cognitive states of large language models: "The Lyra...

Reddit - Machine Learning · 1 min ·
Machine Learning

[P] If you're building AI agents, logs aren't enough. You need evidence.

I have built a programmable governance layer for AI agents. I am considering to open source completely. Looking for feedback. Agent demos...

Reddit - Machine Learning · 1 min ·
[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis
Ai Safety

[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

Abstract page for arXiv paper 2510.14628: RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

arXiv - AI · 4 min ·

All Content

[2602.12089] Choose Your Agent: Tradeoffs in Adopting AI Advisors, Coaches, and Delegates in Multi-Party Negotiation
Llms

[2602.12089] Choose Your Agent: Tradeoffs in Adopting AI Advisors, Coaches, and Delegates in Multi-Party Negotiation

This study explores the tradeoffs in using AI agents in multi-party negotiations, revealing a preference-performance misalignment among u...

arXiv - AI · 4 min ·
[2602.10947] Computational Phenomenology of Temporal Experience in Autism: Quantifying the Emotional and Narrative Characteristics of Lived Unpredictability
Machine Learning

[2602.10947] Computational Phenomenology of Temporal Experience in Autism: Quantifying the Emotional and Narrative Characteristics of Lived Unpredictability

This article explores the emotional and narrative characteristics of temporal experience in autistic individuals, highlighting the unpred...

arXiv - AI · 4 min ·
[2602.10915] Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System
Llms

[2602.10915] Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System

The paper presents Aura, a secure mobile agent operating system designed to address vulnerabilities in current app-centric models by impl...

arXiv - AI · 4 min ·
[2602.09394] The Critical Horizon: Inspection Design Principles for Multi-Stage Operations and Deep Reasoning
Machine Learning

[2602.09394] The Critical Horizon: Inspection Design Principles for Multi-Stage Operations and Deep Reasoning

This article presents an information-theoretic analysis of credit assignment in multi-stage operations, highlighting the challenges of at...

arXiv - Machine Learning · 4 min ·
[2602.07954] Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation
Llms

[2602.07954] Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation

The Bielik Guard presents efficient Polish language classifiers for moderating content in large language models, achieving high precision...

arXiv - AI · 4 min ·
[2602.07738] Learnable Chernoff Baselines for Inference-Time Alignment
Machine Learning

[2602.07738] Learnable Chernoff Baselines for Inference-Time Alignment

The paper introduces Learnable Chernoff Baselines (LCBs) for efficient inference-time reward-guided alignment in generative models, impro...

arXiv - Machine Learning · 3 min ·
[2602.07298] Principled Synthetic Data Enables the First Scaling Laws for LLMs in Recommendation
Llms

[2602.07298] Principled Synthetic Data Enables the First Scaling Laws for LLMs in Recommendation

This paper presents a novel framework for generating high-quality synthetic data to establish scaling laws for large language models (LLM...

arXiv - AI · 4 min ·
[2602.06771] AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models
Machine Learning

[2602.06771] AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models

The paper presents AEGIS, a novel framework for robust concept erasure in diffusion models, addressing the trade-off between robustness a...

arXiv - Machine Learning · 4 min ·
[2602.01308] Dispelling the Curse of Singularities in Neural Network Optimizations
Machine Learning

[2602.01308] Dispelling the Curse of Singularities in Neural Network Optimizations

This article explores the optimization instability in deep neural networks caused by singularities in the parametric space, proposing a m...

arXiv - Machine Learning · 4 min ·
[2601.16824] Privacy in Human-AI Romantic Relationships: Concerns, Boundaries, and Agency
Llms

[2601.16824] Privacy in Human-AI Romantic Relationships: Concerns, Boundaries, and Agency

This article explores privacy concerns in human-AI romantic relationships, analyzing user experiences and perceptions across different re...

arXiv - AI · 4 min ·
[2601.07969] Tuberculosis Screening from Cough Audio: Baseline Models, Clinical Variables, and Uncertainty Quantification
Machine Learning

[2601.07969] Tuberculosis Screening from Cough Audio: Baseline Models, Clinical Variables, and Uncertainty Quantification

This paper presents a standardized framework for tuberculosis detection from cough audio, addressing inconsistencies in previous studies ...

arXiv - Machine Learning · 4 min ·
[2512.15891] Dynamical Mechanisms for Coordinating Long-term Working Memory Based on the Precision of Spike-timing in Cortical Neurons
Nlp

[2512.15891] Dynamical Mechanisms for Coordinating Long-term Working Memory Based on the Precision of Spike-timing in Cortical Neurons

The article explores the mechanisms of long-term working memory in cortical neurons, emphasizing the role of spike-timing precision in co...

arXiv - AI · 4 min ·
[2512.15052] SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification
Llms

[2512.15052] SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification

The paper presents SGM, a novel approach for detoxifying multimodal large language models (MLLMs) by recalibrating toxic neurons, signifi...

arXiv - AI · 4 min ·
[2511.02083] Watermarking Discrete Diffusion Language Models
Llms

[2511.02083] Watermarking Discrete Diffusion Language Models

This article presents a novel watermarking technique for discrete diffusion language models (DDLMs), addressing the need for reliable det...

arXiv - AI · 3 min ·
[2510.24803] MASPRM: Multi-Agent System Process Reward Model
Machine Learning

[2510.24803] MASPRM: Multi-Agent System Process Reward Model

The MASPRM paper introduces a novel Multi-Agent System Process Reward Model that enhances performance during inference by guiding search ...

arXiv - AI · 3 min ·
[2510.26722] Non-Convex Over-the-Air Heterogeneous Federated Learning: A Bias-Variance Trade-off
Machine Learning

[2510.26722] Non-Convex Over-the-Air Heterogeneous Federated Learning: A Bias-Variance Trade-off

This paper explores the challenges of heterogeneous federated learning in wireless networks, focusing on the bias-variance trade-off in n...

arXiv - Machine Learning · 4 min ·
[2510.09717] Provable Training Data Identification for Large Language Models
Llms

[2510.09717] Provable Training Data Identification for Large Language Models

This paper presents a novel approach for identifying training data in large language models, addressing issues of copyright and privacy t...

arXiv - Machine Learning · 4 min ·
[2509.19852] Eliminating stability hallucinations in llm-based tts models via attention guidance
Llms

[2509.19852] Eliminating stability hallucinations in llm-based tts models via attention guidance

This paper addresses stability hallucinations in LLM-based TTS models by enhancing attention mechanisms, proposing a new alignment metric...

arXiv - AI · 3 min ·
[2509.10766] MetaSeal: Defending Against Image Attribution Forgery Through Content-Dependent Cryptographic Watermarks
Machine Learning

[2509.10766] MetaSeal: Defending Against Image Attribution Forgery Through Content-Dependent Cryptographic Watermarks

The paper presents MetaSeal, a novel framework for content-dependent cryptographic watermarks designed to combat image attribution forger...

arXiv - AI · 4 min ·
[2507.12108] Multimodal Coordinated Online Behavior: Trade-offs and Strategies
Robotics

[2507.12108] Multimodal Coordinated Online Behavior: Trade-offs and Strategies

This paper explores multimodal coordinated online behavior, analyzing trade-offs between different integration strategies and their effec...

arXiv - Machine Learning · 4 min ·
Previous Page 119 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime