AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

Top This Week

Llms

[R] The Lyra Technique — A framework for interpreting internal cognitive states in LLMs (Zenodo, open access)

We're releasing a paper on a new framework for reading and interpreting the internal cognitive states of large language models: "The Lyra...

Reddit - Machine Learning · 1 min · about 10 hours ago

Machine Learning

[P] If you're building AI agents, logs aren't enough. You need evidence.

I have built a programmable governance layer for AI agents. I am considering to open source completely. Looking for feedback. Agent demos...

Reddit - Machine Learning · 1 min · about 19 hours ago

Ai Safety

[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

Abstract page for arXiv paper 2510.14628: RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

arXiv - AI · 4 min · about 23 hours ago

All Content

Ai Safety

[2507.02310] Holistic Continual Learning under Concept Drift with Adaptive Memory Realignment

This paper presents a novel framework for continual learning that addresses concept drift through Adaptive Memory Realignment (AMR), enha...

arXiv - Machine Learning · 4 min · about 2 months ago

Machine Learning

[2410.03952] Pixel-Based Similarities as an Alternative to Neural Data for Improving Convolutional Neural Network Adversarial Robustness

This paper presents a novel approach to enhancing the adversarial robustness of Convolutional Neural Networks (CNNs) by utilizing pixel-b...

arXiv - Machine Learning · 4 min · about 2 months ago

Llms

[2601.22977] Quantifying Model Uniqueness in Heterogeneous AI Ecosystems

This paper presents a statistical framework for quantifying model uniqueness in heterogeneous AI ecosystems, addressing the challenge of ...

arXiv - AI · 4 min · about 2 months ago

Ai Startups

[2601.16909] Preventing the Collapse of Peer Review Requires Verification-First AI

The paper discusses the need for a verification-first approach in AI-assisted peer review to prevent the collapse of the review process, ...

arXiv - AI · 3 min · about 2 months ago

Llms

[2510.23883] Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges

This article explores the security implications of agentic AI systems, detailing specific threats, defense strategies, and evaluation met...

arXiv - AI · 3 min · about 2 months ago

Ai Agents

[2510.07117] The Conditions of Physical Embodiment Enable Generalization and Care

This paper explores how physical embodiment in artificial agents can enhance their ability to generalize and provide care in uncertain en...

arXiv - Machine Learning · 4 min · about 2 months ago

Machine Learning

[2510.00664] Batch-CAM: Introduction to better reasoning in convolutional deep learning models

The paper introduces Batch-CAM, a training framework for convolutional deep learning models that enhances interpretability by aligning mo...

arXiv - AI · 4 min · about 2 months ago

Machine Learning

[2507.19593] A Survey on Hypergame Theory: Modeling Misaligned Perceptions and Nested Beliefs for Multi-agent Systems

This article surveys hypergame theory, focusing on modeling misaligned perceptions and nested beliefs in multi-agent systems, highlightin...

arXiv - AI · 4 min · about 2 months ago

Ai Safety

[2501.05454] The Epistemic Asymmetry of Consciousness Self-Reports: A Formal Analysis of AI Consciousness Denial

This article presents a formal analysis of AI consciousness denial, revealing that self-reports of consciousness by AI systems are episte...

arXiv - AI · 4 min · about 2 months ago

Llms

[2602.13156] In-Context Autonomous Network Incident Response: An End-to-End Large Language Model Agent Approach

This article presents a novel approach to network incident response using a large language model (LLM) that autonomously learns and adapt...

arXiv - AI · 4 min · about 2 months ago

Llms

[2602.13110] SCOPE: Selective Conformal Optimized Pairwise LLM Judging

The paper presents SCOPE, a framework for selective pairwise evaluation using large language models (LLMs) that improves judgment accurac...

arXiv - AI · 4 min · about 2 months ago

Computer Vision

[2602.13088] How cyborg propaganda reshapes collective action

This paper explores the emergence of 'cyborg propaganda,' where human and AI collaboration reshapes collective action, blurring lines bet...

arXiv - AI · 4 min · about 2 months ago

Machine Learning

[2602.13087] EXCODER: EXplainable Classification Of DiscretE time series Representations

The paper explores EXCODER, a method for explainable classification of discrete time series representations, enhancing interpretability w...

arXiv - Machine Learning · 4 min · about 2 months ago

Machine Learning

[2602.13061] Diverging Flows: Detecting Extrapolations in Conditional Generation

The paper introduces Diverging Flows, a method for detecting extrapolations in conditional generation models, enhancing safety in applica...

arXiv - Machine Learning · 3 min · about 2 months ago

Machine Learning

[2602.13055] Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation

The paper presents Curriculum-DPO++, an advanced method for text-to-image generation that optimizes preference learning through a dual cu...

arXiv - Machine Learning · 4 min · about 2 months ago

Machine Learning

[2602.13047] Can we trust AI to detect healthy multilingual English speakers among the cognitively impaired cohort in the UK? An investigation using real-world conversational speech

This study investigates the reliability of AI in detecting cognitive impairment among multilingual English speakers in the UK, revealing ...

arXiv - AI · 4 min · about 2 months ago

Llms

[2602.13033] Buy versus Build an LLM: A Decision Framework for Governments

This paper presents a strategic framework for governments to decide between buying or building large language models (LLMs) for public se...

arXiv - AI · 4 min · about 2 months ago

Machine Learning

[2602.13017] Synaptic Activation and Dual Liquid Dynamics for Interpretable Bio-Inspired Models

This paper presents a unified framework for bio-inspired models that enhances interpretability in recurrent neural networks (RNNs) throug...

arXiv - Machine Learning · 3 min · about 2 months ago

Machine Learning

[2602.12983] Detecting Object Tracking Failure via Sequential Hypothesis Testing

This paper presents a method for detecting object tracking failures using sequential hypothesis testing, enhancing safety in computer vis...

arXiv - AI · 4 min · about 2 months ago

Machine Learning

[2602.12975] Extending confidence calibration to generalised measures of variation

The paper introduces the Variation Calibration Error (VCE) metric, extending confidence calibration methods in machine learning to assess...

arXiv - Machine Learning · 3 min · about 2 months ago

Previous Page 120 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

[R] The Lyra Technique — A framework for interpreting internal cognitive states in LLMs (Zenodo, open access)

[P] If you're building AI agents, logs aren't enough. You need evidence.

[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

All Content

[2507.02310] Holistic Continual Learning under Concept Drift with Adaptive Memory Realignment

[2410.03952] Pixel-Based Similarities as an Alternative to Neural Data for Improving Convolutional Neural Network Adversarial Robustness

[2601.22977] Quantifying Model Uniqueness in Heterogeneous AI Ecosystems

[2601.16909] Preventing the Collapse of Peer Review Requires Verification-First AI

[2510.23883] Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges

[2510.07117] The Conditions of Physical Embodiment Enable Generalization and Care

[2510.00664] Batch-CAM: Introduction to better reasoning in convolutional deep learning models

[2507.19593] A Survey on Hypergame Theory: Modeling Misaligned Perceptions and Nested Beliefs for Multi-agent Systems

[2501.05454] The Epistemic Asymmetry of Consciousness Self-Reports: A Formal Analysis of AI Consciousness Denial

[2602.13156] In-Context Autonomous Network Incident Response: An End-to-End Large Language Model Agent Approach

[2602.13110] SCOPE: Selective Conformal Optimized Pairwise LLM Judging

[2602.13088] How cyborg propaganda reshapes collective action

[2602.13087] EXCODER: EXplainable Classification Of DiscretE time series Representations

[2602.13061] Diverging Flows: Detecting Extrapolations in Conditional Generation

[2602.13055] Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation

[2602.13047] Can we trust AI to detect healthy multilingual English speakers among the cognitively impaired cohort in the UK? An investigation using real-world conversational speech

[2602.13033] Buy versus Build an LLM: A Decision Framework for Governments

[2602.13017] Synaptic Activation and Dual Liquid Dynamics for Interpretable Bio-Inspired Models

[2602.12983] Detecting Object Tracking Failure via Sequential Hypothesis Testing

[2602.12975] Extending confidence calibration to generalised measures of variation

Related Topics

Stay updated with AI News