AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

Llms

[R] The Lyra Technique — A framework for interpreting internal cognitive states in LLMs (Zenodo, open access)

We're releasing a paper on a new framework for reading and interpreting the internal cognitive states of large language models: "The Lyra...

Reddit - Machine Learning · 1 min ·
Machine Learning

[P] If you're building AI agents, logs aren't enough. You need evidence.

I have built a programmable governance layer for AI agents. I am considering to open source completely. Looking for feedback. Agent demos...

Reddit - Machine Learning · 1 min ·
[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis
Ai Safety

[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

Abstract page for arXiv paper 2510.14628: RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

arXiv - AI · 4 min ·

All Content

[2507.02310] Holistic Continual Learning under Concept Drift with Adaptive Memory Realignment
Ai Safety

[2507.02310] Holistic Continual Learning under Concept Drift with Adaptive Memory Realignment

This paper presents a novel framework for continual learning that addresses concept drift through Adaptive Memory Realignment (AMR), enha...

arXiv - Machine Learning · 4 min ·
[2410.03952] Pixel-Based Similarities as an Alternative to Neural Data for Improving Convolutional Neural Network Adversarial Robustness
Machine Learning

[2410.03952] Pixel-Based Similarities as an Alternative to Neural Data for Improving Convolutional Neural Network Adversarial Robustness

This paper presents a novel approach to enhancing the adversarial robustness of Convolutional Neural Networks (CNNs) by utilizing pixel-b...

arXiv - Machine Learning · 4 min ·
[2601.22977] Quantifying Model Uniqueness in Heterogeneous AI Ecosystems
Llms

[2601.22977] Quantifying Model Uniqueness in Heterogeneous AI Ecosystems

This paper presents a statistical framework for quantifying model uniqueness in heterogeneous AI ecosystems, addressing the challenge of ...

arXiv - AI · 4 min ·
[2601.16909] Preventing the Collapse of Peer Review Requires Verification-First AI
Ai Startups

[2601.16909] Preventing the Collapse of Peer Review Requires Verification-First AI

The paper discusses the need for a verification-first approach in AI-assisted peer review to prevent the collapse of the review process, ...

arXiv - AI · 3 min ·
[2510.23883] Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges
Llms

[2510.23883] Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges

This article explores the security implications of agentic AI systems, detailing specific threats, defense strategies, and evaluation met...

arXiv - AI · 3 min ·
[2510.07117] The Conditions of Physical Embodiment Enable Generalization and Care
Ai Agents

[2510.07117] The Conditions of Physical Embodiment Enable Generalization and Care

This paper explores how physical embodiment in artificial agents can enhance their ability to generalize and provide care in uncertain en...

arXiv - Machine Learning · 4 min ·
[2510.00664] Batch-CAM: Introduction to better reasoning in convolutional deep learning models
Machine Learning

[2510.00664] Batch-CAM: Introduction to better reasoning in convolutional deep learning models

The paper introduces Batch-CAM, a training framework for convolutional deep learning models that enhances interpretability by aligning mo...

arXiv - AI · 4 min ·
[2507.19593] A Survey on Hypergame Theory: Modeling Misaligned Perceptions and Nested Beliefs for Multi-agent Systems
Machine Learning

[2507.19593] A Survey on Hypergame Theory: Modeling Misaligned Perceptions and Nested Beliefs for Multi-agent Systems

This article surveys hypergame theory, focusing on modeling misaligned perceptions and nested beliefs in multi-agent systems, highlightin...

arXiv - AI · 4 min ·
[2501.05454] The Epistemic Asymmetry of Consciousness Self-Reports: A Formal Analysis of AI Consciousness Denial
Ai Safety

[2501.05454] The Epistemic Asymmetry of Consciousness Self-Reports: A Formal Analysis of AI Consciousness Denial

This article presents a formal analysis of AI consciousness denial, revealing that self-reports of consciousness by AI systems are episte...

arXiv - AI · 4 min ·
[2602.13156] In-Context Autonomous Network Incident Response: An End-to-End Large Language Model Agent Approach
Llms

[2602.13156] In-Context Autonomous Network Incident Response: An End-to-End Large Language Model Agent Approach

This article presents a novel approach to network incident response using a large language model (LLM) that autonomously learns and adapt...

arXiv - AI · 4 min ·
[2602.13110] SCOPE: Selective Conformal Optimized Pairwise LLM Judging
Llms

[2602.13110] SCOPE: Selective Conformal Optimized Pairwise LLM Judging

The paper presents SCOPE, a framework for selective pairwise evaluation using large language models (LLMs) that improves judgment accurac...

arXiv - AI · 4 min ·
[2602.13088] How cyborg propaganda reshapes collective action
Computer Vision

[2602.13088] How cyborg propaganda reshapes collective action

This paper explores the emergence of 'cyborg propaganda,' where human and AI collaboration reshapes collective action, blurring lines bet...

arXiv - AI · 4 min ·
[2602.13087] EXCODER: EXplainable Classification Of DiscretE time series Representations
Machine Learning

[2602.13087] EXCODER: EXplainable Classification Of DiscretE time series Representations

The paper explores EXCODER, a method for explainable classification of discrete time series representations, enhancing interpretability w...

arXiv - Machine Learning · 4 min ·
[2602.13061] Diverging Flows: Detecting Extrapolations in Conditional Generation
Machine Learning

[2602.13061] Diverging Flows: Detecting Extrapolations in Conditional Generation

The paper introduces Diverging Flows, a method for detecting extrapolations in conditional generation models, enhancing safety in applica...

arXiv - Machine Learning · 3 min ·
[2602.13055] Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation
Machine Learning

[2602.13055] Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation

The paper presents Curriculum-DPO++, an advanced method for text-to-image generation that optimizes preference learning through a dual cu...

arXiv - Machine Learning · 4 min ·
[2602.13047] Can we trust AI to detect healthy multilingual English speakers among the cognitively impaired cohort in the UK? An investigation using real-world conversational speech
Machine Learning

[2602.13047] Can we trust AI to detect healthy multilingual English speakers among the cognitively impaired cohort in the UK? An investigation using real-world conversational speech

This study investigates the reliability of AI in detecting cognitive impairment among multilingual English speakers in the UK, revealing ...

arXiv - AI · 4 min ·
[2602.13033] Buy versus Build an LLM: A Decision Framework for Governments
Llms

[2602.13033] Buy versus Build an LLM: A Decision Framework for Governments

This paper presents a strategic framework for governments to decide between buying or building large language models (LLMs) for public se...

arXiv - AI · 4 min ·
[2602.13017] Synaptic Activation and Dual Liquid Dynamics for Interpretable Bio-Inspired Models
Machine Learning

[2602.13017] Synaptic Activation and Dual Liquid Dynamics for Interpretable Bio-Inspired Models

This paper presents a unified framework for bio-inspired models that enhances interpretability in recurrent neural networks (RNNs) throug...

arXiv - Machine Learning · 3 min ·
[2602.12983] Detecting Object Tracking Failure via Sequential Hypothesis Testing
Machine Learning

[2602.12983] Detecting Object Tracking Failure via Sequential Hypothesis Testing

This paper presents a method for detecting object tracking failures using sequential hypothesis testing, enhancing safety in computer vis...

arXiv - AI · 4 min ·
[2602.12975] Extending confidence calibration to generalised measures of variation
Machine Learning

[2602.12975] Extending confidence calibration to generalised measures of variation

The paper introduces the Variation Calibration Error (VCE) metric, extending confidence calibration methods in machine learning to assess...

arXiv - Machine Learning · 3 min ·
Previous Page 120 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime