AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

Ai Safety

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

submitted by /u/esporx [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

RLHF trains models on human feedback. Humans rate responses they like. And it turns out humans consistently rate confident, fluent, agree...

Reddit - Artificial Intelligence · 1 min ·
Computer Vision

House Democrat Questions Anthropic on AI Safety After Source Code Leak

Rep. Josh Gottheimer, who is generally tough on China, just sent a letter to Anthropic questioning their decision to reduce certain safet...

Reddit - Artificial Intelligence · 1 min ·

All Content

[2602.10117] Biases in the Blind Spot: Detecting What LLMs Fail to Mention
Llms

[2602.10117] Biases in the Blind Spot: Detecting What LLMs Fail to Mention

The paper discusses a novel automated pipeline for detecting unverbalized biases in Large Language Models (LLMs), highlighting its effect...

arXiv - Machine Learning · 4 min ·
[2602.07666] SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned
Llms

[2602.07666] SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned

This paper analyzes DARPA's AI Cyber Challenge (AIxCC), focusing on competition design, architectural approaches of finalists, and key le...

arXiv - AI · 4 min ·
[2601.08697] Auditing Student-AI Collaboration: A Case Study of Online Graduate CS Students
Generative Ai

[2601.08697] Auditing Student-AI Collaboration: A Case Study of Online Graduate CS Students

This study audits the collaboration between online graduate CS students and AI, exploring preferences for automation in academic tasks an...

arXiv - AI · 3 min ·
[2601.01224] Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment
Machine Learning

[2601.01224] Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment

This paper presents Contrastive Object-centric Diffusion Alignment (CODA), an enhancement to object-centric learning that reduces slot en...

arXiv - AI · 4 min ·
[2512.23482] Theory of Mind for Explainable Human-Robot Interaction
Machine Learning

[2512.23482] Theory of Mind for Explainable Human-Robot Interaction

This article explores the integration of Theory of Mind (ToM) in human-robot interaction (HRI) to enhance robot interpretability and user...

arXiv - AI · 4 min ·
[2512.19941] Block-Recurrent Dynamics in Vision Transformers
Machine Learning

[2512.19941] Block-Recurrent Dynamics in Vision Transformers

This article introduces the Block-Recurrent Hypothesis (BRH) for Vision Transformers, proposing a new framework for understanding their c...

arXiv - Machine Learning · 4 min ·
[2512.11108] Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution
Llms

[2512.11108] Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution

This article explores the biases inherent in post-hoc feature attribution methods used in language models, revealing how lexical and posi...

arXiv - AI · 4 min ·
[2512.05556] Beyond Linear Surrogates: High-Fidelity Local Explanations for Black-Box Models
Machine Learning

[2512.05556] Beyond Linear Surrogates: High-Fidelity Local Explanations for Black-Box Models

The paper presents a novel method for generating high-fidelity local explanations for black-box machine learning models using multivariat...

arXiv - Machine Learning · 4 min ·
[2511.18696] Empathetic Cascading Networks: A Multi-Stage Prompting Technique for Reducing Social Biases in Large Language Models
Llms

[2511.18696] Empathetic Cascading Networks: A Multi-Stage Prompting Technique for Reducing Social Biases in Large Language Models

The paper presents Empathetic Cascading Networks (ECN), a multi-stage prompting technique aimed at enhancing the empathetic responses of ...

arXiv - AI · 3 min ·
[2511.00040] Semi-Supervised Preference Optimization with Limited Feedback
Llms

[2511.00040] Semi-Supervised Preference Optimization with Limited Feedback

This paper discusses Semi-Supervised Preference Optimization (SSPO), which reduces the need for extensive labeled feedback in preference ...

arXiv - AI · 3 min ·
[2510.25015] VeriStruct: AI-assisted Automated Verification of Data-Structure Modules in Verus
Llms

[2510.25015] VeriStruct: AI-assisted Automated Verification of Data-Structure Modules in Verus

VeriStruct is a novel framework for AI-assisted automated verification of complex data structure modules in Verus, achieving a high succe...

arXiv - AI · 3 min ·
[2510.24983] LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies
Generative Ai

[2510.24983] LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies

LRT-Diffusion introduces a risk-aware sampling method for diffusion policies in offline reinforcement learning, enhancing decision-making...

arXiv - AI · 4 min ·
[2510.15297] VERA-MH Concept Paper
Machine Learning

[2510.15297] VERA-MH Concept Paper

The VERA-MH Concept Paper outlines an innovative framework for evaluating AI chatbots in mental health contexts, focusing on suicide risk...

arXiv - AI · 4 min ·
[2509.24368] Watermarking Diffusion Language Models
Llms

[2509.24368] Watermarking Diffusion Language Models

This article presents a novel watermarking technique specifically designed for diffusion language models (DLMs), addressing challenges in...

arXiv - AI · 3 min ·
[2509.14959] Discrete optimal transport is a strong audio adversarial attack
Ai Safety

[2509.14959] Discrete optimal transport is a strong audio adversarial attack

The paper introduces a novel method called discrete optimal transport voice conversion (kDOT-VC), demonstrating its effectiveness as an a...

arXiv - AI · 3 min ·
[2506.11798] Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models
Llms

[2506.11798] Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models

This paper explores the use of Large Language Models (LLMs) to simulate voting behavior in the European Parliament through persona-driven...

arXiv - Machine Learning · 4 min ·
[2505.20085] Explanation User Interfaces: A Systematic Literature Review
Machine Learning

[2505.20085] Explanation User Interfaces: A Systematic Literature Review

This systematic literature review explores Explanation User Interfaces (XUIs) in AI, emphasizing the importance of effective user explana...

arXiv - AI · 4 min ·
[2504.21730] Cert-SSBD: Certified Backdoor Defense with Sample-Specific Smoothing Noises
Machine Learning

[2504.21730] Cert-SSBD: Certified Backdoor Defense with Sample-Specific Smoothing Noises

The paper presents Cert-SSBD, a novel method for defending against backdoor attacks in deep neural networks by optimizing noise levels sp...

arXiv - Machine Learning · 4 min ·
[2503.04121] Simple Self Organizing Map with Vision Transformers
Machine Learning

[2503.04121] Simple Self Organizing Map with Vision Transformers

This paper explores the integration of Self-Organizing Maps (SOMs) with Vision Transformers (ViTs) to enhance performance on small datase...

arXiv - Machine Learning · 4 min ·
[2502.08834] Rex: A Family of Reversible Exponential (Stochastic) Runge-Kutta Solvers
Machine Learning

[2502.08834] Rex: A Family of Reversible Exponential (Stochastic) Runge-Kutta Solvers

The paper introduces Rex, a family of reversible exponential (stochastic) Runge-Kutta solvers designed to enhance the inversion accuracy ...

arXiv - AI · 4 min ·
Previous Page 78 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime