AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

RSS

Top This Week

Ai Safety

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

submitted by /u/esporx [link] [comments]

Reddit - Artificial Intelligence · 1 min · 2 days ago

Machine Learning

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

RLHF trains models on human feedback. Humans rate responses they like. And it turns out humans consistently rate confident, fluent, agree...

Reddit - Artificial Intelligence · 1 min · 2 days ago

Computer Vision

House Democrat Questions Anthropic on AI Safety After Source Code Leak

Rep. Josh Gottheimer, who is generally tough on China, just sent a letter to Anthropic questioning their decision to reduce certain safet...

Reddit - Artificial Intelligence · 1 min · 2 days ago

All Content

Llms

[2602.10117] Biases in the Blind Spot: Detecting What LLMs Fail to Mention

The paper discusses a novel automated pipeline for detecting unverbalized biases in Large Language Models (LLMs), highlighting its effect...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.07666] SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned

This paper analyzes DARPA's AI Cyber Challenge (AIxCC), focusing on competition design, architectural approaches of finalists, and key le...

arXiv - AI · 4 min · about 1 month ago

Generative Ai

[2601.08697] Auditing Student-AI Collaboration: A Case Study of Online Graduate CS Students

This study audits the collaboration between online graduate CS students and AI, exploring preferences for automation in academic tasks an...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2601.01224] Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment

This paper presents Contrastive Object-centric Diffusion Alignment (CODA), an enhancement to object-centric learning that reduces slot en...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2512.23482] Theory of Mind for Explainable Human-Robot Interaction

This article explores the integration of Theory of Mind (ToM) in human-robot interaction (HRI) to enhance robot interpretability and user...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2512.19941] Block-Recurrent Dynamics in Vision Transformers

This article introduces the Block-Recurrent Hypothesis (BRH) for Vision Transformers, proposing a new framework for understanding their c...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2512.11108] Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution

This article explores the biases inherent in post-hoc feature attribution methods used in language models, revealing how lexical and posi...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2512.05556] Beyond Linear Surrogates: High-Fidelity Local Explanations for Black-Box Models

The paper presents a novel method for generating high-fidelity local explanations for black-box machine learning models using multivariat...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2511.18696] Empathetic Cascading Networks: A Multi-Stage Prompting Technique for Reducing Social Biases in Large Language Models

The paper presents Empathetic Cascading Networks (ECN), a multi-stage prompting technique aimed at enhancing the empathetic responses of ...

arXiv - AI · 3 min · about 1 month ago

Llms

[2511.00040] Semi-Supervised Preference Optimization with Limited Feedback

This paper discusses Semi-Supervised Preference Optimization (SSPO), which reduces the need for extensive labeled feedback in preference ...

arXiv - AI · 3 min · about 1 month ago

Llms

[2510.25015] VeriStruct: AI-assisted Automated Verification of Data-Structure Modules in Verus

VeriStruct is a novel framework for AI-assisted automated verification of complex data structure modules in Verus, achieving a high succe...

arXiv - AI · 3 min · about 1 month ago

Generative Ai

[2510.24983] LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies

LRT-Diffusion introduces a risk-aware sampling method for diffusion policies in offline reinforcement learning, enhancing decision-making...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2510.15297] VERA-MH Concept Paper

The VERA-MH Concept Paper outlines an innovative framework for evaluating AI chatbots in mental health contexts, focusing on suicide risk...

arXiv - AI · 4 min · about 1 month ago

Llms

[2509.24368] Watermarking Diffusion Language Models

This article presents a novel watermarking technique specifically designed for diffusion language models (DLMs), addressing challenges in...

arXiv - AI · 3 min · about 1 month ago

Ai Safety

[2509.14959] Discrete optimal transport is a strong audio adversarial attack

The paper introduces a novel method called discrete optimal transport voice conversion (kDOT-VC), demonstrating its effectiveness as an a...

arXiv - AI · 3 min · about 1 month ago

Llms

[2506.11798] Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models

This paper explores the use of Large Language Models (LLMs) to simulate voting behavior in the European Parliament through persona-driven...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2505.20085] Explanation User Interfaces: A Systematic Literature Review

This systematic literature review explores Explanation User Interfaces (XUIs) in AI, emphasizing the importance of effective user explana...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2504.21730] Cert-SSBD: Certified Backdoor Defense with Sample-Specific Smoothing Noises

The paper presents Cert-SSBD, a novel method for defending against backdoor attacks in deep neural networks by optimizing noise levels sp...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2503.04121] Simple Self Organizing Map with Vision Transformers

This paper explores the integration of Self-Organizing Maps (SOMs) with Vision Transformers (ViTs) to enhance performance on small datase...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2502.08834] Rex: A Family of Reversible Exponential (Stochastic) Runge-Kutta Solvers

The paper introduces Rex, a family of reversible exponential (stochastic) Runge-Kutta solvers designed to enhance the inversion accuracy ...

arXiv - AI · 4 min · about 1 month ago

Previous Page 78 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

House Democrat Questions Anthropic on AI Safety After Source Code Leak

All Content

[2602.10117] Biases in the Blind Spot: Detecting What LLMs Fail to Mention

[2602.07666] SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned

[2601.08697] Auditing Student-AI Collaboration: A Case Study of Online Graduate CS Students

[2601.01224] Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment

[2512.23482] Theory of Mind for Explainable Human-Robot Interaction

[2512.19941] Block-Recurrent Dynamics in Vision Transformers

[2512.11108] Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution

[2512.05556] Beyond Linear Surrogates: High-Fidelity Local Explanations for Black-Box Models

[2511.18696] Empathetic Cascading Networks: A Multi-Stage Prompting Technique for Reducing Social Biases in Large Language Models

[2511.00040] Semi-Supervised Preference Optimization with Limited Feedback

[2510.25015] VeriStruct: AI-assisted Automated Verification of Data-Structure Modules in Verus

[2510.24983] LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies

[2510.15297] VERA-MH Concept Paper

[2509.24368] Watermarking Diffusion Language Models

[2509.14959] Discrete optimal transport is a strong audio adversarial attack

[2506.11798] Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models

[2505.20085] Explanation User Interfaces: A Systematic Literature Review

[2504.21730] Cert-SSBD: Certified Backdoor Defense with Sample-Specific Smoothing Noises

[2503.04121] Simple Self Organizing Map with Vision Transformers

[2502.08834] Rex: A Family of Reversible Exponential (Stochastic) Runge-Kutta Solvers

Related Topics

Stay updated with AI News