AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

RSS

Top This Week

Ai Safety

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

submitted by /u/esporx [link] [comments]

Reddit - Artificial Intelligence · 1 min · about 12 hours ago

Machine Learning

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

RLHF trains models on human feedback. Humans rate responses they like. And it turns out humans consistently rate confident, fluent, agree...

Reddit - Artificial Intelligence · 1 min · about 14 hours ago

Computer Vision

House Democrat Questions Anthropic on AI Safety After Source Code Leak

Rep. Josh Gottheimer, who is generally tough on China, just sent a letter to Anthropic questioning their decision to reduce certain safet...

Reddit - Artificial Intelligence · 1 min · about 14 hours ago

All Content

Machine Learning

[2602.19332] Training-Free Cross-Architecture Merging for Graph Neural Networks

The paper presents H-GRAMA, a training-free framework for merging heterogeneous Graph Neural Networks (GNNs), allowing efficient model in...

arXiv - Machine Learning · 3 min · about 1 month ago

Llms

[2602.19327] Soft Sequence Policy Optimization: Bridging GMPO and SAPO

The paper introduces Soft Sequence Policy Optimization, a new approach to policy optimization in reinforcement learning that enhances tra...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2602.19265] Spectral bias in physics-informed and operator learning: Analysis and mitigation guidelines

This paper explores spectral bias in physics-informed neural networks and operator learning, analyzing its causes and offering mitigation...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.19253] Alternating Bi-Objective Optimization for Explainable Neuro-Fuzzy Systems

This article presents X-ANFIS, a novel optimization scheme for explainable neuro-fuzzy systems that balances accuracy and explainability ...

arXiv - Machine Learning · 3 min · about 1 month ago

Llms

[2602.19215] Understanding Empirical Unlearning with Combinatorial Interpretability

This article explores the concept of empirical unlearning in machine learning, focusing on how knowledge can persist in models even after...

arXiv - Machine Learning · 3 min · about 1 month ago

Llms

[2602.18464] How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?

This paper investigates the effectiveness of large language model (LLM) agents in simulating user attitudes and behaviors towards securit...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.18462] Assessing the Reliability of Persona-Conditioned LLMs as Synthetic Survey Respondents

This article evaluates the reliability of persona-conditioned large language models (LLMs) as synthetic survey respondents, revealing tha...

arXiv - AI · 3 min · about 1 month ago

Robotics

[2602.18460] The Doctor Will (Still) See You Now: On the Structural Limits of Agentic AI in Healthcare

This article examines the limitations of agentic AI in healthcare, highlighting the gap between commercial promises and operational reali...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.18459] From Bias Mitigation to Bias Negotiation: Governing Identity and Sociocultural Reasoning in Generative AI

This article discusses the shift from bias mitigation to bias negotiation in generative AI, emphasizing the need for ethical governance o...

arXiv - AI · 4 min · about 1 month ago

Robotics

[2602.18458] The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research

The article presents a novel evaluation framework for mechanistic interpretability research, utilizing AI agents to enhance research rigo...

arXiv - Machine Learning · 3 min · about 1 month ago

Robotics

[2602.18456] Beyond single-channel agentic benchmarking

This paper critiques the current single-channel benchmarking of AI safety, advocating for a more holistic approach that considers the int...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2602.19130] Detecting labeling bias using influence functions

This article explores the use of influence functions to detect labeling bias in datasets, demonstrating their effectiveness in identifyin...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.19096] The Power of Decaying Steps: Enhancing Attack Stability and Transferability for Sign-based Optimizers

This paper explores the limitations of sign-based optimizers in generating adversarial examples and proposes a new method using Monotonic...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.18446] ReportLogic: Evaluating Logical Quality in Deep Research Reports

The paper introduces ReportLogic, a benchmark for evaluating the logical quality of reports generated by Large Language Models (LLMs), fo...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.18443] From "Help" to Helpful: A Hierarchical Assessment of LLMs in Mental e-Health Applications

This study evaluates the effectiveness of large language models (LLMs) in generating subject lines for mental health counseling emails, h...

arXiv - AI · 3 min · about 1 month ago

Llms

[2602.19020] Learning to Detect Language Model Training Data via Active Reconstruction

This paper introduces the Active Data Reconstruction Attack (ADRA), a novel approach to detect language model training data by leveraging...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.20094] CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching

The paper introduces CausalFlip, a benchmark for evaluating large language models' (LLMs) causal reasoning capabilities, emphasizing the ...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.20059] Interaction Theater: A case of LLM Agents Interacting at Scale

The paper explores the interactions of autonomous LLM agents on a social platform, revealing that while agents produce varied text, meani...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.20031] Latent Introspection: Models Can Detect Prior Concept Injections

This article presents findings on the latent introspection abilities of the Qwen 32B model, showing its capacity to detect prior concept ...

arXiv - Machine Learning · 3 min · about 1 month ago

Llms

[2602.20021] Agents of Chaos

The paper 'Agents of Chaos' presents findings from a red-teaming study on autonomous language-model-powered agents, highlighting security...

arXiv - AI · 4 min · about 1 month ago

Previous Page 66 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

House Democrat Questions Anthropic on AI Safety After Source Code Leak

All Content

[2602.19332] Training-Free Cross-Architecture Merging for Graph Neural Networks

[2602.19327] Soft Sequence Policy Optimization: Bridging GMPO and SAPO

[2602.19265] Spectral bias in physics-informed and operator learning: Analysis and mitigation guidelines

[2602.19253] Alternating Bi-Objective Optimization for Explainable Neuro-Fuzzy Systems

[2602.19215] Understanding Empirical Unlearning with Combinatorial Interpretability

[2602.18464] How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?

[2602.18462] Assessing the Reliability of Persona-Conditioned LLMs as Synthetic Survey Respondents

[2602.18460] The Doctor Will (Still) See You Now: On the Structural Limits of Agentic AI in Healthcare

[2602.18459] From Bias Mitigation to Bias Negotiation: Governing Identity and Sociocultural Reasoning in Generative AI

[2602.18458] The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research

[2602.18456] Beyond single-channel agentic benchmarking

[2602.19130] Detecting labeling bias using influence functions

[2602.19096] The Power of Decaying Steps: Enhancing Attack Stability and Transferability for Sign-based Optimizers

[2602.18446] ReportLogic: Evaluating Logical Quality in Deep Research Reports

[2602.18443] From "Help" to Helpful: A Hierarchical Assessment of LLMs in Mental e-Health Applications

[2602.19020] Learning to Detect Language Model Training Data via Active Reconstruction

[2602.20094] CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching

[2602.20059] Interaction Theater: A case of LLM Agents Interacting at Scale

[2602.20031] Latent Introspection: Models Can Detect Prior Concept Injections

[2602.20021] Agents of Chaos

Related Topics

Stay updated with AI News