AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

RSS

Top This Week

Ai Safety

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

submitted by /u/esporx [link] [comments]

Reddit - Artificial Intelligence · 1 min · about 9 hours ago

Machine Learning

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

RLHF trains models on human feedback. Humans rate responses they like. And it turns out humans consistently rate confident, fluent, agree...

Reddit - Artificial Intelligence · 1 min · about 11 hours ago

Computer Vision

House Democrat Questions Anthropic on AI Safety After Source Code Leak

Rep. Josh Gottheimer, who is generally tough on China, just sent a letter to Anthropic questioning their decision to reduce certain safet...

Reddit - Artificial Intelligence · 1 min · about 11 hours ago

All Content

Machine Learning

[2602.19138] CRCC: Contrast-Based Robust Cross-Subject and Cross-Site Representation Learning for EEG

The paper presents CRCC, a novel framework for improving EEG-based neural decoding models' generalization across different acquisition si...

arXiv - AI · 3 min · about 1 month ago

Llms

[2602.19115] How Do LLMs Encode Scientific Quality? An Empirical Study Using Monosemantic Features from Sparse Autoencoders

This paper investigates how large language models (LLMs) encode scientific quality using monosemantic features from sparse autoencoders, ...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.18536] Triggering hallucinations in model-based MRI reconstruction via adversarial perturbations

This paper investigates how adversarial perturbations can induce hallucinations in generative models used for MRI reconstruction, highlig...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.19101] Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

This paper investigates value entanglement in Large Language Models (LLMs), revealing how moral values influence grammatical and economic...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2602.18502] Mitigating Shortcut Learning via Feature Disentanglement in Medical Imaging: A Benchmark Study

This study evaluates feature disentanglement methods to mitigate shortcut learning in medical imaging, enhancing model robustness and cla...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.19087] Detecting Cybersecurity Threats by Integrating Explainable AI with SHAP Interpretability and Strategic Data Sampling

This article presents a novel framework for detecting cybersecurity threats by integrating Explainable AI (XAI) with SHAP interpretabilit...

arXiv - Machine Learning · 3 min · about 1 month ago

Llms

[2602.19040] Adaptive Multi-Agent Reasoning for Text-to-Video Retrieval

The paper presents an adaptive multi-agent framework for improving text-to-video retrieval systems, addressing challenges in query-depend...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.18489] DCInject: Persistent Backdoor Attacks via Frequency Manipulation in Personal Federated Learning

The paper presents DCInject, a novel backdoor attack method targeting personalized federated learning (PFL) systems, demonstrating high a...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2602.19028] The Metaphysics We Train: A Heideggerian Reading of Machine Learning

This paper explores machine learning through a Heideggerian lens, highlighting insights on algorithmic opacity, the limitations of calcul...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2602.19025] Routing-Aware Explanations for Mixture of Experts Graph Models in Malware Detection

This article presents a novel approach to malware detection using Mixture-of-Experts (MoE) graph models, emphasizing routing-aware explan...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.18916] Adaptive Collaboration of Arena-Based Argumentative LLMs for Explainable and Contestable Legal Reasoning

The paper presents Adaptive Collaboration of Arena-Based Argumentative LLMs (ACAL), a framework designed for explainable and contestable ...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.20111] Reliable Abstention under Adversarial Injections: Tight Lower Bounds and New Upper Bounds

This paper explores reliable abstention in online learning under adversarial injections, presenting new lower and upper bounds for error ...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.20102] BarrierSteer: LLM Safety via Learning Barrier Steering

The article presents BarrierSteer, a framework designed to enhance the safety of large language models (LLMs) by embedding learned safety...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2602.20062] A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning

This paper presents a theoretical framework explaining how pretraining influences inductive bias during fine-tuning in machine learning, ...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.20003] A Secure and Private Distributed Bayesian Federated Learning Design

This paper presents a novel framework for Distributed Federated Learning (DFL) that enhances privacy, convergence speed, and robustness a...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.18880] FOCA: Frequency-Oriented Cross-Domain Forgery Detection, Localization and Explanation via Multi-Modal Large Language Model

The paper presents FOCA, a novel framework for detecting and localizing image forgery using a multi-modal large language model that integ...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2602.19964] On the Equivalence of Random Network Distillation, Deep Ensembles, and Bayesian Inference

This paper establishes theoretical connections between Random Network Distillation (RND), Deep Ensembles, and Bayesian Inference, enhanci...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.19945] DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models

The paper introduces DP-FedAdamW, a novel optimizer designed for differentially private federated learning, addressing key challenges in ...

arXiv - AI · 3 min · about 1 month ago

Ai Infrastructure

[2602.18844] When Agda met Vampire

The paper discusses integrating proof assistants like Agda with automated theorem provers (ATPs) to enhance automation in mechanized math...

arXiv - AI · 3 min · about 1 month ago

Robotics

[2602.18832] OpenClaw AI Agents as Informal Learners at Moltbook: Characterizing an Emergent Learning Community at Scale

This article presents an empirical study of Moltbook, a large-scale informal learning community composed entirely of AI agents, highlight...

arXiv - AI · 4 min · about 1 month ago

Previous Page 64 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

House Democrat Questions Anthropic on AI Safety After Source Code Leak

All Content

[2602.19138] CRCC: Contrast-Based Robust Cross-Subject and Cross-Site Representation Learning for EEG

[2602.19115] How Do LLMs Encode Scientific Quality? An Empirical Study Using Monosemantic Features from Sparse Autoencoders

[2602.18536] Triggering hallucinations in model-based MRI reconstruction via adversarial perturbations

[2602.19101] Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

[2602.18502] Mitigating Shortcut Learning via Feature Disentanglement in Medical Imaging: A Benchmark Study

[2602.19087] Detecting Cybersecurity Threats by Integrating Explainable AI with SHAP Interpretability and Strategic Data Sampling

[2602.19040] Adaptive Multi-Agent Reasoning for Text-to-Video Retrieval

[2602.18489] DCInject: Persistent Backdoor Attacks via Frequency Manipulation in Personal Federated Learning

[2602.19028] The Metaphysics We Train: A Heideggerian Reading of Machine Learning

[2602.19025] Routing-Aware Explanations for Mixture of Experts Graph Models in Malware Detection

[2602.18916] Adaptive Collaboration of Arena-Based Argumentative LLMs for Explainable and Contestable Legal Reasoning

[2602.20111] Reliable Abstention under Adversarial Injections: Tight Lower Bounds and New Upper Bounds

[2602.20102] BarrierSteer: LLM Safety via Learning Barrier Steering

[2602.20062] A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning

[2602.20003] A Secure and Private Distributed Bayesian Federated Learning Design

[2602.18880] FOCA: Frequency-Oriented Cross-Domain Forgery Detection, Localization and Explanation via Multi-Modal Large Language Model

[2602.19964] On the Equivalence of Random Network Distillation, Deep Ensembles, and Bayesian Inference

[2602.19945] DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models

[2602.18844] When Agda met Vampire

[2602.18832] OpenClaw AI Agents as Informal Learners at Moltbook: Characterizing an Emergent Learning Community at Scale

Related Topics

Stay updated with AI News