AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

Ai Safety

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

submitted by /u/esporx [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

RLHF trains models on human feedback. Humans rate responses they like. And it turns out humans consistently rate confident, fluent, agree...

Reddit - Artificial Intelligence · 1 min ·
Computer Vision

House Democrat Questions Anthropic on AI Safety After Source Code Leak

Rep. Josh Gottheimer, who is generally tough on China, just sent a letter to Anthropic questioning their decision to reduce certain safet...

Reddit - Artificial Intelligence · 1 min ·

All Content

[2602.19138] CRCC: Contrast-Based Robust Cross-Subject and Cross-Site Representation Learning for EEG
Machine Learning

[2602.19138] CRCC: Contrast-Based Robust Cross-Subject and Cross-Site Representation Learning for EEG

The paper presents CRCC, a novel framework for improving EEG-based neural decoding models' generalization across different acquisition si...

arXiv - AI · 3 min ·
[2602.19115] How Do LLMs Encode Scientific Quality? An Empirical Study Using Monosemantic Features from Sparse Autoencoders
Llms

[2602.19115] How Do LLMs Encode Scientific Quality? An Empirical Study Using Monosemantic Features from Sparse Autoencoders

This paper investigates how large language models (LLMs) encode scientific quality using monosemantic features from sparse autoencoders, ...

arXiv - AI · 4 min ·
[2602.18536] Triggering hallucinations in model-based MRI reconstruction via adversarial perturbations
Machine Learning

[2602.18536] Triggering hallucinations in model-based MRI reconstruction via adversarial perturbations

This paper investigates how adversarial perturbations can induce hallucinations in generative models used for MRI reconstruction, highlig...

arXiv - Machine Learning · 4 min ·
[2602.19101] Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models
Llms

[2602.19101] Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

This paper investigates value entanglement in Large Language Models (LLMs), revealing how moral values influence grammatical and economic...

arXiv - AI · 3 min ·
[2602.18502] Mitigating Shortcut Learning via Feature Disentanglement in Medical Imaging: A Benchmark Study
Machine Learning

[2602.18502] Mitigating Shortcut Learning via Feature Disentanglement in Medical Imaging: A Benchmark Study

This study evaluates feature disentanglement methods to mitigate shortcut learning in medical imaging, enhancing model robustness and cla...

arXiv - Machine Learning · 4 min ·
[2602.19087] Detecting Cybersecurity Threats by Integrating Explainable AI with SHAP Interpretability and Strategic Data Sampling
Machine Learning

[2602.19087] Detecting Cybersecurity Threats by Integrating Explainable AI with SHAP Interpretability and Strategic Data Sampling

This article presents a novel framework for detecting cybersecurity threats by integrating Explainable AI (XAI) with SHAP interpretabilit...

arXiv - Machine Learning · 3 min ·
[2602.19040] Adaptive Multi-Agent Reasoning for Text-to-Video Retrieval
Llms

[2602.19040] Adaptive Multi-Agent Reasoning for Text-to-Video Retrieval

The paper presents an adaptive multi-agent framework for improving text-to-video retrieval systems, addressing challenges in query-depend...

arXiv - AI · 4 min ·
[2602.18489] DCInject: Persistent Backdoor Attacks via Frequency Manipulation in Personal Federated Learning
Machine Learning

[2602.18489] DCInject: Persistent Backdoor Attacks via Frequency Manipulation in Personal Federated Learning

The paper presents DCInject, a novel backdoor attack method targeting personalized federated learning (PFL) systems, demonstrating high a...

arXiv - Machine Learning · 3 min ·
[2602.19028] The Metaphysics We Train: A Heideggerian Reading of Machine Learning
Machine Learning

[2602.19028] The Metaphysics We Train: A Heideggerian Reading of Machine Learning

This paper explores machine learning through a Heideggerian lens, highlighting insights on algorithmic opacity, the limitations of calcul...

arXiv - Machine Learning · 3 min ·
[2602.19025] Routing-Aware Explanations for Mixture of Experts Graph Models in Malware Detection
Machine Learning

[2602.19025] Routing-Aware Explanations for Mixture of Experts Graph Models in Malware Detection

This article presents a novel approach to malware detection using Mixture-of-Experts (MoE) graph models, emphasizing routing-aware explan...

arXiv - AI · 4 min ·
[2602.18916] Adaptive Collaboration of Arena-Based Argumentative LLMs for Explainable and Contestable Legal Reasoning
Llms

[2602.18916] Adaptive Collaboration of Arena-Based Argumentative LLMs for Explainable and Contestable Legal Reasoning

The paper presents Adaptive Collaboration of Arena-Based Argumentative LLMs (ACAL), a framework designed for explainable and contestable ...

arXiv - AI · 4 min ·
[2602.20111] Reliable Abstention under Adversarial Injections: Tight Lower Bounds and New Upper Bounds
Machine Learning

[2602.20111] Reliable Abstention under Adversarial Injections: Tight Lower Bounds and New Upper Bounds

This paper explores reliable abstention in online learning under adversarial injections, presenting new lower and upper bounds for error ...

arXiv - Machine Learning · 4 min ·
[2602.20102] BarrierSteer: LLM Safety via Learning Barrier Steering
Llms

[2602.20102] BarrierSteer: LLM Safety via Learning Barrier Steering

The article presents BarrierSteer, a framework designed to enhance the safety of large language models (LLMs) by embedding learned safety...

arXiv - AI · 3 min ·
[2602.20062] A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning
Machine Learning

[2602.20062] A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning

This paper presents a theoretical framework explaining how pretraining influences inductive bias during fine-tuning in machine learning, ...

arXiv - Machine Learning · 4 min ·
[2602.20003] A Secure and Private Distributed Bayesian Federated Learning Design
Machine Learning

[2602.20003] A Secure and Private Distributed Bayesian Federated Learning Design

This paper presents a novel framework for Distributed Federated Learning (DFL) that enhances privacy, convergence speed, and robustness a...

arXiv - AI · 4 min ·
[2602.18880] FOCA: Frequency-Oriented Cross-Domain Forgery Detection, Localization and Explanation via Multi-Modal Large Language Model
Llms

[2602.18880] FOCA: Frequency-Oriented Cross-Domain Forgery Detection, Localization and Explanation via Multi-Modal Large Language Model

The paper presents FOCA, a novel framework for detecting and localizing image forgery using a multi-modal large language model that integ...

arXiv - AI · 3 min ·
[2602.19964] On the Equivalence of Random Network Distillation, Deep Ensembles, and Bayesian Inference
Machine Learning

[2602.19964] On the Equivalence of Random Network Distillation, Deep Ensembles, and Bayesian Inference

This paper establishes theoretical connections between Random Network Distillation (RND), Deep Ensembles, and Bayesian Inference, enhanci...

arXiv - Machine Learning · 4 min ·
[2602.19945] DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models
Machine Learning

[2602.19945] DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models

The paper introduces DP-FedAdamW, a novel optimizer designed for differentially private federated learning, addressing key challenges in ...

arXiv - AI · 3 min ·
[2602.18844] When Agda met Vampire
Ai Infrastructure

[2602.18844] When Agda met Vampire

The paper discusses integrating proof assistants like Agda with automated theorem provers (ATPs) to enhance automation in mechanized math...

arXiv - AI · 3 min ·
[2602.18832] OpenClaw AI Agents as Informal Learners at Moltbook: Characterizing an Emergent Learning Community at Scale
Robotics

[2602.18832] OpenClaw AI Agents as Informal Learners at Moltbook: Characterizing an Emergent Learning Community at Scale

This article presents an empirical study of Moltbook, a large-scale informal learning community composed entirely of AI agents, highlight...

arXiv - AI · 4 min ·
Previous Page 64 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime