AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

RSS

Top This Week

Ai Safety

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

submitted by /u/esporx [link] [comments]

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Machine Learning

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

RLHF trains models on human feedback. Humans rate responses they like. And it turns out humans consistently rate confident, fluent, agree...

Reddit - Artificial Intelligence · 1 min · about 4 hours ago

Computer Vision

House Democrat Questions Anthropic on AI Safety After Source Code Leak

Rep. Josh Gottheimer, who is generally tough on China, just sent a letter to Anthropic questioning their decision to reduce certain safet...

Reddit - Artificial Intelligence · 1 min · about 4 hours ago

All Content

Machine Learning

[2512.20363] Clust-PSI-PFL: A Population Stability Index Approach for Clustered Non-IID Personalized Federated Learning

The paper presents Clust-PSI-PFL, a novel framework for personalized federated learning that addresses challenges posed by non-IID data t...

arXiv - AI · 4 min · about 1 month ago

Llms

[2501.17860] Dialogue is Better Than Monologue: Instructing Medical LLMs via Strategical Conversations

This article presents a novel approach to training medical large language models (LLMs) through dialogue-based fine-tuning, improving the...

arXiv - AI · 3 min · about 1 month ago

Llms

[2511.04934] Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

The paper discusses the limitations of current unlearning methods in large language models (LLMs), revealing that they fail to effectivel...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2510.16703] On the Granularity of Causal Effect Identifiability

This paper explores the concept of causal effect identifiability, focusing on state-based effects and how they can be identifiable even w...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2510.13205] CleverCatch: A Knowledge-Guided Weak Supervision Model for Fraud Detection

CleverCatch introduces a knowledge-guided weak supervision model for detecting healthcare fraud, enhancing accuracy and interpretability ...

arXiv - AI · 4 min · about 1 month ago

Llms

[2510.11390] Medical Interpretability and Knowledge Maps of Large Language Models

This article presents a systematic study of medical interpretability in Large Language Models (LLMs), exploring how these models process ...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2510.03734] Cost Efficient Fairness Audit Under Partial Feedback

This paper presents a cost-efficient approach to auditing fairness in classifiers under partial feedback, proposing algorithms that outpe...

arXiv - AI · 4 min · about 1 month ago

Llms

[2510.03346] KVComm: Enabling Efficient LLM Communication through Selective KV Sharing

The paper introduces KVComm, a novel framework for efficient communication between Large Language Models (LLMs) using selective KV pair s...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.11298] Voxtral Realtime

Voxtral Realtime presents a novel streaming automatic speech recognition model achieving offline transcription quality with sub-second la...

arXiv - AI · 5 min · about 1 month ago

Machine Learning

[2602.08354] Does Your Reasoning Model Implicitly Know When to Stop Thinking?

This article explores how large reasoning models (LRMs) can implicitly determine when to stop processing information, introducing a new s...

arXiv - AI · 4 min · about 1 month ago

Ai Agents

[2602.08104] Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems

This paper presents a novel framework for interpretable failure analysis in Multi-Agent Reinforcement Learning (MARL) systems, focusing o...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2510.00502] Diffusion Alignment as Variational Expectation-Maximization

The paper introduces Diffusion Alignment as Variational Expectation-Maximization (DAV), a novel framework that optimizes diffusion models...

arXiv - Machine Learning · 3 min · about 1 month ago

Ai Safety

[2509.25210] STCast: Adaptive Boundary Alignment for Global and Regional Weather Forecasting

The paper introduces STCast, an AI-driven framework for adaptive boundary alignment in weather forecasting, enhancing regional forecasts ...

arXiv - AI · 4 min · about 1 month ago

Ai Safety

[2602.07754] Humanizing AI Grading: Student-Centered Insights on Fairness, Trust, Consistency and Transparency

This study explores student perceptions of AI grading systems, focusing on fairness, trust, consistency, and transparency in an undergrad...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2602.03003] Open Problems in Differentiable Social Choice: Learning Mechanisms, Decisions, and Alignment

The paper discusses differentiable social choice, a framework integrating machine learning with social choice theory, identifying 18 open...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2601.18467] OffSeeker: Online Reinforcement Learning Is Not All You Need for Deep Research Agents

The paper introduces OffSeeker, a model demonstrating that offline training can effectively replace costly online reinforcement learning ...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2601.05500] The Illusion of Human AI Parity Under Uncertainty: Navigating Elusive Ground Truth via a Probabilistic Paradigm

This paper discusses the impact of uncertainty in ground truth evaluations on AI performance assessments, proposing a probabilistic frame...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2509.21655] DriftLite: Lightweight Drift Control for Inference-Time Scaling of Diffusion Models

The paper presents DriftLite, a lightweight approach for inference-time scaling of diffusion models, enhancing adaptation to new distribu...

arXiv - Machine Learning · 3 min · about 1 month ago

Robotics

[2509.20648] Wonder Wins Ways: Curiosity-Driven Exploration through Multi-Agent Contextual Calibration

This paper presents CERMIC, a novel framework for enhancing multi-agent exploration in reinforcement learning by calibrating intrinsic cu...

arXiv - Machine Learning · 4 min · about 1 month ago

Robotics

[2512.20798] A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents

This paper introduces a benchmark for evaluating outcome-driven constraint violations in autonomous AI agents, highlighting safety concer...

arXiv - AI · 4 min · about 1 month ago

Previous Page 60 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

House Democrat Questions Anthropic on AI Safety After Source Code Leak

All Content

[2512.20363] Clust-PSI-PFL: A Population Stability Index Approach for Clustered Non-IID Personalized Federated Learning

[2501.17860] Dialogue is Better Than Monologue: Instructing Medical LLMs via Strategical Conversations

[2511.04934] Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

[2510.16703] On the Granularity of Causal Effect Identifiability

[2510.13205] CleverCatch: A Knowledge-Guided Weak Supervision Model for Fraud Detection

[2510.11390] Medical Interpretability and Knowledge Maps of Large Language Models

[2510.03734] Cost Efficient Fairness Audit Under Partial Feedback

[2510.03346] KVComm: Enabling Efficient LLM Communication through Selective KV Sharing

[2602.11298] Voxtral Realtime

[2602.08354] Does Your Reasoning Model Implicitly Know When to Stop Thinking?

[2602.08104] Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems

[2510.00502] Diffusion Alignment as Variational Expectation-Maximization

[2509.25210] STCast: Adaptive Boundary Alignment for Global and Regional Weather Forecasting

[2602.07754] Humanizing AI Grading: Student-Centered Insights on Fairness, Trust, Consistency and Transparency

[2602.03003] Open Problems in Differentiable Social Choice: Learning Mechanisms, Decisions, and Alignment

[2601.18467] OffSeeker: Online Reinforcement Learning Is Not All You Need for Deep Research Agents

[2601.05500] The Illusion of Human AI Parity Under Uncertainty: Navigating Elusive Ground Truth via a Probabilistic Paradigm

[2509.21655] DriftLite: Lightweight Drift Control for Inference-Time Scaling of Diffusion Models

[2509.20648] Wonder Wins Ways: Curiosity-Driven Exploration through Multi-Agent Contextual Calibration

[2512.20798] A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents

Related Topics

Stay updated with AI News