AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

RSS

Top This Week

Ai Safety

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

submitted by /u/esporx [link] [comments]

Reddit - Artificial Intelligence · 1 min · about 14 hours ago

Machine Learning

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

RLHF trains models on human feedback. Humans rate responses they like. And it turns out humans consistently rate confident, fluent, agree...

Reddit - Artificial Intelligence · 1 min · about 16 hours ago

Computer Vision

House Democrat Questions Anthropic on AI Safety After Source Code Leak

Rep. Josh Gottheimer, who is generally tough on China, just sent a letter to Anthropic questioning their decision to reduce certain safet...

Reddit - Artificial Intelligence · 1 min · about 16 hours ago

All Content

Machine Learning

[2602.18934] LoMime: Query-Efficient Membership Inference using Model Extraction in Label-Only Settings

The paper presents LoMime, a novel framework for membership inference attacks that operates efficiently under label-only conditions, sign...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.18911] From Human-Level AI Tales to AI Leveling Human Scales

This paper proposes a framework to recalibrate AI performance metrics against a global human population scale, addressing misleading comp...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.19837] Meta-Learning and Meta-Reinforcement Learning - Tracing the Path towards DeepMind's Adaptive Agent

This article surveys meta-learning and meta-reinforcement learning, highlighting their significance in developing DeepMind's Adaptive Age...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2602.19672] SkillOrchestra: Learning to Route Agents via Skill Transfer

The paper presents SkillOrchestra, a framework for skill-aware orchestration in AI systems, improving agent routing through skill transfe...

arXiv - Machine Learning · 3 min · about 1 month ago

Llms

[2602.18905] TRUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning

The paper presents the Trustworthy Unified Explanation Framework (TRUE) for enhancing the interpretability of large language models (LLMs...

arXiv - AI · 4 min · about 1 month ago

Robotics

[2602.19810] OpenClaw, Moltbook, and ClawdLab: From Agent-Only Social Networks to Autonomous Scientific Research

The paper discusses OpenClaw, Moltbook, and ClawdLab, highlighting their role in creating a dataset for AI interactions and proposing Cla...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.19620] Rules or Weights? Comparing User Understanding of Explainable AI Techniques with the Cognitive XAI-Adaptive Model

This article explores user understanding of explainable AI (XAI) techniques, comparing rules and weights through the Cognitive XAI-Adapti...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.19562] A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data

This paper presents a computational framework that aligns human linguistic descriptions with visual perceptual data, enhancing understand...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.18849] Exact Attention Sensitivity and the Geometry of Transformer Stability

This article presents a stability theory for transformers, explaining key training dynamics and architectural considerations that affect ...

arXiv - AI · 3 min · about 1 month ago

Llms

[2602.19416] IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking

The paper presents IR$^3$, a novel framework for detecting and mitigating reward hacking in Reinforcement Learning from Human Feedback (R...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.19396] Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

This paper presents a novel framework for detecting concealed jailbreaks in large language models (LLMs) by disentangling semantic factor...

arXiv - AI · 4 min · about 1 month ago

Ai Safety

[2602.18786] CaliCausalRank: Calibrated Multi-Objective Ad Ranking with Robust Counterfactual Utility Optimization

CaliCausalRank presents a novel framework for optimizing multi-objective ad ranking systems, addressing challenges like score scale incon...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2602.19367] Time Series, Vision, and Language: Exploring the Limits of Alignment in Contrastive Representation Spaces

This paper investigates the alignment of representations from time series, vision, and language modalities, revealing insights into their...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.19281] Limited Reasoning Space: The cage of long-horizon reasoning in LLMs

This article discusses the 'Limited Reasoning Space' hypothesis in large language models (LLMs), proposing that over-planning can impair ...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.18739] When World Models Dream Wrong: Physical-Conditioned Adversarial Attacks against World Models

This paper introduces the Physical-Conditioned World Model Attack (PhysCond-WMA), a novel method to exploit vulnerabilities in generative...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.18733] Prior Aware Memorization: An Efficient Metric for Distinguishing Memorization from Generalization in Large Language Models

The paper introduces Prior Aware Memorization, a new metric for distinguishing genuine memorization from generalization in large language...

arXiv - Machine Learning · 4 min · about 1 month ago

Nlp

[2602.18728] Phase-Consistent Magnetic Spectral Learning for Multi-View Clustering

This article presents a novel approach to unsupervised multi-view clustering through Phase-Consistent Magnetic Spectral Learning, address...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.19160] Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing

This paper evaluates the reasoning capabilities of Large Language Models (LLMs) through General Game Playing tasks, revealing performance...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.19159] Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM

This article explores how large language models (LLMs) make decisions based on pain and pleasure, linking behavioral evidence with mechan...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.18674] Robustness of Deep ReLU Networks to Misclassification of High-Dimensional Data

This paper examines the robustness of deep ReLU networks against misclassification when subjected to random input perturbations, providin...

arXiv - Machine Learning · 3 min · about 1 month ago

Previous Page 67 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

House Democrat Questions Anthropic on AI Safety After Source Code Leak

All Content

[2602.18934] LoMime: Query-Efficient Membership Inference using Model Extraction in Label-Only Settings

[2602.18911] From Human-Level AI Tales to AI Leveling Human Scales

[2602.19837] Meta-Learning and Meta-Reinforcement Learning - Tracing the Path towards DeepMind's Adaptive Agent

[2602.19672] SkillOrchestra: Learning to Route Agents via Skill Transfer

[2602.18905] TRUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning

[2602.19810] OpenClaw, Moltbook, and ClawdLab: From Agent-Only Social Networks to Autonomous Scientific Research

[2602.19620] Rules or Weights? Comparing User Understanding of Explainable AI Techniques with the Cognitive XAI-Adaptive Model

[2602.19562] A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data

[2602.18849] Exact Attention Sensitivity and the Geometry of Transformer Stability

[2602.19416] IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking

[2602.19396] Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

[2602.18786] CaliCausalRank: Calibrated Multi-Objective Ad Ranking with Robust Counterfactual Utility Optimization

[2602.19367] Time Series, Vision, and Language: Exploring the Limits of Alignment in Contrastive Representation Spaces

[2602.19281] Limited Reasoning Space: The cage of long-horizon reasoning in LLMs

[2602.18739] When World Models Dream Wrong: Physical-Conditioned Adversarial Attacks against World Models

[2602.18733] Prior Aware Memorization: An Efficient Metric for Distinguishing Memorization from Generalization in Large Language Models

[2602.18728] Phase-Consistent Magnetic Spectral Learning for Multi-View Clustering

[2602.19160] Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing

[2602.19159] Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM

[2602.18674] Robustness of Deep ReLU Networks to Misclassification of High-Dimensional Data

Related Topics

Stay updated with AI News