AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

Ai Safety

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

submitted by /u/esporx [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

RLHF trains models on human feedback. Humans rate responses they like. And it turns out humans consistently rate confident, fluent, agree...

Reddit - Artificial Intelligence · 1 min ·
Computer Vision

House Democrat Questions Anthropic on AI Safety After Source Code Leak

Rep. Josh Gottheimer, who is generally tough on China, just sent a letter to Anthropic questioning their decision to reduce certain safet...

Reddit - Artificial Intelligence · 1 min ·

All Content

[2602.18934] LoMime: Query-Efficient Membership Inference using Model Extraction in Label-Only Settings
Machine Learning

[2602.18934] LoMime: Query-Efficient Membership Inference using Model Extraction in Label-Only Settings

The paper presents LoMime, a novel framework for membership inference attacks that operates efficiently under label-only conditions, sign...

arXiv - Machine Learning · 4 min ·
[2602.18911] From Human-Level AI Tales to AI Leveling Human Scales
Machine Learning

[2602.18911] From Human-Level AI Tales to AI Leveling Human Scales

This paper proposes a framework to recalibrate AI performance metrics against a global human population scale, addressing misleading comp...

arXiv - Machine Learning · 4 min ·
[2602.19837] Meta-Learning and Meta-Reinforcement Learning - Tracing the Path towards DeepMind's Adaptive Agent
Machine Learning

[2602.19837] Meta-Learning and Meta-Reinforcement Learning - Tracing the Path towards DeepMind's Adaptive Agent

This article surveys meta-learning and meta-reinforcement learning, highlighting their significance in developing DeepMind's Adaptive Age...

arXiv - AI · 3 min ·
[2602.19672] SkillOrchestra: Learning to Route Agents via Skill Transfer
Machine Learning

[2602.19672] SkillOrchestra: Learning to Route Agents via Skill Transfer

The paper presents SkillOrchestra, a framework for skill-aware orchestration in AI systems, improving agent routing through skill transfe...

arXiv - Machine Learning · 3 min ·
[2602.18905] TRUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning
Llms

[2602.18905] TRUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning

The paper presents the Trustworthy Unified Explanation Framework (TRUE) for enhancing the interpretability of large language models (LLMs...

arXiv - AI · 4 min ·
[2602.19810] OpenClaw, Moltbook, and ClawdLab: From Agent-Only Social Networks to Autonomous Scientific Research
Robotics

[2602.19810] OpenClaw, Moltbook, and ClawdLab: From Agent-Only Social Networks to Autonomous Scientific Research

The paper discusses OpenClaw, Moltbook, and ClawdLab, highlighting their role in creating a dataset for AI interactions and proposing Cla...

arXiv - AI · 4 min ·
[2602.19620] Rules or Weights? Comparing User Understanding of Explainable AI Techniques with the Cognitive XAI-Adaptive Model
Machine Learning

[2602.19620] Rules or Weights? Comparing User Understanding of Explainable AI Techniques with the Cognitive XAI-Adaptive Model

This article explores user understanding of explainable AI (XAI) techniques, comparing rules and weights through the Cognitive XAI-Adapti...

arXiv - AI · 4 min ·
[2602.19562] A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data
Machine Learning

[2602.19562] A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data

This paper presents a computational framework that aligns human linguistic descriptions with visual perceptual data, enhancing understand...

arXiv - AI · 4 min ·
[2602.18849] Exact Attention Sensitivity and the Geometry of Transformer Stability
Machine Learning

[2602.18849] Exact Attention Sensitivity and the Geometry of Transformer Stability

This article presents a stability theory for transformers, explaining key training dynamics and architectural considerations that affect ...

arXiv - AI · 3 min ·
[2602.19416] IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking
Llms

[2602.19416] IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking

The paper presents IR$^3$, a novel framework for detecting and mitigating reward hacking in Reinforcement Learning from Human Feedback (R...

arXiv - Machine Learning · 4 min ·
[2602.19396] Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement
Llms

[2602.19396] Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

This paper presents a novel framework for detecting concealed jailbreaks in large language models (LLMs) by disentangling semantic factor...

arXiv - AI · 4 min ·
[2602.18786] CaliCausalRank: Calibrated Multi-Objective Ad Ranking with Robust Counterfactual Utility Optimization
Ai Safety

[2602.18786] CaliCausalRank: Calibrated Multi-Objective Ad Ranking with Robust Counterfactual Utility Optimization

CaliCausalRank presents a novel framework for optimizing multi-objective ad ranking systems, addressing challenges like score scale incon...

arXiv - Machine Learning · 3 min ·
[2602.19367] Time Series, Vision, and Language: Exploring the Limits of Alignment in Contrastive Representation Spaces
Machine Learning

[2602.19367] Time Series, Vision, and Language: Exploring the Limits of Alignment in Contrastive Representation Spaces

This paper investigates the alignment of representations from time series, vision, and language modalities, revealing insights into their...

arXiv - AI · 4 min ·
[2602.19281] Limited Reasoning Space: The cage of long-horizon reasoning in LLMs
Llms

[2602.19281] Limited Reasoning Space: The cage of long-horizon reasoning in LLMs

This article discusses the 'Limited Reasoning Space' hypothesis in large language models (LLMs), proposing that over-planning can impair ...

arXiv - AI · 4 min ·
[2602.18739] When World Models Dream Wrong: Physical-Conditioned Adversarial Attacks against World Models
Machine Learning

[2602.18739] When World Models Dream Wrong: Physical-Conditioned Adversarial Attacks against World Models

This paper introduces the Physical-Conditioned World Model Attack (PhysCond-WMA), a novel method to exploit vulnerabilities in generative...

arXiv - Machine Learning · 4 min ·
[2602.18733] Prior Aware Memorization: An Efficient Metric for Distinguishing Memorization from Generalization in Large Language Models
Llms

[2602.18733] Prior Aware Memorization: An Efficient Metric for Distinguishing Memorization from Generalization in Large Language Models

The paper introduces Prior Aware Memorization, a new metric for distinguishing genuine memorization from generalization in large language...

arXiv - Machine Learning · 4 min ·
[2602.18728] Phase-Consistent Magnetic Spectral Learning for Multi-View Clustering
Nlp

[2602.18728] Phase-Consistent Magnetic Spectral Learning for Multi-View Clustering

This article presents a novel approach to unsupervised multi-view clustering through Phase-Consistent Magnetic Spectral Learning, address...

arXiv - Machine Learning · 4 min ·
[2602.19160] Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing
Llms

[2602.19160] Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing

This paper evaluates the reasoning capabilities of Large Language Models (LLMs) through General Game Playing tasks, revealing performance...

arXiv - AI · 4 min ·
[2602.19159] Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM
Llms

[2602.19159] Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM

This article explores how large language models (LLMs) make decisions based on pain and pleasure, linking behavioral evidence with mechan...

arXiv - Machine Learning · 4 min ·
[2602.18674] Robustness of Deep ReLU Networks to Misclassification of High-Dimensional Data
Machine Learning

[2602.18674] Robustness of Deep ReLU Networks to Misclassification of High-Dimensional Data

This paper examines the robustness of deep ReLU networks against misclassification when subjected to random input perturbations, providin...

arXiv - Machine Learning · 3 min ·
Previous Page 67 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime