AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

[2512.21106] Semantic Refinement with LLMs for Graph Representations
Llms

[2512.21106] Semantic Refinement with LLMs for Graph Representations

Abstract page for arXiv paper 2512.21106: Semantic Refinement with LLMs for Graph Representations

arXiv - Machine Learning · 4 min ·
[2511.22294] Structure is Supervision: Multiview Masked Autoencoders for Radiology
Machine Learning

[2511.22294] Structure is Supervision: Multiview Masked Autoencoders for Radiology

Abstract page for arXiv paper 2511.22294: Structure is Supervision: Multiview Masked Autoencoders for Radiology

arXiv - Machine Learning · 4 min ·
[2511.18123] Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models
Llms

[2511.18123] Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models

Abstract page for arXiv paper 2511.18123: Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-La...

arXiv - Machine Learning · 4 min ·

All Content

[2602.20193] When Backdoors Go Beyond Triggers: Semantic Drift in Diffusion Models Under Encoder Attacks
Machine Learning

[2602.20193] When Backdoors Go Beyond Triggers: Semantic Drift in Diffusion Models Under Encoder Attacks

This paper investigates the impact of encoder-side poisoning on text-to-image models, revealing that traditional evaluations of backdoor ...

arXiv - AI · 3 min ·
[2602.20170] CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation
Llms

[2602.20170] CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

The paper introduces CAGE, a framework for culturally adaptive red-teaming benchmark generation, addressing the limitations of existing b...

arXiv - AI · 3 min ·
[2602.20169] Autonomous AI and Ownership Rules
Robotics

[2602.20169] Autonomous AI and Ownership Rules

This article explores the ownership rules surrounding AI-generated outputs, examining how they are linked to their creators and the impli...

arXiv - AI · 3 min ·
[2602.20168] Benchmarking Early Deterioration Prediction Across Hospital-Rich and MCI-Like Emergency Triage Under Constrained Sensing
Machine Learning

[2602.20168] Benchmarking Early Deterioration Prediction Across Hospital-Rich and MCI-Like Emergency Triage Under Constrained Sensing

This article presents a benchmarking framework for early deterioration prediction in emergency triage, comparing hospital-rich settings w...

arXiv - Machine Learning · 3 min ·
[2602.20166] ConceptRM: The Quest to Mitigate Alert Fatigue through Consensus-Based Purity-Driven Data Cleaning for Reflection Modelling
Machine Learning

[2602.20166] ConceptRM: The Quest to Mitigate Alert Fatigue through Consensus-Based Purity-Driven Data Cleaning for Reflection Modelling

The paper presents ConceptRM, a novel method aimed at reducing alert fatigue in intelligent agents by improving data cleaning processes f...

arXiv - AI · 4 min ·
[2602.21064] Motivation is Something You Need
Machine Learning

[2602.21064] Motivation is Something You Need

The paper presents a novel training paradigm for AI that integrates concepts from affective neuroscience, focusing on a dual-model framew...

arXiv - Machine Learning · 3 min ·
[2602.21061] Tool Building as a Path to "Superintelligence"
Llms

[2602.21061] Tool Building as a Path to "Superintelligence"

The paper explores how Large Language Models (LLMs) can achieve superintelligence through the Diligent Learner framework, emphasizing the...

arXiv - AI · 3 min ·
[2602.20878] Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs
Llms

[2602.20878] Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs

This article introduces Vision-Language Causal Graphs (VLCGs) to enhance causal reasoning in Vision-Language Models (LVLMs), addressing t...

arXiv - AI · 3 min ·
[2602.20813] Pressure Reveals Character: Behavioural Alignment Evaluation at Depth
Llms

[2602.20813] Pressure Reveals Character: Behavioural Alignment Evaluation at Depth

This paper presents a novel evaluation framework for assessing the alignment of language models under realistic pressure, revealing behav...

arXiv - AI · 3 min ·
[2602.20770] Pipeline for Verifying LLM-Generated Mathematical Solutions
Llms

[2602.20770] Pipeline for Verifying LLM-Generated Mathematical Solutions

This paper presents a pipeline for verifying mathematical solutions generated by Large Language Models (LLMs), emphasizing both automatic...

arXiv - AI · 3 min ·
[2602.20710] Counterfactual Simulation Training for Chain-of-Thought Faithfulness
Llms

[2602.20710] Counterfactual Simulation Training for Chain-of-Thought Faithfulness

The paper introduces Counterfactual Simulation Training (CST), a method designed to enhance Chain-of-Thought (CoT) faithfulness in large ...

arXiv - AI · 4 min ·
[2602.20708] ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction
Llms

[2602.20708] ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

The paper introduces ICON, a novel framework designed to defend Large Language Model (LLM) agents against Indirect Prompt Injection (IPI)...

arXiv - AI · 3 min ·
[2602.20696] PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding
Llms

[2602.20696] PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding

The paper presents PromptCD, a method for enhancing AI behavior at test time using polarity-prompt contrastive decoding, improving alignm...

arXiv - AI · 4 min ·
[2602.20628] When can we trust untrusted monitoring? A safety case sketch across collusion strategies
Machine Learning

[2602.20628] When can we trust untrusted monitoring? A safety case sketch across collusion strategies

This paper explores the challenges of ensuring safety in AI systems using untrusted monitoring. It develops a taxonomy of collusion strat...

arXiv - AI · 4 min ·
[2602.20624] Physics-based phenomenological characterization of cross-modal bias in multimodal models
Llms

[2602.20624] Physics-based phenomenological characterization of cross-modal bias in multimodal models

This paper explores the cross-modal bias in multimodal large language models (MLLMs) through a physics-based phenomenological approach, a...

arXiv - AI · 4 min ·
[2602.20424] Implicit Intelligence -- Evaluating Agents on What Users Don't Say
Ai Agents

[2602.20424] Implicit Intelligence -- Evaluating Agents on What Users Don't Say

The paper presents an evaluation framework called Implicit Intelligence, which assesses AI agents' ability to understand unstated user re...

arXiv - AI · 3 min ·
Machine Learning

AI energy use: New tools show which model consumes the most power, and why

The article discusses new tools that analyze the energy consumption of various AI models, highlighting the importance of understanding po...

AI Events · 1 min ·
Ai Safety

Anthropic Drops Flagship Safety Pledge

Anthropic has announced the discontinuation of its flagship safety pledge, raising concerns about AI safety commitments in the industry.

Reddit - Artificial Intelligence · 1 min ·
Pete Hegseth’s Pentagon AI bro squad includes a former Uber executive and a private equity billionaire | The Verge
Machine Learning

Pete Hegseth’s Pentagon AI bro squad includes a former Uber executive and a private equity billionaire | The Verge

The article discusses a Pentagon meeting involving Defense Secretary Pete Hegseth, former Uber executive Emil Michael, and private equity...

The Verge - AI · 11 min ·
AI-linked fears roil some corners of Wall Street after years of hype and gains
Ai Safety

AI-linked fears roil some corners of Wall Street after years of hype and gains

Concerns over AI spending are causing volatility in Wall Street, as investors question profitability. Major companies like IBM and Master...

AI Tools & Products · 5 min ·
Previous Page 56 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime