AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

[2512.21106] Semantic Refinement with LLMs for Graph Representations
Llms

[2512.21106] Semantic Refinement with LLMs for Graph Representations

Abstract page for arXiv paper 2512.21106: Semantic Refinement with LLMs for Graph Representations

arXiv - Machine Learning · 4 min ·
[2511.22294] Structure is Supervision: Multiview Masked Autoencoders for Radiology
Machine Learning

[2511.22294] Structure is Supervision: Multiview Masked Autoencoders for Radiology

Abstract page for arXiv paper 2511.22294: Structure is Supervision: Multiview Masked Autoencoders for Radiology

arXiv - Machine Learning · 4 min ·
[2511.18123] Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models
Llms

[2511.18123] Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models

Abstract page for arXiv paper 2511.18123: Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-La...

arXiv - Machine Learning · 4 min ·

All Content

[2602.22146] Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual
Llms

[2602.22146] Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

This article presents a novel optimistic primal-dual framework for safe reinforcement learning from human feedback (RLHF) in large langua...

arXiv - Machine Learning · 4 min ·
[2602.22145] When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models
Llms

[2602.22145] When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models

This article explores the phenomenon of 'Cultural Ghosting' in large language models (LLMs), highlighting the systematic erasure of cultu...

arXiv - AI · 4 min ·
[2602.22144] NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
Llms

[2602.22144] NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

The paper presents NoLan, a framework aimed at reducing object hallucinations in Large Vision-Language Models (LVLMs) by dynamically supp...

arXiv - AI · 4 min ·
[2602.22072] Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models
Llms

[2602.22072] Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models

This article explores the robustness of Theory of Mind (ToM) in large language models (LLMs) through perturbation tasks, revealing signif...

arXiv - AI · 3 min ·
[2602.21939] Hidden Topics: Measuring Sensitive AI Beliefs with List Experiments
Llms

[2602.21939] Hidden Topics: Measuring Sensitive AI Beliefs with List Experiments

This paper explores how list experiments can be used to uncover hidden beliefs in large language models (LLMs), revealing concerning appr...

arXiv - AI · 3 min ·
[2602.21841] Resilient Federated Chain: Transforming Blockchain Consensus into an Active Defense Layer for Federated Learning
Machine Learning

[2602.21841] Resilient Federated Chain: Transforming Blockchain Consensus into an Active Defense Layer for Federated Learning

The paper presents the Resilient Federated Chain (RFC), a blockchain-enabled framework designed to enhance the security of Federated Lear...

arXiv - AI · 4 min ·
[2602.21845] xai-cola: A Python library for sparsifying counterfactual explanations
Ai Infrastructure

[2602.21845] xai-cola: A Python library for sparsifying counterfactual explanations

The article introduces xai-cola, an open-source Python library designed to sparsify counterfactual explanations, enhancing interpretabili...

arXiv - Machine Learning · 3 min ·
[2602.21829] StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles
Machine Learning

[2602.21829] StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles

The paper introduces StoryMovie, a dataset designed for aligning visual stories with movie scripts and subtitles, enhancing dialogue attr...

arXiv - AI · 3 min ·
[2602.21779] Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models
Llms

[2602.21779] Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models

This paper introduces a forensic benchmark for evaluating video deepfake reasoning in vision-language models, focusing on temporal incons...

arXiv - AI · 4 min ·
[2602.21765] Generalisation of RLHF under Reward Shift and Clipped KL Regularisation
Llms

[2602.21765] Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

This paper explores the generalization of Reinforcement Learning from Human Feedback (RLHF) under conditions of reward shift and clipped ...

arXiv - Machine Learning · 4 min ·
[2602.21720] Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning
Ai Safety

[2602.21720] Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning

This article explores the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning, dem...

arXiv - AI · 3 min ·
[2602.21704] Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models
Llms

[2602.21704] Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models

This paper presents Dynamic Multimodal Activation Steering, a novel approach to mitigate hallucinations in Large Vision-Language Models (...

arXiv - AI · 3 min ·
[2602.21613] Virtual Biopsy for Intracranial Tumors Diagnosis on MRI
Ai Safety

[2602.21613] Virtual Biopsy for Intracranial Tumors Diagnosis on MRI

This article presents a novel Virtual Biopsy framework for diagnosing intracranial tumors using MRI, addressing the challenges of traditi...

arXiv - AI · 4 min ·
[2602.21584] Exploring Human-Machine Coexistence in Symmetrical Reality
Ai Safety

[2602.21584] Exploring Human-Machine Coexistence in Symmetrical Reality

This paper explores the evolving relationship between humans and AI, proposing a framework for harmonious coexistence termed 'symmetrical...

arXiv - AI · 3 min ·
[2602.21543] Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment
Machine Learning

[2602.21543] Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment

This paper presents a method for enhancing multilingual embeddings through multi-way parallel text alignment, demonstrating improved cros...

arXiv - AI · 3 min ·
[2602.21515] Training Generalizable Collaborative Agents via Strategic Risk Aversion
Machine Learning

[2602.21515] Training Generalizable Collaborative Agents via Strategic Risk Aversion

This paper explores training strategies for collaborative agents, emphasizing strategic risk aversion to enhance generalizability and rob...

arXiv - Machine Learning · 4 min ·
[2602.21452] Adversarial Robustness of Deep Learning-Based Thyroid Nodule Segmentation in Ultrasound
Machine Learning

[2602.21452] Adversarial Robustness of Deep Learning-Based Thyroid Nodule Segmentation in Ultrasound

This article evaluates the adversarial robustness of deep learning models for thyroid nodule segmentation in ultrasound images, highlight...

arXiv - AI · 4 min ·
[2602.21447] Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG
Machine Learning

[2602.21447] Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG

The paper presents a novel framework, MMA-RAG^T, for enhancing the security of multimodal agentic retrieval-augmented generation systems ...

arXiv - Machine Learning · 4 min ·
[2602.21442] MINAR: Mechanistic Interpretability for Neural Algorithmic Reasoning
Llms

[2602.21442] MINAR: Mechanistic Interpretability for Neural Algorithmic Reasoning

The paper introduces MINAR, a toolbox for mechanistic interpretability in neural algorithmic reasoning, enhancing understanding of GNNs' ...

arXiv - Machine Learning · 3 min ·
[2602.21429] Provably Safe Generative Sampling with Constricting Barrier Functions
Machine Learning

[2602.21429] Provably Safe Generative Sampling with Constricting Barrier Functions

This paper presents a safety filtering framework for generative models, ensuring generated samples meet hard constraints while minimizing...

arXiv - Machine Learning · 4 min ·
Previous Page 49 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime