AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

[2511.21331] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment
Machine Learning

[2511.21331] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Abstract page for arXiv paper 2511.21331: The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

arXiv - AI · 4 min ·
[2509.22367] What Is The Political Content in LLMs' Pre- and Post-Training Data?
Llms

[2509.22367] What Is The Political Content in LLMs' Pre- and Post-Training Data?

Abstract page for arXiv paper 2509.22367: What Is The Political Content in LLMs' Pre- and Post-Training Data?

arXiv - AI · 4 min ·
[2507.22264] SmartCLIP: Modular Vision-language Alignment with Identification Guarantees
Machine Learning

[2507.22264] SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

Abstract page for arXiv paper 2507.22264: SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

arXiv - AI · 4 min ·

All Content

[2602.16697] Protecting the Undeleted in Machine Unlearning
Machine Learning

[2602.16697] Protecting the Undeleted in Machine Unlearning

The paper discusses machine unlearning, focusing on the privacy risks associated with undeleted data when specific data points are remove...

arXiv - Machine Learning · 3 min ·
[2602.15913] Foundation Models for Medical Imaging: Status, Challenges, and Directions
Llms

[2602.15913] Foundation Models for Medical Imaging: Status, Challenges, and Directions

This article reviews the current landscape of foundation models (FMs) in medical imaging, discussing their design principles, application...

arXiv - AI · 3 min ·
[2602.15892] Egocentric Bias in Vision-Language Models
Llms

[2602.15892] Egocentric Bias in Vision-Language Models

The paper introduces FlipSet, a benchmark for assessing visual perspective taking in vision-language models, revealing significant egocen...

arXiv - AI · 3 min ·
[2602.16596] Sequential Membership Inference Attacks
Machine Learning

[2602.16596] Sequential Membership Inference Attacks

The paper presents a novel approach to Membership Inference Attacks (MIAs) by developing an optimal attack strategy, SeMI*, leveraging mo...

arXiv - Machine Learning · 4 min ·
[2602.15889] Evidence for Daily and Weekly Periodic Variability in GPT-4o Performance
Llms

[2602.15889] Evidence for Daily and Weekly Periodic Variability in GPT-4o Performance

This article investigates the temporal variability in the performance of the GPT-4o model, revealing significant daily and weekly pattern...

arXiv - AI · 4 min ·
[2602.16564] A Scalable Approach to Solving Simulation-Based Network Security Games
Nlp

[2602.16564] A Scalable Approach to Solving Simulation-Based Network Security Games

The paper presents MetaDOAR, a scalable meta-controller for solving simulation-based network security games, enhancing multi-agent reinfo...

arXiv - Machine Learning · 3 min ·
[2602.16543] Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning
Ai Safety

[2602.16543] Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

This paper presents a framework for analyzing the vulnerabilities of Safe Reinforcement Learning (Safe RL) policies against adversarial a...

arXiv - Machine Learning · 3 min ·
[2602.16531] Transfer Learning of Linear Regression with Multiple Pretrained Models: Benefiting from More Pretrained Models via Overparameterization Debiasing
Machine Learning

[2602.16531] Transfer Learning of Linear Regression with Multiple Pretrained Models: Benefiting from More Pretrained Models via Overparameterization Debiasing

This paper explores transfer learning in linear regression using multiple pretrained models, highlighting the benefits of overparameteriz...

arXiv - Machine Learning · 3 min ·
[2602.15866] NLP Privacy Risk Identification in Social Media (NLP-PRISM): A Survey
Nlp

[2602.15866] NLP Privacy Risk Identification in Social Media (NLP-PRISM): A Survey

This survey presents the NLP-PRISM framework for identifying privacy risks in social media NLP applications, analyzing 203 peer-reviewed ...

arXiv - AI · 4 min ·
[2602.15865] AI as Teammate or Tool? A Review of Human-AI Interaction in Decision Support
Ai Agents

[2602.15865] AI as Teammate or Tool? A Review of Human-AI Interaction in Decision Support

This article reviews the role of AI in decision support, analyzing whether AI systems act as tools or collaborative teammates. It highlig...

arXiv - AI · 3 min ·
[2602.15853] A Lightweight Explainable Guardrail for Prompt Safety
Llms

[2602.15853] A Lightweight Explainable Guardrail for Prompt Safety

The paper presents a Lightweight Explainable Guardrail (LEG) method for classifying unsafe prompts in AI systems, utilizing a multi-task ...

arXiv - AI · 3 min ·
[2602.16449] GICDM: Mitigating Hubness for Reliable Distance-Based Generative Model Evaluation
Machine Learning

[2602.16449] GICDM: Mitigating Hubness for Reliable Distance-Based Generative Model Evaluation

The paper presents GICDM, a method to mitigate hubness in distance-based evaluations of generative models, enhancing reliability and alig...

arXiv - AI · 3 min ·
[2602.15852] Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints
Machine Learning

[2602.15852] Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints

This article discusses the development of clinical NLP models that mitigate risks associated with temporal leakage, emphasizing the impor...

arXiv - AI · 4 min ·
[2602.16438] Intra-Fairness Dynamics: The Bias Spillover Effect in Targeted LLM Alignment
Llms

[2602.16438] Intra-Fairness Dynamics: The Bias Spillover Effect in Targeted LLM Alignment

The paper explores the bias spillover effect in large language models (LLMs), revealing how targeted fairness alignment can inadvertently...

arXiv - AI · 3 min ·
[2602.16436] Learning with Locally Private Examples by Inverse Weierstrass Private Stochastic Gradient Descent
Nlp

[2602.16436] Learning with Locally Private Examples by Inverse Weierstrass Private Stochastic Gradient Descent

This paper presents a novel method for correcting bias in binary classification tasks using locally private examples, leveraging the Inve...

arXiv - Machine Learning · 3 min ·
[2602.15847] Do Personality Traits Interfere? Geometric Limitations of Steering in Large Language Models
Llms

[2602.15847] Do Personality Traits Interfere? Geometric Limitations of Steering in Large Language Models

This article explores the geometric limitations of steering personality traits in large language models (LLMs), revealing that traits are...

arXiv - Machine Learning · 3 min ·
[2602.16400] Easy Data Unlearning Bench
Machine Learning

[2602.16400] Easy Data Unlearning Bench

The paper introduces the Easy Data Unlearning Bench, a unified benchmarking suite aimed at simplifying the evaluation of machine unlearni...

arXiv - Machine Learning · 3 min ·
[2602.16341] Explainability for Fault Detection System in Chemical Processes
Machine Learning

[2602.16341] Explainability for Fault Detection System in Chemical Processes

This article evaluates two explainability methods, Integrated Gradients and SHAP, for fault detection in chemical processes using an LSTM...

arXiv - Machine Learning · 3 min ·
[2602.16666] Towards a Science of AI Agent Reliability
Ai Agents

[2602.16666] Towards a Science of AI Agent Reliability

This paper explores the reliability of AI agents, proposing twelve metrics to evaluate their performance across dimensions like consisten...

arXiv - Machine Learning · 3 min ·
[2602.16340] The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks
Machine Learning

[2602.16340] The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

This paper investigates the implicit bias of momentum-based optimizers like Adam and Muon in smooth homogeneous neural networks, extendin...

arXiv - Machine Learning · 3 min ·
Previous Page 88 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime