AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

Top This Week

Machine Learning

[2511.21331] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Abstract page for arXiv paper 2511.21331: The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

arXiv - AI · 4 min · about 12 hours ago

Llms

[2509.22367] What Is The Political Content in LLMs' Pre- and Post-Training Data?

Abstract page for arXiv paper 2509.22367: What Is The Political Content in LLMs' Pre- and Post-Training Data?

arXiv - AI · 4 min · about 12 hours ago

Machine Learning

[2507.22264] SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

Abstract page for arXiv paper 2507.22264: SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

arXiv - AI · 4 min · about 12 hours ago

All Content

Machine Learning

[2602.16697] Protecting the Undeleted in Machine Unlearning

The paper discusses machine unlearning, focusing on the privacy risks associated with undeleted data when specific data points are remove...

arXiv - Machine Learning · 3 min · about 2 months ago

Llms

[2602.15913] Foundation Models for Medical Imaging: Status, Challenges, and Directions

This article reviews the current landscape of foundation models (FMs) in medical imaging, discussing their design principles, application...

arXiv - AI · 3 min · about 2 months ago

Llms

[2602.15892] Egocentric Bias in Vision-Language Models

The paper introduces FlipSet, a benchmark for assessing visual perspective taking in vision-language models, revealing significant egocen...

arXiv - AI · 3 min · about 2 months ago

Machine Learning

[2602.16596] Sequential Membership Inference Attacks

The paper presents a novel approach to Membership Inference Attacks (MIAs) by developing an optimal attack strategy, SeMI*, leveraging mo...

arXiv - Machine Learning · 4 min · about 2 months ago

Llms

[2602.15889] Evidence for Daily and Weekly Periodic Variability in GPT-4o Performance

This article investigates the temporal variability in the performance of the GPT-4o model, revealing significant daily and weekly pattern...

arXiv - AI · 4 min · about 2 months ago

Nlp

[2602.16564] A Scalable Approach to Solving Simulation-Based Network Security Games

The paper presents MetaDOAR, a scalable meta-controller for solving simulation-based network security games, enhancing multi-agent reinfo...

arXiv - Machine Learning · 3 min · about 2 months ago

Ai Safety

[2602.16543] Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

This paper presents a framework for analyzing the vulnerabilities of Safe Reinforcement Learning (Safe RL) policies against adversarial a...

arXiv - Machine Learning · 3 min · about 2 months ago

Machine Learning

[2602.16531] Transfer Learning of Linear Regression with Multiple Pretrained Models: Benefiting from More Pretrained Models via Overparameterization Debiasing

This paper explores transfer learning in linear regression using multiple pretrained models, highlighting the benefits of overparameteriz...

arXiv - Machine Learning · 3 min · about 2 months ago

Nlp

[2602.15866] NLP Privacy Risk Identification in Social Media (NLP-PRISM): A Survey

This survey presents the NLP-PRISM framework for identifying privacy risks in social media NLP applications, analyzing 203 peer-reviewed ...

arXiv - AI · 4 min · about 2 months ago

Ai Agents

[2602.15865] AI as Teammate or Tool? A Review of Human-AI Interaction in Decision Support

This article reviews the role of AI in decision support, analyzing whether AI systems act as tools or collaborative teammates. It highlig...

arXiv - AI · 3 min · about 2 months ago

Llms

[2602.15853] A Lightweight Explainable Guardrail for Prompt Safety

The paper presents a Lightweight Explainable Guardrail (LEG) method for classifying unsafe prompts in AI systems, utilizing a multi-task ...

arXiv - AI · 3 min · about 2 months ago

Machine Learning

[2602.16449] GICDM: Mitigating Hubness for Reliable Distance-Based Generative Model Evaluation

The paper presents GICDM, a method to mitigate hubness in distance-based evaluations of generative models, enhancing reliability and alig...

arXiv - AI · 3 min · about 2 months ago

Machine Learning

[2602.15852] Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints

This article discusses the development of clinical NLP models that mitigate risks associated with temporal leakage, emphasizing the impor...

arXiv - AI · 4 min · about 2 months ago

Llms

[2602.16438] Intra-Fairness Dynamics: The Bias Spillover Effect in Targeted LLM Alignment

The paper explores the bias spillover effect in large language models (LLMs), revealing how targeted fairness alignment can inadvertently...

arXiv - AI · 3 min · about 2 months ago

Nlp

[2602.16436] Learning with Locally Private Examples by Inverse Weierstrass Private Stochastic Gradient Descent

This paper presents a novel method for correcting bias in binary classification tasks using locally private examples, leveraging the Inve...

arXiv - Machine Learning · 3 min · about 2 months ago

Llms

[2602.15847] Do Personality Traits Interfere? Geometric Limitations of Steering in Large Language Models

This article explores the geometric limitations of steering personality traits in large language models (LLMs), revealing that traits are...

arXiv - Machine Learning · 3 min · about 2 months ago

Machine Learning

[2602.16400] Easy Data Unlearning Bench

The paper introduces the Easy Data Unlearning Bench, a unified benchmarking suite aimed at simplifying the evaluation of machine unlearni...

arXiv - Machine Learning · 3 min · about 2 months ago

Machine Learning

[2602.16341] Explainability for Fault Detection System in Chemical Processes

This article evaluates two explainability methods, Integrated Gradients and SHAP, for fault detection in chemical processes using an LSTM...

arXiv - Machine Learning · 3 min · about 2 months ago

Ai Agents

[2602.16666] Towards a Science of AI Agent Reliability

This paper explores the reliability of AI agents, proposing twelve metrics to evaluate their performance across dimensions like consisten...

arXiv - Machine Learning · 3 min · about 2 months ago

Machine Learning

[2602.16340] The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

This paper investigates the implicit bias of momentum-based optimizers like Adam and Muon in smooth homogeneous neural networks, extendin...

arXiv - Machine Learning · 3 min · about 2 months ago

Previous Page 88 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

[2511.21331] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

[2509.22367] What Is The Political Content in LLMs' Pre- and Post-Training Data?

[2507.22264] SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

All Content

[2602.16697] Protecting the Undeleted in Machine Unlearning

[2602.15913] Foundation Models for Medical Imaging: Status, Challenges, and Directions

[2602.15892] Egocentric Bias in Vision-Language Models

[2602.16596] Sequential Membership Inference Attacks

[2602.15889] Evidence for Daily and Weekly Periodic Variability in GPT-4o Performance

[2602.16564] A Scalable Approach to Solving Simulation-Based Network Security Games

[2602.16543] Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

[2602.16531] Transfer Learning of Linear Regression with Multiple Pretrained Models: Benefiting from More Pretrained Models via Overparameterization Debiasing

[2602.15866] NLP Privacy Risk Identification in Social Media (NLP-PRISM): A Survey

[2602.15865] AI as Teammate or Tool? A Review of Human-AI Interaction in Decision Support

[2602.15853] A Lightweight Explainable Guardrail for Prompt Safety

[2602.16449] GICDM: Mitigating Hubness for Reliable Distance-Based Generative Model Evaluation

[2602.15852] Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints

[2602.16438] Intra-Fairness Dynamics: The Bias Spillover Effect in Targeted LLM Alignment

[2602.16436] Learning with Locally Private Examples by Inverse Weierstrass Private Stochastic Gradient Descent

[2602.15847] Do Personality Traits Interfere? Geometric Limitations of Steering in Large Language Models

[2602.16400] Easy Data Unlearning Bench

[2602.16341] Explainability for Fault Detection System in Chemical Processes

[2602.16666] Towards a Science of AI Agent Reliability

[2602.16340] The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

Related Topics

Stay updated with AI News