AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

[2511.21331] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment
Machine Learning

[2511.21331] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Abstract page for arXiv paper 2511.21331: The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

arXiv - AI · 4 min ·
[2509.22367] What Is The Political Content in LLMs' Pre- and Post-Training Data?
Llms

[2509.22367] What Is The Political Content in LLMs' Pre- and Post-Training Data?

Abstract page for arXiv paper 2509.22367: What Is The Political Content in LLMs' Pre- and Post-Training Data?

arXiv - AI · 4 min ·
[2507.22264] SmartCLIP: Modular Vision-language Alignment with Identification Guarantees
Machine Learning

[2507.22264] SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

Abstract page for arXiv paper 2507.22264: SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

arXiv - AI · 4 min ·

All Content

[2602.00191] GEPC: Group-Equivariant Posterior Consistency for Out-of-Distribution Detection in Diffusion Models
Machine Learning

[2602.00191] GEPC: Group-Equivariant Posterior Consistency for Out-of-Distribution Detection in Diffusion Models

The paper introduces Group-Equivariant Posterior Consistency (GEPC), a method for detecting out-of-distribution data in diffusion models ...

arXiv - Machine Learning · 4 min ·
[2601.20568] Reinforcement Unlearning via Group Relative Policy Optimization
Llms

[2601.20568] Reinforcement Unlearning via Group Relative Policy Optimization

This article presents a novel method called PURGE for reinforcement unlearning in large language models, addressing the challenge of safe...

arXiv - Machine Learning · 4 min ·
[2512.18454] Out-of-Distribution Detection in Molecular Complexes via Diffusion Models for Irregular Graphs
Machine Learning

[2512.18454] Out-of-Distribution Detection in Molecular Complexes via Diffusion Models for Irregular Graphs

This paper presents a novel framework for out-of-distribution (OOD) detection in molecular complexes using diffusion models tailored for ...

arXiv - Machine Learning · 4 min ·
[2512.22623] Communication Compression for Distributed Learning with Aggregate and Server-Guided Feedback
Ai Safety

[2512.22623] Communication Compression for Distributed Learning with Aggregate and Server-Guided Feedback

This paper presents novel frameworks for communication compression in distributed learning, addressing bandwidth constraints in federated...

arXiv - Machine Learning · 4 min ·
[2511.14406] Watch Out for the Lifespan: Evaluating Backdoor Attacks Against Federated Model Adaptation
Machine Learning

[2511.14406] Watch Out for the Lifespan: Evaluating Backdoor Attacks Against Federated Model Adaptation

This paper evaluates backdoor attacks against federated learning model adaptation, focusing on the impact of Low-Rank Adaptation (LoRA) o...

arXiv - Machine Learning · 4 min ·
[2505.15801] VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
Llms

[2505.15801] VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

The paper introduces VerifyBench, a new benchmarking framework for evaluating reference-based reward systems in large language models, hi...

arXiv - AI · 4 min ·
[2504.00869] m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models
Llms

[2504.00869] m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models

This article explores the effectiveness of test-time scaling for enhancing medical reasoning in large language models, presenting the m1 ...

arXiv - AI · 4 min ·
[2510.18478] Safe But Not Sorry: Reducing Over-Conservatism in Safety Critics via Uncertainty-Aware Modulation
Ai Infrastructure

[2510.18478] Safe But Not Sorry: Reducing Over-Conservatism in Safety Critics via Uncertainty-Aware Modulation

This article presents the Uncertain Safety Critic (USC), a novel approach to enhance safety in reinforcement learning (RL) by balancing s...

arXiv - Machine Learning · 3 min ·
[2508.12907] SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty in TinyML
Machine Learning

[2508.12907] SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty in TinyML

The paper presents SNAP-UQ, a novel method for single-pass uncertainty estimation in TinyML, enhancing reliability in on-device monitorin...

arXiv - Machine Learning · 4 min ·
[2501.16534] Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs
Llms

[2501.16534] Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs

This article presents a novel technique for extracting safety classifiers from aligned large language models (LLMs) to address vulnerabil...

arXiv - AI · 4 min ·
[2508.10836] SoK: Data Minimization in Machine Learning
Machine Learning

[2508.10836] SoK: Data Minimization in Machine Learning

The paper presents a systematization of knowledge on data minimization in machine learning, addressing its importance in regulatory compl...

arXiv - Machine Learning · 4 min ·
[2501.03544] PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models
Machine Learning

[2501.03544] PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

PromptGuard introduces a novel method for moderating unsafe content in text-to-image models, enhancing safety without sacrificing image q...

arXiv - AI · 4 min ·
[2507.04033] Benchmarking Stochastic Approximation Algorithms for Fairness-Constrained Training of Deep Neural Networks
Machine Learning

[2507.04033] Benchmarking Stochastic Approximation Algorithms for Fairness-Constrained Training of Deep Neural Networks

This paper benchmarks stochastic approximation algorithms for fairness-constrained training of deep neural networks, addressing theoretic...

arXiv - Machine Learning · 3 min ·
[2506.17047] Navigating the Deep: End-to-End Extraction on Deep Neural Networks
Machine Learning

[2506.17047] Navigating the Deep: End-to-End Extraction on Deep Neural Networks

This article presents a novel end-to-end model extraction method for deep neural networks, addressing limitations in existing techniques ...

arXiv - Machine Learning · 4 min ·
[2602.11348] AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition
Llms

[2602.11348] AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition

The paper introduces AgentNoiseBench, a framework for evaluating the robustness of tool-using LLM agents under noisy conditions, highligh...

arXiv - AI · 4 min ·
[2602.05088] VERA-MH: Reliability and Validity of an Open-Source AI Safety Evaluation in Mental Health
Generative Ai

[2602.05088] VERA-MH: Reliability and Validity of an Open-Source AI Safety Evaluation in Mental Health

The article presents VERA-MH, an open-source evaluation tool designed to assess the safety of AI in mental health contexts, focusing on s...

arXiv - AI · 4 min ·
[2601.07611] DIAGPaper: Diagnosing Valid and Specific Weaknesses in Scientific Papers via Multi-Agent Reasoning
Llms

[2601.07611] DIAGPaper: Diagnosing Valid and Specific Weaknesses in Scientific Papers via Multi-Agent Reasoning

DIAGPaper introduces a multi-agent framework for identifying and prioritizing weaknesses in scientific papers, addressing limitations of ...

arXiv - AI · 4 min ·
[2504.05615] FedEFC: Federated Learning Using Enhanced Forward Correction Against Noisy Labels
Machine Learning

[2504.05615] FedEFC: Federated Learning Using Enhanced Forward Correction Against Noisy Labels

The paper presents FedEFC, a novel approach to federated learning that addresses the challenges posed by noisy labels through techniques ...

arXiv - Machine Learning · 4 min ·
[2502.09683] Channel Dependence, Limited Lookback Windows, and the Simplicity of Datasets: How Biased is Time Series Forecasting?
Machine Learning

[2502.09683] Channel Dependence, Limited Lookback Windows, and the Simplicity of Datasets: How Biased is Time Series Forecasting?

This article examines the biases in time series forecasting (TSF) due to arbitrary lookback windows and channel dependence, advocating fo...

arXiv - Machine Learning · 4 min ·
[2510.12121] Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
Llms

[2510.12121] Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing

This paper introduces a method for precise control of attribute intensities in Large Language Models (LLMs) through targeted representati...

arXiv - Machine Learning · 4 min ·
Previous Page 85 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime