AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

[2603.12681] Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment
Llms

[2603.12681] Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment

Abstract page for arXiv paper 2603.12681: Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment

arXiv - Machine Learning · 3 min ·
[2512.02711] CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer
Llms

[2512.02711] CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer

Abstract page for arXiv paper 2512.02711: CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer

arXiv - Machine Learning · 3 min ·
[2510.03721] Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models
Llms

[2510.03721] Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models

Abstract page for arXiv paper 2510.03721: Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models

arXiv - Machine Learning · 4 min ·

All Content

[2404.01877] Procedural Fairness in Machine Learning
Machine Learning

[2404.01877] Procedural Fairness in Machine Learning

This paper explores procedural fairness in machine learning, proposing a new metric for evaluation and methods to enhance fairness withou...

arXiv - Machine Learning · 4 min ·
[2602.23192] FairQuant: Fairness-Aware Mixed-Precision Quantization for Medical Image Classification
Machine Learning

[2602.23192] FairQuant: Fairness-Aware Mixed-Precision Quantization for Medical Image Classification

The paper presents FairQuant, a framework for fairness-aware mixed-precision quantization in medical image classification, optimizing bot...

arXiv - Machine Learning · 3 min ·
[2602.22790] Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift
Llms

[2602.22790] Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift

The paper introduces Natural Language Declarative Prompting (NLD-P), a governance method for prompt design that addresses challenges pose...

arXiv - AI · 4 min ·
[2602.22775] TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
Ai Startups

[2602.22775] TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation

The paper introduces TherapyProbe, a methodology for enhancing relational safety in mental health chatbots through adversarial simulation...

arXiv - AI · 3 min ·
[2602.23085] Q-Tag: Watermarking Quantum Circuit Generative Models
Machine Learning

[2602.23085] Q-Tag: Watermarking Quantum Circuit Generative Models

The paper presents Q-Tag, a novel watermarking framework for quantum circuit generative models (QCGMs), addressing the need for secure co...

arXiv - Machine Learning · 4 min ·
[2602.23079] Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent
Llms

[2602.23079] Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent

This article introduces a novel LLM agent designed to assess and mitigate deanonymization risks in textual data using a method called SAL...

arXiv - Machine Learning · 3 min ·
[2602.22740] AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation
Machine Learning

[2602.22740] AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation

The paper presents AMLRIS, a novel training strategy for Referring Image Segmentation (RIS) that enhances object segmentation through ali...

arXiv - AI · 3 min ·
[2602.22724] AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification
Llms

[2602.22724] AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

AgentSentry introduces a novel framework to mitigate indirect prompt injection (IPI) in LLM agents, enhancing their security while mainta...

arXiv - AI · 4 min ·
[2602.22710] Same Words, Different Judgments: Modality Effects on Preference Alignment
Ai Safety

[2602.22710] Same Words, Different Judgments: Modality Effects on Preference Alignment

This study explores how modality affects preference alignment in AI systems, comparing human and synthetic evaluations of audio and text ...

arXiv - AI · 3 min ·
[2602.22700] IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation
Llms

[2602.22700] IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation

The paper presents IMMACULATE, a framework for auditing large language models (LLMs) using verifiable computation to detect economic devi...

arXiv - AI · 3 min ·
[2602.22903] PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised MMEA
Llms

[2602.22903] PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised MMEA

The paper presents PSQE, a method for enhancing pseudo seed quality in unsupervised multimodal entity alignment, addressing challenges in...

arXiv - Machine Learning · 4 min ·
[2602.22621] CGSA: Class-Guided Slot-Aware Adaptation for Source-Free Object Detection
Computer Vision

[2602.22621] CGSA: Class-Guided Slot-Aware Adaptation for Source-Free Object Detection

The paper presents CGSA, a novel framework for Source-Free Domain Adaptive Object Detection that integrates object-centric learning to en...

arXiv - AI · 3 min ·
[2602.22699] DPSQL+: A Differentially Private SQL Library with a Minimum Frequency Rule
Machine Learning

[2602.22699] DPSQL+: A Differentially Private SQL Library with a Minimum Frequency Rule

DPSQL+ is a new SQL library designed to enhance data privacy by enforcing differential privacy and a minimum frequency rule, ensuring sen...

arXiv - Machine Learning · 4 min ·
[2602.22570] Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation
Machine Learning

[2602.22570] Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

The paper discusses the evaluation challenges in text-to-image generation, focusing on classifier-free guidance (CFG) and proposing a new...

arXiv - AI · 4 min ·
[2602.22631] TorchLean: Formalizing Neural Networks in Lean
Machine Learning

[2602.22631] TorchLean: Formalizing Neural Networks in Lean

TorchLean is a framework that formalizes neural networks within the Lean 4 theorem prover, enabling precise semantics for execution and v...

arXiv - Machine Learning · 4 min ·
[2602.22564] Addressing Climate Action Misperceptions with Generative AI
Llms

[2602.22564] Addressing Climate Action Misperceptions with Generative AI

This study explores how a personalized large language model (LLM) can correct climate action misperceptions among climate-concerned indiv...

arXiv - AI · 3 min ·
[2602.22609] EvolveGen: Algorithmic Level Hardware Model Checking Benchmark Generation through Reinforcement Learning
Machine Learning

[2602.22609] EvolveGen: Algorithmic Level Hardware Model Checking Benchmark Generation through Reinforcement Learning

EvolveGen introduces a novel framework for generating hardware model checking benchmarks using reinforcement learning, addressing the ben...

arXiv - Machine Learning · 4 min ·
[2602.22488] Explainability-Aware Evaluation of Transfer Learning Models for IoT DDoS Detection Under Resource Constraints
Machine Learning

[2602.22488] Explainability-Aware Evaluation of Transfer Learning Models for IoT DDoS Detection Under Resource Constraints

This article evaluates transfer learning models for IoT DDoS detection, focusing on explainability and resource constraints. It analyzes ...

arXiv - AI · 3 min ·
[2602.22481] Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs
Llms

[2602.22481] Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs

This article explores the relationship between AI and humans through the lens of large language models (LLMs), focusing on the Sydney per...

arXiv - AI · 4 min ·
[2602.22450] Silent Egress: When Implicit Prompt Injection Makes LLM Agents Leak Without a Trace
Llms

[2602.22450] Silent Egress: When Implicit Prompt Injection Makes LLM Agents Leak Without a Trace

The paper discusses the security risks posed by implicit prompt injection in large language model (LLM) agents, demonstrating how adversa...

arXiv - AI · 4 min ·
Previous Page 35 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime