AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

RSS

Top This Week

Llms

[2603.12681] Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment

Abstract page for arXiv paper 2603.12681: Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment

arXiv - Machine Learning · 3 min · 31 minutes ago

Llms

[2512.02711] CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer

Abstract page for arXiv paper 2512.02711: CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer

arXiv - Machine Learning · 3 min · 31 minutes ago

Llms

[2510.03721] Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models

Abstract page for arXiv paper 2510.03721: Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models

arXiv - Machine Learning · 4 min · 31 minutes ago

All Content

Machine Learning

[2404.01877] Procedural Fairness in Machine Learning

This paper explores procedural fairness in machine learning, proposing a new metric for evaluation and methods to enhance fairness withou...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.23192] FairQuant: Fairness-Aware Mixed-Precision Quantization for Medical Image Classification

The paper presents FairQuant, a framework for fairness-aware mixed-precision quantization in medical image classification, optimizing bot...

arXiv - Machine Learning · 3 min · about 1 month ago

Llms

[2602.22790] Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift

The paper introduces Natural Language Declarative Prompting (NLD-P), a governance method for prompt design that addresses challenges pose...

arXiv - AI · 4 min · about 1 month ago

Ai Startups

[2602.22775] TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation

The paper introduces TherapyProbe, a methodology for enhancing relational safety in mental health chatbots through adversarial simulation...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2602.23085] Q-Tag: Watermarking Quantum Circuit Generative Models

The paper presents Q-Tag, a novel watermarking framework for quantum circuit generative models (QCGMs), addressing the need for secure co...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.23079] Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent

This article introduces a novel LLM agent designed to assess and mitigate deanonymization risks in textual data using a method called SAL...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2602.22740] AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation

The paper presents AMLRIS, a novel training strategy for Referring Image Segmentation (RIS) that enhances object segmentation through ali...

arXiv - AI · 3 min · about 1 month ago

Llms

[2602.22724] AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

AgentSentry introduces a novel framework to mitigate indirect prompt injection (IPI) in LLM agents, enhancing their security while mainta...

arXiv - AI · 4 min · about 1 month ago

Ai Safety

[2602.22710] Same Words, Different Judgments: Modality Effects on Preference Alignment

This study explores how modality affects preference alignment in AI systems, comparing human and synthetic evaluations of audio and text ...

arXiv - AI · 3 min · about 1 month ago

Llms

[2602.22700] IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation

The paper presents IMMACULATE, a framework for auditing large language models (LLMs) using verifiable computation to detect economic devi...

arXiv - AI · 3 min · about 1 month ago

Llms

[2602.22903] PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised MMEA

The paper presents PSQE, a method for enhancing pseudo seed quality in unsupervised multimodal entity alignment, addressing challenges in...

arXiv - Machine Learning · 4 min · about 1 month ago

Computer Vision

[2602.22621] CGSA: Class-Guided Slot-Aware Adaptation for Source-Free Object Detection

The paper presents CGSA, a novel framework for Source-Free Domain Adaptive Object Detection that integrates object-centric learning to en...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2602.22699] DPSQL+: A Differentially Private SQL Library with a Minimum Frequency Rule

DPSQL+ is a new SQL library designed to enhance data privacy by enforcing differential privacy and a minimum frequency rule, ensuring sen...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.22570] Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

The paper discusses the evaluation challenges in text-to-image generation, focusing on classifier-free guidance (CFG) and proposing a new...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.22631] TorchLean: Formalizing Neural Networks in Lean

TorchLean is a framework that formalizes neural networks within the Lean 4 theorem prover, enabling precise semantics for execution and v...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.22564] Addressing Climate Action Misperceptions with Generative AI

This study explores how a personalized large language model (LLM) can correct climate action misperceptions among climate-concerned indiv...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2602.22609] EvolveGen: Algorithmic Level Hardware Model Checking Benchmark Generation through Reinforcement Learning

EvolveGen introduces a novel framework for generating hardware model checking benchmarks using reinforcement learning, addressing the ben...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.22488] Explainability-Aware Evaluation of Transfer Learning Models for IoT DDoS Detection Under Resource Constraints

This article evaluates transfer learning models for IoT DDoS detection, focusing on explainability and resource constraints. It analyzes ...

arXiv - AI · 3 min · about 1 month ago

Llms

[2602.22481] Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs

This article explores the relationship between AI and humans through the lens of large language models (LLMs), focusing on the Sydney per...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.22450] Silent Egress: When Implicit Prompt Injection Makes LLM Agents Leak Without a Trace

The paper discusses the security risks posed by implicit prompt injection in large language model (LLM) agents, demonstrating how adversa...

arXiv - AI · 4 min · about 1 month ago

Previous Page 35 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

[2603.12681] Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment

[2512.02711] CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer

[2510.03721] Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models

All Content

[2404.01877] Procedural Fairness in Machine Learning

[2602.23192] FairQuant: Fairness-Aware Mixed-Precision Quantization for Medical Image Classification

[2602.22790] Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift

[2602.22775] TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation

[2602.23085] Q-Tag: Watermarking Quantum Circuit Generative Models

[2602.23079] Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent

[2602.22740] AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation

[2602.22724] AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

[2602.22710] Same Words, Different Judgments: Modality Effects on Preference Alignment

[2602.22700] IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation

[2602.22903] PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised MMEA

[2602.22621] CGSA: Class-Guided Slot-Aware Adaptation for Source-Free Object Detection

[2602.22699] DPSQL+: A Differentially Private SQL Library with a Minimum Frequency Rule

[2602.22570] Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

[2602.22631] TorchLean: Formalizing Neural Networks in Lean

[2602.22564] Addressing Climate Action Misperceptions with Generative AI

[2602.22609] EvolveGen: Algorithmic Level Hardware Model Checking Benchmark Generation through Reinforcement Learning

[2602.22488] Explainability-Aware Evaluation of Transfer Learning Models for IoT DDoS Detection Under Resource Constraints

[2602.22481] Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs

[2602.22450] Silent Egress: When Implicit Prompt Injection Makes LLM Agents Leak Without a Trace

Related Topics

Stay updated with AI News