AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

Llms

[R] The Lyra Technique — A framework for interpreting internal cognitive states in LLMs (Zenodo, open access)

We're releasing a paper on a new framework for reading and interpreting the internal cognitive states of large language models: "The Lyra...

Reddit - Machine Learning · 1 min ·
Machine Learning

[P] If you're building AI agents, logs aren't enough. You need evidence.

I have built a programmable governance layer for AI agents. I am considering to open source completely. Looking for feedback. Agent demos...

Reddit - Machine Learning · 1 min ·
[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis
Ai Safety

[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

Abstract page for arXiv paper 2510.14628: RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

arXiv - AI · 4 min ·

All Content

[2602.12972] Jointly Optimizing Debiased CTR and Uplift for Coupons Marketing: A Unified Causal Framework
Machine Learning

[2602.12972] Jointly Optimizing Debiased CTR and Uplift for Coupons Marketing: A Unified Causal Framework

This paper presents a novel framework, UniMVT, for optimizing debiased Click-Through Rate (CTR) and uplift in coupon marketing, addressin...

arXiv - Machine Learning · 4 min ·
[2602.12825] Reliable Hierarchical Operating System Fingerprinting via Conformal Prediction
Nlp

[2602.12825] Reliable Hierarchical Operating System Fingerprinting via Conformal Prediction

This paper presents a novel approach to Operating System fingerprinting using Conformal Prediction, addressing limitations in existing me...

arXiv - Machine Learning · 3 min ·
[2602.12680] A Regularization-Sharpness Tradeoff for Linear Interpolators
Machine Learning

[2602.12680] A Regularization-Sharpness Tradeoff for Linear Interpolators

This paper introduces a regularization-sharpness tradeoff for linear interpolators in overparameterized settings, challenging traditional...

arXiv - Machine Learning · 3 min ·
[2602.12681] Fool Me If You Can: On the Robustness of Binary Code Similarity Detection Models against Semantics-preserving Transformations
Machine Learning

[2602.12681] Fool Me If You Can: On the Robustness of Binary Code Similarity Detection Models against Semantics-preserving Transformations

This paper evaluates the robustness of binary code similarity detection models against semantics-preserving transformations, introducing ...

arXiv - Machine Learning · 4 min ·
[2602.12445] RBCorr: Response Bias Correction in Language Models
Llms

[2602.12445] RBCorr: Response Bias Correction in Language Models

The paper presents RBCorr, a method for correcting response biases in language models, demonstrating its effectiveness across various mod...

arXiv - Machine Learning · 3 min ·
[2602.12426] Interference-Robust Non-Coherent Over-the-Air Computation for Decentralized Optimization
Nlp

[2602.12426] Interference-Robust Non-Coherent Over-the-Air Computation for Decentralized Optimization

This paper presents an interference-robust non-coherent over-the-air computation (IR-NCOTA) method for decentralized optimization, enhanc...

arXiv - Machine Learning · 3 min ·
[2602.12418] Sparse Autoencoders are Capable LLM Jailbreak Mitigators
Llms

[2602.12418] Sparse Autoencoders are Capable LLM Jailbreak Mitigators

The paper presents Context-Conditioned Delta Steering (CC-Delta), a defense mechanism using Sparse Autoencoders (SAEs) to mitigate jailbr...

arXiv - Machine Learning · 3 min ·
[2602.13151] Quantization-Robust LLM Unlearning via Low-Rank Adaptation
Llms

[2602.13151] Quantization-Robust LLM Unlearning via Low-Rank Adaptation

The paper presents a method for unlearning knowledge in large language models (LLMs) while maintaining performance after quantization, us...

arXiv - Machine Learning · 4 min ·
[2602.13128] Eventizing Traditionally Opaque Binary Neural Networks as 1-safe Petri net Models
Machine Learning

[2602.13128] Eventizing Traditionally Opaque Binary Neural Networks as 1-safe Petri net Models

This article presents a framework for enhancing the transparency of Binary Neural Networks (BNNs) by modeling their operations as event-d...

arXiv - Machine Learning · 4 min ·
[2602.13062] Backdoor Attacks on Contrastive Continual Learning for IoT Systems
Machine Learning

[2602.13062] Backdoor Attacks on Contrastive Continual Learning for IoT Systems

This paper analyzes backdoor attacks on contrastive continual learning (CCL) in IoT systems, highlighting vulnerabilities and proposing d...

arXiv - Machine Learning · 4 min ·
[2602.13042] GPTZero: Robust Detection of LLM-Generated Texts
Llms

[2602.13042] GPTZero: Robust Detection of LLM-Generated Texts

GPTZero introduces a robust solution for detecting AI-generated texts, addressing concerns over text authenticity and misinformation in t...

arXiv - Machine Learning · 3 min ·
[2602.13040] TCRL: Temporal-Coupled Adversarial Training for Robust Constrained Reinforcement Learning in Worst-Case Scenarios
Machine Learning

[2602.13040] TCRL: Temporal-Coupled Adversarial Training for Robust Constrained Reinforcement Learning in Worst-Case Scenarios

The paper presents TCRL, a novel framework for robust constrained reinforcement learning that addresses challenges posed by temporally co...

arXiv - Machine Learning · 4 min ·
[2602.12980] MAUNet-Light: A Concise MAUNet Architecture for Bias Correction and Downscaling of Precipitation Estimates
Machine Learning

[2602.12980] MAUNet-Light: A Concise MAUNet Architecture for Bias Correction and Downscaling of Precipitation Estimates

The paper presents MAUNet-Light, a lightweight neural network architecture designed for bias correction and downscaling of precipitation ...

arXiv - Machine Learning · 4 min ·
[2602.12714] ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools -- From Consensus Learning to Ambiguity-Driven Emotion Reasoning
Llms

[2602.12714] ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools -- From Consensus Learning to Ambiguity-Driven Emotion Reasoning

The paper introduces ADEPT, a novel framework for emotion recognition that enhances accuracy by integrating acoustic evidence and multi-t...

arXiv - Machine Learning · 4 min ·
[2602.12708] Mixture of Predefined Experts: Maximizing Data Usage on Vertical Federated Learning
Machine Learning

[2602.12708] Mixture of Predefined Experts: Maximizing Data Usage on Vertical Federated Learning

The paper introduces Split-MoPE, a novel framework for Vertical Federated Learning that maximizes data usage by integrating predefined ex...

arXiv - Machine Learning · 4 min ·
[2602.12605] Block-Sample MAC-Bayes Generalization Bounds
Machine Learning

[2602.12605] Block-Sample MAC-Bayes Generalization Bounds

The paper introduces Block-Sample MAC-Bayes bounds, a new approach to generalization error estimation in machine learning, enhancing trad...

arXiv - Machine Learning · 4 min ·
[2602.12587] Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers
Machine Learning

[2602.12587] Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers

The paper discusses how multi-head attention in Mixture-of-Experts (MoE) Transformers contributes to catastrophic forgetting, proposing a...

arXiv - Machine Learning · 4 min ·
[2602.12506] On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs
Llms

[2602.12506] On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

This article examines the robustness and chain-of-thought consistency of reinforcement learning (RL) fine-tuned vision language models (V...

arXiv - Machine Learning · 4 min ·
[2602.12379] Deep Doubly Debiased Longitudinal Effect Estimation with ICE G-Computation
Machine Learning

[2602.12379] Deep Doubly Debiased Longitudinal Effect Estimation with ICE G-Computation

This paper presents D3-Net, a novel framework for estimating longitudinal treatment effects using ICE G-computation, addressing error pro...

arXiv - Machine Learning · 3 min ·
[2602.12318] Abstractive Red-Teaming of Language Model Character
Llms

[2602.12318] Abstractive Red-Teaming of Language Model Character

This article presents a novel approach to auditing language model behavior through 'abstractive red-teaming,' identifying query types tha...

arXiv - Machine Learning · 4 min ·
Previous Page 118 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime