AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

Top This Week

Llms

[R] The Lyra Technique — A framework for interpreting internal cognitive states in LLMs (Zenodo, open access)

We're releasing a paper on a new framework for reading and interpreting the internal cognitive states of large language models: "The Lyra...

Reddit - Machine Learning · 1 min · about 7 hours ago

Machine Learning

[P] If you're building AI agents, logs aren't enough. You need evidence.

I have built a programmable governance layer for AI agents. I am considering to open source completely. Looking for feedback. Agent demos...

Reddit - Machine Learning · 1 min · about 16 hours ago

Ai Safety

[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

Abstract page for arXiv paper 2510.14628: RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

arXiv - AI · 4 min · about 20 hours ago

All Content

Machine Learning

[2602.12972] Jointly Optimizing Debiased CTR and Uplift for Coupons Marketing: A Unified Causal Framework

This paper presents a novel framework, UniMVT, for optimizing debiased Click-Through Rate (CTR) and uplift in coupon marketing, addressin...

arXiv - Machine Learning · 4 min · about 2 months ago

Nlp

[2602.12825] Reliable Hierarchical Operating System Fingerprinting via Conformal Prediction

This paper presents a novel approach to Operating System fingerprinting using Conformal Prediction, addressing limitations in existing me...

arXiv - Machine Learning · 3 min · about 2 months ago

Machine Learning

[2602.12680] A Regularization-Sharpness Tradeoff for Linear Interpolators

This paper introduces a regularization-sharpness tradeoff for linear interpolators in overparameterized settings, challenging traditional...

arXiv - Machine Learning · 3 min · about 2 months ago

Machine Learning

[2602.12681] Fool Me If You Can: On the Robustness of Binary Code Similarity Detection Models against Semantics-preserving Transformations

This paper evaluates the robustness of binary code similarity detection models against semantics-preserving transformations, introducing ...

arXiv - Machine Learning · 4 min · about 2 months ago

Llms

[2602.12445] RBCorr: Response Bias Correction in Language Models

The paper presents RBCorr, a method for correcting response biases in language models, demonstrating its effectiveness across various mod...

arXiv - Machine Learning · 3 min · about 2 months ago

Nlp

[2602.12426] Interference-Robust Non-Coherent Over-the-Air Computation for Decentralized Optimization

This paper presents an interference-robust non-coherent over-the-air computation (IR-NCOTA) method for decentralized optimization, enhanc...

arXiv - Machine Learning · 3 min · about 2 months ago

Llms

[2602.12418] Sparse Autoencoders are Capable LLM Jailbreak Mitigators

The paper presents Context-Conditioned Delta Steering (CC-Delta), a defense mechanism using Sparse Autoencoders (SAEs) to mitigate jailbr...

arXiv - Machine Learning · 3 min · about 2 months ago

Llms

[2602.13151] Quantization-Robust LLM Unlearning via Low-Rank Adaptation

The paper presents a method for unlearning knowledge in large language models (LLMs) while maintaining performance after quantization, us...

arXiv - Machine Learning · 4 min · about 2 months ago

Machine Learning

[2602.13128] Eventizing Traditionally Opaque Binary Neural Networks as 1-safe Petri net Models

This article presents a framework for enhancing the transparency of Binary Neural Networks (BNNs) by modeling their operations as event-d...

arXiv - Machine Learning · 4 min · about 2 months ago

Machine Learning

[2602.13062] Backdoor Attacks on Contrastive Continual Learning for IoT Systems

This paper analyzes backdoor attacks on contrastive continual learning (CCL) in IoT systems, highlighting vulnerabilities and proposing d...

arXiv - Machine Learning · 4 min · about 2 months ago

Llms

[2602.13042] GPTZero: Robust Detection of LLM-Generated Texts

GPTZero introduces a robust solution for detecting AI-generated texts, addressing concerns over text authenticity and misinformation in t...

arXiv - Machine Learning · 3 min · about 2 months ago

Machine Learning

[2602.13040] TCRL: Temporal-Coupled Adversarial Training for Robust Constrained Reinforcement Learning in Worst-Case Scenarios

The paper presents TCRL, a novel framework for robust constrained reinforcement learning that addresses challenges posed by temporally co...

arXiv - Machine Learning · 4 min · about 2 months ago

Machine Learning

[2602.12980] MAUNet-Light: A Concise MAUNet Architecture for Bias Correction and Downscaling of Precipitation Estimates

The paper presents MAUNet-Light, a lightweight neural network architecture designed for bias correction and downscaling of precipitation ...

arXiv - Machine Learning · 4 min · about 2 months ago

Llms

[2602.12714] ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools -- From Consensus Learning to Ambiguity-Driven Emotion Reasoning

The paper introduces ADEPT, a novel framework for emotion recognition that enhances accuracy by integrating acoustic evidence and multi-t...

arXiv - Machine Learning · 4 min · about 2 months ago

Machine Learning

[2602.12708] Mixture of Predefined Experts: Maximizing Data Usage on Vertical Federated Learning

The paper introduces Split-MoPE, a novel framework for Vertical Federated Learning that maximizes data usage by integrating predefined ex...

arXiv - Machine Learning · 4 min · about 2 months ago

Machine Learning

[2602.12605] Block-Sample MAC-Bayes Generalization Bounds

The paper introduces Block-Sample MAC-Bayes bounds, a new approach to generalization error estimation in machine learning, enhancing trad...

arXiv - Machine Learning · 4 min · about 2 months ago

Machine Learning

[2602.12587] Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers

The paper discusses how multi-head attention in Mixture-of-Experts (MoE) Transformers contributes to catastrophic forgetting, proposing a...

arXiv - Machine Learning · 4 min · about 2 months ago

Llms

[2602.12506] On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

This article examines the robustness and chain-of-thought consistency of reinforcement learning (RL) fine-tuned vision language models (V...

arXiv - Machine Learning · 4 min · about 2 months ago

Machine Learning

[2602.12379] Deep Doubly Debiased Longitudinal Effect Estimation with ICE G-Computation

This paper presents D3-Net, a novel framework for estimating longitudinal treatment effects using ICE G-computation, addressing error pro...

arXiv - Machine Learning · 3 min · about 2 months ago

Llms

[2602.12318] Abstractive Red-Teaming of Language Model Character

This article presents a novel approach to auditing language model behavior through 'abstractive red-teaming,' identifying query types tha...

arXiv - Machine Learning · 4 min · about 2 months ago

Previous Page 118 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

[R] The Lyra Technique — A framework for interpreting internal cognitive states in LLMs (Zenodo, open access)

[P] If you're building AI agents, logs aren't enough. You need evidence.

[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

All Content

[2602.12972] Jointly Optimizing Debiased CTR and Uplift for Coupons Marketing: A Unified Causal Framework

[2602.12825] Reliable Hierarchical Operating System Fingerprinting via Conformal Prediction

[2602.12680] A Regularization-Sharpness Tradeoff for Linear Interpolators

[2602.12681] Fool Me If You Can: On the Robustness of Binary Code Similarity Detection Models against Semantics-preserving Transformations

[2602.12445] RBCorr: Response Bias Correction in Language Models

[2602.12426] Interference-Robust Non-Coherent Over-the-Air Computation for Decentralized Optimization

[2602.12418] Sparse Autoencoders are Capable LLM Jailbreak Mitigators

[2602.13151] Quantization-Robust LLM Unlearning via Low-Rank Adaptation

[2602.13128] Eventizing Traditionally Opaque Binary Neural Networks as 1-safe Petri net Models

[2602.13062] Backdoor Attacks on Contrastive Continual Learning for IoT Systems

[2602.13042] GPTZero: Robust Detection of LLM-Generated Texts

[2602.13040] TCRL: Temporal-Coupled Adversarial Training for Robust Constrained Reinforcement Learning in Worst-Case Scenarios

[2602.12980] MAUNet-Light: A Concise MAUNet Architecture for Bias Correction and Downscaling of Precipitation Estimates

[2602.12714] ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools -- From Consensus Learning to Ambiguity-Driven Emotion Reasoning

[2602.12708] Mixture of Predefined Experts: Maximizing Data Usage on Vertical Federated Learning

[2602.12605] Block-Sample MAC-Bayes Generalization Bounds

[2602.12587] Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers

[2602.12506] On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

[2602.12379] Deep Doubly Debiased Longitudinal Effect Estimation with ICE G-Computation

[2602.12318] Abstractive Red-Teaming of Language Model Character

Related Topics

Stay updated with AI News