AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

RSS

Top This Week

Ai Safety

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

submitted by /u/esporx [link] [comments]

Reddit - Artificial Intelligence · 1 min · 2 days ago

Machine Learning

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

RLHF trains models on human feedback. Humans rate responses they like. And it turns out humans consistently rate confident, fluent, agree...

Reddit - Artificial Intelligence · 1 min · 2 days ago

Computer Vision

House Democrat Questions Anthropic on AI Safety After Source Code Leak

Rep. Josh Gottheimer, who is generally tough on China, just sent a letter to Anthropic questioning their decision to reduce certain safet...

Reddit - Artificial Intelligence · 1 min · 2 days ago

All Content

Llms

[2602.17223] Privacy-Preserving Mechanisms Enable Cheap Verifiable Inference of LLMs

The paper presents new privacy-preserving protocols for verifiable inference of large language models (LLMs), addressing the challenges o...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.16835] NeST: Neuron Selective Tuning for LLM Safety

The paper introduces NeST, a novel framework for enhancing safety in large language models (LLMs) by selectively tuning a small subset of...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.16794] Beyond Procedure: Substantive Fairness in Conformal Prediction

This paper explores substantive fairness in conformal prediction, analyzing its impact on downstream decision-making and proposing method...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2602.17642] A.R.I.S.: Automated Recycling Identification System for E-Waste Classification Using Deep Learning

The A.R.I.S. system utilizes deep learning to enhance e-waste recycling by accurately classifying materials in real-time, improving recov...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2602.17625] Catastrophic Forgetting Resilient One-Shot Incremental Federated Learning

This paper introduces One-Shot Incremental Federated Learning (OSI-FL), a novel framework that mitigates catastrophic forgetting and comm...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.17614] Guarding the Middle: Protecting Intermediate Representations in Federated Split Learning

This paper presents KD-UFSL, a method to enhance privacy in federated split learning by minimizing data leakage through intermediate repr...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.17559] Revisiting Weight Regularization for Low-Rank Continual Learning

This paper explores weight regularization techniques in low-rank continual learning, proposing EWC-LoRA to mitigate task interference whi...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.17554] A Theoretical Framework for Modular Learning of Robust Generative Models

This article presents a theoretical framework for modular learning in robust generative models, exploring the combination of domain-speci...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.17312] LexiSafe: Offline Safe Reinforcement Learning with Lexicographic Safety-Reward Hierarchy

The paper presents LexiSafe, a novel offline safe reinforcement learning framework that employs a lexicographic safety-reward hierarchy t...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2602.17284] Efficient privacy loss accounting for subsampling and random allocation

This paper presents an efficient method for privacy loss accounting in subsampling and random allocation, demonstrating advantages over t...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.17244] CounterFlowNet: From Minimal Changes to Meaningful Counterfactual Explanations

CounterFlowNet introduces a novel generative approach for creating counterfactual explanations in machine learning, enhancing interpretab...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2602.17092] A Locality Radius Framework for Understanding Relational Inductive Bias in Database Learning

This paper introduces a locality radius framework to understand relational inductive bias in database learning, focusing on the necessary...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2602.17088] MeGU: Machine-Guided Unlearning with Target Feature Disentanglement

The paper presents MeGU, a novel framework for machine unlearning that addresses the challenge of effectively erasing target data while p...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.16980] Discovering Universal Activation Directions for PII Leakage in Language Models

The paper introduces UniLeak, a framework that identifies universal activation directions in language models, enhancing the understanding...

arXiv - Machine Learning · 3 min · about 1 month ago

Llms

[2602.16977] Fail-Closed Alignment for Large Language Models

This paper proposes a fail-closed alignment mechanism for large language models (LLMs) to enhance their safety and robustness against pro...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2602.16944] Exact Certification of Data-Poisoning Attacks Using Mixed-Integer Programming

This paper presents a framework for certifying data-poisoning attacks in neural networks using mixed-integer programming, ensuring robust...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2602.16849] On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking

This paper analyzes how two-layer neural networks learn to solve the modular addition task, providing insights into feature learning, tra...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.16837] A Residual-Aware Theory of Position Bias in Transformers

This paper presents a residual-aware theory explaining the position bias in Transformers, revealing how residual connections prevent atte...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2602.16823] Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees

This article presents a novel approach to automated circuit discovery in neural networks, emphasizing provable guarantees for robustness ...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.16784] Omitted Variable Bias in Language Models Under Distribution Shift

This paper explores omitted variable bias in language models under distribution shifts, proposing a framework to evaluate and optimize pe...

arXiv - Machine Learning · 3 min · about 1 month ago

Previous Page 77 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

House Democrat Questions Anthropic on AI Safety After Source Code Leak

All Content

[2602.17223] Privacy-Preserving Mechanisms Enable Cheap Verifiable Inference of LLMs

[2602.16835] NeST: Neuron Selective Tuning for LLM Safety

[2602.16794] Beyond Procedure: Substantive Fairness in Conformal Prediction

[2602.17642] A.R.I.S.: Automated Recycling Identification System for E-Waste Classification Using Deep Learning

[2602.17625] Catastrophic Forgetting Resilient One-Shot Incremental Federated Learning

[2602.17614] Guarding the Middle: Protecting Intermediate Representations in Federated Split Learning

[2602.17559] Revisiting Weight Regularization for Low-Rank Continual Learning

[2602.17554] A Theoretical Framework for Modular Learning of Robust Generative Models

[2602.17312] LexiSafe: Offline Safe Reinforcement Learning with Lexicographic Safety-Reward Hierarchy

[2602.17284] Efficient privacy loss accounting for subsampling and random allocation

[2602.17244] CounterFlowNet: From Minimal Changes to Meaningful Counterfactual Explanations

[2602.17092] A Locality Radius Framework for Understanding Relational Inductive Bias in Database Learning

[2602.17088] MeGU: Machine-Guided Unlearning with Target Feature Disentanglement

[2602.16980] Discovering Universal Activation Directions for PII Leakage in Language Models

[2602.16977] Fail-Closed Alignment for Large Language Models

[2602.16944] Exact Certification of Data-Poisoning Attacks Using Mixed-Integer Programming

[2602.16849] On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking

[2602.16837] A Residual-Aware Theory of Position Bias in Transformers

[2602.16823] Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees

[2602.16784] Omitted Variable Bias in Language Models Under Distribution Shift

Related Topics

Stay updated with AI News