AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

Ai Safety

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

submitted by /u/esporx [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

RLHF trains models on human feedback. Humans rate responses they like. And it turns out humans consistently rate confident, fluent, agree...

Reddit - Artificial Intelligence · 1 min ·
Computer Vision

House Democrat Questions Anthropic on AI Safety After Source Code Leak

Rep. Josh Gottheimer, who is generally tough on China, just sent a letter to Anthropic questioning their decision to reduce certain safet...

Reddit - Artificial Intelligence · 1 min ·

All Content

[2602.17223] Privacy-Preserving Mechanisms Enable Cheap Verifiable Inference of LLMs
Llms

[2602.17223] Privacy-Preserving Mechanisms Enable Cheap Verifiable Inference of LLMs

The paper presents new privacy-preserving protocols for verifiable inference of large language models (LLMs), addressing the challenges o...

arXiv - Machine Learning · 4 min ·
[2602.16835] NeST: Neuron Selective Tuning for LLM Safety
Llms

[2602.16835] NeST: Neuron Selective Tuning for LLM Safety

The paper introduces NeST, a novel framework for enhancing safety in large language models (LLMs) by selectively tuning a small subset of...

arXiv - Machine Learning · 4 min ·
[2602.16794] Beyond Procedure: Substantive Fairness in Conformal Prediction
Machine Learning

[2602.16794] Beyond Procedure: Substantive Fairness in Conformal Prediction

This paper explores substantive fairness in conformal prediction, analyzing its impact on downstream decision-making and proposing method...

arXiv - Machine Learning · 3 min ·
[2602.17642] A.R.I.S.: Automated Recycling Identification System for E-Waste Classification Using Deep Learning
Machine Learning

[2602.17642] A.R.I.S.: Automated Recycling Identification System for E-Waste Classification Using Deep Learning

The A.R.I.S. system utilizes deep learning to enhance e-waste recycling by accurately classifying materials in real-time, improving recov...

arXiv - Machine Learning · 3 min ·
[2602.17625] Catastrophic Forgetting Resilient One-Shot Incremental Federated Learning
Machine Learning

[2602.17625] Catastrophic Forgetting Resilient One-Shot Incremental Federated Learning

This paper introduces One-Shot Incremental Federated Learning (OSI-FL), a novel framework that mitigates catastrophic forgetting and comm...

arXiv - Machine Learning · 4 min ·
[2602.17614] Guarding the Middle: Protecting Intermediate Representations in Federated Split Learning
Machine Learning

[2602.17614] Guarding the Middle: Protecting Intermediate Representations in Federated Split Learning

This paper presents KD-UFSL, a method to enhance privacy in federated split learning by minimizing data leakage through intermediate repr...

arXiv - Machine Learning · 4 min ·
[2602.17559] Revisiting Weight Regularization for Low-Rank Continual Learning
Machine Learning

[2602.17559] Revisiting Weight Regularization for Low-Rank Continual Learning

This paper explores weight regularization techniques in low-rank continual learning, proposing EWC-LoRA to mitigate task interference whi...

arXiv - Machine Learning · 4 min ·
[2602.17554] A Theoretical Framework for Modular Learning of Robust Generative Models
Llms

[2602.17554] A Theoretical Framework for Modular Learning of Robust Generative Models

This article presents a theoretical framework for modular learning in robust generative models, exploring the combination of domain-speci...

arXiv - Machine Learning · 4 min ·
[2602.17312] LexiSafe: Offline Safe Reinforcement Learning with Lexicographic Safety-Reward Hierarchy
Machine Learning

[2602.17312] LexiSafe: Offline Safe Reinforcement Learning with Lexicographic Safety-Reward Hierarchy

The paper presents LexiSafe, a novel offline safe reinforcement learning framework that employs a lexicographic safety-reward hierarchy t...

arXiv - Machine Learning · 3 min ·
[2602.17284] Efficient privacy loss accounting for subsampling and random allocation
Machine Learning

[2602.17284] Efficient privacy loss accounting for subsampling and random allocation

This paper presents an efficient method for privacy loss accounting in subsampling and random allocation, demonstrating advantages over t...

arXiv - Machine Learning · 4 min ·
[2602.17244] CounterFlowNet: From Minimal Changes to Meaningful Counterfactual Explanations
Machine Learning

[2602.17244] CounterFlowNet: From Minimal Changes to Meaningful Counterfactual Explanations

CounterFlowNet introduces a novel generative approach for creating counterfactual explanations in machine learning, enhancing interpretab...

arXiv - Machine Learning · 3 min ·
[2602.17092] A Locality Radius Framework for Understanding Relational Inductive Bias in Database Learning
Machine Learning

[2602.17092] A Locality Radius Framework for Understanding Relational Inductive Bias in Database Learning

This paper introduces a locality radius framework to understand relational inductive bias in database learning, focusing on the necessary...

arXiv - Machine Learning · 3 min ·
[2602.17088] MeGU: Machine-Guided Unlearning with Target Feature Disentanglement
Machine Learning

[2602.17088] MeGU: Machine-Guided Unlearning with Target Feature Disentanglement

The paper presents MeGU, a novel framework for machine unlearning that addresses the challenge of effectively erasing target data while p...

arXiv - Machine Learning · 4 min ·
[2602.16980] Discovering Universal Activation Directions for PII Leakage in Language Models
Llms

[2602.16980] Discovering Universal Activation Directions for PII Leakage in Language Models

The paper introduces UniLeak, a framework that identifies universal activation directions in language models, enhancing the understanding...

arXiv - Machine Learning · 3 min ·
[2602.16977] Fail-Closed Alignment for Large Language Models
Llms

[2602.16977] Fail-Closed Alignment for Large Language Models

This paper proposes a fail-closed alignment mechanism for large language models (LLMs) to enhance their safety and robustness against pro...

arXiv - Machine Learning · 3 min ·
[2602.16944] Exact Certification of Data-Poisoning Attacks Using Mixed-Integer Programming
Machine Learning

[2602.16944] Exact Certification of Data-Poisoning Attacks Using Mixed-Integer Programming

This paper presents a framework for certifying data-poisoning attacks in neural networks using mixed-integer programming, ensuring robust...

arXiv - Machine Learning · 3 min ·
[2602.16849] On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking
Machine Learning

[2602.16849] On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking

This paper analyzes how two-layer neural networks learn to solve the modular addition task, providing insights into feature learning, tra...

arXiv - Machine Learning · 4 min ·
[2602.16837] A Residual-Aware Theory of Position Bias in Transformers
Machine Learning

[2602.16837] A Residual-Aware Theory of Position Bias in Transformers

This paper presents a residual-aware theory explaining the position bias in Transformers, revealing how residual connections prevent atte...

arXiv - Machine Learning · 3 min ·
[2602.16823] Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees
Machine Learning

[2602.16823] Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees

This article presents a novel approach to automated circuit discovery in neural networks, emphasizing provable guarantees for robustness ...

arXiv - Machine Learning · 4 min ·
[2602.16784] Omitted Variable Bias in Language Models Under Distribution Shift
Llms

[2602.16784] Omitted Variable Bias in Language Models Under Distribution Shift

This paper explores omitted variable bias in language models under distribution shifts, proposing a framework to evaluate and optimize pe...

arXiv - Machine Learning · 3 min ·
Previous Page 77 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime