AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

Ai Safety

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

submitted by /u/esporx [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

RLHF trains models on human feedback. Humans rate responses they like. And it turns out humans consistently rate confident, fluent, agree...

Reddit - Artificial Intelligence · 1 min ·
Computer Vision

House Democrat Questions Anthropic on AI Safety After Source Code Leak

Rep. Josh Gottheimer, who is generally tough on China, just sent a letter to Anthropic questioning their decision to reduce certain safet...

Reddit - Artificial Intelligence · 1 min ·

All Content

[2510.04891] SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests
Llms

[2510.04891] SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

The paper introduces SocialHarmBench, a dataset designed to evaluate the vulnerabilities of large language models (LLMs) to socially harm...

arXiv - Machine Learning · 4 min ·
[2506.16224] Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy
Machine Learning

[2506.16224] Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy

This article explores the use of NLP and machine learning techniques for enhancing malware classification accuracy, achieving a notable 9...

arXiv - Machine Learning · 3 min ·
[2506.10572] Probability Bounding: Post-Hoc Calibration via Box-Constrained Softmax
Machine Learning

[2506.10572] Probability Bounding: Post-Hoc Calibration via Box-Constrained Softmax

The paper introduces Probability Bounding (PB), a novel post-hoc calibration method that uses Box-Constrained Softmax to improve the cali...

arXiv - Machine Learning · 3 min ·
[2509.24243] SafeFlowMatcher: Safe and Fast Planning using Flow Matching with Control Barrier Functions
Robotics

[2509.24243] SafeFlowMatcher: Safe and Fast Planning using Flow Matching with Control Barrier Functions

The paper presents SafeFlowMatcher, a new planning framework that integrates flow matching with control barrier functions to ensure safe ...

arXiv - AI · 4 min ·
[2509.06326] AttestLLM: Efficient Attestation Framework for Billion-scale On-device LLMs
Llms

[2509.06326] AttestLLM: Efficient Attestation Framework for Billion-scale On-device LLMs

The paper presents AttestLLM, a novel framework for efficiently attesting billion-scale on-device LLMs, ensuring model legitimacy and pro...

arXiv - AI · 3 min ·
[2410.02099] A Watermark for Black-Box Language Models
Llms

[2410.02099] A Watermark for Black-Box Language Models

The paper presents a novel watermarking scheme for black-box language models, enabling detection of model outputs without requiring white...

arXiv - Machine Learning · 3 min ·
[2507.23465] Role-Aware Language Models for Secure and Contextualized Access Control in Organizations
Llms

[2507.23465] Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

This article explores the development of role-aware language models designed to enhance access control in organizational settings, focusi...

arXiv - AI · 3 min ·
[2507.10846] Winsor-CAM: Human-Tunable Visual Explanations from Deep Networks via Layer-Wise Winsorization
Machine Learning

[2507.10846] Winsor-CAM: Human-Tunable Visual Explanations from Deep Networks via Layer-Wise Winsorization

Winsor-CAM introduces a novel method for visual explanations in deep networks, enhancing interpretability through human-tunable parameter...

arXiv - Machine Learning · 4 min ·
[2506.07751] AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking
Llms

[2506.07751] AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking

The paper presents AbstRaL, a method to enhance large language models' reasoning capabilities by reinforcing abstract thinking, particula...

arXiv - AI · 4 min ·
[2602.03596] SAGE-5GC: Security-Aware Guidelines for Evaluating Anomaly Detection in the 5G Core Network
Machine Learning

[2602.03596] SAGE-5GC: Security-Aware Guidelines for Evaluating Anomaly Detection in the 5G Core Network

The paper presents SAGE-5GC, a set of security-aware guidelines for evaluating anomaly detection in the 5G Core Network, addressing chall...

arXiv - Machine Learning · 4 min ·
[2505.20181] The Problem of Algorithmic Collisions: Mitigating Unforeseen Risks in a Connected World
Robotics

[2505.20181] The Problem of Algorithmic Collisions: Mitigating Unforeseen Risks in a Connected World

The paper discusses the systemic risks posed by algorithmic collisions in interconnected AI systems, highlighting the need for improved g...

arXiv - AI · 4 min ·
[2505.16789] Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards
Llms

[2505.16789] Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards

The paper explores how fine-tuning large language models can unintentionally create vulnerabilities, analyzing factors like dataset chara...

arXiv - Machine Learning · 3 min ·
[2505.16670] BitHydra: Towards Bit-flip Inference Cost Attack against Large Language Models
Llms

[2505.16670] BitHydra: Towards Bit-flip Inference Cost Attack against Large Language Models

The paper presents BitHydra, a framework for executing bit-flip inference cost attacks on large language models (LLMs), demonstrating how...

arXiv - AI · 4 min ·
[2602.01428] Improving the Trade-off Between Watermark Strength and Speculative Sampling Efficiency for Language Models
Llms

[2602.01428] Improving the Trade-off Between Watermark Strength and Speculative Sampling Efficiency for Language Models

This paper explores the balance between watermark strength and speculative sampling efficiency in language models, proposing a new approa...

arXiv - Machine Learning · 4 min ·
[2601.18650] FaLW: A Forgetting-aware Loss Reweighting for Long-tailed Unlearning
Machine Learning

[2601.18650] FaLW: A Forgetting-aware Loss Reweighting for Long-tailed Unlearning

The paper introduces FaLW, a novel method for machine unlearning that addresses challenges in long-tailed data scenarios, enhancing data ...

arXiv - AI · 4 min ·
[2601.18696] Explainability Methods for Hardware Trojan Detection: A Systematic Comparison
Machine Learning

[2601.18696] Explainability Methods for Hardware Trojan Detection: A Systematic Comparison

This article systematically compares various explainability methods for detecting hardware trojans, focusing on their effectiveness in pr...

arXiv - Machine Learning · 4 min ·
[2504.04717] Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models
Llms

[2504.04717] Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

This article surveys advancements in multi-turn interactions with large language models (LLMs), focusing on evaluation methods, challenge...

arXiv - AI · 4 min ·
[2503.23377] JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization
Machine Learning

[2503.23377] JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

The paper presents JavisDiT, a novel Joint Audio-Video Diffusion Transformer that enhances synchronized audio-video generation through a ...

arXiv - AI · 4 min ·
[2601.03612] Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias
Nlp

[2601.03612] Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias

This article presents a novel approach to polyphonic music generation using structural inductive bias, focusing on Beethoven's piano sona...

arXiv - Machine Learning · 3 min ·
[2512.20821] Divided We Fall: Defending Against Adversarial Attacks via Soft-Gated Fractional Mixture-of-Experts with Randomized Adversarial Training
Machine Learning

[2512.20821] Divided We Fall: Defending Against Adversarial Attacks via Soft-Gated Fractional Mixture-of-Experts with Randomized Adversarial Training

The paper presents a novel defense mechanism against adversarial attacks in machine learning using a soft-gated fractional mixture-of-exp...

arXiv - Machine Learning · 4 min ·
Previous Page 59 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime