AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

RSS

Top This Week

Ai Safety

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

submitted by /u/esporx [link] [comments]

Reddit - Artificial Intelligence · 1 min · 37 minutes ago

Machine Learning

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

RLHF trains models on human feedback. Humans rate responses they like. And it turns out humans consistently rate confident, fluent, agree...

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

Computer Vision

House Democrat Questions Anthropic on AI Safety After Source Code Leak

Rep. Josh Gottheimer, who is generally tough on China, just sent a letter to Anthropic questioning their decision to reduce certain safet...

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

All Content

Llms

[2510.04891] SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

The paper introduces SocialHarmBench, a dataset designed to evaluate the vulnerabilities of large language models (LLMs) to socially harm...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2506.16224] Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy

This article explores the use of NLP and machine learning techniques for enhancing malware classification accuracy, achieving a notable 9...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2506.10572] Probability Bounding: Post-Hoc Calibration via Box-Constrained Softmax

The paper introduces Probability Bounding (PB), a novel post-hoc calibration method that uses Box-Constrained Softmax to improve the cali...

arXiv - Machine Learning · 3 min · about 1 month ago

Robotics

[2509.24243] SafeFlowMatcher: Safe and Fast Planning using Flow Matching with Control Barrier Functions

The paper presents SafeFlowMatcher, a new planning framework that integrates flow matching with control barrier functions to ensure safe ...

arXiv - AI · 4 min · about 1 month ago

Llms

[2509.06326] AttestLLM: Efficient Attestation Framework for Billion-scale On-device LLMs

The paper presents AttestLLM, a novel framework for efficiently attesting billion-scale on-device LLMs, ensuring model legitimacy and pro...

arXiv - AI · 3 min · about 1 month ago

Llms

[2410.02099] A Watermark for Black-Box Language Models

The paper presents a novel watermarking scheme for black-box language models, enabling detection of model outputs without requiring white...

arXiv - Machine Learning · 3 min · about 1 month ago

Llms

[2507.23465] Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

This article explores the development of role-aware language models designed to enhance access control in organizational settings, focusi...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2507.10846] Winsor-CAM: Human-Tunable Visual Explanations from Deep Networks via Layer-Wise Winsorization

Winsor-CAM introduces a novel method for visual explanations in deep networks, enhancing interpretability through human-tunable parameter...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2506.07751] AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking

The paper presents AbstRaL, a method to enhance large language models' reasoning capabilities by reinforcing abstract thinking, particula...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.03596] SAGE-5GC: Security-Aware Guidelines for Evaluating Anomaly Detection in the 5G Core Network

The paper presents SAGE-5GC, a set of security-aware guidelines for evaluating anomaly detection in the 5G Core Network, addressing chall...

arXiv - Machine Learning · 4 min · about 1 month ago

Robotics

[2505.20181] The Problem of Algorithmic Collisions: Mitigating Unforeseen Risks in a Connected World

The paper discusses the systemic risks posed by algorithmic collisions in interconnected AI systems, highlighting the need for improved g...

arXiv - AI · 4 min · about 1 month ago

Llms

[2505.16789] Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards

The paper explores how fine-tuning large language models can unintentionally create vulnerabilities, analyzing factors like dataset chara...

arXiv - Machine Learning · 3 min · about 1 month ago

Llms

[2505.16670] BitHydra: Towards Bit-flip Inference Cost Attack against Large Language Models

The paper presents BitHydra, a framework for executing bit-flip inference cost attacks on large language models (LLMs), demonstrating how...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.01428] Improving the Trade-off Between Watermark Strength and Speculative Sampling Efficiency for Language Models

This paper explores the balance between watermark strength and speculative sampling efficiency in language models, proposing a new approa...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2601.18650] FaLW: A Forgetting-aware Loss Reweighting for Long-tailed Unlearning

The paper introduces FaLW, a novel method for machine unlearning that addresses challenges in long-tailed data scenarios, enhancing data ...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2601.18696] Explainability Methods for Hardware Trojan Detection: A Systematic Comparison

This article systematically compares various explainability methods for detecting hardware trojans, focusing on their effectiveness in pr...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2504.04717] Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

This article surveys advancements in multi-turn interactions with large language models (LLMs), focusing on evaluation methods, challenge...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2503.23377] JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

The paper presents JavisDiT, a novel Joint Audio-Video Diffusion Transformer that enhances synchronized audio-video generation through a ...

arXiv - AI · 4 min · about 1 month ago

Nlp

[2601.03612] Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias

This article presents a novel approach to polyphonic music generation using structural inductive bias, focusing on Beethoven's piano sona...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2512.20821] Divided We Fall: Defending Against Adversarial Attacks via Soft-Gated Fractional Mixture-of-Experts with Randomized Adversarial Training

The paper presents a novel defense mechanism against adversarial attacks in machine learning using a soft-gated fractional mixture-of-exp...

arXiv - Machine Learning · 4 min · about 1 month ago

Previous Page 59 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

House Democrat Questions Anthropic on AI Safety After Source Code Leak

All Content

[2510.04891] SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

[2506.16224] Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy

[2506.10572] Probability Bounding: Post-Hoc Calibration via Box-Constrained Softmax

[2509.24243] SafeFlowMatcher: Safe and Fast Planning using Flow Matching with Control Barrier Functions

[2509.06326] AttestLLM: Efficient Attestation Framework for Billion-scale On-device LLMs

[2410.02099] A Watermark for Black-Box Language Models

[2507.23465] Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

[2507.10846] Winsor-CAM: Human-Tunable Visual Explanations from Deep Networks via Layer-Wise Winsorization

[2506.07751] AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking

[2602.03596] SAGE-5GC: Security-Aware Guidelines for Evaluating Anomaly Detection in the 5G Core Network

[2505.20181] The Problem of Algorithmic Collisions: Mitigating Unforeseen Risks in a Connected World

[2505.16789] Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards

[2505.16670] BitHydra: Towards Bit-flip Inference Cost Attack against Large Language Models

[2602.01428] Improving the Trade-off Between Watermark Strength and Speculative Sampling Efficiency for Language Models

[2601.18650] FaLW: A Forgetting-aware Loss Reweighting for Long-tailed Unlearning

[2601.18696] Explainability Methods for Hardware Trojan Detection: A Systematic Comparison

[2504.04717] Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

[2503.23377] JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

[2601.03612] Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias

[2512.20821] Divided We Fall: Defending Against Adversarial Attacks via Soft-Gated Fractional Mixture-of-Experts with Randomized Adversarial Training

Related Topics

Stay updated with AI News