AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

RSS

Top This Week

Ai Safety

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

submitted by /u/esporx [link] [comments]

Reddit - Artificial Intelligence · 1 min · 1 day ago

Machine Learning

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

RLHF trains models on human feedback. Humans rate responses they like. And it turns out humans consistently rate confident, fluent, agree...

Reddit - Artificial Intelligence · 1 min · 1 day ago

Computer Vision

House Democrat Questions Anthropic on AI Safety After Source Code Leak

Rep. Josh Gottheimer, who is generally tough on China, just sent a letter to Anthropic questioning their decision to reduce certain safet...

Reddit - Artificial Intelligence · 1 min · 1 day ago

All Content

Llms

[2602.18171] Click it or Leave it: Detecting and Spoiling Clickbait with Informativeness Measures and Large Language Models

This paper presents a hybrid approach to detecting clickbait using large language models and informativeness measures, achieving a high F...

arXiv - AI · 3 min · about 1 month ago

Llms

[2602.18154] FENCE: A Financial and Multimodal Jailbreak Detection Dataset

The paper presents FENCE, a bilingual multimodal dataset designed for detecting jailbreaks in financial applications, highlighting vulner...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2602.17894] Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

This paper explores optimal data collection strategies from biased and costly sources, focusing on maximizing effective sample size under...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.17837] TFL: Targeted Bit-Flip Attack on Large Language Model

The paper presents TFL, a targeted bit-flip attack framework for large language models (LLMs) that allows precise manipulation of outputs...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.18094] OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models

The paper introduces OODBench, a benchmark for evaluating large vision-language models' performance on out-of-distribution (OOD) data, hi...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.18092] Perceived Political Bias in LLMs Reduces Persuasive Abilities

This article explores how perceived political bias in large language models (LLMs) can diminish their effectiveness in persuasion, reveal...

arXiv - AI · 3 min · about 1 month ago

Nlp

[2602.18045] Conformal Tradeoffs: Guarantees Beyond Coverage

This article presents a framework for operational certification in conformal predictors, focusing on trade-offs beyond mere coverage, and...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.18029] Towards More Standardized AI Evaluation: From Models to Agents

This paper discusses the evolution of AI evaluation from static models to dynamic agents, emphasizing the need for standardized evaluatio...

arXiv - AI · 3 min · about 1 month ago

Computer Vision

[2602.18019] DeepSVU: Towards In-depth Security-oriented Video Understanding via Unified Physical-world Regularized MoE

The paper introduces DeepSVU, a novel approach for Security-oriented Video Understanding that identifies threats and evaluates their caus...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.17770] CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild

The paper introduces CLUTCH, a novel model for generating hand motions from text, leveraging a new dataset and advanced techniques to imp...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.17730] Clever Materials: When Models Identify Good Materials for the Wrong Reasons

This article examines the limitations of machine learning in materials discovery, highlighting that high performance on benchmarks may st...

arXiv - Machine Learning · 3 min · about 1 month ago

Ai Infrastructure

[2602.17973] PenTiDef: Enhancing Privacy and Robustness in Decentralized Federated Intrusion Detection Systems against Poisoning Attacks

The paper presents PenTiDef, a novel framework designed to enhance privacy and robustness in decentralized federated intrusion detection ...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.17951] ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

The paper presents ROCKET, a novel framework for enhancing Vision-Language-Action models by employing residual-oriented multi-layer align...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.18403] Scientific Knowledge-Guided Machine Learning for Vessel Power Prediction: A Comparative Study

This study presents a hybrid modeling framework that combines scientific knowledge with machine learning to improve vessel power predicti...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.18396] PRISM-FCP: Byzantine-Resilient Federated Conformal Prediction via Partial Sharing

The paper presents PRISM-FCP, a Byzantine-resilient framework for federated conformal prediction that enhances robustness against attacks...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.17881] Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations

This paper explores the unreliability of steering vectors in language models, examining how geometric predictors and linear approximation...

arXiv - Machine Learning · 3 min · about 1 month ago

Llms

[2602.17875] MultiVer: Zero-Shot Multi-Agent Vulnerability Detection

The paper presents MultiVer, a zero-shot multi-agent system for vulnerability detection that outperforms fine-tuned models in recall, ach...

arXiv - AI · 3 min · about 1 month ago

Llms

[2602.18333] On the "Induction Bias" in Sequence Models

This paper examines the 'induction bias' in sequence models, focusing on the limitations of transformer-based models in state tracking co...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.17871] Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models

This paper explores the fine-grained knowledge capabilities of vision-language models (VLMs), highlighting their performance on visual qu...

arXiv - Machine Learning · 3 min · about 1 month ago

Llms

[2602.18297] Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory

This paper explores the monitorability of chain-of-thought (CoT) systems in LLMs using information theory, identifying errors that affect...

arXiv - Machine Learning · 4 min · about 1 month ago

Previous Page 72 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

House Democrat Questions Anthropic on AI Safety After Source Code Leak

All Content

[2602.18171] Click it or Leave it: Detecting and Spoiling Clickbait with Informativeness Measures and Large Language Models

[2602.18154] FENCE: A Financial and Multimodal Jailbreak Detection Dataset

[2602.17894] Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

[2602.17837] TFL: Targeted Bit-Flip Attack on Large Language Model

[2602.18094] OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models

[2602.18092] Perceived Political Bias in LLMs Reduces Persuasive Abilities

[2602.18045] Conformal Tradeoffs: Guarantees Beyond Coverage

[2602.18029] Towards More Standardized AI Evaluation: From Models to Agents

[2602.18019] DeepSVU: Towards In-depth Security-oriented Video Understanding via Unified Physical-world Regularized MoE

[2602.17770] CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild

[2602.17730] Clever Materials: When Models Identify Good Materials for the Wrong Reasons

[2602.17973] PenTiDef: Enhancing Privacy and Robustness in Decentralized Federated Intrusion Detection Systems against Poisoning Attacks

[2602.17951] ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

[2602.18403] Scientific Knowledge-Guided Machine Learning for Vessel Power Prediction: A Comparative Study

[2602.18396] PRISM-FCP: Byzantine-Resilient Federated Conformal Prediction via Partial Sharing

[2602.17881] Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations

[2602.17875] MultiVer: Zero-Shot Multi-Agent Vulnerability Detection

[2602.18333] On the "Induction Bias" in Sequence Models

[2602.17871] Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models

[2602.18297] Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory

Related Topics

Stay updated with AI News