AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

Top This Week

Machine Learning

[D] I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Machine Learning · 1 min · about 10 hours ago

Machine Learning

I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Artificial Intelligence · 1 min · about 10 hours ago

Ai Safety

Newsom signs executive order requiring AI companies to have safety, privacy guardrails

submitted by /u/Fcking_Chuck [link] [comments]

Reddit - Artificial Intelligence · 1 min · about 15 hours ago

All Content

Llms

[2506.09886] Probabilistic distances-based hallucination detection in LLMs with RAG

This paper presents a novel method for detecting hallucinations in large language models (LLMs) using probabilistic distances in retrieva...

arXiv - AI · 3 min · about 1 month ago

Llms

[2506.07452] When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment

This paper explores the vulnerabilities of large language models (LLMs) to superficial style alignment, proposing a defense mechanism cal...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2506.06060] Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models

This article discusses the privacy risks associated with federated fine-tuning of large language models, highlighting methods for extract...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2504.06533] Rethinking Flexible Graph Similarity Computation: One-step Alignment with Global Guidance

The paper presents a novel approach to graph similarity computation through the Graph Edit Network (GEN), which integrates cost-aware est...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2406.17115] Measuring the Measurers: Quality Evaluation of Hallucination Benchmarks for Large Vision-Language Models

This article evaluates the quality of hallucination benchmarks for Large Vision-Language Models (LVLMs) and introduces a new framework fo...

arXiv - AI · 4 min · about 1 month ago

Llms

[2601.10402] Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

The paper discusses advancements in AI towards ultra-long-horizon autonomy, introducing ML-Master 2.0, which utilizes Hierarchical Cognit...

arXiv - AI · 4 min · about 1 month ago

Llms

[2510.19139] A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist

This paper evaluates the cognitive abilities of large language models (LLMs) in assessing clinical trial reporting according to CONSORT s...

arXiv - AI · 4 min · about 1 month ago

Llms

[2508.07667] 1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning

The paper presents a multi-agent framework to enhance contextual privacy in large language models (LLMs), demonstrating a significant red...

arXiv - AI · 3 min · about 1 month ago

Llms

[2506.10947] Spurious Rewards: Rethinking Training Signals in RLVR

The paper explores the impact of spurious rewards in reinforcement learning with verifiable rewards (RLVR), demonstrating how they can en...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2505.13529] BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs

The paper presents BARREL, a framework designed to enhance the factual reliability of Large Reasoning Models (LRMs) by addressing overcon...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2602.22197] Off-The-Shelf Image-to-Image Models Are All You Need To Defeat Image Protection Schemes

This paper demonstrates that off-the-shelf image-to-image models can effectively defeat various image protection schemes, highlighting a ...

arXiv - AI · 4 min · about 1 month ago

Ai Safety

[2602.22149] Enhancing Framingham Cardiovascular Risk Score Transparency through Logic-Based XAI

This article presents a logic-based explainable AI model designed to enhance the transparency of the Framingham Cardiovascular Risk Score...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.22146] Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

This article presents a novel optimistic primal-dual framework for safe reinforcement learning from human feedback (RLHF) in large langua...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.22145] When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models

This article explores the phenomenon of 'Cultural Ghosting' in large language models (LLMs), highlighting the systematic erasure of cultu...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.22144] NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

The paper presents NoLan, a framework aimed at reducing object hallucinations in Large Vision-Language Models (LVLMs) by dynamically supp...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.22072] Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models

This article explores the robustness of Theory of Mind (ToM) in large language models (LLMs) through perturbation tasks, revealing signif...

arXiv - AI · 3 min · about 1 month ago

Llms

[2602.21939] Hidden Topics: Measuring Sensitive AI Beliefs with List Experiments

This paper explores how list experiments can be used to uncover hidden beliefs in large language models (LLMs), revealing concerning appr...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2602.21841] Resilient Federated Chain: Transforming Blockchain Consensus into an Active Defense Layer for Federated Learning

The paper presents the Resilient Federated Chain (RFC), a blockchain-enabled framework designed to enhance the security of Federated Lear...

arXiv - AI · 4 min · about 1 month ago

Ai Infrastructure

[2602.21845] xai-cola: A Python library for sparsifying counterfactual explanations

The article introduces xai-cola, an open-source Python library designed to sparsify counterfactual explanations, enhancing interpretabili...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2602.21829] StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles

The paper introduces StoryMovie, a dataset designed for aligning visual stories with movie scripts and subtitles, enhancing dialogue attr...

arXiv - AI · 3 min · about 1 month ago

Previous Page 45 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

[D] I had an idea, would love your thoughts

I had an idea, would love your thoughts

Newsom signs executive order requiring AI companies to have safety, privacy guardrails

All Content

[2506.09886] Probabilistic distances-based hallucination detection in LLMs with RAG

[2506.07452] When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment

[2506.06060] Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models

[2504.06533] Rethinking Flexible Graph Similarity Computation: One-step Alignment with Global Guidance

[2406.17115] Measuring the Measurers: Quality Evaluation of Hallucination Benchmarks for Large Vision-Language Models

[2601.10402] Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

[2510.19139] A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist

[2508.07667] 1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning

[2506.10947] Spurious Rewards: Rethinking Training Signals in RLVR

[2505.13529] BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs

[2602.22197] Off-The-Shelf Image-to-Image Models Are All You Need To Defeat Image Protection Schemes

[2602.22149] Enhancing Framingham Cardiovascular Risk Score Transparency through Logic-Based XAI

[2602.22146] Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

[2602.22145] When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models

[2602.22144] NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

[2602.22072] Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models

[2602.21939] Hidden Topics: Measuring Sensitive AI Beliefs with List Experiments

[2602.21841] Resilient Federated Chain: Transforming Blockchain Consensus into an Active Defense Layer for Federated Learning

[2602.21845] xai-cola: A Python library for sparsifying counterfactual explanations

[2602.21829] StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles

Related Topics

Stay updated with AI News