AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

Machine Learning

[D] I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Machine Learning · 1 min ·
Machine Learning

I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Artificial Intelligence · 1 min ·
Ai Safety

Newsom signs executive order requiring AI companies to have safety, privacy guardrails

submitted by /u/Fcking_Chuck [link] [comments]

Reddit - Artificial Intelligence · 1 min ·

All Content

[2506.09886] Probabilistic distances-based hallucination detection in LLMs with RAG
Llms

[2506.09886] Probabilistic distances-based hallucination detection in LLMs with RAG

This paper presents a novel method for detecting hallucinations in large language models (LLMs) using probabilistic distances in retrieva...

arXiv - AI · 3 min ·
[2506.07452] When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment
Llms

[2506.07452] When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment

This paper explores the vulnerabilities of large language models (LLMs) to superficial style alignment, proposing a defense mechanism cal...

arXiv - Machine Learning · 4 min ·
[2506.06060] Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models
Llms

[2506.06060] Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models

This article discusses the privacy risks associated with federated fine-tuning of large language models, highlighting methods for extract...

arXiv - AI · 4 min ·
[2504.06533] Rethinking Flexible Graph Similarity Computation: One-step Alignment with Global Guidance
Machine Learning

[2504.06533] Rethinking Flexible Graph Similarity Computation: One-step Alignment with Global Guidance

The paper presents a novel approach to graph similarity computation through the Graph Edit Network (GEN), which integrates cost-aware est...

arXiv - Machine Learning · 4 min ·
[2406.17115] Measuring the Measurers: Quality Evaluation of Hallucination Benchmarks for Large Vision-Language Models
Llms

[2406.17115] Measuring the Measurers: Quality Evaluation of Hallucination Benchmarks for Large Vision-Language Models

This article evaluates the quality of hallucination benchmarks for Large Vision-Language Models (LVLMs) and introduces a new framework fo...

arXiv - AI · 4 min ·
[2601.10402] Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering
Llms

[2601.10402] Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

The paper discusses advancements in AI towards ultra-long-horizon autonomy, introducing ML-Master 2.0, which utilizes Hierarchical Cognit...

arXiv - AI · 4 min ·
[2510.19139] A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist
Llms

[2510.19139] A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist

This paper evaluates the cognitive abilities of large language models (LLMs) in assessing clinical trial reporting according to CONSORT s...

arXiv - AI · 4 min ·
[2508.07667] 1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning
Llms

[2508.07667] 1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning

The paper presents a multi-agent framework to enhance contextual privacy in large language models (LLMs), demonstrating a significant red...

arXiv - AI · 3 min ·
[2506.10947] Spurious Rewards: Rethinking Training Signals in RLVR
Llms

[2506.10947] Spurious Rewards: Rethinking Training Signals in RLVR

The paper explores the impact of spurious rewards in reinforcement learning with verifiable rewards (RLVR), demonstrating how they can en...

arXiv - Machine Learning · 4 min ·
[2505.13529] BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs
Machine Learning

[2505.13529] BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs

The paper presents BARREL, a framework designed to enhance the factual reliability of Large Reasoning Models (LRMs) by addressing overcon...

arXiv - Machine Learning · 3 min ·
[2602.22197] Off-The-Shelf Image-to-Image Models Are All You Need To Defeat Image Protection Schemes
Machine Learning

[2602.22197] Off-The-Shelf Image-to-Image Models Are All You Need To Defeat Image Protection Schemes

This paper demonstrates that off-the-shelf image-to-image models can effectively defeat various image protection schemes, highlighting a ...

arXiv - AI · 4 min ·
[2602.22149] Enhancing Framingham Cardiovascular Risk Score Transparency through Logic-Based XAI
Ai Safety

[2602.22149] Enhancing Framingham Cardiovascular Risk Score Transparency through Logic-Based XAI

This article presents a logic-based explainable AI model designed to enhance the transparency of the Framingham Cardiovascular Risk Score...

arXiv - AI · 4 min ·
[2602.22146] Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual
Llms

[2602.22146] Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

This article presents a novel optimistic primal-dual framework for safe reinforcement learning from human feedback (RLHF) in large langua...

arXiv - Machine Learning · 4 min ·
[2602.22145] When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models
Llms

[2602.22145] When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models

This article explores the phenomenon of 'Cultural Ghosting' in large language models (LLMs), highlighting the systematic erasure of cultu...

arXiv - AI · 4 min ·
[2602.22144] NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
Llms

[2602.22144] NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

The paper presents NoLan, a framework aimed at reducing object hallucinations in Large Vision-Language Models (LVLMs) by dynamically supp...

arXiv - AI · 4 min ·
[2602.22072] Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models
Llms

[2602.22072] Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models

This article explores the robustness of Theory of Mind (ToM) in large language models (LLMs) through perturbation tasks, revealing signif...

arXiv - AI · 3 min ·
[2602.21939] Hidden Topics: Measuring Sensitive AI Beliefs with List Experiments
Llms

[2602.21939] Hidden Topics: Measuring Sensitive AI Beliefs with List Experiments

This paper explores how list experiments can be used to uncover hidden beliefs in large language models (LLMs), revealing concerning appr...

arXiv - AI · 3 min ·
[2602.21841] Resilient Federated Chain: Transforming Blockchain Consensus into an Active Defense Layer for Federated Learning
Machine Learning

[2602.21841] Resilient Federated Chain: Transforming Blockchain Consensus into an Active Defense Layer for Federated Learning

The paper presents the Resilient Federated Chain (RFC), a blockchain-enabled framework designed to enhance the security of Federated Lear...

arXiv - AI · 4 min ·
[2602.21845] xai-cola: A Python library for sparsifying counterfactual explanations
Ai Infrastructure

[2602.21845] xai-cola: A Python library for sparsifying counterfactual explanations

The article introduces xai-cola, an open-source Python library designed to sparsify counterfactual explanations, enhancing interpretabili...

arXiv - Machine Learning · 3 min ·
[2602.21829] StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles
Machine Learning

[2602.21829] StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles

The paper introduces StoryMovie, a dataset designed for aligning visual stories with movie scripts and subtitles, enhancing dialogue attr...

arXiv - AI · 3 min ·
Previous Page 45 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime