AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

Top This Week

Machine Learning

[D] I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Machine Learning · 1 min · about 11 hours ago

Machine Learning

I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Artificial Intelligence · 1 min · about 12 hours ago

Ai Safety

Newsom signs executive order requiring AI companies to have safety, privacy guardrails

submitted by /u/Fcking_Chuck [link] [comments]

Reddit - Artificial Intelligence · 1 min · about 16 hours ago

All Content

Llms

[2602.21779] Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models

This paper introduces a forensic benchmark for evaluating video deepfake reasoning in vision-language models, focusing on temporal incons...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.21765] Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

This paper explores the generalization of Reinforcement Learning from Human Feedback (RLHF) under conditions of reward shift and clipped ...

arXiv - Machine Learning · 4 min · about 1 month ago

Ai Safety

[2602.21720] Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning

This article explores the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning, dem...

arXiv - AI · 3 min · about 1 month ago

Llms

[2602.21704] Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models

This paper presents Dynamic Multimodal Activation Steering, a novel approach to mitigate hallucinations in Large Vision-Language Models (...

arXiv - AI · 3 min · about 1 month ago

Ai Safety

[2602.21613] Virtual Biopsy for Intracranial Tumors Diagnosis on MRI

This article presents a novel Virtual Biopsy framework for diagnosing intracranial tumors using MRI, addressing the challenges of traditi...

arXiv - AI · 4 min · about 1 month ago

Ai Safety

[2602.21584] Exploring Human-Machine Coexistence in Symmetrical Reality

This paper explores the evolving relationship between humans and AI, proposing a framework for harmonious coexistence termed 'symmetrical...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2602.21543] Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment

This paper presents a method for enhancing multilingual embeddings through multi-way parallel text alignment, demonstrating improved cros...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2602.21515] Training Generalizable Collaborative Agents via Strategic Risk Aversion

This paper explores training strategies for collaborative agents, emphasizing strategic risk aversion to enhance generalizability and rob...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.21452] Adversarial Robustness of Deep Learning-Based Thyroid Nodule Segmentation in Ultrasound

This article evaluates the adversarial robustness of deep learning models for thyroid nodule segmentation in ultrasound images, highlight...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.21447] Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG

The paper presents a novel framework, MMA-RAG^T, for enhancing the security of multimodal agentic retrieval-augmented generation systems ...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.21442] MINAR: Mechanistic Interpretability for Neural Algorithmic Reasoning

The paper introduces MINAR, a toolbox for mechanistic interpretability in neural algorithmic reasoning, enhancing understanding of GNNs' ...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2602.21429] Provably Safe Generative Sampling with Constricting Barrier Functions

This paper presents a safety filtering framework for generative models, ensuring generated samples meet hard constraints while minimizing...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.21420] Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

This paper introduces the Asymmetric Confidence-aware Error Penalty (ACE) to enhance reinforcement learning by addressing overconfident e...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.21374] Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages

This study explores the use of small language models for extracting clinical information from low-resource languages, focusing on a priva...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.21372] The Mean is the Mirage: Entropy-Adaptive Model Merging under Heterogeneous Domain Shifts in Medical Imaging

This article presents an entropy-adaptive model merging technique for medical imaging that addresses challenges posed by heterogeneous do...

arXiv - Machine Learning · 4 min · about 1 month ago

Ai Infrastructure

[2602.21368] Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration

This paper presents a method for certifying the reliability of black-box AI systems using self-consistency sampling and conformal calibra...

arXiv - Machine Learning · 3 min · about 1 month ago

Llms

[2602.21346] Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

This article presents a novel approach to enhance safety alignment in large language models (LLMs) through Alignment-Weighted Direct Pref...

arXiv - AI · 4 min · about 1 month ago

Ai Startups

[2602.21327] Equitable Evaluation via Elicitation

The paper discusses an AI-driven approach for equitable skill evaluation, addressing biases in self-presentation among job seekers. It pr...

arXiv - Machine Learning · 3 min · about 1 month ago

Llms

[2602.21269] Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space

The paper introduces Group Orthogonalized Policy Optimization (GOPO), a novel algorithm for aligning large language models using Hilbert ...

arXiv - Machine Learning · 4 min · about 1 month ago

Ai Safety

[2602.21267] A Systematic Review of Algorithmic Red Teaming Methodologies for Assurance and Security of AI Applications

This systematic review explores automated red teaming methodologies for enhancing the security of AI applications, addressing the limitat...

arXiv - AI · 3 min · about 1 month ago

Previous Page 46 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

[D] I had an idea, would love your thoughts

I had an idea, would love your thoughts

Newsom signs executive order requiring AI companies to have safety, privacy guardrails

All Content

[2602.21779] Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models

[2602.21765] Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

[2602.21720] Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning

[2602.21704] Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models

[2602.21613] Virtual Biopsy for Intracranial Tumors Diagnosis on MRI

[2602.21584] Exploring Human-Machine Coexistence in Symmetrical Reality

[2602.21543] Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment

[2602.21515] Training Generalizable Collaborative Agents via Strategic Risk Aversion

[2602.21452] Adversarial Robustness of Deep Learning-Based Thyroid Nodule Segmentation in Ultrasound

[2602.21447] Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG

[2602.21442] MINAR: Mechanistic Interpretability for Neural Algorithmic Reasoning

[2602.21429] Provably Safe Generative Sampling with Constricting Barrier Functions

[2602.21420] Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

[2602.21374] Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages

[2602.21372] The Mean is the Mirage: Entropy-Adaptive Model Merging under Heterogeneous Domain Shifts in Medical Imaging

[2602.21368] Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration

[2602.21346] Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

[2602.21327] Equitable Evaluation via Elicitation

[2602.21269] Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space

[2602.21267] A Systematic Review of Algorithmic Red Teaming Methodologies for Assurance and Security of AI Applications

Related Topics

Stay updated with AI News