AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

Machine Learning

[D] I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Machine Learning · 1 min ·
Machine Learning

I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Artificial Intelligence · 1 min ·
Ai Safety

Newsom signs executive order requiring AI companies to have safety, privacy guardrails

submitted by /u/Fcking_Chuck [link] [comments]

Reddit - Artificial Intelligence · 1 min ·

All Content

[2602.21779] Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models
Llms

[2602.21779] Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models

This paper introduces a forensic benchmark for evaluating video deepfake reasoning in vision-language models, focusing on temporal incons...

arXiv - AI · 4 min ·
[2602.21765] Generalisation of RLHF under Reward Shift and Clipped KL Regularisation
Llms

[2602.21765] Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

This paper explores the generalization of Reinforcement Learning from Human Feedback (RLHF) under conditions of reward shift and clipped ...

arXiv - Machine Learning · 4 min ·
[2602.21720] Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning
Ai Safety

[2602.21720] Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning

This article explores the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning, dem...

arXiv - AI · 3 min ·
[2602.21704] Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models
Llms

[2602.21704] Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models

This paper presents Dynamic Multimodal Activation Steering, a novel approach to mitigate hallucinations in Large Vision-Language Models (...

arXiv - AI · 3 min ·
[2602.21613] Virtual Biopsy for Intracranial Tumors Diagnosis on MRI
Ai Safety

[2602.21613] Virtual Biopsy for Intracranial Tumors Diagnosis on MRI

This article presents a novel Virtual Biopsy framework for diagnosing intracranial tumors using MRI, addressing the challenges of traditi...

arXiv - AI · 4 min ·
[2602.21584] Exploring Human-Machine Coexistence in Symmetrical Reality
Ai Safety

[2602.21584] Exploring Human-Machine Coexistence in Symmetrical Reality

This paper explores the evolving relationship between humans and AI, proposing a framework for harmonious coexistence termed 'symmetrical...

arXiv - AI · 3 min ·
[2602.21543] Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment
Machine Learning

[2602.21543] Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment

This paper presents a method for enhancing multilingual embeddings through multi-way parallel text alignment, demonstrating improved cros...

arXiv - AI · 3 min ·
[2602.21515] Training Generalizable Collaborative Agents via Strategic Risk Aversion
Machine Learning

[2602.21515] Training Generalizable Collaborative Agents via Strategic Risk Aversion

This paper explores training strategies for collaborative agents, emphasizing strategic risk aversion to enhance generalizability and rob...

arXiv - Machine Learning · 4 min ·
[2602.21452] Adversarial Robustness of Deep Learning-Based Thyroid Nodule Segmentation in Ultrasound
Machine Learning

[2602.21452] Adversarial Robustness of Deep Learning-Based Thyroid Nodule Segmentation in Ultrasound

This article evaluates the adversarial robustness of deep learning models for thyroid nodule segmentation in ultrasound images, highlight...

arXiv - AI · 4 min ·
[2602.21447] Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG
Machine Learning

[2602.21447] Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG

The paper presents a novel framework, MMA-RAG^T, for enhancing the security of multimodal agentic retrieval-augmented generation systems ...

arXiv - Machine Learning · 4 min ·
[2602.21442] MINAR: Mechanistic Interpretability for Neural Algorithmic Reasoning
Llms

[2602.21442] MINAR: Mechanistic Interpretability for Neural Algorithmic Reasoning

The paper introduces MINAR, a toolbox for mechanistic interpretability in neural algorithmic reasoning, enhancing understanding of GNNs' ...

arXiv - Machine Learning · 3 min ·
[2602.21429] Provably Safe Generative Sampling with Constricting Barrier Functions
Machine Learning

[2602.21429] Provably Safe Generative Sampling with Constricting Barrier Functions

This paper presents a safety filtering framework for generative models, ensuring generated samples meet hard constraints while minimizing...

arXiv - Machine Learning · 4 min ·
[2602.21420] Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning
Llms

[2602.21420] Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

This paper introduces the Asymmetric Confidence-aware Error Penalty (ACE) to enhance reinforcement learning by addressing overconfident e...

arXiv - Machine Learning · 4 min ·
[2602.21374] Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages
Llms

[2602.21374] Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages

This study explores the use of small language models for extracting clinical information from low-resource languages, focusing on a priva...

arXiv - Machine Learning · 4 min ·
[2602.21372] The Mean is the Mirage: Entropy-Adaptive Model Merging under Heterogeneous Domain Shifts in Medical Imaging
Machine Learning

[2602.21372] The Mean is the Mirage: Entropy-Adaptive Model Merging under Heterogeneous Domain Shifts in Medical Imaging

This article presents an entropy-adaptive model merging technique for medical imaging that addresses challenges posed by heterogeneous do...

arXiv - Machine Learning · 4 min ·
[2602.21368] Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration
Ai Infrastructure

[2602.21368] Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration

This paper presents a method for certifying the reliability of black-box AI systems using self-consistency sampling and conformal calibra...

arXiv - Machine Learning · 3 min ·
[2602.21346] Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment
Llms

[2602.21346] Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

This article presents a novel approach to enhance safety alignment in large language models (LLMs) through Alignment-Weighted Direct Pref...

arXiv - AI · 4 min ·
[2602.21327] Equitable Evaluation via Elicitation
Ai Startups

[2602.21327] Equitable Evaluation via Elicitation

The paper discusses an AI-driven approach for equitable skill evaluation, addressing biases in self-presentation among job seekers. It pr...

arXiv - Machine Learning · 3 min ·
[2602.21269] Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space
Llms

[2602.21269] Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space

The paper introduces Group Orthogonalized Policy Optimization (GOPO), a novel algorithm for aligning large language models using Hilbert ...

arXiv - Machine Learning · 4 min ·
[2602.21267] A Systematic Review of Algorithmic Red Teaming Methodologies for Assurance and Security of AI Applications
Ai Safety

[2602.21267] A Systematic Review of Algorithmic Red Teaming Methodologies for Assurance and Security of AI Applications

This systematic review explores automated red teaming methodologies for enhancing the security of AI applications, addressing the limitat...

arXiv - AI · 3 min ·
Previous Page 46 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime