AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

Machine Learning

[D] I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Machine Learning · 1 min ·
Machine Learning

I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Artificial Intelligence · 1 min ·
Ai Safety

Newsom signs executive order requiring AI companies to have safety, privacy guardrails

submitted by /u/Fcking_Chuck [link] [comments]

Reddit - Artificial Intelligence · 1 min ·

All Content

[2602.22973] Modeling Expert AI Diagnostic Alignment via Immutable Inference Snapshots
Machine Learning

[2602.22973] Modeling Expert AI Diagnostic Alignment via Immutable Inference Snapshots

The paper presents a framework for improving AI diagnostic alignment in clinical settings by preserving AI-generated reports as immutable...

arXiv - AI · 4 min ·
[2602.22968] Certified Circuits: Stability Guarantees for Mechanistic Circuits
Machine Learning

[2602.22968] Certified Circuits: Stability Guarantees for Mechanistic Circuits

The paper introduces Certified Circuits, a framework that enhances the stability and accuracy of circuit discovery in neural networks, ad...

arXiv - AI · 3 min ·
[2602.22963] FactGuard: Agentic Video Misinformation Detection via Reinforcement Learning
Llms

[2602.22963] FactGuard: Agentic Video Misinformation Detection via Reinforcement Learning

FactGuard introduces an innovative framework for detecting video misinformation using reinforcement learning, enhancing the capabilities ...

arXiv - AI · 3 min ·
[2602.22953] General Agent Evaluation
Llms

[2602.22953] General Agent Evaluation

This paper introduces a framework for evaluating general-purpose agents, proposing a Unified Protocol and Exgentic framework, and benchma...

arXiv - AI · 3 min ·
[2602.22660] LEDA: Latent Semantic Distribution Alignment for Multi-domain Graph Pre-training
Llms

[2602.22660] LEDA: Latent Semantic Distribution Alignment for Multi-domain Graph Pre-training

The paper presents LEDA, a novel model for universal graph pre-training that addresses challenges in aligning diverse graph data and enha...

arXiv - Machine Learning · 4 min ·
[2602.22879] Towards LLM-Empowered Knowledge Tracing via LLM-Student Hierarchical Behavior Alignment in Hyperbolic Space
Llms

[2602.22879] Towards LLM-Empowered Knowledge Tracing via LLM-Student Hierarchical Behavior Alignment in Hyperbolic Space

This article presents a novel approach to knowledge tracing using a Large Language Model (LLM) to enhance the understanding of student le...

arXiv - AI · 4 min ·
[2602.22633] Tackling Privacy Heterogeneity in Differentially Private Federated Learning
Machine Learning

[2602.22633] Tackling Privacy Heterogeneity in Differentially Private Federated Learning

This article presents a novel approach to address privacy heterogeneity in differentially private federated learning (DP-FL), proposing a...

arXiv - Machine Learning · 4 min ·
[2602.22814] When Should an AI Act? A Human-Centered Model of Scene, Context, and Behavior for Agentic AI Design
Machine Learning

[2602.22814] When Should an AI Act? A Human-Centered Model of Scene, Context, and Behavior for Agentic AI Design

This article presents a human-centered model for agentic AI design, focusing on when AI should act based on contextual understanding and ...

arXiv - AI · 3 min ·
[2602.22611] Mitigating Membership Inference in Intermediate Representations via Layer-wise MIA-risk-aware DP-SGD
Machine Learning

[2602.22611] Mitigating Membership Inference in Intermediate Representations via Layer-wise MIA-risk-aware DP-SGD

This paper presents Layer-wise MIA-risk-aware DP-SGD, a method to reduce Membership Inference Attack risks in machine learning models by ...

arXiv - Machine Learning · 4 min ·
[2602.22610] DP-aware AdaLN-Zero: Taming Conditioning-Induced Heavy-Tailed Gradients in Differentially Private Diffusion
Machine Learning

[2602.22610] DP-aware AdaLN-Zero: Taming Conditioning-Induced Heavy-Tailed Gradients in Differentially Private Diffusion

The paper introduces DP-aware AdaLN-Zero, a novel mechanism to mitigate heavy-tailed gradients in differentially private diffusion models...

arXiv - Machine Learning · 4 min ·
[2602.22771] ClinDet-Bench: Beyond Abstention, Evaluating Judgment Determinability of LLMs in Clinical Decision-Making
Llms

[2602.22771] ClinDet-Bench: Beyond Abstention, Evaluating Judgment Determinability of LLMs in Clinical Decision-Making

The paper presents ClinDet-Bench, a benchmark for evaluating the judgment determinability of large language models (LLMs) in clinical dec...

arXiv - AI · 3 min ·
[2602.22601] $ϕ$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models
Machine Learning

[2602.22601] $ϕ$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

The paper presents the $ϕ$-DPO framework, addressing fairness in continual learning for large multimodal models by optimizing preference ...

arXiv - Machine Learning · 4 min ·
[2602.22769] AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
Llms

[2602.22769] AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

The paper introduces AMA-Bench, a new benchmark for evaluating long-horizon memory in Large Language Models (LLMs) for agentic applicatio...

arXiv - Machine Learning · 4 min ·
[2602.22600] Transformers converge to invariant algorithmic cores
Llms

[2602.22600] Transformers converge to invariant algorithmic cores

The paper explores how transformers, despite varying weights, converge to invariant algorithmic cores essential for task performance, rev...

arXiv - AI · 3 min ·
[2602.22758] Decomposing Physician Disagreement in HealthBench
Data Science

[2602.22758] Decomposing Physician Disagreement in HealthBench

This paper analyzes physician disagreement in the HealthBench dataset, identifying key factors contributing to variance in evaluations an...

arXiv - AI · 3 min ·
[2602.22751] Know What You Know: Metacognitive Entropy Calibration for Verifiable RL Reasoning
Machine Learning

[2602.22751] Know What You Know: Metacognitive Entropy Calibration for Verifiable RL Reasoning

The paper proposes EGPO, a metacognitive entropy calibration framework that integrates intrinsic uncertainty into reinforcement learning ...

arXiv - AI · 4 min ·
[2602.22718] RLHFless: Serverless Computing for Efficient RLHF
Llms

[2602.22718] RLHFless: Serverless Computing for Efficient RLHF

The paper introduces RLHFless, a serverless computing framework designed to enhance the efficiency of Reinforcement Learning from Human F...

arXiv - AI · 4 min ·
[2602.22702] Knob: A Physics-Inspired Gating Interface for Interpretable and Controllable Neural Dynamics
Machine Learning

[2602.22702] Knob: A Physics-Inspired Gating Interface for Interpretable and Controllable Neural Dynamics

The paper introduces 'Knob', a physics-inspired framework that enhances neural network calibration by allowing dynamic adjustments to mod...

arXiv - AI · 4 min ·
[2602.22560] Operationalizing Fairness: Post-Hoc Threshold Optimization Under Hard Resource Limits
Machine Learning

[2602.22560] Operationalizing Fairness: Post-Hoc Threshold Optimization Under Hard Resource Limits

This paper presents a framework for optimizing decision thresholds in machine learning to balance fairness and resource constraints, ensu...

arXiv - AI · 4 min ·
[2602.22556] Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation
Machine Learning

[2602.22556] Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation

The paper presents a two-stage framework for enhancing large reasoning models (LRMs) by addressing overthinking in low-complexity queries...

arXiv - AI · 3 min ·
Previous Page 39 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime