AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

Top This Week

Machine Learning

[D] I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Machine Learning · 1 min · 9 minutes ago

Machine Learning

I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Ai Safety

Newsom signs executive order requiring AI companies to have safety, privacy guardrails

submitted by /u/Fcking_Chuck [link] [comments]

Reddit - Artificial Intelligence · 1 min · about 5 hours ago

All Content

Machine Learning

[2602.22973] Modeling Expert AI Diagnostic Alignment via Immutable Inference Snapshots

The paper presents a framework for improving AI diagnostic alignment in clinical settings by preserving AI-generated reports as immutable...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.22968] Certified Circuits: Stability Guarantees for Mechanistic Circuits

The paper introduces Certified Circuits, a framework that enhances the stability and accuracy of circuit discovery in neural networks, ad...

arXiv - AI · 3 min · about 1 month ago

Llms

[2602.22963] FactGuard: Agentic Video Misinformation Detection via Reinforcement Learning

FactGuard introduces an innovative framework for detecting video misinformation using reinforcement learning, enhancing the capabilities ...

arXiv - AI · 3 min · about 1 month ago

Llms

[2602.22953] General Agent Evaluation

This paper introduces a framework for evaluating general-purpose agents, proposing a Unified Protocol and Exgentic framework, and benchma...

arXiv - AI · 3 min · about 1 month ago

Llms

[2602.22660] LEDA: Latent Semantic Distribution Alignment for Multi-domain Graph Pre-training

The paper presents LEDA, a novel model for universal graph pre-training that addresses challenges in aligning diverse graph data and enha...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.22879] Towards LLM-Empowered Knowledge Tracing via LLM-Student Hierarchical Behavior Alignment in Hyperbolic Space

This article presents a novel approach to knowledge tracing using a Large Language Model (LLM) to enhance the understanding of student le...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.22633] Tackling Privacy Heterogeneity in Differentially Private Federated Learning

This article presents a novel approach to address privacy heterogeneity in differentially private federated learning (DP-FL), proposing a...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.22814] When Should an AI Act? A Human-Centered Model of Scene, Context, and Behavior for Agentic AI Design

This article presents a human-centered model for agentic AI design, focusing on when AI should act based on contextual understanding and ...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2602.22611] Mitigating Membership Inference in Intermediate Representations via Layer-wise MIA-risk-aware DP-SGD

This paper presents Layer-wise MIA-risk-aware DP-SGD, a method to reduce Membership Inference Attack risks in machine learning models by ...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.22610] DP-aware AdaLN-Zero: Taming Conditioning-Induced Heavy-Tailed Gradients in Differentially Private Diffusion

The paper introduces DP-aware AdaLN-Zero, a novel mechanism to mitigate heavy-tailed gradients in differentially private diffusion models...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.22771] ClinDet-Bench: Beyond Abstention, Evaluating Judgment Determinability of LLMs in Clinical Decision-Making

The paper presents ClinDet-Bench, a benchmark for evaluating the judgment determinability of large language models (LLMs) in clinical dec...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2602.22601] $ϕ$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

The paper presents the $ϕ$-DPO framework, addressing fairness in continual learning for large multimodal models by optimizing preference ...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.22769] AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

The paper introduces AMA-Bench, a new benchmark for evaluating long-horizon memory in Large Language Models (LLMs) for agentic applicatio...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.22600] Transformers converge to invariant algorithmic cores

The paper explores how transformers, despite varying weights, converge to invariant algorithmic cores essential for task performance, rev...

arXiv - AI · 3 min · about 1 month ago

Data Science

[2602.22758] Decomposing Physician Disagreement in HealthBench

This paper analyzes physician disagreement in the HealthBench dataset, identifying key factors contributing to variance in evaluations an...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2602.22751] Know What You Know: Metacognitive Entropy Calibration for Verifiable RL Reasoning

The paper proposes EGPO, a metacognitive entropy calibration framework that integrates intrinsic uncertainty into reinforcement learning ...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.22718] RLHFless: Serverless Computing for Efficient RLHF

The paper introduces RLHFless, a serverless computing framework designed to enhance the efficiency of Reinforcement Learning from Human F...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.22702] Knob: A Physics-Inspired Gating Interface for Interpretable and Controllable Neural Dynamics

The paper introduces 'Knob', a physics-inspired framework that enhances neural network calibration by allowing dynamic adjustments to mod...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.22560] Operationalizing Fairness: Post-Hoc Threshold Optimization Under Hard Resource Limits

This paper presents a framework for optimizing decision thresholds in machine learning to balance fairness and resource constraints, ensu...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.22556] Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation

The paper presents a two-stage framework for enhancing large reasoning models (LRMs) by addressing overthinking in low-complexity queries...

arXiv - AI · 3 min · about 1 month ago

Previous Page 39 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

[D] I had an idea, would love your thoughts

I had an idea, would love your thoughts

Newsom signs executive order requiring AI companies to have safety, privacy guardrails

All Content

[2602.22973] Modeling Expert AI Diagnostic Alignment via Immutable Inference Snapshots

[2602.22968] Certified Circuits: Stability Guarantees for Mechanistic Circuits

[2602.22963] FactGuard: Agentic Video Misinformation Detection via Reinforcement Learning

[2602.22953] General Agent Evaluation

[2602.22660] LEDA: Latent Semantic Distribution Alignment for Multi-domain Graph Pre-training

[2602.22879] Towards LLM-Empowered Knowledge Tracing via LLM-Student Hierarchical Behavior Alignment in Hyperbolic Space

[2602.22633] Tackling Privacy Heterogeneity in Differentially Private Federated Learning

[2602.22814] When Should an AI Act? A Human-Centered Model of Scene, Context, and Behavior for Agentic AI Design

[2602.22611] Mitigating Membership Inference in Intermediate Representations via Layer-wise MIA-risk-aware DP-SGD

[2602.22610] DP-aware AdaLN-Zero: Taming Conditioning-Induced Heavy-Tailed Gradients in Differentially Private Diffusion

[2602.22771] ClinDet-Bench: Beyond Abstention, Evaluating Judgment Determinability of LLMs in Clinical Decision-Making

[2602.22601] $ϕ$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

[2602.22769] AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

[2602.22600] Transformers converge to invariant algorithmic cores

[2602.22758] Decomposing Physician Disagreement in HealthBench

[2602.22751] Know What You Know: Metacognitive Entropy Calibration for Verifiable RL Reasoning

[2602.22718] RLHFless: Serverless Computing for Efficient RLHF

[2602.22702] Knob: A Physics-Inspired Gating Interface for Interpretable and Controllable Neural Dynamics

[2602.22560] Operationalizing Fairness: Post-Hoc Threshold Optimization Under Hard Resource Limits

[2602.22556] Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation

Related Topics

Stay updated with AI News