AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

Machine Learning

[D] I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Machine Learning · 1 min ·
Machine Learning

I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Artificial Intelligence · 1 min ·
Ai Safety

Newsom signs executive order requiring AI companies to have safety, privacy guardrails

submitted by /u/Fcking_Chuck [link] [comments]

Reddit - Artificial Intelligence · 1 min ·

All Content

[2602.22638] MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
Llms

[2602.22638] MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

MobilityBench introduces a benchmark for evaluating LLM-based route-planning agents, addressing challenges in real-world mobility scenari...

arXiv - AI · 4 min ·
[2602.22554] Multilingual Safety Alignment Via Sparse Weight Editing
Llms

[2602.22554] Multilingual Safety Alignment Via Sparse Weight Editing

This paper presents a novel framework for aligning safety measures in multilingual large language models (LLMs) through Sparse Weight Edi...

arXiv - Machine Learning · 3 min ·
[2602.22585] Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach
Machine Learning

[2602.22585] Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach

This paper explores the integration of psychometric rater models into AI evaluation, aiming to correct human label biases and improve the...

arXiv - Machine Learning · 3 min ·
[2602.22557] CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety
Llms

[2602.22557] CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

CourtGuard introduces a model-agnostic framework for zero-shot policy adaptation in LLM safety, enhancing adaptability and performance wi...

arXiv - Machine Learning · 3 min ·
[2602.22546] Requesting Expert Reasoning: Augmenting LLM Agents with Learned Collaborative Intervention
Llms

[2602.22546] Requesting Expert Reasoning: Augmenting LLM Agents with Learned Collaborative Intervention

This article presents a framework called AHCE for enhancing Large Language Model (LLM) agents through effective human collaboration, sign...

arXiv - AI · 3 min ·
[2602.22520] TEFL: Prediction-Residual-Guided Rolling Forecasting for Multi-Horizon Time Series
Machine Learning

[2602.22520] TEFL: Prediction-Residual-Guided Rolling Forecasting for Multi-Horizon Time Series

The paper presents TEFL, a novel framework for multi-horizon time series forecasting that utilizes prediction residuals to enhance accura...

arXiv - Machine Learning · 4 min ·
[2602.22519] A Mathematical Theory of Agency and Intelligence
Ai Agents

[2602.22519] A Mathematical Theory of Agency and Intelligence

This paper presents a mathematical framework for understanding agency and intelligence in AI systems, introducing the concept of bipredic...

arXiv - AI · 4 min ·
[2602.22500] Mapping the Landscape of Artificial Intelligence in Life Cycle Assessment Using Large Language Models
Llms

[2602.22500] Mapping the Landscape of Artificial Intelligence in Life Cycle Assessment Using Large Language Models

This article reviews the integration of AI into life cycle assessment (LCA), highlighting trends, themes, and future directions using lar...

arXiv - AI · 4 min ·
[2602.22480] VeRO: An Evaluation Harness for Agents to Optimize Agents
Llms

[2602.22480] VeRO: An Evaluation Harness for Agents to Optimize Agents

The paper introduces VeRO, an evaluation harness designed for optimizing coding agents through structured evaluation and benchmarking, ad...

arXiv - Machine Learning · 3 min ·
[2602.22470] Beyond performance-wise Contribution Evaluation in Federated Learning
Machine Learning

[2602.22470] Beyond performance-wise Contribution Evaluation in Federated Learning

This paper explores the limitations of current evaluation methods in federated learning, emphasizing the need for a multidimensional appr...

arXiv - Machine Learning · 3 min ·
[2602.22442] A Framework for Assessing AI Agent Decisions and Outcomes in AutoML Pipelines
Llms

[2602.22442] A Framework for Assessing AI Agent Decisions and Outcomes in AutoML Pipelines

This article presents a framework for evaluating AI agent decisions in AutoML pipelines, emphasizing decision-centric metrics over tradit...

arXiv - AI · 4 min ·
[2602.22438] From Bias to Balance: Fairness-Aware Paper Recommendation for Equitable Peer Review
Machine Learning

[2602.22438] From Bias to Balance: Fairness-Aware Paper Recommendation for Equitable Peer Review

This paper introduces Fair-PaperRec, a fairness-aware paper recommendation system designed to mitigate biases in peer review, enhancing e...

arXiv - AI · 4 min ·
[2602.22413] Epistemic Filtering and Collective Hallucination: A Jury Theorem for Confidence-Calibrated Agents
Ai Safety

[2602.22413] Epistemic Filtering and Collective Hallucination: A Jury Theorem for Confidence-Calibrated Agents

This paper explores a probabilistic framework for collective decision-making among agents that can assess their own reliability and selec...

arXiv - AI · 3 min ·
[2602.22302] Agent Behavioral Contracts: Formal Specification and Runtime Enforcement for Reliable Autonomous AI Agents
Nlp

[2602.22302] Agent Behavioral Contracts: Formal Specification and Runtime Enforcement for Reliable Autonomous AI Agents

The paper presents Agent Behavioral Contracts (ABC), a framework for specifying and enforcing the behavior of autonomous AI agents, addre...

arXiv - AI · 4 min ·
[2602.22345] Structure and Redundancy in Large Language Models: A Spectral Study via Random Matrix Theory
Llms

[2602.22345] Structure and Redundancy in Large Language Models: A Spectral Study via Random Matrix Theory

This paper explores the reliability and efficiency of large language models (LLMs) using Random Matrix Theory. It introduces EigenTrack f...

arXiv - AI · 4 min ·
[2602.22303] Training Agents to Self-Report Misbehavior
Llms

[2602.22303] Training Agents to Self-Report Misbehavior

The paper discusses a novel approach to training AI agents to self-report misbehavior, enhancing alignment and safety in AI systems by re...

arXiv - AI · 3 min ·
[2602.22298] AviaSafe: A Physics-Informed Data-Driven Model for Aviation Safety-Critical Cloud Forecasts
Machine Learning

[2602.22298] AviaSafe: A Physics-Informed Data-Driven Model for Aviation Safety-Critical Cloud Forecasts

AviaSafe introduces a physics-informed, data-driven model for aviation cloud forecasts, enhancing prediction accuracy for critical hydrom...

arXiv - AI · 3 min ·
[2602.22291] Manifold of Failure: Behavioral Attraction Basins in Language Models
Llms

[2602.22291] Manifold of Failure: Behavioral Attraction Basins in Language Models

This paper introduces a framework for mapping the 'Manifold of Failure' in language models, identifying vulnerability regions and their t...

arXiv - Machine Learning · 4 min ·
[2602.22288] Reliable XAI Explanations in Sudden Cardiac Death Prediction for Chagas Cardiomyopathy
Machine Learning

[2602.22288] Reliable XAI Explanations in Sudden Cardiac Death Prediction for Chagas Cardiomyopathy

This article discusses a novel explainable AI (XAI) method for predicting sudden cardiac death in Chagas cardiomyopathy, emphasizing its ...

arXiv - Machine Learning · 4 min ·
[2602.22271] Support Tokens, Stability Margins, and a New Foundation for Robust LLMs
Llms

[2602.22271] Support Tokens, Stability Margins, and a New Foundation for Robust LLMs

This article presents a novel probabilistic framework for understanding causal self-attention in LLMs, introducing concepts like support ...

arXiv - Machine Learning · 4 min ·
Previous Page 40 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime