AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

[2511.21331] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment
Machine Learning

[2511.21331] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Abstract page for arXiv paper 2511.21331: The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

arXiv - AI · 4 min ·
[2509.22367] What Is The Political Content in LLMs' Pre- and Post-Training Data?
Llms

[2509.22367] What Is The Political Content in LLMs' Pre- and Post-Training Data?

Abstract page for arXiv paper 2509.22367: What Is The Political Content in LLMs' Pre- and Post-Training Data?

arXiv - AI · 4 min ·
[2507.22264] SmartCLIP: Modular Vision-language Alignment with Identification Guarantees
Machine Learning

[2507.22264] SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

Abstract page for arXiv paper 2507.22264: SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

arXiv - AI · 4 min ·

All Content

[2602.16844] Overseeing Agents Without Constant Oversight: Challenges and Opportunities
Ai Agents

[2602.16844] Overseeing Agents Without Constant Oversight: Challenges and Opportunities

This article explores the challenges and opportunities in overseeing AI agents without constant human oversight, focusing on user studies...

arXiv - AI · 3 min ·
[2602.16826] HiVAE: Hierarchical Latent Variables for Scalable Theory of Mind
Machine Learning

[2602.16826] HiVAE: Hierarchical Latent Variables for Scalable Theory of Mind

The paper presents HiVAE, a hierarchical variational architecture designed to enhance AI's theory of mind capabilities, enabling better i...

arXiv - AI · 3 min ·
[2602.16829] Learning under noisy supervision is governed by a feedback-truth gap
Machine Learning

[2602.16829] Learning under noisy supervision is governed by a feedback-truth gap

This paper explores how learning under noisy supervision is influenced by a feedback-truth gap, demonstrating its effects across various ...

arXiv - AI · 3 min ·
[2602.16802] References Improve LLM Alignment in Non-Verifiable Domains
Llms

[2602.16802] References Improve LLM Alignment in Non-Verifiable Domains

This paper explores how reference-guided evaluators can enhance LLM alignment in non-verifiable domains, demonstrating significant improv...

arXiv - Machine Learning · 4 min ·
[2602.16800] Large-scale online deanonymization with LLMs
Llms

[2602.16800] Large-scale online deanonymization with LLMs

This article discusses the use of large language models (LLMs) for deanonymizing online users, demonstrating high precision in identifyin...

arXiv - Machine Learning · 4 min ·
[2602.16747] LiveClin: A Live Clinical Benchmark without Leakage
Llms

[2602.16747] LiveClin: A Live Clinical Benchmark without Leakage

LiveClin introduces a novel clinical benchmark for evaluating medical LLMs, addressing issues of data contamination and knowledge obsoles...

arXiv - AI · 4 min ·
[2602.16741] Can Adversarial Code Comments Fool AI Security Reviewers -- Large-Scale Empirical Study of Comment-Based Attacks and Defenses Against LLM Code Analysis
Llms

[2602.16741] Can Adversarial Code Comments Fool AI Security Reviewers -- Large-Scale Empirical Study of Comment-Based Attacks and Defenses Against LLM Code Analysis

This study investigates whether adversarial code comments can mislead AI security reviewers during vulnerability detection in code, revea...

arXiv - Machine Learning · 4 min ·
[2602.16740] Quantifying LLM Attention-Head Stability: Implications for Circuit Universality
Llms

[2602.16740] Quantifying LLM Attention-Head Stability: Implications for Circuit Universality

This article examines the stability of attention heads in transformer models, revealing insights into their representational robustness a...

arXiv - AI · 4 min ·
[2602.16729] Intent Laundering: AI Safety Datasets Are Not What They Seem
Ai Safety

[2602.16729] Intent Laundering: AI Safety Datasets Are Not What They Seem

The paper evaluates AI safety datasets, revealing they often misrepresent real-world attacks due to an overreliance on triggering cues, l...

arXiv - Machine Learning · 4 min ·
[2602.16723] Is Mamba Reliable for Medical Imaging?
Machine Learning

[2602.16723] Is Mamba Reliable for Medical Imaging?

This paper evaluates the reliability of Mamba, a state-space model, for medical imaging under various attack scenarios, highlighting vuln...

arXiv - AI · 3 min ·
[2602.17594] AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games
Ai Startups

[2602.17594] AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

The paper introduces the AI Gamestore, a platform for evaluating machine general intelligence through human games, highlighting its poten...

arXiv - AI · 4 min ·
[2602.17566] A Hybrid Federated Learning Based Ensemble Approach for Lung Disease Diagnosis Leveraging Fusion of SWIN Transformer and CNN
Machine Learning

[2602.17566] A Hybrid Federated Learning Based Ensemble Approach for Lung Disease Diagnosis Leveraging Fusion of SWIN Transformer and CNN

This article presents a hybrid federated learning model that combines SWIN Transformer and CNN for diagnosing lung diseases, particularly...

arXiv - AI · 4 min ·
[2602.17560] ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment
Llms

[2602.17560] ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment

The paper presents ODESteer, a novel ODE-based framework for aligning large language models (LLMs) by addressing limitations in existing ...

arXiv - AI · 4 min ·
[2602.17508] Pareto Optimal Benchmarking of AI Models on ARM Cortex Processors for Sustainable Embedded Systems
Machine Learning

[2602.17508] Pareto Optimal Benchmarking of AI Models on ARM Cortex Processors for Sustainable Embedded Systems

This article presents a benchmarking framework for optimizing AI models on ARM Cortex processors, focusing on energy efficiency and perfo...

arXiv - AI · 4 min ·
[2602.17418] A Privacy by Design Framework for Large Language Model-Based Applications for Children
Llms

[2602.17418] A Privacy by Design Framework for Large Language Model-Based Applications for Children

This article proposes a Privacy by Design framework for AI applications targeting children, addressing privacy risks and compliance with ...

arXiv - AI · 4 min ·
[2602.17234] All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting
Llms

[2602.17234] All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting

The paper introduces a framework for detecting temporal knowledge leakage in LLM backtesting, proposing a new metric, Shapley-DCLR, and a...

arXiv - Machine Learning · 4 min ·
[2602.17229] Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom's Taxonomy
Llms

[2602.17229] Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom's Taxonomy

This paper explores the mechanistic interpretability of cognitive complexity in Large Language Models (LLMs) using Bloom's Taxonomy, demo...

arXiv - AI · 3 min ·
[2602.17116] Epistemology of Generative AI: The Geometry of Knowing
Generative Ai

[2602.17116] Epistemology of Generative AI: The Geometry of Knowing

This article explores the epistemological implications of generative AI, proposing a new framework for understanding knowledge production...

arXiv - AI · 4 min ·
[2602.17107] Owen-based Semantics and Hierarchy-Aware Explanation (O-Shap)
Ai Safety

[2602.17107] Owen-based Semantics and Hierarchy-Aware Explanation (O-Shap)

The paper presents O-Shap, a novel method for feature attribution in explainable AI, addressing limitations of traditional Shapley value ...

arXiv - AI · 3 min ·
[2602.17106] Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction
Ai Safety

[2602.17106] Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction

The paper proposes a human-AI collaborative framework for creating benchmark datasets to evaluate sustainability rating methodologies, ad...

arXiv - AI · 3 min ·
Previous Page 82 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime