AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

Top This Week

Machine Learning

[2511.21331] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Abstract page for arXiv paper 2511.21331: The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

arXiv - AI · 4 min · about 1 hour ago

Llms

[2509.22367] What Is The Political Content in LLMs' Pre- and Post-Training Data?

Abstract page for arXiv paper 2509.22367: What Is The Political Content in LLMs' Pre- and Post-Training Data?

arXiv - AI · 4 min · about 1 hour ago

Machine Learning

[2507.22264] SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

Abstract page for arXiv paper 2507.22264: SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

arXiv - AI · 4 min · about 1 hour ago

All Content

Ai Agents

[2602.16844] Overseeing Agents Without Constant Oversight: Challenges and Opportunities

This article explores the challenges and opportunities in overseeing AI agents without constant human oversight, focusing on user studies...

arXiv - AI · 3 min · about 2 months ago

Machine Learning

[2602.16826] HiVAE: Hierarchical Latent Variables for Scalable Theory of Mind

The paper presents HiVAE, a hierarchical variational architecture designed to enhance AI's theory of mind capabilities, enabling better i...

arXiv - AI · 3 min · about 2 months ago

Machine Learning

[2602.16829] Learning under noisy supervision is governed by a feedback-truth gap

This paper explores how learning under noisy supervision is influenced by a feedback-truth gap, demonstrating its effects across various ...

arXiv - AI · 3 min · about 2 months ago

Llms

[2602.16802] References Improve LLM Alignment in Non-Verifiable Domains

This paper explores how reference-guided evaluators can enhance LLM alignment in non-verifiable domains, demonstrating significant improv...

arXiv - Machine Learning · 4 min · about 2 months ago

Llms

[2602.16800] Large-scale online deanonymization with LLMs

This article discusses the use of large language models (LLMs) for deanonymizing online users, demonstrating high precision in identifyin...

arXiv - Machine Learning · 4 min · about 2 months ago

Llms

[2602.16747] LiveClin: A Live Clinical Benchmark without Leakage

LiveClin introduces a novel clinical benchmark for evaluating medical LLMs, addressing issues of data contamination and knowledge obsoles...

arXiv - AI · 4 min · about 2 months ago

Llms

[2602.16741] Can Adversarial Code Comments Fool AI Security Reviewers -- Large-Scale Empirical Study of Comment-Based Attacks and Defenses Against LLM Code Analysis

This study investigates whether adversarial code comments can mislead AI security reviewers during vulnerability detection in code, revea...

arXiv - Machine Learning · 4 min · about 2 months ago

Llms

[2602.16740] Quantifying LLM Attention-Head Stability: Implications for Circuit Universality

This article examines the stability of attention heads in transformer models, revealing insights into their representational robustness a...

arXiv - AI · 4 min · about 2 months ago

Ai Safety

[2602.16729] Intent Laundering: AI Safety Datasets Are Not What They Seem

The paper evaluates AI safety datasets, revealing they often misrepresent real-world attacks due to an overreliance on triggering cues, l...

arXiv - Machine Learning · 4 min · about 2 months ago

Machine Learning

[2602.16723] Is Mamba Reliable for Medical Imaging?

This paper evaluates the reliability of Mamba, a state-space model, for medical imaging under various attack scenarios, highlighting vuln...

arXiv - AI · 3 min · about 2 months ago

Ai Startups

[2602.17594] AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

The paper introduces the AI Gamestore, a platform for evaluating machine general intelligence through human games, highlighting its poten...

arXiv - AI · 4 min · about 2 months ago

Machine Learning

[2602.17566] A Hybrid Federated Learning Based Ensemble Approach for Lung Disease Diagnosis Leveraging Fusion of SWIN Transformer and CNN

This article presents a hybrid federated learning model that combines SWIN Transformer and CNN for diagnosing lung diseases, particularly...

arXiv - AI · 4 min · about 2 months ago

Llms

[2602.17560] ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment

The paper presents ODESteer, a novel ODE-based framework for aligning large language models (LLMs) by addressing limitations in existing ...

arXiv - AI · 4 min · about 2 months ago

Machine Learning

[2602.17508] Pareto Optimal Benchmarking of AI Models on ARM Cortex Processors for Sustainable Embedded Systems

This article presents a benchmarking framework for optimizing AI models on ARM Cortex processors, focusing on energy efficiency and perfo...

arXiv - AI · 4 min · about 2 months ago

Llms

[2602.17418] A Privacy by Design Framework for Large Language Model-Based Applications for Children

This article proposes a Privacy by Design framework for AI applications targeting children, addressing privacy risks and compliance with ...

arXiv - AI · 4 min · about 2 months ago

Llms

[2602.17234] All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting

The paper introduces a framework for detecting temporal knowledge leakage in LLM backtesting, proposing a new metric, Shapley-DCLR, and a...

arXiv - Machine Learning · 4 min · about 2 months ago

Llms

[2602.17229] Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom's Taxonomy

This paper explores the mechanistic interpretability of cognitive complexity in Large Language Models (LLMs) using Bloom's Taxonomy, demo...

arXiv - AI · 3 min · about 2 months ago

Generative Ai

[2602.17116] Epistemology of Generative AI: The Geometry of Knowing

This article explores the epistemological implications of generative AI, proposing a new framework for understanding knowledge production...

arXiv - AI · 4 min · about 2 months ago

Ai Safety

[2602.17107] Owen-based Semantics and Hierarchy-Aware Explanation (O-Shap)

The paper presents O-Shap, a novel method for feature attribution in explainable AI, addressing limitations of traditional Shapley value ...

arXiv - AI · 3 min · about 2 months ago

Ai Safety

[2602.17106] Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction

The paper proposes a human-AI collaborative framework for creating benchmark datasets to evaluate sustainability rating methodologies, ad...

arXiv - AI · 3 min · about 2 months ago

Previous Page 82 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

[2511.21331] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

[2509.22367] What Is The Political Content in LLMs' Pre- and Post-Training Data?

[2507.22264] SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

All Content

[2602.16844] Overseeing Agents Without Constant Oversight: Challenges and Opportunities

[2602.16826] HiVAE: Hierarchical Latent Variables for Scalable Theory of Mind

[2602.16829] Learning under noisy supervision is governed by a feedback-truth gap

[2602.16802] References Improve LLM Alignment in Non-Verifiable Domains

[2602.16800] Large-scale online deanonymization with LLMs

[2602.16747] LiveClin: A Live Clinical Benchmark without Leakage

[2602.16741] Can Adversarial Code Comments Fool AI Security Reviewers -- Large-Scale Empirical Study of Comment-Based Attacks and Defenses Against LLM Code Analysis

[2602.16740] Quantifying LLM Attention-Head Stability: Implications for Circuit Universality

[2602.16729] Intent Laundering: AI Safety Datasets Are Not What They Seem

[2602.16723] Is Mamba Reliable for Medical Imaging?

[2602.17594] AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

[2602.17566] A Hybrid Federated Learning Based Ensemble Approach for Lung Disease Diagnosis Leveraging Fusion of SWIN Transformer and CNN

[2602.17560] ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment

[2602.17508] Pareto Optimal Benchmarking of AI Models on ARM Cortex Processors for Sustainable Embedded Systems

[2602.17418] A Privacy by Design Framework for Large Language Model-Based Applications for Children

[2602.17234] All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting

[2602.17229] Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom's Taxonomy

[2602.17116] Epistemology of Generative AI: The Geometry of Knowing

[2602.17107] Owen-based Semantics and Hierarchy-Aware Explanation (O-Shap)

[2602.17106] Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction

Related Topics

Stay updated with AI News