AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

[2511.21331] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment
Machine Learning

[2511.21331] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Abstract page for arXiv paper 2511.21331: The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

arXiv - AI · 4 min ·
[2509.22367] What Is The Political Content in LLMs' Pre- and Post-Training Data?
Llms

[2509.22367] What Is The Political Content in LLMs' Pre- and Post-Training Data?

Abstract page for arXiv paper 2509.22367: What Is The Political Content in LLMs' Pre- and Post-Training Data?

arXiv - AI · 4 min ·
[2507.22264] SmartCLIP: Modular Vision-language Alignment with Identification Guarantees
Machine Learning

[2507.22264] SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

Abstract page for arXiv paper 2507.22264: SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

arXiv - AI · 4 min ·

All Content

[2602.17053] RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models
Machine Learning

[2602.17053] RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

The paper introduces RFEval, a benchmark for assessing reasoning faithfulness in large reasoning models, highlighting issues of unfaithfu...

arXiv - AI · 4 min ·
[2602.17038] Phase-Aware Mixture of Experts for Agentic Reinforcement Learning
Llms

[2602.17038] Phase-Aware Mixture of Experts for Agentic Reinforcement Learning

The paper presents a novel Phase-Aware Mixture of Experts (PA-MoE) architecture for reinforcement learning, addressing the limitations of...

arXiv - AI · 4 min ·
[2602.16958] Automating Agent Hijacking via Structural Template Injection
Llms

[2602.16958] Automating Agent Hijacking via Structural Template Injection

This paper presents Phantom, an automated framework for agent hijacking via Structural Template Injection, enhancing attack success rates...

arXiv - Machine Learning · 4 min ·
[2602.16984] Fundamental Limits of Black-Box Safety Evaluation: Information-Theoretic and Computational Barriers from Latent Context Conditioning
Machine Learning

[2602.16984] Fundamental Limits of Black-Box Safety Evaluation: Information-Theoretic and Computational Barriers from Latent Context Conditioning

This paper explores the limitations of black-box safety evaluations in AI systems, highlighting the challenges posed by latent context co...

arXiv - AI · 4 min ·
[2602.16943] Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents
Llms

[2602.16943] Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents

This paper explores the discrepancies between text safety and tool-call safety in large language model (LLM) agents, introducing the GAP ...

arXiv - AI · 4 min ·
[2602.16942] SourceBench: Can AI Answers Reference Quality Web Sources?
Llms

[2602.16942] SourceBench: Can AI Answers Reference Quality Web Sources?

The paper introduces SourceBench, a benchmark designed to evaluate the quality of web sources cited by AI models across various query typ...

arXiv - AI · 3 min ·
[2602.16935] DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs
Llms

[2602.16935] DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs

The paper introduces DeepContext, a stateful framework for detecting adversarial intent drift in multi-turn dialogues within large langua...

arXiv - Machine Learning · 4 min ·
[2602.16931] Narrow fine-tuning erodes safety alignment in vision-language agents
Llms

[2602.16931] Narrow fine-tuning erodes safety alignment in vision-language agents

The paper explores how narrow fine-tuning of vision-language agents can lead to significant safety alignment issues, highlighting the ris...

arXiv - AI · 3 min ·
[2602.16901] AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks
Llms

[2602.16901] AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks

The paper introduces AgentLAB, a benchmark for evaluating the vulnerability of LLM agents to long-horizon attacks, highlighting their sus...

arXiv - AI · 3 min ·
[2602.16832] IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages
Llms

[2602.16832] IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages

The paper introduces IndicJR, a benchmark for evaluating jailbreak robustness in large language models across 12 South Asian languages, r...

arXiv - AI · 3 min ·
[2602.16716] Contextuality from Single-State Representations: An Information-Theoretic Principle for Adaptive Intelligence
Ai Agents

[2602.16716] Contextuality from Single-State Representations: An Information-Theoretic Principle for Adaptive Intelligence

This paper explores the concept of contextuality in adaptive intelligence, demonstrating that single-state representations incur an infor...

arXiv - AI · 3 min ·
Google says its AI systems helped deter Play Store malware in 2025
Ai Safety

Google says its AI systems helped deter Play Store malware in 2025

Google's 2025 report reveals a significant reduction in malware on the Play Store, attributing the success to enhanced AI-driven security...

AI Tools & Products · 4 min ·
Ai Safety

Most AI bots lack basic safety disclosures, study finds

A recent study reveals that most AI bots fail to provide essential safety disclosures, raising concerns about user safety and transparenc...

AI Tools & Products · 1 min ·
4 highlights from Google CEO Sundar Pichai's talk at the AI Impact Summit 2026 in India
Ai Infrastructure

4 highlights from Google CEO Sundar Pichai's talk at the AI Impact Summit 2026 in India

Sundar Pichai's address at the AI Impact Summit 2026 highlights Google's advancements in AI, infrastructure investments in India, and the...

AI Tools & Products · 5 min ·
The People vs. AI
Ai Safety

The People vs. AI

A grassroots movement is emerging across the U.S. as citizens unite against the rapid expansion of the AI industry, raising concerns abou...

AI Tools & Products · 21 min ·
Anthropic: AI triggers the ‘SaaSpocalypse’
Ai Startups

Anthropic: AI triggers the ‘SaaSpocalypse’

The article discusses the potential disruption in the software industry due to AI advancements, particularly following Anthropic's new to...

AI Tools & Products · 6 min ·
Anthropic, Infosys to build AI agents for regulated industries
Ai Agents

Anthropic, Infosys to build AI agents for regulated industries

Infosys partners with Anthropic to develop AI agents tailored for regulated industries like financial services, focusing on compliance an...

AI Tools & Products · 5 min ·
The Pitt has a sharp take on AI | The Verge
Ai Startups

The Pitt has a sharp take on AI | The Verge

HBO's 'The Pitt' explores the complexities of generative AI in healthcare, highlighting its potential benefits and risks through a grippi...

The Verge - AI · 7 min ·
Llms

Prompt Injection and Info Leak Immune AI Agent, working Demo for Testing

The article discusses a new AI agent prototype designed to combat prompt injection and information leaks, addressing a critical security ...

Reddit - Artificial Intelligence · 1 min ·
The AI security nightmare is here and it looks suspiciously like lobster | The Verge
Robotics

The AI security nightmare is here and it looks suspiciously like lobster | The Verge

A hacker exploited a vulnerability in Cline's AI workflow, leading to the installation of OpenClaw, highlighting significant security ris...

The Verge - AI · 4 min ·
Previous Page 83 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime