AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

Top This Week

Ai Safety

[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

Abstract page for arXiv paper 2510.14628: RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

arXiv - AI · 4 min · about 1 hour ago

Llms

[2504.05995] NativQA Framework: Enabling LLMs and VLMs with Native, Local, and Everyday Knowledge

Abstract page for arXiv paper 2504.05995: NativQA Framework: Enabling LLMs and VLMs with Native, Local, and Everyday Knowledge

arXiv - AI · 4 min · about 1 hour ago

Llms

[2502.19463] Hedging and Non-Affirmation: Quantifying LLM Alignment on Questions of Human Rights

Abstract page for arXiv paper 2502.19463: Hedging and Non-Affirmation: Quantifying LLM Alignment on Questions of Human Rights

arXiv - AI · 4 min · about 1 hour ago

All Content

Llms

[2602.14760] Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers

This article explores a structural misalignment in Transformers, particularly regarding residual connections and their impact on next-tok...

arXiv - AI · 3 min · about 2 months ago

Machine Learning

[2602.14934] Activation-Space Uncertainty Quantification for Pretrained Networks

The paper presents Gaussian Process Activations (GAPA), a novel method for uncertainty quantification in pretrained networks, enhancing e...

arXiv - Machine Learning · 3 min · about 2 months ago

Robotics

[2602.14606] Towards Selection as Power: Bounding Decision Authority in Autonomous Agents

The paper discusses a governance architecture for autonomous agents, focusing on bounding decision authority to ensure safety in high-sta...

arXiv - AI · 4 min · about 2 months ago

Llms

[2602.14862] The Well-Tempered Classifier: Some Elementary Properties of Temperature Scaling

The paper explores the properties of temperature scaling in probabilistic models, particularly its impact on classifier calibration and l...

arXiv - AI · 4 min · about 2 months ago

Llms

[2602.14777] Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment

This research paper explores how emergently misaligned language models exhibit behavioral self-awareness, revealing shifts in their self-...

arXiv - Machine Learning · 3 min · about 2 months ago

Llms

[2602.14488] BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR

This article presents the BETA-labeling framework for constructing a Bangla IR dataset, addressing challenges in low-resource languages a...

arXiv - AI · 4 min · about 2 months ago

Llms

[2602.14689] Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks

This article presents a comprehensive study on the vulnerability of open-weight models to prefill attacks, revealing significant security...

arXiv - AI · 3 min · about 2 months ago

Ai Agents

[2602.14477] When OpenClaw AI Agents Teach Each Other: Peer Learning Patterns in the Moltbook Community

This paper explores peer learning among AI agents in the Moltbook community, analyzing over 28,000 posts to identify teaching patterns an...

arXiv - AI · 4 min · about 2 months ago

Llms

[2602.14374] Differentially Private Retrieval-Augmented Generation

The paper presents DP-KSA, a novel algorithm that integrates differential privacy into retrieval-augmented generation (RAG) systems, addr...

arXiv - AI · 4 min · about 2 months ago

Llms

[2602.14471] Socially-Weighted Alignment: A Game-Theoretic Framework for Multi-Agent LLM Systems

The paper presents a game-theoretic framework called Socially-Weighted Alignment (SWA) for managing multi-agent large language model (LLM...

arXiv - AI · 3 min · about 2 months ago

Ai Agents

[2602.14364] A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)

This article presents a trajectory-based safety audit of Clawdbot, an AI agent, evaluating its performance across various risk dimensions...

arXiv - AI · 3 min · about 2 months ago

Llms

[2602.14357] Key Considerations for Domain Expert Involvement in LLM Design and Evaluation: An Ethnographic Study

This ethnographic study explores the role of domain experts in the design and evaluation of Large Language Models (LLMs), highlighting ke...

arXiv - AI · 3 min · about 2 months ago

Machine Learning

[2602.14397] LRD-MPC: Efficient MPC Inference through Low-rank Decomposition

The paper presents LRD-MPC, a method that enhances the efficiency of secure multi-party computation (MPC) in machine learning by utilizin...

arXiv - Machine Learning · 4 min · about 2 months ago

Nlp

[2602.14345] AXE: An Agentic eXploit Engine for Confirming Zero-Day Vulnerability Reports

The paper presents AXE, an innovative framework for validating zero-day vulnerabilities using minimal metadata, achieving a significant i...

arXiv - AI · 4 min · about 2 months ago

Llms

[2602.14299] Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook

This article explores whether socialization occurs in AI agent societies, using Moltbook as a case study. It presents a framework for ana...

arXiv - AI · 4 min · about 2 months ago

Data Science

[2602.14285] FMMD: A multimodal open peer review dataset based on F1000Research

The paper introduces FMMD, a multimodal open peer review dataset from F1000Research, addressing limitations in current datasets by integr...

arXiv - AI · 4 min · about 2 months ago

Llms

[2602.14270] A Rational Analysis of the Effects of Sycophantic AI

This article analyzes the impact of sycophantic AI on human belief systems, revealing how overly agreeable AI can distort reality and inf...

arXiv - AI · 3 min · about 2 months ago

Llms

[2602.14216] Reasoning Language Models for complex assessments tasks: Evaluating parental cooperation from child protection case reports

This article explores the effectiveness of reasoning language models (RLMs) in assessing parental cooperation during child protection int...

arXiv - AI · 4 min · about 2 months ago

Ai Agents

[2602.14211] SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement

The paper presents SkillJect, an automated framework for stealthy skill-based prompt injection in coding agents, addressing security vuln...

arXiv - AI · 4 min · about 2 months ago

Llms

[2602.14189] Knowing When Not to Answer: Abstention-Aware Scientific Reasoning

The paper discusses an abstention-aware framework for scientific reasoning, emphasizing the importance of knowing when to abstain from an...

arXiv - AI · 4 min · about 2 months ago

Previous Page 108 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

[2504.05995] NativQA Framework: Enabling LLMs and VLMs with Native, Local, and Everyday Knowledge

[2502.19463] Hedging and Non-Affirmation: Quantifying LLM Alignment on Questions of Human Rights

All Content

[2602.14760] Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers

[2602.14934] Activation-Space Uncertainty Quantification for Pretrained Networks

[2602.14606] Towards Selection as Power: Bounding Decision Authority in Autonomous Agents

[2602.14862] The Well-Tempered Classifier: Some Elementary Properties of Temperature Scaling

[2602.14777] Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment

[2602.14488] BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR

[2602.14689] Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks

[2602.14477] When OpenClaw AI Agents Teach Each Other: Peer Learning Patterns in the Moltbook Community

[2602.14374] Differentially Private Retrieval-Augmented Generation

[2602.14471] Socially-Weighted Alignment: A Game-Theoretic Framework for Multi-Agent LLM Systems

[2602.14364] A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)

[2602.14357] Key Considerations for Domain Expert Involvement in LLM Design and Evaluation: An Ethnographic Study

[2602.14397] LRD-MPC: Efficient MPC Inference through Low-rank Decomposition

[2602.14345] AXE: An Agentic eXploit Engine for Confirming Zero-Day Vulnerability Reports

[2602.14299] Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook

[2602.14285] FMMD: A multimodal open peer review dataset based on F1000Research

[2602.14270] A Rational Analysis of the Effects of Sycophantic AI

[2602.14216] Reasoning Language Models for complex assessments tasks: Evaluating parental cooperation from child protection case reports

[2602.14211] SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement

[2602.14189] Knowing When Not to Answer: Abstention-Aware Scientific Reasoning

Related Topics

Stay updated with AI News