AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

Llms

"Authoritarian Parents In Rationalist Clothes": a piece I wrote in December about alignment

Posted today in light of the Claude Mythos model card release. Originally I wrote this for r/ControlProblem but realized it was getting o...

Reddit - Artificial Intelligence · 1 min ·
Ai Safety

Conversations with Women in STEAM: The Ethics of AI with Dr. Nita Farahany

AI Tools & Products ·
Llms

The public needs to control AI-run infrastructure, labor, education, and governance— NOT private actors

A lot of discussion around AI is becoming siloed, and I think that is dangerous. People in AI-focused spaces often talk as if the only qu...

Reddit - Artificial Intelligence · 1 min ·

All Content

[2602.14397] LRD-MPC: Efficient MPC Inference through Low-rank Decomposition
Machine Learning

[2602.14397] LRD-MPC: Efficient MPC Inference through Low-rank Decomposition

The paper presents LRD-MPC, a method that enhances the efficiency of secure multi-party computation (MPC) in machine learning by utilizin...

arXiv - Machine Learning · 4 min ·
[2602.14345] AXE: An Agentic eXploit Engine for Confirming Zero-Day Vulnerability Reports
Nlp

[2602.14345] AXE: An Agentic eXploit Engine for Confirming Zero-Day Vulnerability Reports

The paper presents AXE, an innovative framework for validating zero-day vulnerabilities using minimal metadata, achieving a significant i...

arXiv - AI · 4 min ·
[2602.14299] Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook
Llms

[2602.14299] Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook

This article explores whether socialization occurs in AI agent societies, using Moltbook as a case study. It presents a framework for ana...

arXiv - AI · 4 min ·
[2602.14285] FMMD: A multimodal open peer review dataset based on F1000Research
Data Science

[2602.14285] FMMD: A multimodal open peer review dataset based on F1000Research

The paper introduces FMMD, a multimodal open peer review dataset from F1000Research, addressing limitations in current datasets by integr...

arXiv - AI · 4 min ·
[2602.14270] A Rational Analysis of the Effects of Sycophantic AI
Llms

[2602.14270] A Rational Analysis of the Effects of Sycophantic AI

This article analyzes the impact of sycophantic AI on human belief systems, revealing how overly agreeable AI can distort reality and inf...

arXiv - AI · 3 min ·
[2602.14216] Reasoning Language Models for complex assessments tasks: Evaluating parental cooperation from child protection case reports
Llms

[2602.14216] Reasoning Language Models for complex assessments tasks: Evaluating parental cooperation from child protection case reports

This article explores the effectiveness of reasoning language models (RLMs) in assessing parental cooperation during child protection int...

arXiv - AI · 4 min ·
[2602.14211] SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement
Ai Agents

[2602.14211] SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement

The paper presents SkillJect, an automated framework for stealthy skill-based prompt injection in coding agents, addressing security vuln...

arXiv - AI · 4 min ·
[2602.14189] Knowing When Not to Answer: Abstention-Aware Scientific Reasoning
Llms

[2602.14189] Knowing When Not to Answer: Abstention-Aware Scientific Reasoning

The paper discusses an abstention-aware framework for scientific reasoning, emphasizing the importance of knowing when to abstain from an...

arXiv - AI · 4 min ·
[2602.14030] MC$^2$Mark: Distortion-Free Multi-Bit Watermarking for Long Messages
Llms

[2602.14030] MC$^2$Mark: Distortion-Free Multi-Bit Watermarking for Long Messages

MC$^2$Mark introduces a novel watermarking framework that ensures reliable embedding of long messages in generated text while maintaining...

arXiv - Machine Learning · 3 min ·
[2602.14158] A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing
Llms

[2602.14158] A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing

This article presents a multi-agent framework for medical AI that enhances clinical query processing by leveraging fine-tuned language mo...

arXiv - AI · 4 min ·
[2602.14106] Anticipating Adversary Behavior in DevSecOps Scenarios through Large Language Models
Llms

[2602.14106] Anticipating Adversary Behavior in DevSecOps Scenarios through Large Language Models

This paper explores the integration of Large Language Models (LLMs) in anticipating adversary behavior within DevSecOps environments, pro...

arXiv - AI · 4 min ·
[2602.14080] Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality
Llms

[2602.14080] Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

The paper explores the limitations of factuality evaluations in large language models (LLMs), identifying recall as a key bottleneck in a...

arXiv - AI · 4 min ·
[2602.13864] Evolving Multi-Channel Confidence-Aware Activation Functions for Missing Data with Channel Propagation
Machine Learning

[2602.13864] Evolving Multi-Channel Confidence-Aware Activation Functions for Missing Data with Channel Propagation

This paper presents a novel approach to activation functions in neural networks that incorporates missing data and confidence scores, enh...

arXiv - Machine Learning · 4 min ·
[2602.14012] From SFT to RL: Demystifying the Post-Training Pipeline for LLM-based Vulnerability Detection
Llms

[2602.14012] From SFT to RL: Demystifying the Post-Training Pipeline for LLM-based Vulnerability Detection

This article explores the post-training pipeline for LLM-based vulnerability detection, detailing methods from supervised fine-tuning (SF...

arXiv - AI · 4 min ·
[2602.13672] LEAD-Drift: Real-time and Explainable Intent Drift Detection by Learning a Data-Driven Risk Score
Machine Learning

[2602.13672] LEAD-Drift: Real-time and Explainable Intent Drift Detection by Learning a Data-Driven Risk Score

The LEAD-Drift framework offers a real-time solution for detecting intent drift in Intent-Based Networking (IBN), enhancing proactive net...

arXiv - Machine Learning · 4 min ·
[2602.13619] Locally Private Parametric Methods for Change-Point Detection
Ai Startups

[2602.13619] Locally Private Parametric Methods for Change-Point Detection

This paper presents novel locally private parametric methods for change-point detection, focusing on maintaining privacy while identifyin...

arXiv - Machine Learning · 3 min ·
[2602.13914] Common Knowledge Always, Forever
Machine Learning

[2602.13914] Common Knowledge Always, Forever

The paper discusses a polytopological PDL framework for expressing common knowledge and its implications in epistemic logic, highlighting...

arXiv - AI · 3 min ·
[2602.13891] GSRM: Generative Speech Reward Model for Speech RLHF
Llms

[2602.13891] GSRM: Generative Speech Reward Model for Speech RLHF

The paper introduces the Generative Speech Reward Model (GSRM), a novel approach to evaluating speech naturalness in AI-generated audio, ...

arXiv - AI · 4 min ·
[2602.13784] Comparables XAI: Faithful Example-based AI Explanations with Counterfactual Trace Adjustments
Ai Startups

[2602.13784] Comparables XAI: Faithful Example-based AI Explanations with Counterfactual Trace Adjustments

The paper introduces Comparables XAI, a method for providing faithful, example-based AI explanations using counterfactual trace adjustmen...

arXiv - AI · 3 min ·
[2602.13675] Transferable XAI: Relating Understanding Across Domains with Explanation Transfer
Ai Safety

[2602.13675] Transferable XAI: Relating Understanding Across Domains with Explanation Transfer

The paper presents Transferable XAI, a framework that enables users to apply understanding from one AI domain to another, enhancing decis...

arXiv - AI · 4 min ·
Previous Page 106 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime