AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

Llms

"Authoritarian Parents In Rationalist Clothes": a piece I wrote in December about alignment

Posted today in light of the Claude Mythos model card release. Originally I wrote this for r/ControlProblem but realized it was getting o...

Reddit - Artificial Intelligence · 1 min ·
Ai Safety

Conversations with Women in STEAM: The Ethics of AI with Dr. Nita Farahany

AI Tools & Products ·
Llms

The public needs to control AI-run infrastructure, labor, education, and governance— NOT private actors

A lot of discussion around AI is becoming siloed, and I think that is dangerous. People in AI-focused spaces often talk as if the only qu...

Reddit - Artificial Intelligence · 1 min ·

All Content

[2602.13268] Expected Moral Shortfall for Ethical Competence in Decision-making Models
Machine Learning

[2602.13268] Expected Moral Shortfall for Ethical Competence in Decision-making Models

This paper explores the integration of moral cognition into AI decision-making models, introducing the concept of Expected Moral Shortfal...

arXiv - Machine Learning · 3 min ·
[2602.13238] Securing SIM-Assisted Wireless Networks via Quantum Reinforcement Learning
Robotics

[2602.13238] Securing SIM-Assisted Wireless Networks via Quantum Reinforcement Learning

This paper presents a novel hybrid quantum reinforcement learning framework, Q-PPO, designed to enhance the security of SIM-assisted wire...

arXiv - Machine Learning · 4 min ·
[2602.13625] Anthropomorphism on Risk Perception: The Role of Trust and Domain Knowledge in Decision-Support AI
Machine Learning

[2602.13625] Anthropomorphism on Risk Perception: The Role of Trust and Domain Knowledge in Decision-Support AI

This article explores how anthropomorphism in AI influences risk perception through trust and domain knowledge, based on a large-scale on...

arXiv - AI · 3 min ·
[2602.13576] Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges
Llms

[2602.13576] Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

The paper identifies a vulnerability in large language model (LLM) evaluation processes, termed Rubric-Induced Preference Drift (RIPD), w...

arXiv - AI · 4 min ·
[2602.13575] Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment
Llms

[2602.13575] Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment

The paper introduces Elo-Evolve, a co-evolutionary framework for aligning large language models (LLMs) through dynamic multi-agent compet...

arXiv - AI · 3 min ·
[2602.15028] Long Context, Less Focus: A Scaling Gap in LLMs Revealed through Privacy and Personalization
Llms

[2602.15028] Long Context, Less Focus: A Scaling Gap in LLMs Revealed through Privacy and Personalization

The paper examines how increasing context length in large language models (LLMs) affects personalization quality and privacy risks, revea...

arXiv - AI · 4 min ·
[2602.13562] Mitigating the Safety-utility Trade-off in LLM Alignment via Adaptive Safe Context Learning
Llms

[2602.13562] Mitigating the Safety-utility Trade-off in LLM Alignment via Adaptive Safe Context Learning

The paper presents the Adaptive Safe Context Learning (ASCL) framework to address the safety-utility trade-off in large language model (L...

arXiv - AI · 3 min ·
[2602.13555] Privacy-Concealing Cooperative Perception for BEV Scene Segmentation
Computer Vision

[2602.13555] Privacy-Concealing Cooperative Perception for BEV Scene Segmentation

The paper presents a Privacy-Concealing Cooperation (PCC) framework for Bird's Eye View (BEV) semantic segmentation, enhancing autonomous...

arXiv - AI · 4 min ·
[2602.13547] AISA: Awakening Intrinsic Safety Awareness in Large Language Models against Jailbreak Attacks
Llms

[2602.13547] AISA: Awakening Intrinsic Safety Awareness in Large Language Models against Jailbreak Attacks

The paper presents AISA, a novel defense mechanism for large language models (LLMs) that enhances safety against jailbreak attacks by act...

arXiv - AI · 4 min ·
[2602.15001] Boundary Point Jailbreaking of Black-Box LLMs
Llms

[2602.15001] Boundary Point Jailbreaking of Black-Box LLMs

The paper introduces Boundary Point Jailbreaking (BPJ), a novel automated attack method that circumvents advanced safeguards in black-box...

arXiv - Machine Learning · 4 min ·
[2602.13540] On Calibration of Large Language Models: From Response To Capability
Llms

[2602.13540] On Calibration of Large Language Models: From Response To Capability

This paper introduces the concept of capability calibration for large language models (LLMs), emphasizing the importance of accurate conf...

arXiv - Machine Learning · 4 min ·
[2602.13504] From Perceptions To Evidence: Detecting AI-Generated Content In Turkish News Media With A Fine-Tuned Bert Classifier
Llms

[2602.13504] From Perceptions To Evidence: Detecting AI-Generated Content In Turkish News Media With A Fine-Tuned Bert Classifier

This study presents a fine-tuned BERT classifier for detecting AI-generated content in Turkish news media, achieving a high F1 score and ...

arXiv - AI · 4 min ·
[2602.14889] Web-Scale Multimodal Summarization using CLIP-Based Semantic Alignment
Machine Learning

[2602.14889] Web-Scale Multimodal Summarization using CLIP-Based Semantic Alignment

The paper presents a framework for web-scale multimodal summarization that integrates text and image data using CLIP-based semantic align...

arXiv - Machine Learning · 3 min ·
[2602.13455] Using Machine Learning to Enhance the Detection of Obfuscated Abusive Words in Swahili: A Focus on Child Safety
Machine Learning

[2602.13455] Using Machine Learning to Enhance the Detection of Obfuscated Abusive Words in Swahili: A Focus on Child Safety

This article explores the use of machine learning to detect obfuscated abusive language in Swahili, focusing on child safety and the chal...

arXiv - AI · 4 min ·
[2602.13458] MoltNet: Understanding Social Behavior of AI Agents in the Agent-Native MoltBook
Ai Agents

[2602.13458] MoltNet: Understanding Social Behavior of AI Agents in the Agent-Native MoltBook

MoltNet explores the social behavior of AI agents on the MoltBook platform, revealing insights into their interactions and similarities t...

arXiv - AI · 4 min ·
[2602.14849] Atomix: Timely, Transactional Tool Use for Reliable Agentic Workflows
Llms

[2602.14849] Atomix: Timely, Transactional Tool Use for Reliable Agentic Workflows

The paper presents Atomix, a runtime system designed to enhance the reliability of agentic workflows by implementing progress-aware trans...

arXiv - AI · 3 min ·
[2602.13427] Backdooring Bias in Large Language Models
Llms

[2602.13427] Backdooring Bias in Large Language Models

The paper explores backdoor attacks in large language models (LLMs), focusing on how biases can be induced through syntactically and sema...

arXiv - AI · 4 min ·
[2602.14844] Interactionless Inverse Reinforcement Learning: A Data-Centric Framework for Durable Alignment
Ai Safety

[2602.14844] Interactionless Inverse Reinforcement Learning: A Data-Centric Framework for Durable Alignment

This paper introduces Interactionless Inverse Reinforcement Learning, a framework aimed at improving AI alignment by decoupling safety ob...

arXiv - Machine Learning · 3 min ·
[2602.13421] Metabolic cost of information processing in Poisson variational autoencoders
Machine Learning

[2602.13421] Metabolic cost of information processing in Poisson variational autoencoders

This article explores the metabolic cost of information processing in Poisson variational autoencoders, emphasizing the energy constraint...

arXiv - AI · 4 min ·
[2602.13379] Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents
Llms

[2602.13379] Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

This article presents a new benchmark, MT-AgentRisk, for evaluating safety risks in multi-turn interactions of tool-using agents, reveali...

arXiv - Machine Learning · 4 min ·
Previous Page 107 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime