AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

Machine Learning

[R] I trained a 3k parameter model on XOR sequences of length 20. It extrapolates perfectly to length 1,000,000. Here's why I think that's architecturally significant.

I've been working on an alternative to attention-based sequence modeling that I'm calling Geometric Flow Networks (GFN). The core idea: i...

Reddit - Machine Learning · 1 min ·
Machine Learning

[D] Data curation and targeted replacement as a pre-training alignment and controllability method

Hi, r/MachineLearning: has much research been done in large-scale training scenarios where undesirable data has been replaced before trai...

Reddit - Machine Learning · 1 min ·
Ai Safety

I’ve come up with a new thought experiment to approach ASI, and it challenges the very notions of alignment and containment

I’ve written an essay exploring what I’m calling the Super-Intelligent Octopus Problem—a thought experiment designed to surface a paradox...

Reddit - Artificial Intelligence · 1 min ·

All Content

[2603.03507] Solving adversarial examples requires solving exponential misalignment
Machine Learning

[2603.03507] Solving adversarial examples requires solving exponential misalignment

Abstract page for arXiv paper 2603.03507: Solving adversarial examples requires solving exponential misalignment

arXiv - Machine Learning · 4 min ·
[2603.03326] Controllable and explainable personality sliders for LLMs at inference time
Llms

[2603.03326] Controllable and explainable personality sliders for LLMs at inference time

Abstract page for arXiv paper 2603.03326: Controllable and explainable personality sliders for LLMs at inference time

arXiv - AI · 3 min ·
[2603.03469] Biased Generalization in Diffusion Models
Machine Learning

[2603.03469] Biased Generalization in Diffusion Models

Abstract page for arXiv paper 2603.03469: Biased Generalization in Diffusion Models

arXiv - Machine Learning · 4 min ·
[2603.03324] Controlling Chat Style in Language Models via Single-Direction Editing
Llms

[2603.03324] Controlling Chat Style in Language Models via Single-Direction Editing

Abstract page for arXiv paper 2603.03324: Controlling Chat Style in Language Models via Single-Direction Editing

arXiv - AI · 3 min ·
[2603.03319] Automated Concept Discovery for LLM-as-a-Judge Preference Analysis
Llms

[2603.03319] Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

Abstract page for arXiv paper 2603.03319: Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

arXiv - AI · 4 min ·
[2603.03312] Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text Decoding
Machine Learning

[2603.03312] Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text Decoding

Abstract page for arXiv paper 2603.03312: Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to...

arXiv - AI · 4 min ·
[2603.03308] Old Habits Die Hard: How Conversational History Geometrically Traps LLMs
Llms

[2603.03308] Old Habits Die Hard: How Conversational History Geometrically Traps LLMs

Abstract page for arXiv paper 2603.03308: Old Habits Die Hard: How Conversational History Geometrically Traps LLMs

arXiv - AI · 3 min ·
[2603.03303] HumanLM: Simulating Users with State Alignment Beats Response Imitation
Llms

[2603.03303] HumanLM: Simulating Users with State Alignment Beats Response Imitation

Abstract page for arXiv paper 2603.03303: HumanLM: Simulating Users with State Alignment Beats Response Imitation

arXiv - AI · 4 min ·
[2603.03298] TATRA: Training-Free Instance-Adaptive Prompting Through Rephrasing and Aggregation
Llms

[2603.03298] TATRA: Training-Free Instance-Adaptive Prompting Through Rephrasing and Aggregation

Abstract page for arXiv paper 2603.03298: TATRA: Training-Free Instance-Adaptive Prompting Through Rephrasing and Aggregation

arXiv - AI · 4 min ·
[2603.03291] One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models
Llms

[2603.03291] One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

Abstract page for arXiv paper 2603.03291: One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

arXiv - AI · 3 min ·
[2603.04390] A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development
Llms

[2603.04390] A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development

Abstract page for arXiv paper 2603.04390: A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development

arXiv - AI · 3 min ·
[2603.03686] AI4S-SDS: A Neuro-Symbolic Solvent Design System via Sparse MCTS and Differentiable Physics Alignment
Llms

[2603.03686] AI4S-SDS: A Neuro-Symbolic Solvent Design System via Sparse MCTS and Differentiable Physics Alignment

Abstract page for arXiv paper 2603.03686: AI4S-SDS: A Neuro-Symbolic Solvent Design System via Sparse MCTS and Differentiable Physics Ali...

arXiv - AI · 4 min ·
[2603.03655] Mozi: Governed Autonomy for Drug Discovery LLM Agents
Llms

[2603.03655] Mozi: Governed Autonomy for Drug Discovery LLM Agents

Abstract page for arXiv paper 2603.03655: Mozi: Governed Autonomy for Drug Discovery LLM Agents

arXiv - AI · 4 min ·
Anthropic CEO Dario Amodei calls OpenAI's messaging around military deal 'straight up lies,' report says | TechCrunch
Ai Safety

Anthropic CEO Dario Amodei calls OpenAI's messaging around military deal 'straight up lies,' report says | TechCrunch

Anthropic gave up its contract with the Pentagon over AI safety disagreements -- then, OpenAI swooped in.

TechCrunch - AI · 5 min ·
Nlp

Using AI With Deep Knowledge From 37 Academic Books Using Graph RAG to Make 9, Well-Informed Predictions About Our Future. The Analysis is...Bleak.

I'm using this specialized canvas app that lets me build the neurological brain of a chatbot based on connected notes. I added and connec...

Reddit - Artificial Intelligence · 1 min ·
Anthropic’s Break With the Pentagon Ignites AI Ethics Debate
Ai Safety

Anthropic’s Break With the Pentagon Ignites AI Ethics Debate

AI Tools & Products · 12 min ·
[2602.04288] Contextual Drag: How Errors in the Context Affect LLM Reasoning
Llms

[2602.04288] Contextual Drag: How Errors in the Context Affect LLM Reasoning

Abstract page for arXiv paper 2602.04288: Contextual Drag: How Errors in the Context Affect LLM Reasoning

arXiv - Machine Learning · 3 min ·
[2511.12832] From Passive to Persuasive: Steering Emotional Nuance in Human-AI Negotiation
Llms

[2511.12832] From Passive to Persuasive: Steering Emotional Nuance in Human-AI Negotiation

Abstract page for arXiv paper 2511.12832: From Passive to Persuasive: Steering Emotional Nuance in Human-AI Negotiation

arXiv - AI · 3 min ·
[2510.13900] Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
Llms

[2510.13900] Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Abstract page for arXiv paper 2510.13900: Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

arXiv - AI · 4 min ·
[2512.05116] Value Gradient Guidance for Flow Matching Alignment
Machine Learning

[2512.05116] Value Gradient Guidance for Flow Matching Alignment

Abstract page for arXiv paper 2512.05116: Value Gradient Guidance for Flow Matching Alignment

arXiv - Machine Learning · 3 min ·
Previous Page 18 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime