AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

Top This Week

Machine Learning

[R] I trained a 3k parameter model on XOR sequences of length 20. It extrapolates perfectly to length 1,000,000. Here's why I think that's architecturally significant.

I've been working on an alternative to attention-based sequence modeling that I'm calling Geometric Flow Networks (GFN). The core idea: i...

Reddit - Machine Learning · 1 min · about 5 hours ago

Machine Learning

[D] Data curation and targeted replacement as a pre-training alignment and controllability method

Hi, r/MachineLearning: has much research been done in large-scale training scenarios where undesirable data has been replaced before trai...

Reddit - Machine Learning · 1 min · about 8 hours ago

Ai Safety

I’ve come up with a new thought experiment to approach ASI, and it challenges the very notions of alignment and containment

I’ve written an essay exploring what I’m calling the Super-Intelligent Octopus Problem—a thought experiment designed to surface a paradox...

Reddit - Artificial Intelligence · 1 min · about 11 hours ago

All Content

Machine Learning

[2603.03507] Solving adversarial examples requires solving exponential misalignment

Abstract page for arXiv paper 2603.03507: Solving adversarial examples requires solving exponential misalignment

arXiv - Machine Learning · 4 min · 25 days ago

Llms

[2603.03326] Controllable and explainable personality sliders for LLMs at inference time

Abstract page for arXiv paper 2603.03326: Controllable and explainable personality sliders for LLMs at inference time

arXiv - AI · 3 min · 25 days ago

Machine Learning

[2603.03469] Biased Generalization in Diffusion Models

Abstract page for arXiv paper 2603.03469: Biased Generalization in Diffusion Models

arXiv - Machine Learning · 4 min · 25 days ago

Llms

[2603.03324] Controlling Chat Style in Language Models via Single-Direction Editing

Abstract page for arXiv paper 2603.03324: Controlling Chat Style in Language Models via Single-Direction Editing

arXiv - AI · 3 min · 25 days ago

Llms

[2603.03319] Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

Abstract page for arXiv paper 2603.03319: Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

arXiv - AI · 4 min · 25 days ago

Machine Learning

[2603.03312] Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text Decoding

Abstract page for arXiv paper 2603.03312: Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to...

arXiv - AI · 4 min · 25 days ago

Llms

[2603.03308] Old Habits Die Hard: How Conversational History Geometrically Traps LLMs

Abstract page for arXiv paper 2603.03308: Old Habits Die Hard: How Conversational History Geometrically Traps LLMs

arXiv - AI · 3 min · 25 days ago

Llms

[2603.03303] HumanLM: Simulating Users with State Alignment Beats Response Imitation

Abstract page for arXiv paper 2603.03303: HumanLM: Simulating Users with State Alignment Beats Response Imitation

arXiv - AI · 4 min · 25 days ago

Llms

[2603.03298] TATRA: Training-Free Instance-Adaptive Prompting Through Rephrasing and Aggregation

Abstract page for arXiv paper 2603.03298: TATRA: Training-Free Instance-Adaptive Prompting Through Rephrasing and Aggregation

arXiv - AI · 4 min · 25 days ago

Llms

[2603.03291] One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

Abstract page for arXiv paper 2603.03291: One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

arXiv - AI · 3 min · 25 days ago

Llms

[2603.04390] A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development

Abstract page for arXiv paper 2603.04390: A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development

arXiv - AI · 3 min · 25 days ago

Llms

[2603.03686] AI4S-SDS: A Neuro-Symbolic Solvent Design System via Sparse MCTS and Differentiable Physics Alignment

Abstract page for arXiv paper 2603.03686: AI4S-SDS: A Neuro-Symbolic Solvent Design System via Sparse MCTS and Differentiable Physics Ali...

arXiv - AI · 4 min · 25 days ago

Llms

[2603.03655] Mozi: Governed Autonomy for Drug Discovery LLM Agents

Abstract page for arXiv paper 2603.03655: Mozi: Governed Autonomy for Drug Discovery LLM Agents

arXiv - AI · 4 min · 25 days ago

Ai Safety

Anthropic CEO Dario Amodei calls OpenAI's messaging around military deal 'straight up lies,' report says | TechCrunch

Anthropic gave up its contract with the Pentagon over AI safety disagreements -- then, OpenAI swooped in.

TechCrunch - AI · 5 min · 25 days ago

Nlp

Using AI With Deep Knowledge From 37 Academic Books Using Graph RAG to Make 9, Well-Informed Predictions About Our Future. The Analysis is...Bleak.

I'm using this specialized canvas app that lets me build the neurological brain of a chatbot based on connected notes. I added and connec...

Reddit - Artificial Intelligence · 1 min · 25 days ago

Ai Safety

Anthropic’s Break With the Pentagon Ignites AI Ethics Debate

AI Tools & Products · 12 min · 26 days ago

Llms

[2602.04288] Contextual Drag: How Errors in the Context Affect LLM Reasoning

Abstract page for arXiv paper 2602.04288: Contextual Drag: How Errors in the Context Affect LLM Reasoning

arXiv - Machine Learning · 3 min · 26 days ago

Llms

[2511.12832] From Passive to Persuasive: Steering Emotional Nuance in Human-AI Negotiation

Abstract page for arXiv paper 2511.12832: From Passive to Persuasive: Steering Emotional Nuance in Human-AI Negotiation

arXiv - AI · 3 min · 26 days ago

Llms

[2510.13900] Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Abstract page for arXiv paper 2510.13900: Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

arXiv - AI · 4 min · 26 days ago

Machine Learning

[2512.05116] Value Gradient Guidance for Flow Matching Alignment

Abstract page for arXiv paper 2512.05116: Value Gradient Guidance for Flow Matching Alignment

arXiv - Machine Learning · 3 min · 26 days ago

Previous Page 18 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

[R] I trained a 3k parameter model on XOR sequences of length 20. It extrapolates perfectly to length 1,000,000. Here's why I think that's architecturally significant.

[D] Data curation and targeted replacement as a pre-training alignment and controllability method

I’ve come up with a new thought experiment to approach ASI, and it challenges the very notions of alignment and containment

All Content

[2603.03507] Solving adversarial examples requires solving exponential misalignment

[2603.03326] Controllable and explainable personality sliders for LLMs at inference time

[2603.03469] Biased Generalization in Diffusion Models

[2603.03324] Controlling Chat Style in Language Models via Single-Direction Editing

[2603.03319] Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

[2603.03312] Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text Decoding

[2603.03308] Old Habits Die Hard: How Conversational History Geometrically Traps LLMs

[2603.03303] HumanLM: Simulating Users with State Alignment Beats Response Imitation

[2603.03298] TATRA: Training-Free Instance-Adaptive Prompting Through Rephrasing and Aggregation

[2603.03291] One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

[2603.04390] A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development

[2603.03686] AI4S-SDS: A Neuro-Symbolic Solvent Design System via Sparse MCTS and Differentiable Physics Alignment

[2603.03655] Mozi: Governed Autonomy for Drug Discovery LLM Agents

Anthropic CEO Dario Amodei calls OpenAI's messaging around military deal 'straight up lies,' report says | TechCrunch

Using AI With Deep Knowledge From 37 Academic Books Using Graph RAG to Make 9, Well-Informed Predictions About Our Future. The Analysis is...Bleak.

Anthropic’s Break With the Pentagon Ignites AI Ethics Debate

[2602.04288] Contextual Drag: How Errors in the Context Affect LLM Reasoning

[2511.12832] From Passive to Persuasive: Steering Emotional Nuance in Human-AI Negotiation

[2510.13900] Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

[2512.05116] Value Gradient Guidance for Flow Matching Alignment

Related Topics

Stay updated with AI News