AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

Top This Week

Machine Learning

[R] I trained a 3k parameter model on XOR sequences of length 20. It extrapolates perfectly to length 1,000,000. Here's why I think that's architecturally significant.

I've been working on an alternative to attention-based sequence modeling that I'm calling Geometric Flow Networks (GFN). The core idea: i...

Reddit - Machine Learning · 1 min · about 3 hours ago

Machine Learning

[D] Data curation and targeted replacement as a pre-training alignment and controllability method

Hi, r/MachineLearning: has much research been done in large-scale training scenarios where undesirable data has been replaced before trai...

Reddit - Machine Learning · 1 min · about 7 hours ago

Ai Safety

I’ve come up with a new thought experiment to approach ASI, and it challenges the very notions of alignment and containment

I’ve written an essay exploring what I’m calling the Super-Intelligent Octopus Problem—a thought experiment designed to surface a paradox...

Reddit - Artificial Intelligence · 1 min · about 9 hours ago

All Content

Machine Learning

[2603.04340] Balancing Fidelity, Utility, and Privacy in Synthetic Cardiac MRI Generation: A Comparative Study

Abstract page for arXiv paper 2603.04340: Balancing Fidelity, Utility, and Privacy in Synthetic Cardiac MRI Generation: A Comparative Study

arXiv - Machine Learning · 3 min · 25 days ago

Ai Safety

[2603.04113] Understanding Sources of Demographic Predictability in Brain MRI via Disentangling Anatomy and Contrast

Abstract page for arXiv paper 2603.04113: Understanding Sources of Demographic Predictability in Brain MRI via Disentangling Anatomy and ...

arXiv - AI · 4 min · 25 days ago

Llms

[2603.04069] Monitoring Emergent Reward Hacking During Generation via Internal Activations

Abstract page for arXiv paper 2603.04069: Monitoring Emergent Reward Hacking During Generation via Internal Activations

arXiv - AI · 4 min · 25 days ago

Machine Learning

[2603.03785] Observationally Informed Adaptive Causal Experimental Design

Abstract page for arXiv paper 2603.03785: Observationally Informed Adaptive Causal Experimental Design

arXiv - Machine Learning · 4 min · 25 days ago

Machine Learning

[2603.03989] When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

Abstract page for arXiv paper 2603.03989: When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

arXiv - AI · 4 min · 25 days ago

Generative Ai

[2603.03971] Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI

Abstract page for arXiv paper 2603.03971: Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI

arXiv - AI · 4 min · 25 days ago

Machine Learning

[2603.03405] Surprisal-Rényi Free Energy

Abstract page for arXiv paper 2603.03405: Surprisal-Rényi Free Energy

arXiv - Machine Learning · 3 min · 25 days ago

Ai Safety

[2603.03401] Beyond Cross-Validation: Adaptive Parameter Selection for Kernel-Based Gradient Descents

Abstract page for arXiv paper 2603.03401: Beyond Cross-Validation: Adaptive Parameter Selection for Kernel-Based Gradient Descents

arXiv - Machine Learning · 3 min · 25 days ago

Llms

[2603.03915] Rethinking Role-Playing Evaluation: Anonymous Benchmarking and a Systematic Study of Personality Effects

Abstract page for arXiv paper 2603.03915: Rethinking Role-Playing Evaluation: Anonymous Benchmarking and a Systematic Study of Personalit...

arXiv - AI · 3 min · 25 days ago

Ai Safety

[2603.03769] IntroductionDMD-augmented Unpaired Neural Schrödinger Bridge for Ultra-Low Field MRI Enhancement

Abstract page for arXiv paper 2603.03769: IntroductionDMD-augmented Unpaired Neural Schrödinger Bridge for Ultra-Low Field MRI Enhancement

arXiv - Machine Learning · 3 min · 25 days ago

Machine Learning

[2603.04359] Dissecting Quantization Error: A Concentration-Alignment Perspective

Abstract page for arXiv paper 2603.04359: Dissecting Quantization Error: A Concentration-Alignment Perspective

arXiv - AI · 3 min · 25 days ago

Machine Learning

[2603.03714] Order Is Not Layout: Order-to-Space Bias in Image Generation

Abstract page for arXiv paper 2603.03714: Order Is Not Layout: Order-to-Space Bias in Image Generation

arXiv - AI · 3 min · 25 days ago

Llms

[2603.04135] Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization

Abstract page for arXiv paper 2603.04135: Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization

arXiv - AI · 4 min · 25 days ago

Llms

[2603.03536] SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems

Abstract page for arXiv paper 2603.03536: SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems

arXiv - AI · 3 min · 25 days ago

Machine Learning

[2603.03920] BD-Merging: Bias-Aware Dynamic Model Merging with Evidence-Guided Contrastive Learning

Abstract page for arXiv paper 2603.03920: BD-Merging: Bias-Aware Dynamic Model Merging with Evidence-Guided Contrastive Learning

arXiv - AI · 4 min · 25 days ago

Machine Learning

[2603.03515] The Controllability Trap: A Governance Framework for Military AI Agents

Abstract page for arXiv paper 2603.03515: The Controllability Trap: A Governance Framework for Military AI Agents

arXiv - AI · 3 min · 25 days ago

Ai Safety

[2603.03867] k-hop Fairness: Addressing Disparities in Graph Link Prediction Beyond First-Order Neighborhoods

Abstract page for arXiv paper 2603.03867: k-hop Fairness: Addressing Disparities in Graph Link Prediction Beyond First-Order Neighborhoods

arXiv - Machine Learning · 4 min · 25 days ago

Ai Safety

[2603.03820] Fairness Begins with State: Purifying Latent Preferences for Hierarchical Reinforcement Learning in Interactive Recommendation

Abstract page for arXiv paper 2603.03820: Fairness Begins with State: Purifying Latent Preferences for Hierarchical Reinforcement Learnin...

arXiv - AI · 4 min · 25 days ago

Machine Learning

[2603.03662] Graph Negative Feedback Bias Correction Framework for Adaptive Heterophily Modeling

Abstract page for arXiv paper 2603.03662: Graph Negative Feedback Bias Correction Framework for Adaptive Heterophily Modeling

arXiv - AI · 4 min · 25 days ago

Machine Learning

[2603.03341] Ethical and Explainable AI in Reusable MLOps Pipelines

Abstract page for arXiv paper 2603.03341: Ethical and Explainable AI in Reusable MLOps Pipelines

arXiv - AI · 4 min · 25 days ago

Previous Page 17 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

[R] I trained a 3k parameter model on XOR sequences of length 20. It extrapolates perfectly to length 1,000,000. Here's why I think that's architecturally significant.

[D] Data curation and targeted replacement as a pre-training alignment and controllability method

I’ve come up with a new thought experiment to approach ASI, and it challenges the very notions of alignment and containment

All Content

[2603.04340] Balancing Fidelity, Utility, and Privacy in Synthetic Cardiac MRI Generation: A Comparative Study

[2603.04113] Understanding Sources of Demographic Predictability in Brain MRI via Disentangling Anatomy and Contrast

[2603.04069] Monitoring Emergent Reward Hacking During Generation via Internal Activations

[2603.03785] Observationally Informed Adaptive Causal Experimental Design

[2603.03989] When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

[2603.03971] Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI

[2603.03405] Surprisal-Rényi Free Energy

[2603.03401] Beyond Cross-Validation: Adaptive Parameter Selection for Kernel-Based Gradient Descents

[2603.03915] Rethinking Role-Playing Evaluation: Anonymous Benchmarking and a Systematic Study of Personality Effects

[2603.03769] IntroductionDMD-augmented Unpaired Neural Schrödinger Bridge for Ultra-Low Field MRI Enhancement

[2603.04359] Dissecting Quantization Error: A Concentration-Alignment Perspective

[2603.03714] Order Is Not Layout: Order-to-Space Bias in Image Generation

[2603.04135] Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization

[2603.03536] SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems

[2603.03920] BD-Merging: Bias-Aware Dynamic Model Merging with Evidence-Guided Contrastive Learning

[2603.03515] The Controllability Trap: A Governance Framework for Military AI Agents

[2603.03867] k-hop Fairness: Addressing Disparities in Graph Link Prediction Beyond First-Order Neighborhoods

[2603.03820] Fairness Begins with State: Purifying Latent Preferences for Hierarchical Reinforcement Learning in Interactive Recommendation

[2603.03662] Graph Negative Feedback Bias Correction Framework for Adaptive Heterophily Modeling

[2603.03341] Ethical and Explainable AI in Reusable MLOps Pipelines

Related Topics

Stay updated with AI News