AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

Top This Week

Ai Safety

Implementing advanced AI technologies in finance | MIT Technology Review

In finance departments that have long been defined by precision and control, AI has arrived less as a neatly managed upgrade than as a qu...

MIT Technology Review · 4 min · about 2 hours ago

Llms

[2602.07026] Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

Abstract page for arXiv paper 2602.07026: Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

arXiv - AI · 4 min · about 9 hours ago

Machine Learning

[2511.22893] Switching-time bioprocess control with pulse-width-modulated optogenetics

Abstract page for arXiv paper 2511.22893: Switching-time bioprocess control with pulse-width-modulated optogenetics

arXiv - AI · 4 min · about 9 hours ago

All Content

Ai Safety

Implementing advanced AI technologies in finance | MIT Technology Review

In finance departments that have long been defined by precision and control, AI has arrived less as a neatly managed upgrade than as a qu...

MIT Technology Review · 4 min · about 2 hours ago

Llms

[2602.07026] Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

Abstract page for arXiv paper 2602.07026: Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

arXiv - AI · 4 min · about 9 hours ago

Machine Learning

[2511.22893] Switching-time bioprocess control with pulse-width-modulated optogenetics

Abstract page for arXiv paper 2511.22893: Switching-time bioprocess control with pulse-width-modulated optogenetics

arXiv - AI · 4 min · about 9 hours ago

Llms

[2407.04183] Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms

Abstract page for arXiv paper 2407.04183: Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms

arXiv - AI · 4 min · about 9 hours ago

Machine Learning

[2602.00924] Supervised sparse auto-encoders for interpretable and compositional representations

Abstract page for arXiv paper 2602.00924: Supervised sparse auto-encoders for interpretable and compositional representations

arXiv - AI · 3 min · about 9 hours ago

Machine Learning

[2601.23143] THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Abstract page for arXiv paper 2601.23143: THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

arXiv - AI · 3 min · about 9 hours ago

Llms

[2510.01569] InvThink: Premortem Reasoning for Safer Language Models

Abstract page for arXiv paper 2510.01569: InvThink: Premortem Reasoning for Safer Language Models

arXiv - AI · 3 min · about 9 hours ago

Llms

[2605.08063] Flow-OPD: On-Policy Distillation for Flow Matching Models

Abstract page for arXiv paper 2605.08063: Flow-OPD: On-Policy Distillation for Flow Matching Models

arXiv - AI · 4 min · about 9 hours ago

Machine Learning

[2605.07821] Divide and Conquer: Object Co-occurrence Helps Mitigate Simplicity Bias in OOD Detection

Abstract page for arXiv paper 2605.07821: Divide and Conquer: Object Co-occurrence Helps Mitigate Simplicity Bias in OOD Detection

arXiv - AI · 4 min · about 9 hours ago

Llms

[2605.07830] CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

Abstract page for arXiv paper 2605.07830: CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

arXiv - AI · 3 min · about 9 hours ago

Machine Learning

[2605.07786] APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment

Abstract page for arXiv paper 2605.07786: APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment

arXiv - AI · 3 min · about 9 hours ago

Llms

[2605.07647] Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

Abstract page for arXiv paper 2605.07647: Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the ...

arXiv - AI · 4 min · about 9 hours ago

Llms

[2605.07649] Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models

Abstract page for arXiv paper 2605.07649: Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models

arXiv - AI · 4 min · about 9 hours ago

Llms

[2605.07575] Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding

Abstract page for arXiv paper 2605.07575: Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding

arXiv - AI · 3 min · about 9 hours ago

Ai Safety

[2605.07545] Implicit Preference Alignment for Human Image Animation

Abstract page for arXiv paper 2605.07545: Implicit Preference Alignment for Human Image Animation

arXiv - AI · 3 min · about 9 hours ago

Llms

[2605.07250] Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment

Abstract page for arXiv paper 2605.07250: Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment

arXiv - AI · 3 min · about 9 hours ago

Ai Safety

[2605.08019] Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners

Abstract page for arXiv paper 2605.08019: Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners

arXiv - AI · 4 min · about 9 hours ago

Llms

[2605.07631] Inference Time Causal Probing in LLMs

Abstract page for arXiv paper 2605.07631: Inference Time Causal Probing in LLMs

arXiv - AI · 3 min · about 9 hours ago

Llms

[2605.07353] Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

Abstract page for arXiv paper 2605.07353: Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

arXiv - AI · 3 min · about 9 hours ago

Machine Learning

[2605.06895] Mitigating Cognitive Bias in RLHF by Altering Rationality

Abstract page for arXiv paper 2605.06895: Mitigating Cognitive Bias in RLHF by Altering Rationality

arXiv - AI · 3 min · about 9 hours ago

Page 1 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

Implementing advanced AI technologies in finance | MIT Technology Review

[2602.07026] Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

[2511.22893] Switching-time bioprocess control with pulse-width-modulated optogenetics

All Content

Implementing advanced AI technologies in finance | MIT Technology Review

[2602.07026] Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

[2511.22893] Switching-time bioprocess control with pulse-width-modulated optogenetics

[2407.04183] Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms

[2602.00924] Supervised sparse auto-encoders for interpretable and compositional representations

[2601.23143] THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

[2510.01569] InvThink: Premortem Reasoning for Safer Language Models

[2605.08063] Flow-OPD: On-Policy Distillation for Flow Matching Models

[2605.07821] Divide and Conquer: Object Co-occurrence Helps Mitigate Simplicity Bias in OOD Detection

[2605.07830] CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

[2605.07786] APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment

[2605.07647] Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

[2605.07649] Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models

[2605.07575] Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding

[2605.07545] Implicit Preference Alignment for Human Image Animation

[2605.07250] Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment

[2605.08019] Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners

[2605.07631] Inference Time Causal Probing in LLMs

[2605.07353] Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

[2605.06895] Mitigating Cognitive Bias in RLHF by Altering Rationality

Related Topics

Stay updated with AI News