AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

India advancing AI-ready public data infrastructure for smarter governance
Ai Safety

India advancing AI-ready public data infrastructure for smarter governance

India has reported significant progress in developing artificial intelligence (AI)-ready public data infrastructure, with a range of digi...

AI News - General · 2 min ·
Machine Learning

[R] I trained a 3k parameter model on XOR sequences of length 20. It extrapolates perfectly to length 1,000,000. Here's why I think that's architecturally significant.

I've been working on an alternative to attention-based sequence modeling that I'm calling Geometric Flow Networks (GFN). The core idea: i...

Reddit - Machine Learning · 1 min ·
Machine Learning

[D] Data curation and targeted replacement as a pre-training alignment and controllability method

Hi, r/MachineLearning: has much research been done in large-scale training scenarios where undesirable data has been replaced before trai...

Reddit - Machine Learning · 1 min ·

All Content

[2603.04986] Debiasing Sequential Recommendation with Time-aware Inverse Propensity Scoring
Machine Learning

[2603.04986] Debiasing Sequential Recommendation with Time-aware Inverse Propensity Scoring

Abstract page for arXiv paper 2603.04986: Debiasing Sequential Recommendation with Time-aware Inverse Propensity Scoring

arXiv - AI · 4 min ·
[2603.04976] 3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding
Llms

[2603.04976] 3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

Abstract page for arXiv paper 2603.04976: 3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

arXiv - AI · 4 min ·
[2603.04968] When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger
Llms

[2603.04968] When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

Abstract page for arXiv paper 2603.04968: When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

arXiv - AI · 3 min ·
[2603.04905] Deterministic Preprocessing and Interpretable Fuzzy Banding for Cost-per-Student Reporting from Extracted Records
Ai Safety

[2603.04905] Deterministic Preprocessing and Interpretable Fuzzy Banding for Cost-per-Student Reporting from Extracted Records

Abstract page for arXiv paper 2603.04905: Deterministic Preprocessing and Interpretable Fuzzy Banding for Cost-per-Student Reporting from...

arXiv - AI · 4 min ·
[2603.04676] Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks
Llms

[2603.04676] Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks

Abstract page for arXiv paper 2603.04676: Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks

arXiv - AI · 3 min ·
[2603.04421] Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?
Llms

[2603.04421] Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?

Abstract page for arXiv paper 2603.04421: Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?

arXiv - AI · 3 min ·
[2603.04410] SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models
Llms

[2603.04410] SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models

Abstract page for arXiv paper 2603.04410: SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models

arXiv - AI · 4 min ·
[2603.04407] Semantic Containment as a Fundamental Property of Emergent Misalignment
Llms

[2603.04407] Semantic Containment as a Fundamental Property of Emergent Misalignment

Abstract page for arXiv paper 2603.04407: Semantic Containment as a Fundamental Property of Emergent Misalignment

arXiv - AI · 3 min ·
[2603.05485] Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation
Llms

[2603.05485] Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation

Abstract page for arXiv paper 2603.05485: Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation

arXiv - AI · 3 min ·
[2603.05295] WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces
Nlp

[2603.05295] WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

Abstract page for arXiv paper 2603.05295: WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

arXiv - AI · 3 min ·
[2603.05040] Enhancing Zero-shot Commonsense Reasoning by Integrating Visual Knowledge via Machine Imagination
Llms

[2603.05040] Enhancing Zero-shot Commonsense Reasoning by Integrating Visual Knowledge via Machine Imagination

Abstract page for arXiv paper 2603.05040: Enhancing Zero-shot Commonsense Reasoning by Integrating Visual Knowledge via Machine Imagination

arXiv - AI · 3 min ·
[2603.05027] S5-SHB Agent: Society 5.0 enabled Multi-model Agentic Blockchain Framework for Smart Home
Machine Learning

[2603.05027] S5-SHB Agent: Society 5.0 enabled Multi-model Agentic Blockchain Framework for Smart Home

Abstract page for arXiv paper 2603.05027: S5-SHB Agent: Society 5.0 enabled Multi-model Agentic Blockchain Framework for Smart Home

arXiv - AI · 4 min ·
[2603.04904] Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems
Llms

[2603.04904] Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

Abstract page for arXiv paper 2603.04904: Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in ...

arXiv - AI · 4 min ·
[2603.04837] Design Behaviour Codes (DBCs): A Taxonomy-Driven Layered Governance Benchmark for Large Language Models
Llms

[2603.04837] Design Behaviour Codes (DBCs): A Taxonomy-Driven Layered Governance Benchmark for Large Language Models

Abstract page for arXiv paper 2603.04837: Design Behaviour Codes (DBCs): A Taxonomy-Driven Layered Governance Benchmark for Large Languag...

arXiv - AI · 4 min ·
[2603.04822] VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment
Llms

[2603.04822] VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

Abstract page for arXiv paper 2603.04822: VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

arXiv - AI · 4 min ·
[2603.04746] Visioning Human-Agentic AI Teaming: Continuity, Tension, and Future Research
Ai Safety

[2603.04746] Visioning Human-Agentic AI Teaming: Continuity, Tension, and Future Research

Abstract page for arXiv paper 2603.04746: Visioning Human-Agentic AI Teaming: Continuity, Tension, and Future Research

arXiv - AI · 4 min ·
[2603.04631] Towards automated data analysis: A guided framework for LLM-based risk estimation
Llms

[2603.04631] Towards automated data analysis: A guided framework for LLM-based risk estimation

Abstract page for arXiv paper 2603.04631: Towards automated data analysis: A guided framework for LLM-based risk estimation

arXiv - AI · 3 min ·
[2603.04582] Self-Attribution Bias: When AI Monitors Go Easy on Themselves
Llms

[2603.04582] Self-Attribution Bias: When AI Monitors Go Easy on Themselves

Abstract page for arXiv paper 2603.04582: Self-Attribution Bias: When AI Monitors Go Easy on Themselves

arXiv - Machine Learning · 4 min ·
[2603.04514] Progressive Refinement Regulation for Accelerating Diffusion Language Model Decoding
Llms

[2603.04514] Progressive Refinement Regulation for Accelerating Diffusion Language Model Decoding

Abstract page for arXiv paper 2603.04514: Progressive Refinement Regulation for Accelerating Diffusion Language Model Decoding

arXiv - AI · 3 min ·
Pentagon formally designates Anthropic a supply chain risk amid feud over AI guardrails
Ai Safety

Pentagon formally designates Anthropic a supply chain risk amid feud over AI guardrails

AI Tools & Products · 5 min ·
Previous Page 15 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime