AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

Ai Safety

China drafts law regulating 'digital humans' and banning addictive virtual services for children

A Reuters report outlines China's proposed regulations on the rapidly expanding sector of digital humans and AI avatars. Under the new dr...

Reddit - Artificial Intelligence · 1 min ·
[2512.00408] Low-Bitrate Video Compression through Semantic-Conditioned Diffusion
Generative Ai

[2512.00408] Low-Bitrate Video Compression through Semantic-Conditioned Diffusion

Abstract page for arXiv paper 2512.00408: Low-Bitrate Video Compression through Semantic-Conditioned Diffusion

arXiv - AI · 3 min ·
[2510.15148] XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
Llms

[2510.15148] XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

Abstract page for arXiv paper 2510.15148: XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

arXiv - AI · 4 min ·

All Content

[2509.23519] ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search
Llms

[2509.23519] ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

The paper introduces ReliabilityRAG, a framework designed to enhance the robustness of Retrieval-Augmented Generation (RAG) systems again...

arXiv - AI · 4 min ·
[2509.22794] Differentially Private Two-Stage Gradient Descent for Instrumental Variable Regression
Machine Learning

[2509.22794] Differentially Private Two-Stage Gradient Descent for Instrumental Variable Regression

This paper presents a novel algorithm for instrumental variable regression that ensures differential privacy while maintaining statistica...

arXiv - Machine Learning · 4 min ·
[2509.18776] AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field
Llms

[2509.18776] AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

The paper introduces AECBench, a benchmark for evaluating large language models (LLMs) in the Architecture, Engineering, and Construction...

arXiv - Machine Learning · 4 min ·
[2504.09733] Epsilon-Neighborhood Decision-Boundary Governed Estimation (EDGE) of 2D Black Box Classifier Functions
Ai Startups

[2504.09733] Epsilon-Neighborhood Decision-Boundary Governed Estimation (EDGE) of 2D Black Box Classifier Functions

The paper presents the Epsilon-Neighborhood Decision-Boundary Governed Estimation (EDGE) algorithm for efficiently estimating decision bo...

arXiv - Machine Learning · 4 min ·
[2503.00187] Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
Llms

[2503.00187] Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks

This paper presents a safety steering framework to enhance the robustness of large language models (LLMs) against multi-turn jailbreaking...

arXiv - Machine Learning · 4 min ·
[2502.01713] Auditing a Dutch Public Sector Risk Profiling Algorithm Using an Unsupervised Bias Detection Tool
Ai Safety

[2502.01713] Auditing a Dutch Public Sector Risk Profiling Algorithm Using an Unsupervised Bias Detection Tool

This article presents an audit of a Dutch public sector risk profiling algorithm, utilizing an unsupervised bias detection tool to identi...

arXiv - Machine Learning · 4 min ·
[2509.05311] Large Language Model Integration with Reinforcement Learning to Augment Decision-Making in Autonomous Cyber Operations
Llms

[2509.05311] Large Language Model Integration with Reinforcement Learning to Augment Decision-Making in Autonomous Cyber Operations

This article explores the integration of Large Language Models (LLMs) with Reinforcement Learning (RL) to enhance decision-making in auto...

arXiv - Machine Learning · 3 min ·
[2411.12159] Sensor-fusion based Prognostics for Deep-space Habitats Exhibiting Multiple Unlabeled Failure Modes
Robotics

[2411.12159] Sensor-fusion based Prognostics for Deep-space Habitats Exhibiting Multiple Unlabeled Failure Modes

This paper presents a novel unsupervised prognostics framework for deep-space habitats, addressing multiple unlabeled failure modes throu...

arXiv - Machine Learning · 4 min ·
[2508.19278] Towards Production-Worthy Simulation for Autonomous Cyber Operations
Machine Learning

[2508.19278] Towards Production-Worthy Simulation for Autonomous Cyber Operations

This article presents a framework for enhancing simulation environments in Autonomous Cyber Operations (ACO) by implementing new actions ...

arXiv - Machine Learning · 3 min ·
[2409.11972] Generation of Uncertainty-Aware High-Level Spatial Concepts in Factorized 3D Scene Graphs via Graph Neural Networks
Machine Learning

[2409.11972] Generation of Uncertainty-Aware High-Level Spatial Concepts in Factorized 3D Scene Graphs via Graph Neural Networks

This paper introduces a learning-based approach for generating uncertainty-aware high-level spatial concepts in 3D Scene Graphs, enhancin...

arXiv - Machine Learning · 4 min ·
[2508.03882] Simulating Cyberattacks through a Breach Attack Simulation (BAS) Platform empowered by Security Chaos Engineering (SCE)
Nlp

[2508.03882] Simulating Cyberattacks through a Breach Attack Simulation (BAS) Platform empowered by Security Chaos Engineering (SCE)

This article presents a novel approach to simulating cyberattacks by integrating Security Chaos Engineering (SCE) into Breach Attack Simu...

arXiv - AI · 3 min ·
[2602.11712] Potential-energy gating for robust state estimation in bistable stochastic systems
Machine Learning

[2602.11712] Potential-energy gating for robust state estimation in bistable stochastic systems

This article presents a novel method called potential-energy gating for robust state estimation in bistable stochastic systems, enhancing...

arXiv - Machine Learning · 4 min ·
[2602.11079] In-the-Wild Model Organisms: Mitigating Undesirable Emergent Behaviors in Production LLM Post-Training via Data Attribution
Llms

[2602.11079] In-the-Wild Model Organisms: Mitigating Undesirable Emergent Behaviors in Production LLM Post-Training via Data Attribution

The paper introduces activation-based data attribution to identify and mitigate undesirable behaviors in production language models post-...

arXiv - AI · 3 min ·
[2602.09238] Feature salience -- not task-informativeness -- drives machine learning model explanations
Machine Learning

[2602.09238] Feature salience -- not task-informativeness -- drives machine learning model explanations

This paper investigates the factors influencing feature importance in machine learning model explanations, emphasizing that feature salie...

arXiv - Machine Learning · 4 min ·
[2506.11526] Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis
Llms

[2506.11526] Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis

This survey explores the role of foundation models in enhancing scenario generation and analysis for autonomous driving, addressing limit...

arXiv - AI · 4 min ·
[2602.08655] From Robotics to Sepsis Treatment: Offline RL via Geometric Pessimism
Robotics

[2602.08655] From Robotics to Sepsis Treatment: Offline RL via Geometric Pessimism

This article presents a novel framework called Geometric Pessimism for Offline Reinforcement Learning (RL), enhancing performance in robo...

arXiv - Machine Learning · 4 min ·
[2602.06801] On the Non-Identifiability of Steering Vectors in Large Language Models
Llms

[2602.06801] On the Non-Identifiability of Steering Vectors in Large Language Models

This paper explores the non-identifiability of steering vectors in large language models (LLMs), revealing that these vectors cannot be u...

arXiv - AI · 3 min ·
[2602.06130] Self-Improving World Modelling with Latent Actions
Llms

[2602.06130] Self-Improving World Modelling with Latent Actions

The paper presents SWIRL, a framework for self-improving world modeling in machine learning, focusing on latent actions to enhance predic...

arXiv - AI · 4 min ·
[2506.04051] High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning
Llms

[2506.04051] High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning

The paper presents HALT, a method for finetuning large language models (LLMs) to enhance reliability by generating responses only when co...

arXiv - AI · 4 min ·
[2602.05119] Unbiased Single-Queried Gradient for Combinatorial Objective
Ai Safety

[2602.05119] Unbiased Single-Queried Gradient for Combinatorial Objective

This paper presents a novel stochastic gradient method for combinatorial optimization that requires only a single query, enhancing effici...

arXiv - Machine Learning · 3 min ·
Previous Page 102 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime