AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

RSS

Top This Week

Llms

[2512.21106] Semantic Refinement with LLMs for Graph Representations

Abstract page for arXiv paper 2512.21106: Semantic Refinement with LLMs for Graph Representations

arXiv - Machine Learning · 4 min · about 11 hours ago

Machine Learning

[2511.22294] Structure is Supervision: Multiview Masked Autoencoders for Radiology

Abstract page for arXiv paper 2511.22294: Structure is Supervision: Multiview Masked Autoencoders for Radiology

arXiv - Machine Learning · 4 min · about 11 hours ago

Llms

[2511.18123] Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models

Abstract page for arXiv paper 2511.18123: Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-La...

arXiv - Machine Learning · 4 min · about 11 hours ago

All Content

Ai Safety

[2602.20468] CGSTA: Cross-Scale Graph Contrast with Stability-Aware Alignment for Multivariate Time-Series Anomaly Detection

The CGSTA framework enhances multivariate time-series anomaly detection by utilizing dynamic layered graphs and stability-aware alignment...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.20457] Oracle-Robust Online Alignment for Large Language Models

This paper explores the online alignment of large language models (LLMs) under misspecified preference feedback, proposing a robust optim...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2602.20419] CREDIT: Certified Ownership Verification of Deep Neural Networks Against Model Extraction Attacks

The paper introduces CREDIT, a method for certified ownership verification of deep neural networks to combat model extraction attacks, en...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2602.20418] CITED: A Decision Boundary-Aware Signature for GNNs Towards Model Extraction Defense

The paper presents CITED, a novel framework for defending Graph Neural Networks (GNNs) against Model Extraction Attacks (MEAs) by providi...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.20273] The Truthfulness Spectrum Hypothesis

The Truthfulness Spectrum Hypothesis explores how large language models (LLMs) represent truthfulness across various domains, revealing a...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.20194] FedAvg-Based CTMC Hazard Model for Federated Bridge Deterioration Assessment

This article presents a federated framework using a CTMC hazard model for assessing bridge deterioration, allowing municipalities to coll...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.11184] KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models

The paper presents KBVQ-MoE, a novel framework for improving vector quantization in Mixture of Experts (MoE) large language models, addre...

arXiv - Machine Learning · 4 min · about 1 month ago

Ai Safety

[2602.09050] SAS-Net: Scene-Appearance Separation Network for Robust Spatiotemporal Registration in Bidirectional Photoacoustic Microscopy

The paper introduces SAS-Net, a novel framework for robust spatiotemporal registration in bidirectional photoacoustic microscopy, address...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.00044] When LLMs Imagine People: A Human-Centered Persona Brainstorm Audit for Bias and Fairness in Creative Applications

This paper introduces the Persona Brainstorm Audit (PBA), a method for assessing bias in Large Language Models (LLMs) used in creative ap...

arXiv - AI · 4 min · about 1 month ago

Llms

[2601.03868] What Matters For Safety Alignment?

This paper investigates safety alignment in large language models (LLMs) and large reasoning models (LRMs), identifying key factors that ...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2512.24787] HiGR: Efficient Generative Slate Recommendation via Hierarchical Planning and Multi-Objective Preference Alignment

The paper presents HiGR, a novel framework for generative slate recommendation that enhances efficiency and user preference alignment thr...

arXiv - AI · 4 min · about 1 month ago

Llms

[2512.16602] Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics

The paper introduces Refusal Steering, a method for controlling Large Language Models' refusal behavior on sensitive topics without retra...

arXiv - AI · 4 min · about 1 month ago

Llms

[2510.22620] Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents

This article evaluates the security of large language models (LLMs) used in AI agents, introducing a framework for identifying vulnerabil...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2510.22500] Towards Scalable Oversight via Partitioned Human Supervision

The paper proposes a scalable oversight framework for AI systems using partitioned human supervision, addressing challenges in obtaining ...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2510.08091] Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility

This article explores how rationales generated by large language models (LLMs) influence human judgments of plausibility in commonsense r...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2510.06868] Multi-hop Deep Joint Source-Channel Coding with Deep Hash Distillation for Semantically Aligned Image Recovery

This paper presents a novel approach to image transmission using multi-hop deep joint source-channel coding (DeepJSCC) combined with deep...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2510.00037] On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations

This paper evaluates the robustness of Vision-Language-Action (VLA) models against various multi-modal perturbations, proposing a new met...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2509.25774] PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models

The paper introduces Proportionate Credit Policy Optimization (PCPO), a novel framework aimed at improving the stability and quality of t...

arXiv - Machine Learning · 3 min · about 1 month ago

Llms

[2506.03922] HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

HSSBench introduces a benchmark for evaluating Multimodal Large Language Models (MLLMs) in Humanities and Social Sciences, addressing gap...

arXiv - AI · 4 min · about 1 month ago

Llms

[2504.18310] How much does context affect the accuracy of AI health advice?

This article examines how linguistic and contextual factors influence the accuracy of AI-generated health advice, revealing significant d...

arXiv - Machine Learning · 4 min · about 1 month ago

Previous Page 53 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

[2512.21106] Semantic Refinement with LLMs for Graph Representations

[2511.22294] Structure is Supervision: Multiview Masked Autoencoders for Radiology

[2511.18123] Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models

All Content

[2602.20468] CGSTA: Cross-Scale Graph Contrast with Stability-Aware Alignment for Multivariate Time-Series Anomaly Detection

[2602.20457] Oracle-Robust Online Alignment for Large Language Models

[2602.20419] CREDIT: Certified Ownership Verification of Deep Neural Networks Against Model Extraction Attacks

[2602.20418] CITED: A Decision Boundary-Aware Signature for GNNs Towards Model Extraction Defense

[2602.20273] The Truthfulness Spectrum Hypothesis

[2602.20194] FedAvg-Based CTMC Hazard Model for Federated Bridge Deterioration Assessment

[2602.11184] KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models

[2602.09050] SAS-Net: Scene-Appearance Separation Network for Robust Spatiotemporal Registration in Bidirectional Photoacoustic Microscopy

[2602.00044] When LLMs Imagine People: A Human-Centered Persona Brainstorm Audit for Bias and Fairness in Creative Applications

[2601.03868] What Matters For Safety Alignment?

[2512.24787] HiGR: Efficient Generative Slate Recommendation via Hierarchical Planning and Multi-Objective Preference Alignment

[2512.16602] Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics

[2510.22620] Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents

[2510.22500] Towards Scalable Oversight via Partitioned Human Supervision

[2510.08091] Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility

[2510.06868] Multi-hop Deep Joint Source-Channel Coding with Deep Hash Distillation for Semantically Aligned Image Recovery

[2510.00037] On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations

[2509.25774] PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models

[2506.03922] HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

[2504.18310] How much does context affect the accuracy of AI health advice?

Related Topics

Stay updated with AI News