AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

Top This Week

Llms

This Is Not Hacking. This Is Structured Intelligence.

Watch me demonstrate everything I've been talking about—live, in real time. The Setup: Maestro University AI enrollment system Standard c...

Reddit - Artificial Intelligence · 1 min · 25 minutes ago

Ai Safety

When Agentic AI Browsers Outrun Governance

Agentic AI browsers introduce new enterprise risk. Learn how AI governance helps leaders assess exposure, oversight gaps, and safe adopti...

AI Tools & Products · 14 min · about 2 hours ago

Llms

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

Greetings all - I've posted mostly in r/claudecode and r/aigamedev a couple of times previously. Working with CC for personal projects re...

Reddit - Artificial Intelligence · 1 min · about 5 hours ago

All Content

Llms

[2506.14261] RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

This article explores RL-Obfuscation, a method for training language models to evade latent-space monitors that detect undesirable behavi...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.23121] Automated Vulnerability Detection in Source Code Using Deep Representation Learning

This article presents a convolutional neural network model designed to automate the detection of vulnerabilities in C source code, achiev...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.23117] Devling into Adversarial Transferability on Image Classification: Review, Benchmark, and Evaluation

This article reviews adversarial transferability in image classification, proposing a standardized framework for evaluating transfer-base...

arXiv - AI · 3 min · about 1 month ago

Robotics

[2602.23073] Accelerated Online Risk-Averse Policy Evaluation in POMDPs with Theoretical Guarantees and Novel CVaR Bounds

This paper presents a theoretical framework for accelerating risk-averse policy evaluation in partially observable Markov decision proces...

arXiv - AI · 4 min · about 1 month ago

Ai Safety

[2602.23070] Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

This paper presents a novel approach to long-form Bengali Automatic Speech Recognition (ASR) and speaker diarization, introducing a compr...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2504.18594] RaPA: Enhancing Transferable Targeted Attacks via Random Parameter Pruning

The paper presents RaPA, a novel approach to enhance transferable targeted attacks in machine learning by utilizing random parameter prun...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2503.10503] Sample Compression for Self Certified Continual Learning

The paper introduces Continual Pick-to-Learn (CoP2L), a method for continual learning that uses sample compression to mitigate catastroph...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2410.12439] Beyond Attribution: Unified Concept-Level Explanations

The paper presents UnCLE, a framework that enhances model-agnostic explanation techniques by integrating concept-based approaches, offeri...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2602.22935] A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment

This paper presents a robust framework for Bangla Automatic Speech Recognition (ASR) and Speaker Diarization, addressing challenges in pr...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2410.10922] Towards Privacy-Guaranteed Label Unlearning in Vertical Federated Learning: Few-Shot Forgetting without Disclosure

This paper introduces a novel method for label unlearning in Vertical Federated Learning (VFL), addressing privacy concerns while maintai...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2404.01877] Procedural Fairness in Machine Learning

This paper explores procedural fairness in machine learning, proposing a new metric for evaluation and methods to enhance fairness withou...

arXiv - Machine Learning · 4 min · about 1 month ago

Machine Learning

[2602.23192] FairQuant: Fairness-Aware Mixed-Precision Quantization for Medical Image Classification

The paper presents FairQuant, a framework for fairness-aware mixed-precision quantization in medical image classification, optimizing bot...

arXiv - Machine Learning · 3 min · about 1 month ago

Llms

[2602.22790] Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift

The paper introduces Natural Language Declarative Prompting (NLD-P), a governance method for prompt design that addresses challenges pose...

arXiv - AI · 4 min · about 1 month ago

Ai Startups

[2602.22775] TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation

The paper introduces TherapyProbe, a methodology for enhancing relational safety in mental health chatbots through adversarial simulation...

arXiv - AI · 3 min · about 1 month ago

Machine Learning

[2602.23085] Q-Tag: Watermarking Quantum Circuit Generative Models

The paper presents Q-Tag, a novel watermarking framework for quantum circuit generative models (QCGMs), addressing the need for secure co...

arXiv - Machine Learning · 4 min · about 1 month ago

Llms

[2602.23079] Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent

This article introduces a novel LLM agent designed to assess and mitigate deanonymization risks in textual data using a method called SAL...

arXiv - Machine Learning · 3 min · about 1 month ago

Machine Learning

[2602.22740] AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation

The paper presents AMLRIS, a novel training strategy for Referring Image Segmentation (RIS) that enhances object segmentation through ali...

arXiv - AI · 3 min · about 1 month ago

Llms

[2602.22724] AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

AgentSentry introduces a novel framework to mitigate indirect prompt injection (IPI) in LLM agents, enhancing their security while mainta...

arXiv - AI · 4 min · about 1 month ago

Ai Safety

[2602.22710] Same Words, Different Judgments: Modality Effects on Preference Alignment

This study explores how modality affects preference alignment in AI systems, comparing human and synthetic evaluations of audio and text ...

arXiv - AI · 3 min · about 1 month ago

Llms

[2602.22700] IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation

The paper presents IMMACULATE, a framework for auditing large language models (LLMs) using verifiable computation to detect economic devi...

arXiv - AI · 3 min · about 1 month ago

Previous Page 33 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

This Is Not Hacking. This Is Structured Intelligence.

When Agentic AI Browsers Outrun Governance

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

All Content

[2506.14261] RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

[2602.23121] Automated Vulnerability Detection in Source Code Using Deep Representation Learning

[2602.23117] Devling into Adversarial Transferability on Image Classification: Review, Benchmark, and Evaluation

[2602.23073] Accelerated Online Risk-Averse Policy Evaluation in POMDPs with Theoretical Guarantees and Novel CVaR Bounds

[2602.23070] Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

[2504.18594] RaPA: Enhancing Transferable Targeted Attacks via Random Parameter Pruning

[2503.10503] Sample Compression for Self Certified Continual Learning

[2410.12439] Beyond Attribution: Unified Concept-Level Explanations

[2602.22935] A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment

[2410.10922] Towards Privacy-Guaranteed Label Unlearning in Vertical Federated Learning: Few-Shot Forgetting without Disclosure

[2404.01877] Procedural Fairness in Machine Learning

[2602.23192] FairQuant: Fairness-Aware Mixed-Precision Quantization for Medical Image Classification

[2602.22790] Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift

[2602.22775] TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation

[2602.23085] Q-Tag: Watermarking Quantum Circuit Generative Models

[2602.23079] Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent

[2602.22740] AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation

[2602.22724] AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

[2602.22710] Same Words, Different Judgments: Modality Effects on Preference Alignment

[2602.22700] IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation

Related Topics

Stay updated with AI News