AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

Top This Week

Llms

This Is Not Hacking. This Is Structured Intelligence.

Watch me demonstrate everything I've been talking about—live, in real time. The Setup: Maestro University AI enrollment system Standard c...

Reddit - Artificial Intelligence · 1 min ·
When Agentic AI Browsers Outrun Governance
Ai Safety

When Agentic AI Browsers Outrun Governance

Agentic AI browsers introduce new enterprise risk. Learn how AI governance helps leaders assess exposure, oversight gaps, and safe adopti...

AI Tools & Products · 14 min ·
Llms

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

Greetings all - I've posted mostly in r/claudecode and r/aigamedev a couple of times previously. Working with CC for personal projects re...

Reddit - Artificial Intelligence · 1 min ·

All Content

[2506.14261] RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
Llms

[2506.14261] RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

This article explores RL-Obfuscation, a method for training language models to evade latent-space monitors that detect undesirable behavi...

arXiv - Machine Learning · 4 min ·
[2602.23121] Automated Vulnerability Detection in Source Code Using Deep Representation Learning
Machine Learning

[2602.23121] Automated Vulnerability Detection in Source Code Using Deep Representation Learning

This article presents a convolutional neural network model designed to automate the detection of vulnerabilities in C source code, achiev...

arXiv - AI · 4 min ·
[2602.23117] Devling into Adversarial Transferability on Image Classification: Review, Benchmark, and Evaluation
Machine Learning

[2602.23117] Devling into Adversarial Transferability on Image Classification: Review, Benchmark, and Evaluation

This article reviews adversarial transferability in image classification, proposing a standardized framework for evaluating transfer-base...

arXiv - AI · 3 min ·
[2602.23073] Accelerated Online Risk-Averse Policy Evaluation in POMDPs with Theoretical Guarantees and Novel CVaR Bounds
Robotics

[2602.23073] Accelerated Online Risk-Averse Policy Evaluation in POMDPs with Theoretical Guarantees and Novel CVaR Bounds

This paper presents a theoretical framework for accelerating risk-averse policy evaluation in partially observable Markov decision proces...

arXiv - AI · 4 min ·
[2602.23070] Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment
Ai Safety

[2602.23070] Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

This paper presents a novel approach to long-form Bengali Automatic Speech Recognition (ASR) and speaker diarization, introducing a compr...

arXiv - AI · 4 min ·
[2504.18594] RaPA: Enhancing Transferable Targeted Attacks via Random Parameter Pruning
Machine Learning

[2504.18594] RaPA: Enhancing Transferable Targeted Attacks via Random Parameter Pruning

The paper presents RaPA, a novel approach to enhance transferable targeted attacks in machine learning by utilizing random parameter prun...

arXiv - Machine Learning · 4 min ·
[2503.10503] Sample Compression for Self Certified Continual Learning
Machine Learning

[2503.10503] Sample Compression for Self Certified Continual Learning

The paper introduces Continual Pick-to-Learn (CoP2L), a method for continual learning that uses sample compression to mitigate catastroph...

arXiv - Machine Learning · 3 min ·
[2410.12439] Beyond Attribution: Unified Concept-Level Explanations
Machine Learning

[2410.12439] Beyond Attribution: Unified Concept-Level Explanations

The paper presents UnCLE, a framework that enhances model-agnostic explanation techniques by integrating concept-based approaches, offeri...

arXiv - Machine Learning · 3 min ·
[2602.22935] A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment
Machine Learning

[2602.22935] A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment

This paper presents a robust framework for Bangla Automatic Speech Recognition (ASR) and Speaker Diarization, addressing challenges in pr...

arXiv - AI · 3 min ·
[2410.10922] Towards Privacy-Guaranteed Label Unlearning in Vertical Federated Learning: Few-Shot Forgetting without Disclosure
Machine Learning

[2410.10922] Towards Privacy-Guaranteed Label Unlearning in Vertical Federated Learning: Few-Shot Forgetting without Disclosure

This paper introduces a novel method for label unlearning in Vertical Federated Learning (VFL), addressing privacy concerns while maintai...

arXiv - Machine Learning · 4 min ·
[2404.01877] Procedural Fairness in Machine Learning
Machine Learning

[2404.01877] Procedural Fairness in Machine Learning

This paper explores procedural fairness in machine learning, proposing a new metric for evaluation and methods to enhance fairness withou...

arXiv - Machine Learning · 4 min ·
[2602.23192] FairQuant: Fairness-Aware Mixed-Precision Quantization for Medical Image Classification
Machine Learning

[2602.23192] FairQuant: Fairness-Aware Mixed-Precision Quantization for Medical Image Classification

The paper presents FairQuant, a framework for fairness-aware mixed-precision quantization in medical image classification, optimizing bot...

arXiv - Machine Learning · 3 min ·
[2602.22790] Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift
Llms

[2602.22790] Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift

The paper introduces Natural Language Declarative Prompting (NLD-P), a governance method for prompt design that addresses challenges pose...

arXiv - AI · 4 min ·
[2602.22775] TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
Ai Startups

[2602.22775] TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation

The paper introduces TherapyProbe, a methodology for enhancing relational safety in mental health chatbots through adversarial simulation...

arXiv - AI · 3 min ·
[2602.23085] Q-Tag: Watermarking Quantum Circuit Generative Models
Machine Learning

[2602.23085] Q-Tag: Watermarking Quantum Circuit Generative Models

The paper presents Q-Tag, a novel watermarking framework for quantum circuit generative models (QCGMs), addressing the need for secure co...

arXiv - Machine Learning · 4 min ·
[2602.23079] Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent
Llms

[2602.23079] Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent

This article introduces a novel LLM agent designed to assess and mitigate deanonymization risks in textual data using a method called SAL...

arXiv - Machine Learning · 3 min ·
[2602.22740] AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation
Machine Learning

[2602.22740] AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation

The paper presents AMLRIS, a novel training strategy for Referring Image Segmentation (RIS) that enhances object segmentation through ali...

arXiv - AI · 3 min ·
[2602.22724] AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification
Llms

[2602.22724] AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

AgentSentry introduces a novel framework to mitigate indirect prompt injection (IPI) in LLM agents, enhancing their security while mainta...

arXiv - AI · 4 min ·
[2602.22710] Same Words, Different Judgments: Modality Effects on Preference Alignment
Ai Safety

[2602.22710] Same Words, Different Judgments: Modality Effects on Preference Alignment

This study explores how modality affects preference alignment in AI systems, comparing human and synthetic evaluations of audio and text ...

arXiv - AI · 3 min ·
[2602.22700] IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation
Llms

[2602.22700] IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation

The paper presents IMMACULATE, a framework for auditing large language models (LLMs) using verifiable computation to detect economic devi...

arXiv - AI · 3 min ·
Previous Page 33 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime