AI Safety & Ethics

Alignment, bias, regulation, and responsible AI

This Week's Best | Monthly Best | Guide | Trending

Top This Week

Llms

[R] The Lyra Technique — A framework for interpreting internal cognitive states in LLMs (Zenodo, open access)

We're releasing a paper on a new framework for reading and interpreting the internal cognitive states of large language models: "The Lyra...

Reddit - Machine Learning · 1 min · about 6 hours ago

Machine Learning

[P] If you're building AI agents, logs aren't enough. You need evidence.

I have built a programmable governance layer for AI agents. I am considering to open source completely. Looking for feedback. Agent demos...

Reddit - Machine Learning · 1 min · about 15 hours ago

Ai Safety

[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

Abstract page for arXiv paper 2510.14628: RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

arXiv - AI · 4 min · about 18 hours ago

All Content

Ai Safety

The Download: Unraveling a death threat mystery, and AI voice recreation for musicians | MIT Technology Review

This article discusses two significant stories: a cybersecurity researcher facing death threats from hackers and a musician using AI to r...

MIT Technology Review · 7 min · about 2 months ago

Machine Learning

After spooking Hollywood, ByteDance will tweak safeguards on new AI model | The Verge

ByteDance plans to enhance safeguards for its AI video generator, Seedance 2.0, following copyright infringement allegations from Hollywo...

The Verge - AI · 4 min · about 2 months ago

Ai Infrastructure

All the important news from the ongoing India AI Impact Summit | TechCrunch

India's AI Impact Summit gathers leaders from major tech firms and governments to discuss AI investments, innovations, and the future of ...

TechCrunch - AI · 4 min · about 2 months ago

Ai Safety

[2602.12039] The Implicit Bias of Logit Regularization

The paper explores the implicit bias introduced by logit regularization in classifiers, demonstrating its effects on weight alignment and...

arXiv - Machine Learning · 3 min · about 2 months ago

Machine Learning

[2602.10538] Why Agentic Theorem Prover Works: A Statistical Provability Theory of Mathematical Reasoning Models

This article explores the effectiveness of agentic theorem provers through a statistical provability theory, analyzing their performance ...

arXiv - Machine Learning · 4 min · about 2 months ago

Machine Learning

[2602.10478] GPU-Fuzz: Finding Memory Errors in Deep Learning Frameworks

GPU-Fuzz introduces a novel approach to identifying memory errors in deep learning frameworks, demonstrating its effectiveness by uncover...

arXiv - Machine Learning · 3 min · about 2 months ago

Machine Learning

[2602.05096] Visual concept ranking uncovers medical shortcuts used by large multimodal models

This article presents a method called Visual Concept Ranking (VCR) to identify visual concepts in large multimodal models, focusing on th...

arXiv - Machine Learning · 3 min · about 2 months ago

Machine Learning

[2601.22983] PIDSMaker: Building and Evaluating Provenance-based Intrusion Detection Systems

PIDSMaker is an open-source framework designed for building and evaluating provenance-based intrusion detection systems (PIDSs), addressi...

arXiv - Machine Learning · 3 min · about 2 months ago

Llms

[2508.02872] Highlight & Summarize: RAG without the jailbreaks

The paper presents Highlight & Summarize (H&S), a novel design pattern for retrieval-augmented generation (RAG) systems that prevents jai...

arXiv - Machine Learning · 4 min · about 2 months ago

Generative Ai

[2506.06027] Sample-Specific Noise Injection For Diffusion-Based Adversarial Purification

This paper introduces Sample-specific Score-aware Noise Injection (SSNI), a novel framework for diffusion-based adversarial purification ...

arXiv - Machine Learning · 4 min · about 2 months ago

Llms

[2505.19558] PoliCon: Evaluating LLMs on Achieving Diverse Political Consensus Objectives

The paper introduces PoliCon, a benchmark for evaluating large language models (LLMs) in achieving political consensus from diverse party...

arXiv - Machine Learning · 4 min · about 2 months ago

Machine Learning

[2602.05358] Bayesian Neighborhood Adaptation for Graph Neural Networks

This paper presents a Bayesian framework for adapting neighborhood scopes in Graph Neural Networks (GNNs), enhancing their performance in...

arXiv - Machine Learning · 4 min · about 2 months ago

Machine Learning

[2512.09654] Membership and Dataset Inference Attacks on Large Audio Generative Models

This paper explores membership and dataset inference attacks on large audio generative models, assessing their implications for copyright...

arXiv - Machine Learning · 4 min · about 2 months ago

Machine Learning

[2510.11834] Don't Walk the Line: Boundary Guidance for Filtered Generation

The paper presents Boundary Guidance, a reinforcement learning method designed to improve the safety and utility of generative models by ...

arXiv - Machine Learning · 3 min · about 2 months ago

Ai Safety

[2507.03168] Adopting a human developmental visual diet yields robust, shape-based AI vision

This article presents a novel approach to AI vision by adopting a human developmental visual diet, enhancing shape recognition and resili...

arXiv - Machine Learning · 4 min · about 2 months ago

Ai Safety

[2506.05325] Quasiparticle Interference Kernel Extraction with Variational Autoencoders via Latent Alignment

This article presents an AI-based framework for extracting quasiparticle interference (QPI) kernels from complex scattering images, impro...

arXiv - Machine Learning · 4 min · about 2 months ago

Llms

[2505.22650] On Learning Verifiers and Implications to Chain-of-Thought Reasoning

This paper explores learning verifiers for Chain-of-Thought reasoning in natural language, addressing the challenges of incorrect inferen...

arXiv - Machine Learning · 4 min · about 2 months ago

Machine Learning

[2505.11846] Learning on a Razor's Edge: Identifiability and Singularity of Polynomial Neural Networks

This paper investigates the identifiability and singularity of polynomial neural networks, focusing on MLPs and CNNs, and explores their ...

arXiv - Machine Learning · 4 min · about 2 months ago

Llms

[2503.03704] Memory Injection Attacks on LLM Agents via Query-Only Interaction

The paper discusses Memory Injection Attacks (MINJA) on LLM agents, demonstrating how attackers can manipulate agent memory through query...

arXiv - Machine Learning · 4 min · about 2 months ago

Machine Learning

[2602.13168] Realistic Face Reconstruction from Facial Embeddings via Diffusion Models

This paper presents a novel framework for reconstructing realistic high-resolution face images from facial embeddings using diffusion mod...

arXiv - Machine Learning · 3 min · about 2 months ago

Previous Page 117 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

AI Safety & Ethics

Top This Week

[R] The Lyra Technique — A framework for interpreting internal cognitive states in LLMs (Zenodo, open access)

[P] If you're building AI agents, logs aren't enough. You need evidence.

[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

All Content

The Download: Unraveling a death threat mystery, and AI voice recreation for musicians | MIT Technology Review

After spooking Hollywood, ByteDance will tweak safeguards on new AI model | The Verge

All the important news from the ongoing India AI Impact Summit | TechCrunch

[2602.12039] The Implicit Bias of Logit Regularization

[2602.10538] Why Agentic Theorem Prover Works: A Statistical Provability Theory of Mathematical Reasoning Models

[2602.10478] GPU-Fuzz: Finding Memory Errors in Deep Learning Frameworks

[2602.05096] Visual concept ranking uncovers medical shortcuts used by large multimodal models

[2601.22983] PIDSMaker: Building and Evaluating Provenance-based Intrusion Detection Systems

[2508.02872] Highlight & Summarize: RAG without the jailbreaks

[2506.06027] Sample-Specific Noise Injection For Diffusion-Based Adversarial Purification

[2505.19558] PoliCon: Evaluating LLMs on Achieving Diverse Political Consensus Objectives

[2602.05358] Bayesian Neighborhood Adaptation for Graph Neural Networks

[2512.09654] Membership and Dataset Inference Attacks on Large Audio Generative Models

[2510.11834] Don't Walk the Line: Boundary Guidance for Filtered Generation

[2507.03168] Adopting a human developmental visual diet yields robust, shape-based AI vision

[2506.05325] Quasiparticle Interference Kernel Extraction with Variational Autoencoders via Latent Alignment

[2505.22650] On Learning Verifiers and Implications to Chain-of-Thought Reasoning

[2505.11846] Learning on a Razor's Edge: Identifiability and Singularity of Polynomial Neural Networks

[2503.03704] Memory Injection Attacks on LLM Agents via Query-Only Interaction

[2602.13168] Realistic Face Reconstruction from Facial Embeddings via Diffusion Models

Related Topics

Stay updated with AI News