"Authoritarian Parents In Rationalist Clothes": a piece I wrote in December about alignment
Posted today in light of the Claude Mythos model card release. Originally I wrote this for r/ControlProblem but realized it was getting o...
Alignment, bias, regulation, and responsible AI
Posted today in light of the Claude Mythos model card release. Originally I wrote this for r/ControlProblem but realized it was getting o...
A lot of discussion around AI is becoming siloed, and I think that is dangerous. People in AI-focused spaces often talk as if the only qu...
This paper explores the integration of moral cognition into AI decision-making models, introducing the concept of Expected Moral Shortfal...
This paper presents a novel hybrid quantum reinforcement learning framework, Q-PPO, designed to enhance the security of SIM-assisted wire...
This article explores how anthropomorphism in AI influences risk perception through trust and domain knowledge, based on a large-scale on...
The paper identifies a vulnerability in large language model (LLM) evaluation processes, termed Rubric-Induced Preference Drift (RIPD), w...
The paper introduces Elo-Evolve, a co-evolutionary framework for aligning large language models (LLMs) through dynamic multi-agent compet...
The paper examines how increasing context length in large language models (LLMs) affects personalization quality and privacy risks, revea...
The paper presents the Adaptive Safe Context Learning (ASCL) framework to address the safety-utility trade-off in large language model (L...
The paper presents a Privacy-Concealing Cooperation (PCC) framework for Bird's Eye View (BEV) semantic segmentation, enhancing autonomous...
The paper presents AISA, a novel defense mechanism for large language models (LLMs) that enhances safety against jailbreak attacks by act...
The paper introduces Boundary Point Jailbreaking (BPJ), a novel automated attack method that circumvents advanced safeguards in black-box...
This paper introduces the concept of capability calibration for large language models (LLMs), emphasizing the importance of accurate conf...
This study presents a fine-tuned BERT classifier for detecting AI-generated content in Turkish news media, achieving a high F1 score and ...
The paper presents a framework for web-scale multimodal summarization that integrates text and image data using CLIP-based semantic align...
This article explores the use of machine learning to detect obfuscated abusive language in Swahili, focusing on child safety and the chal...
MoltNet explores the social behavior of AI agents on the MoltBook platform, revealing insights into their interactions and similarities t...
The paper presents Atomix, a runtime system designed to enhance the reliability of agentic workflows by implementing progress-aware trans...
The paper explores backdoor attacks in large language models (LLMs), focusing on how biases can be induced through syntactically and sema...
This paper introduces Interactionless Inverse Reinforcement Learning, a framework aimed at improving AI alignment by decoupling safety ob...
This article explores the metabolic cost of information processing in Poisson variational autoencoders, emphasizing the energy constraint...
This article presents a new benchmark, MT-AgentRisk, for evaluating safety risks in multi-turn interactions of tool-using agents, reveali...
Get the latest news, tools, and insights delivered to your inbox.
Daily or weekly digest • Unsubscribe anytime