Things I got wrong building a confidence evaluator for local LLMs [D]
I've been building **Autodidact**, a local-first AI agent framework. The central piece is a **confidence evaluator** - something that dec...
GPT, Claude, Gemini, and other LLMs
I've been building **Autodidact**, a local-first AI agent framework. The central piece is a **confidence evaluator** - something that dec...
Seriously, I just audited my stack and realized I’m spending more on rotating residential proxies than I am on the actual Claude and Open...
I’ve been in QA for almost a decade. My mental model for quality was always: given input X, assert output Y. Now I’m on a team that’s shi...
Abstract page for arXiv paper 2603.04545: An LLM-Guided Query-Aware Inference System for GNN Models on Large Knowledge Graphs
Abstract page for arXiv paper 2603.04478: Standing on the Shoulders of Giants: Rethinking EEG Foundation Model Pretraining via Multi-Teac...
Abstract page for arXiv paper 2602.07075: LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning
Abstract page for arXiv paper 2601.23236: YuriiFormer: A Suite of Nesterov-Accelerated Transformers
Abstract page for arXiv paper 2601.21149: Mobility-Embedded POIs: Learning What A Place Is and How It Is Used from Human Movement
Abstract page for arXiv paper 2601.16333: Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextuall...
Abstract page for arXiv paper 2601.14327: Yuan3.0 Ultra: A Trillion-Parameter Enterprise-Oriented MoE LLM
Abstract page for arXiv paper 2601.11527: "What if she doesn't feel the same?" What Happens When We Ask AI for Relationship Advice
Abstract page for arXiv paper 2601.11063: EmboTeam: Grounding LLM Reasoning into Reactive Behavior Trees via PDDL for Embodied Multi-Robo...
Abstract page for arXiv paper 2601.08393: Controlled LLM Training on Spectral Sphere
Abstract page for arXiv paper 2601.04548: Identifying Good and Bad Neurons for Task-Level Controllable LLMs
Abstract page for arXiv paper 2601.02663: When Do Tools and Planning Help Large Language Models Think? A Cost- and Latency-Aware Benchmark
Abstract page for arXiv paper 2512.15163: MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP...
Abstract page for arXiv paper 2512.14391: RePo: Language Models with Context Re-Positioning
Abstract page for arXiv paper 2512.13586: ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
Abstract page for arXiv paper 2511.21399: Steering Awareness: Models Can Be Trained to Detect Activation Steering
Abstract page for arXiv paper 2511.16786: Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach
Abstract page for arXiv paper 2511.03153: RefAgent: A Multi-agent LLM-based Framework for Automatic Software Refactoring
Abstract page for arXiv paper 2511.01870: CytoNet: A Foundation Model for the Human Cerebral Cortex at Cellular Resolution
Abstract page for arXiv paper 2510.27173: FMint-SDE: A Multimodal Foundation Model for Accelerating Numerical Simulation of SDEs via Erro...
Get the latest news, tools, and insights delivered to your inbox.
Daily or weekly digest • Unsubscribe anytime