New in llama.cpp: Model Management

Hugging Face Blog February 15, 2026 6 min read

About this article

A Blog post by ggml.ai on Hugging Face

Back to Articles New in llama.cpp: Model Management Team Article Published December 11, 2025 Upvote 119 +113 Xuan-Son Nguyen ngxson Follow ggml-org Victor Mustar victor Follow ggml-org llama.cpp server now ships with router mode, which lets you dynamically load, unload, and switch between multiple models without restarting. Reminder: llama.cpp server is a lightweight, OpenAI-compatible HTTP server for running LLMs locally. This feature was a popular request to bring Ollama-style model management to llama.cpp. It uses a multi-process architecture where each model runs in its own process, so if one model crashes, others remain unaffected. Quick Start Start the server in router mode by not specifying a model: llama-server This auto-discovers models from your llama.cpp cache (LLAMA_CACHE or ~/.cache/llama.cpp). If you've previously downloaded models via llama-server -hf user/model, they'll be available automatically. You can also point to a local directory of GGUF files: llama-server --models-dir ./my-models Features Auto-discovery: Scans your llama.cpp cache (default) or a custom --models-dir folder for GGUF files On-demand loading: Models load automatically when first requested LRU eviction: When you hit --models-max (default: 4), the least-recently-used model unloads Request routing: The model field in your request determines which model handles it Examples Chat with a specific model curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '...

Originally published on February 15, 2026. Curated by AI News.

Llms

Is the Mirage Effect a bug, or is it Geometric Reconstruction in action? A framework for why VLMs perform better "hallucinating" than guessing, and what that may tell us about what's really inside these models

Last week, a team from Stanford and UCSF (Asadi, O'Sullivan, Fei-Fei Li, Euan Ashley et al.) dropped two companion papers. The first, MAR...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Llms

Paper Finds That Leading AI Chatbots Like ChatGPT and Claude Remain Incredibly Sycophantic, Resulting in Twisted Effects on Users

https://futurism.com/artificial-intelligence/paper-ai-chatbots-chatgpt-claude-sycophantic Your AI chatbot isn’t neutral. Trust its advice...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Llms

Claude Code leak exposes a Tamagotchi-style ‘pet’ and an always-on agent | The Verge

Anthropic says “human error” resulted in a leak that exposed Claude Code’s source code. The leaked code, which has since been copied to G...

The Verge - AI · 4 min · about 2 hours ago

Llms

You can now use ChatGPT with Apple’s CarPlay | The Verge

ChatGPT is now accessible from your CarPlay dashboard if you have iOS 26.4 or newer and the latest version of the ChatGPT app.

The Verge - AI · 3 min · about 3 hours ago

New in llama.cpp: Model Management

About this article

Related Articles

Is the Mirage Effect a bug, or is it Geometric Reconstruction in action? A framework for why VLMs perform better "hallucinating" than guessing, and what that may tell us about what's really inside these models

Paper Finds That Leading AI Chatbots Like ChatGPT and Claude Remain Incredibly Sycophantic, Resulting in Twisted Effects on Users

Claude Code leak exposes a Tamagotchi-style ‘pet’ and an always-on agent | The Verge

You can now use ChatGPT with Apple’s CarPlay | The Verge

No comments

Stay updated with AI News