New in llama.cpp: Model Management
About this article
A Blog post by ggml.ai on Hugging Face
Back to Articles New in llama.cpp: Model Management Team Article Published December 11, 2025 Upvote 119 +113 Xuan-Son Nguyen ngxson Follow ggml-org Victor Mustar victor Follow ggml-org llama.cpp server now ships with router mode, which lets you dynamically load, unload, and switch between multiple models without restarting. Reminder: llama.cpp server is a lightweight, OpenAI-compatible HTTP server for running LLMs locally. This feature was a popular request to bring Ollama-style model management to llama.cpp. It uses a multi-process architecture where each model runs in its own process, so if one model crashes, others remain unaffected. Quick Start Start the server in router mode by not specifying a model: llama-server This auto-discovers models from your llama.cpp cache (LLAMA_CACHE or ~/.cache/llama.cpp). If you've previously downloaded models via llama-server -hf user/model, they'll be available automatically. You can also point to a local directory of GGUF files: llama-server --models-dir ./my-models Features Auto-discovery: Scans your llama.cpp cache (default) or a custom --models-dir folder for GGUF files On-demand loading: Models load automatically when first requested LRU eviction: When you hit --models-max (default: 4), the least-recently-used model unloads Request routing: The model field in your request determines which model handles it Examples Chat with a specific model curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '...