[2602.14302] Floe: Federated Specialization for Real-Time LLM-SLM Inference
Summary
The paper presents Floe, a federated learning framework that enhances real-time inference of large language models (LLMs) while addressing privacy and latency challenges through a hybrid approach combining cloud and edge computing.
Why It Matters
Floe is significant as it addresses the growing need for efficient, privacy-preserving AI solutions in real-time applications. By leveraging federated learning, it allows for personalized model fine-tuning without compromising user data, making it relevant for industries focused on user privacy and performance.
Key Takeaways
- Floe combines cloud-based LLMs with lightweight SLMs for efficient inference.
- The framework enhances user privacy by keeping personal data on-device.
- It employs a heterogeneity-aware adaptation strategy for diverse hardware.
- Real-time coordination between edge and cloud models is achieved through logit-level fusion.
- Floe significantly reduces inference latency compared to existing methods.
Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2602.14302 (cs) [Submitted on 15 Feb 2026] Title:Floe: Federated Specialization for Real-Time LLM-SLM Inference Authors:Chunlin Tian, Kahou Tam, Yebo Wu, Shuaihang Zhong, Li Li, Nicholas D. Lane, Chengzhong Xu View a PDF of the paper titled Floe: Federated Specialization for Real-Time LLM-SLM Inference, by Chunlin Tian and 6 other authors View PDF HTML (experimental) Abstract:Deploying large language models (LLMs) in real-time systems remains challenging due to their substantial computational demands and privacy concerns. We propose Floe, a hybrid federated learning framework designed for latency-sensitive, resource-constrained environments. Floe combines a cloud-based black-box LLM with lightweight small language models (SLMs) on edge devices to enable low-latency, privacy-preserving inference. Personal data and fine-tuning remain on-device, while the cloud LLM contributes general knowledge without exposing proprietary weights. A heterogeneity-aware LoRA adaptation strategy enables efficient edge deployment across diverse hardware, and a logit-level fusion mechanism enables real-time coordination between edge and cloud models. Extensive experiments demonstrate that Floe enhances user privacy and personalization. Moreover, it significantly improves model performance and reduces inference latency on edge devices under real-time constraints compared with baseline approaches. Comments: Subjects: Distributed...