[P] TurboQuant for weights: near‑optimal 4‑bit LLM quantization with lossless 8‑bit residual – 3.2× memory savings
About this article
An adaptation of the recent TurboQuant algorithm (Zandieh et al., 2025) from KV‑cache quantization to model weight compression. It gives you a drop‑in replacement for nn.Linear with near‑optimal distortion. Benchmarks (Qwen3.5‑0.8B, WikiText‑103) Config Bits PPL Δ PPL Compressed Size Baseline bf16 16 14.29 – 1,504 MB 4+4 residual 8 14.29 0.00 762 MB 4‑bit (group=full) 4 16.23 +1.94 361 MB 4‑bit (group=128) 4 16.57 +2.28 381 MB Check the GitHub repo for full docs, benchmarks, and Triton kernel...
You've been blocked by network security.To continue, log in to your Reddit account or use your developer tokenIf you think you've been blocked by mistake, file a ticket below and we'll look into it.Log in File a ticket