Quanto: a PyTorch quantization backend for Optimum

Quanto: a PyTorch quantization backend for Optimum

Hugging Face Blog 6 min read

About this article

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Back to Articles Quanto: a PyTorch quantization backend for Optimum Published March 18, 2024 Update on GitHub Upvote 45 +39 David Corvoysier dacorvo Follow Younes B ybelkada Follow Marc Sun marcsun13 Follow Quantization is a technique to reduce the computational and memory costs of evaluating Deep Learning Models by representing their weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32). Reducing the number of bits means the resulting model requires less memory storage, which is crucial for deploying Large Language Models on consumer devices. It also enables specific optimizations for lower bitwidth datatypes, such as int8 or float8 matrix multiplications on CUDA devices. Many open-source libraries are available to quantize pytorch Deep Learning Models, each providing very powerful features, yet often restricted to specific model configurations and devices. Also, although they are based on the same design principles, they are unfortunately often incompatible with one another. Today, we are excited to introduce quanto, a PyTorch quantization backend for Optimum. It has been designed with versatility and simplicity in mind: all features are available in eager mode (works with non-traceable models), quantized models can be placed on any device (including CUDA and MPS), automatically inserts quantization and dequantization stubs, automatically inserts quantized functional operations, automatically ...

Originally published on February 15, 2026. Curated by AI News.

Related Articles

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents
Open Source Ai

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

A Blog post by IBM Granite on Hugging Face

Hugging Face Blog · 7 min ·
Llms

My AI spent last night modifying its own codebase

I've been working on a local AI system called Apis that runs completely offline through Ollama. During a background run, Apis identified ...

Reddit - Artificial Intelligence · 1 min ·
Llms

Depth-first pruning seems to transfer from GPT-2 to Llama (unexpectedly well)

TL;DR: Removing the right transformer layers (instead of shrinking all layers) gives smaller, faster models with minimal quality loss — a...

Reddit - Artificial Intelligence · 1 min ·
[2603.16430] EngGPT2: Sovereign, Efficient and Open Intelligence
Llms

[2603.16430] EngGPT2: Sovereign, Efficient and Open Intelligence

Abstract page for arXiv paper 2603.16430: EngGPT2: Sovereign, Efficient and Open Intelligence

arXiv - AI · 4 min ·
More in Open Source Ai: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime