Block Sparse Matrices for Smaller and Faster Language Models
About this article
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Back to Articles Block Sparse Matrices for Smaller and Faster Language Models Published September 10, 2020 Update on GitHub Upvote - François Lagunas madlag Follow Saving space and time, one zero at a time In previous blog posts we introduced sparse matrices and what they could do to improve neural networks. The basic assumption is that full dense layers are often overkill and can be pruned without a significant loss in precision. In some cases sparse linear layers can even improve precision or/and generalization. The main issue is that currently available code that supports sparse algebra computation is severely lacking efficiency. We are also still waiting for official PyTorch support. That's why we ran out of patience and took some time this summer to address this "lacuna". Today, we are excited to release the extension pytorch_block_sparse. By itself, or even better combined with other methods like distillation and quantization, this library enables networks which are both smaller and faster, something Hugging Face considers crucial to let anybody use neural networks in production at low cost, and to improve the experience for the end user. Usage The provided BlockSparseLinear module is a drop in replacement for torch.nn.Linear, and it is trivial to use it in your models: # from torch.nn import Linear from pytorch_block_sparse import BlockSparseLinear ... # self.fc = nn.Linear(1024, 256) self.fc = BlockSparseLinear(1024, 256, density=0.1) The extension also provides a ...