Scaling-up BERT Inference on CPU (Part 1)
About this article
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Back to Articles Scaling up BERT-like model Inference on modern CPU - Part 1 Published April 20, 2021 Update on GitHub Upvote 5 Morgan Funtowicz mfuntowicz Follow 1. Context and Motivations Back in October 2019, my colleague Lysandre Debut published a comprehensive (at the time) inference performance benchmarking blog (1). Since then, 🤗 transformers (2) welcomed a tremendous number of new architectures and thousands of new models were added to the 🤗 hub (3) which now counts more than 9,000 of them as of first quarter of 2021. As the NLP landscape keeps trending towards more and more BERT-like models being used in production, it remains challenging to efficiently deploy and run these architectures at scale.This is why we recently introduced our 🤗 Inference API: to let you focus on building value for your users and customers, rather than digging into all the highly technical aspects of running such models. This blog post is the first part of a series which will cover most of the hardware and software optimizations to better leverage CPUs for BERT model inference. For this initial blog post, we will cover the hardware part: Setting up a baseline - Out of the box results Practical & technical considerations when leveraging modern CPUs for CPU-bound tasks Core count scaling - Does increasing the number of cores actually give better performance? Batch size scaling - Increasing throughput with multiple parallel & independent model instances We decided to focus on the most famous ...