Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training

Hugging Face Blog February 15, 2026 21 min read

About this article

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Back to Articles Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Published August 8, 2025 Update on GitHub Upvote 92 +86 Salman Mohammadi smohammadi Follow axolotl-ai-co Matej Sirovatka siro1 Follow wing lian winglian Follow axolotl-ai-co Marc Sun marcsun13 Follow Dan Saunders djsaunde Follow axolotl-ai-co Training large models across multiple GPUs can be challenging due to the complexities of different parallelism strategies. In Accelerate, together with Axolotl, we have integrated a quick and easy way to use any combination of parallelism strategies in your training script! Here is how to add it to your training script: from transformers import AutoModelForCausalLM from accelerate import Accelerator from accelerate.parallelism_config import ParallelismConfig from accelerate.utils import FullyShardedDataParallelPlugin # configure your desired parallelisms here - this particular configuration requires at least 2 nodes with 8 GPUs each. # setting any parallelism degree to 1 disables it i.e. dp_replicate_size=1 disables DP. pc = ParallelismConfig( dp_shard_size=2, # Fully Sharded Data Parallel degree dp_replicate_size=2, # Data Parallel degree cp_size=2, # Context Parallel degree tp_size=2, # Tensor Parallel degree ) fsdp_plugin = FullyShardedDataParallelPlugin( fsdp_version=2, auto_wrap_policy="transformer_based_wrap", transformer_cls_names_to_wrap=["LlamaDecoderLayer"], state_dict_type="SHARDED_STATE_DICT", ) accelerator = Accelerator( parallelism_config=pc...

Originally published on February 15, 2026. Curated by AI News.

Llms

My AI spent last night modifying its own codebase

I've been working on a local AI system called Apis that runs completely offline through Ollama. During a background run, Apis identified ...

Reddit - Artificial Intelligence · 1 min · about 6 hours ago

Llms

Depth-first pruning seems to transfer from GPT-2 to Llama (unexpectedly well)

TL;DR: Removing the right transformer layers (instead of shrinking all layers) gives smaller, faster models with minimal quality loss — a...

Reddit - Artificial Intelligence · 1 min · about 8 hours ago

Llms

[2603.16430] EngGPT2: Sovereign, Efficient and Open Intelligence

Abstract page for arXiv paper 2603.16430: EngGPT2: Sovereign, Efficient and Open Intelligence

arXiv - AI · 4 min · about 9 hours ago

Llms

[2512.12812] Does Tone Change the Answer? Evaluating Prompt Politeness Effects on Modern LLMs: GPT, Gemini, and LLaMA

Abstract page for arXiv paper 2512.12812: Does Tone Change the Answer? Evaluating Prompt Politeness Effects on Modern LLMs: GPT, Gemini, ...

arXiv - AI · 4 min · about 9 hours ago

Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training

About this article

Related Articles

My AI spent last night modifying its own codebase

Depth-first pruning seems to transfer from GPT-2 to Llama (unexpectedly well)

[2603.16430] EngGPT2: Sovereign, Efficient and Open Intelligence

[2512.12812] Does Tone Change the Answer? Evaluating Prompt Politeness Effects on Modern LLMs: GPT, Gemini, and LLaMA

No comments

Stay updated with AI News