Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training
About this article
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Back to Articles Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Published August 8, 2025 Update on GitHub Upvote 92 +86 Salman Mohammadi smohammadi Follow axolotl-ai-co Matej Sirovatka siro1 Follow wing lian winglian Follow axolotl-ai-co Marc Sun marcsun13 Follow Dan Saunders djsaunde Follow axolotl-ai-co Training large models across multiple GPUs can be challenging due to the complexities of different parallelism strategies. In Accelerate, together with Axolotl, we have integrated a quick and easy way to use any combination of parallelism strategies in your training script! Here is how to add it to your training script: from transformers import AutoModelForCausalLM from accelerate import Accelerator from accelerate.parallelism_config import ParallelismConfig from accelerate.utils import FullyShardedDataParallelPlugin # configure your desired parallelisms here - this particular configuration requires at least 2 nodes with 8 GPUs each. # setting any parallelism degree to 1 disables it i.e. dp_replicate_size=1 disables DP. pc = ParallelismConfig( dp_shard_size=2, # Fully Sharded Data Parallel degree dp_replicate_size=2, # Data Parallel degree cp_size=2, # Context Parallel degree tp_size=2, # Tensor Parallel degree ) fsdp_plugin = FullyShardedDataParallelPlugin( fsdp_version=2, auto_wrap_policy="transformer_based_wrap", transformer_cls_names_to_wrap=["LlamaDecoderLayer"], state_dict_type="SHARDED_STATE_DICT", ) accelerator = Accelerator( parallelism_config=pc...