Data Center / Cloud

Reproducing NVIDIA MLPerf v5.0 Training Scores for LLM Benchmarks

The previous post, NVIDIA Blackwell Delivers up to 2.6x Higher Performance in MLPerf Training v5.0, explains how the NVIDIA platform delivered the fastest time to train across all seven benchmarks in this latest MLPerf round. This post provides a guide to reproduce the performance of NVIDIA MLPerf v5.0 submissions of Llama 2 70B LoRA fine-tuning and Llama 405B pretraining. Submission repositories also include README files to reproduce the scores. See, for example, those for the Llama 2 70B LoRA fine-tuning benchmark and the Llama 3.1 405B benchmark

Prerequisites

Running NVIDIA benchmarks requires your system to have the following:

  • Container preparation, dataset/checkpoint download and preprocessing
    • Docker
    • A Hugging Face access token (for dataset/checkpoint download)
    • At least 2.5 TB of disk space for Llama 3.1, 300 GB for LoRA fine-tuning
  • Hardware requirements
    • Llama 2 70B LoRA: An NVIDIA DGX B200 or NVIDIA GB200 NVL72 system, or multiple GB200 NVL72 systems connected with InfiniBand for scales larger than 72 GPUs. The smallest NVIDIA submission for this benchmark is eight GPUs.
    • Llama 3.1 405B: At least four GB200 NVL72 systems connected with InfiniBand. The smallest NVIDIA submission for this benchmark is 256 GPUs. 

Cluster setup

Running NVIDIA MLPerf Training benchmarks requires:

  • Environment based on Slurm, Pyxis, and Enroot
  • Networking with NVIDIA NVLink and InfiniBand
  • Fast local storage set up in RAID0 configuration to minimize data loading bottlenecks

NVIDIA submission clusters do not support running workloads with Docker. The clusters are governed by the NVIDIA Base Command Manager (BCM). Follow the official instructions to properly set up a BCM SLURM cluster.

After a proper setup, you should be able to log in to the head node and access SLURM commands (sinfo, squeue, srun, sbatch) to launch jobs on the compute nodes.

Running benchmarks

The steps necessary to start benchmarking any model include the following:

  1. Build a Docker container.
  2. Run the container on any machine with Docker to download and process the dataset and the checkpoint. This step can be done on any machine, in the cluster or not, and it generally doesn’t require a system with a GPU. Make sure the data is accessible by the compute nodes. Preferably the data is stored locally on the node; alternatively, it is accessible through a fast (parallel) file system.
  3. Launch the training and parse the logs.

Llama 2 70B LoRA

To run benchmarks for Llama 2 70B LoRA, follow the instructions in this section.

Build the container

  • Clone the mlcommons/training_results_v5.0 GitHub repo.
  • cd NVIDIA/benchmarks/llama2_70b_lora/implementations/tyche_ngpu72_ngc25.04_nemo.
  • Docker build -t mlperf-nvidia:llama2_70b_lora-pyt. If you have a registry you would like to push the image to, add the registry name to the image name.

Download the dataset and model

This benchmark uses the GovReport dataset and a Hugging Face checkpoint. Both the dataset and the checkpoint require preprocessing to be used by NVIDIA NeMo. You need a Hugging Face token to download the checkpoint.

To download and preprocess, do the following:

# create a directory where the data will be stored
mkdir </path/to/dataset>
# start the docker container 
docker run -it --rm --gpus all --network=host --ipc=host --volume </path/to/dataset>:/data mlperf-nvidia:llama2_70b_lora-pyt
# now you should be inside the container in the /workspace/ft-llm directory. run the download scripts
python scripts/download_dataset.py --data_dir /data/gov_report  # download dataset
python scripts/download_model.py --model_dir /data/model  # download preprocessed model checkpoint in NeMo format used for initialization; could take up to 30 minutes

If the model download failed, you might need to export your HF_TOKEN before you call the download_model.py script:

export HF_TOKEN=<your/huggingface/token>

After conversion you should see the following files in the /data directory:

/data
├── gov_report
│   ├── train.npy
│   └── validation.npy
└── model
    ├── context
    │   ├── io.json
    │   ├── model.yaml
    │   └── nemo_tokenizer
    └── weights
        ├── common.pt
        ├── metadata.json
        ├── module.decoder.final_layernorm._extra_state
        ├── module.decoder.final_layernorm.weight
        ├── module.decoder.layers.mlp.linear_fc1._extra_state
        ├── module.decoder.layers.mlp.linear_fc1.layer_norm_weight
        ├── module.decoder.layers.mlp.linear_fc1.weight
        ├── module.decoder.layers.mlp.linear_fc2._extra_state
        ├── module.decoder.layers.mlp.linear_fc2.weight
        ├── module.decoder.layers.self_attention.core_attention._extra_state
        ├── module.decoder.layers.self_attention.linear_proj._extra_state
        ├── module.decoder.layers.self_attention.linear_proj.weight
        ├── module.decoder.layers.self_attention.linear_qkv._extra_state
        ├── module.decoder.layers.self_attention.linear_qkv.layer_norm_weight
        ├── module.decoder.layers.self_attention.linear_qkv.weight
        ├── module.embedding.word_embeddings.weight
        └── module.output_layer.weight

You may exit the container at this point.

Launch the benchmarking

NVIDIA uses SLURM to launch benchmarks on the compute nodes. Two files are used to facilitate the job launch process:

  1. A configuration file (config_*.sh) that describes the hyperparameters of the model, including the number of nodes, walltime, and so on. Organizing these in a single file per submission enables easy configuration of the workload to run at desired scale with optimal hyperparameters.
  2. A fixed run.sub file that contains srun commands to launch the training, passing all the hyperparameters from the config to the Python script.

To take a look at a typical config file, in this case config_GB200_18x4x1xtp1pp1cp8.sh, the name describes the size and the type of the system:

  • GB200: Designed to run on a GB200 machine
  • 18×4: A system setting that can be decoded as NNODES x NGPUS
    • NNODES: Number of GB200 nodes
    • NGPUS: Number of GPUs per node
  • x1xtp1pp1cp8 is a parallelization schema
    • x1 is GradientAccumulation, here equal to 1, meaning no GA
    • TP1: No TensorParallel
    • PP1: No PipelineParallel
    • CP8: 8-way ContextParallel

This means a 72-GPU configuration will be used, running on a single GB200 NVL72 rack. The benchmark will run GA1TP1PP1CP8: Global batch size (GBS) = 9.

Next, take a closer look at the content of the config file. The first part sources config_common.sh, which contains hyperparameters and optimization flags used by all configs. Some cases override a flag from config_common.sh. Set the max steps, learning rate, gradient accumulation (MINIBS), and the aforementioned parallelization schema.

#!/bin/bash
source $(dirname ${BASH_SOURCE[0]})/config_common.sh

# hyperparameters
export MAX_STEPS=800
export LR=0.0005
export MINIBS=1
export TP=1
export SP=0
export CP=8

Next is a section to add system-specific optimizations and override the common flags if needed.

export LAYER_CUDA_GRAPH=0
export MCORE_CUDA_GRAPH=1

Next is a system-level setting to pass to SLURM.

# system parameters
export VBOOST_VALUE=0
export DGXNNODES=18
export DGXNGPU=4
export WALLTIME_RUNANDTIME=10
export WALLTIME=$((5 + ${NEXP:-1} * ($WALLTIME_RUNANDTIME + 5)))

The configs are being tuned for a particular system size, both in terms of optimization flags (impacting performance) as well as hyperparameters (impacting convergence). It is possible to modify a given config to run on a system of a different size, but that requires careful consideration and is not guaranteed to be as performant as the original config.

To start the actual training, you need to tell the script where the dataset/model is, where you want to have the logfile stored, which container you want to use, source the config files, and run the sbatch command:

export DATADIR="</path/to/dataset>/gov_report"  # set your </path/to/dataset>
export MODEL="</path/to/dataset>/model"  # set your </path/to/dataset>
export LOGDIR="</path/to/output_logdir>"  # set the place where the output logs will be saved
export CONT=mlperf-nvidia:llama2_70b_lora-pyt
source config_<system>.sh  # select config and source it
sbatch -N $DGXNNODES -t $WALLTIME run.sub  # you may be required to set --account and --partition here

Parse the logs

The logfile will contain a lot of output from the initialization and other info lines. The MLPerf-relevant lines start with the MLPerf logger prefix: :::MLLOG. There are a few interesting markers, as shown below.

Initialization starts:

:::MLLOG {"namespace": "", "time_ms": 1745066769306, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/workspace/ft-llm/train.py", "lineno": 327}}

Here, you can see that the Python script has started. Below, you can see the hyperparameters that you have selected using the config file, along the default (immutable) ones. 

After the initialization is completed, and the model is warmed up, the init_stop and run_start markers are printed:

:::MLLOG {"namespace": "", "time_ms": 1745066917960, "event_type": "INTERVAL_END", "key": "init_stop", "value": null, "metadata": {"file": "/usr/local/lib/python3.12/dist-packages/mlperf_common/callbacks/logging.py", "lineno": 83}}
:::MLLOG {"namespace": "", "time_ms": 1745066917961, "event_type": "INTERVAL_START", "key": "run_start", "value": null, "metadata": {"file": "/usr/local/lib/python3.12/dist-packages/mlperf_common/callbacks/logging.py", "lineno": 83}}

The run_start line marks the start of the timing clock. The following lines show the progress of the training, including evaluation. You can see that the evaluation loss is decreasing, marked by the eval_accuracy marker.

When the evaluation accuracy (evaluation loss in reality) drops below the threshold of 0.925, the training stops and the run_stop marker is printed:

:::MLLOG {"namespace": "", "time_ms": 1745067009412, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 0.92474365234375, "metadata": {"file": "/usr/local/lib/python3.12/dist-packages/mlperf_common/callbacks/logging.py", "lineno": 303, "samples_count": 3024}}
…
:::MLLOG {"namespace": "", "time_ms": 1745067009420, "event_type": "INTERVAL_END", "key": "run_stop", "value": null, "metadata": {"file": "/usr/local/lib/python3.12/dist-packages/mlperf_common/callbacks/logging.py", "lineno": 106, "samples_count": 3024, "status": "success"}}

If the benchmark fails to converge, the run_stop status will display as ‘aborted’. The MLPerf score is a difference between the timestamp of run_stop and run_start. In this case:

Score [milliseconds] = (1745067009420 – 1745066917961) = 91459
Score [minutes] = 91459/60000 = 1.524

Keep in mind that because the convergence is nondeterministic, the final score has to be deduced from multiple runs, because the number of samples to converge may vary. Here, the benchmark converged at 3,072 samples, while on average, it should converge at around 3,100-3,200 samples.

Llama 3.1 405B

To run benchmarks for Llama 3.1 405B, follow the instructions in this section.

Build the container

  • Clone the mlcommons/training_results_v5.0 GitHub repo.
  • cd NVIDIA/benchmarks/llama31_405b/implementations/tyche_ngpu512_ngc25.04_nemo.
  • docker build -t mlperf-nvidia:large_language_model-pyt. If you have a registry you would like to push the image to, feel free to add the registry name to the image name.

Download the dataset and model

For instructions on how to download the dataset and the tokenizer, see the Llama 3.1 405B reference README.

Environment variable PREPROCESSED_PATH points to the preprocessed dataset. Downloaded files should end with .idx and .bin.

c4-train.en_&lt;number&gt;_text_document where number belongs to 0~7.
c4-validation-91205-samples

Environment variable TOKENIZER_PATH points to the tokenizer used in this benchmark. Downloaded files include:

special_tokens_map.json
tokenizer.json
tokenizer.model
tokenizer.model.v1
tokenizer_config.json

You can clean up unnecessary files by running the cleanup script:

bash scripts/cleanup.sh

The final PREPROCESSED_PATH directory should contain:

c4-train.en_6_text_document.bin
c4-train.en_6_text_document.idx
c4-train.en_7_text_document.bin
c4-train.en_7_text_document.idx
c4-validation-91205-samples.en_text_document.bin
c4-validation-91205-samples.en_text_document.idx

Checkpoint

In the benchmarking region, resume training from the Meta official Hugging Face checkpoint. Refer to the instructions in the reference README to download the BF16 model checkpoint. Note that before you proceed, make sure that your current working directory is able to hold >1.5 TB of data.

Assuming that you are running the download command under a given directory, with its location stored under LOAD_CHECKPOINTS_PATH environment variable. After the checkpoint is downloaded, you should be able to find a 405B folder which holds a context and weights subfolder under the current directory:

&lt;LOAD_CHECKPOINTS_PATH&gt;
└── 405b
	├── context
	│   ├── nemo_tokenizer
	│   │   ├── special_tokens_map.json
	│   │   ├── tokenizer_config.json
	│   │   └── tokenizer.json
	│   ├── io.json
	│   └── model.yaml
	└── weights
    	├── __0_0.distcp
    	├── __0_1.distcp
    	├── .metadata
    	├── common.pt
    	└── metadata.json

Launch the benchmarking

NVIDIA uses SLURM to launch benchmarks on the compute nodes. To facilitate the job launch process, similarly to Llama 2 70B LORA, two files are used:

  1. A configuration file (config_*.sh) that describes the hyperparameters of the model, including the number of nodes, walltime, and so on. Selecting a proper file enables you to easily configure the workload to run at desired scale with optimal hyperparameters.
  2. A fixed run.sub file that contains srun commands to launch the training, passing all the hyperparameters from the config to the Python script.

To take a look at a typical config file, in this case config_GB200_128x4x112xtp4pp8cp2_cg_dplast.sh, the name describes the size and the type of the system:

  • GB200: Designed to run on a GB200 machine
  • 128×4: A system setting that can be decoded as NNODES x NGPUS
    • NNODES: Number of GB200 nodes
    • NGPUS: Number of GPUs per node
  • x112xtp4pp8cp2 is a parallelization schema:
    • x112 is GradientAccumulation, here equal to 112
    • TP4: 4-way TensorParallel
    • PP8: 8-way PipelineParallel
    • CP2: 2-way ContextParallel

This means that a 512-GPU configuration will be used running on eight GB200 NVL72 racks, with 64 GPUs being used from each rack. The benchmark will run GA112TP4PP8CP2: Global batch size (GBS) = 896.

To take a closer look at the content of the config file, the first part sources configs containing hyperparameters and optimization flags used by: all configs, configs using Blackwell GPUs, and configs using CUDA Graphs. In some cases, a flag from config_common.sh is overridden. Later, set the gradient accumulation (MINIBS), parallelization schema, micro batch size, the model size (frozen), and max steps.

source $(dirname ${BASH_SOURCE[0]})/config_common.sh 
source $(dirname ${BASH_SOURCE[0]})/config_common_blackwell.sh 
source $(dirname ${BASH_SOURCE[0]})/config_common_cg.sh

export MINIBS=112
export TENSOR_MODEL_PARALLEL=4
export PIPELINE_MODEL_PARALLEL=8
export INTERLEAVED_PIPELINE=8
export CONTEXT_PARALLEL=2
export MICRO_BATCH_SIZE=1
export MODEL_SIZE="405b"
export MAX_STEPS=450

Next, set performance optimization flags:

export FP8_PARAM_GATHER=True
export TP_COMM_OVERLAP=True
export ASYM_PP_EMBED=True
export ASYM_PP_LOSS=True
export TP_PP_DP_MAPPING=True

# Binding
export BINDCMD="bindpcie --cpu=node"

Next is a system-level settings to pass to SLURM:

export DGXNNODES=128
export DGXNGPU=4
export DGXSYSTEM=$(basename $(readlink -f ${BASH_SOURCE[0]}) | sed 's/^config_//' | sed 's/\.sh$//' )

export WALLTIME_RUNANDTIME=180
export WALLTIME=$((5 + ${NEXP:-1} * ($WALLTIME_RUNANDTIME + 5)))

To start the actual training, you need to tell the script where the dataset/checkpoint is, where you want to have the logfile stored, which container you want to use, source the config files, and run the sbatch command:

export PREPROC_DATA="/path/to/your/preprocessed_c4"
export TOKENIZER="/path/to/your/tokenizer.model"
export LOAD_CHECKPOINTS_PATH="/path/to/your/downloaded/checkpoint"
export LOAD_CHECKPOINT="/load_checkpoints/405b"
export LOGDIR=</path/to/output/dir>  # set the place where the output logs will be saved
export CONT=mlperf-nvidia:large_language_model-pyt
source config_GB200_128x4x112xtp4pp8cp2_cg_dplast.sh  # select config and source it
sbatch -N ${DGXNNODES} --time=${WALLTIME} run.sub  # you may be required to set --account and --partition here

Parse the logs

The logfiles should largely resemble the Llama 2 70B LoRA logs. The target accuracy (evaluation loss) is 5.6. The training will stop once the target has been reached, and print the run_stop marker.

Discuss (0)

Tags