Quick Start

This section explains how to start using Tarantella to distributedly train an existing TensorFlow model.

Note

Tarantella is composed of two different components that need to be used together for data parallel training across multiple devices.

Python module that can be imported in your code and provides access to the Tarantella API.
Runtime execution script tarantella to deploy the code in parallel.

Now, we will examine what changes have to be made to your code, and how to execute it on the command line with tarantella.

Code example: LeNet-5 on MNIST

After having built and installed Tarantella we are ready to add distributed training support to an existing TensorFlow model. We will first illustrate all the necessary steps, using the well-known example of LeNet-5 on the MNIST dataset. Although this is not necessarily a good use case to take full advantage of Tarantella’s capabilities, it will allow you to simply copy-paste the code snippets and try them out, even on your laptop.

Let’s get started!

import tensorflow as tf
from tensorflow import keras

# Initialize Tarantella (before doing anything else)
import tarantella as tnt

# Skip function implementations for brevity
[...]

args = parse_args()
              
# Create Tarantella model from a `keras.Model`
model = tnt.Model(lenet5_model_generator())

# Compile Tarantella model (as with Keras)
model.compile(optimizer = keras.optimizers.SGD(learning_rate=args.learning_rate),
              loss = keras.losses.SparseCategoricalCrossentropy(),
              metrics = [keras.metrics.SparseCategoricalAccuracy()])

# Load MNIST dataset (as with Keras)
shuffle_seed = 42
(x_train, y_train), (x_val, y_val), (x_test, y_test) = \
      mnist_as_np_arrays(args.train_size, args.val_size, args.test_size)

train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.shuffle(len(x_train), shuffle_seed)
train_dataset = train_dataset.batch(args.batch_size)
train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE)

test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))
test_dataset = test_dataset.batch(args.batch_size)

# Train Tarantella model (as with Keras)
model.fit(train_dataset,
          epochs = args.number_epochs,
          verbose = 1)

# Evaluate Tarantella model (as with Keras)
model.evaluate(test_dataset, verbose = 1)

As you can see from the marked lines in the code snippet, you only need to add two lines of code to train LeNet-5 distributedly using Tarantella! Let us go through the code in some more detail, in order to understand what is going on.

First we need to import the Tarantella library:

import tarantella as tnt

Importing the Tarantella package will initialize the library and set up the communication infrastructure. Note that this should be done before executing any other code.

Next, we need to wrap the keras.Model object, generated by lenet5_model_generator(), into a tnt.Model object:

model = tnt.Model(lenet5_model_generator())

That’s it!

All the necessary steps to distribute training and datasets will now be automatically handled by Tarantella. In particular, we still run model.compile on the new model to generate a compute graph, just as we would have done with a typical Keras model.

Next, we load the MNIST data for training and testing, and create tf.data.Dataset s from it. Note that we batch the dataset for training. This will guarantee that Tarantella is able to distribute the data later on in the correct way. Also note that the batch_size used here, is the same as for the original model, that is the global batch size. For details concerning local and global batch sizes have a look here.

Now we are able to train our model using model.fit, in the same familiar way used by the standard Keras interface. Note, however, that Tarantella is taking care of the proper distribution of the train_dataset in the background. All the possibilities of how to feed datasets to Tarantella are explained in more detail below. Lastly, we can evaluate the final accuracy of our model on the test_dataset using model.evaluate.

To test and run Tarantella in the next section, you can find a full version of the above example here.

Executing your model with `tarantella`

Next, let’s execute our model distributedly using tarantella on the command line.

Caution

When working on STYX, make sure to export the following environment variables before calling tarantella:

export LD_LIBRARY_PATH=/opt/GPI/lib64:${LD_LIBRARY_PATH}
export LD_LIBRARY_PATH=${GASPICXX_INSTALLATION_PATH}:${LD_LIBRARY_PATH}

The simplest way to run the model is by passing its Python script to tarantella:

tarantella -- model.py

This will execute our model distributedly on a single node, using all the available GPUs.

Caution

On STYX, you might run into some error messages when trying to use the GPUs. Follow the following steps to correctly run Tarantella:

export CONDA_ENV_PATH=/path/to/your/conda/environment

mkdir -p ${CONDA_ENV_PATH}/lib/nvvm/libdevice
mv ${CONDA_ENV_PATH}/lib/libdevice.10.bc ${CONDA_ENV_PATH}/lib/nvvm/libdevice
export LD_LIBRARY_PATH=${CONDA_ENV_PATH}/lib:${LD_LIBRARY_PATH}

Always add the following -x flags to the tarantella command in the examples bellow:

tarantella -x XLA_FLAGS="--xla_gpu_cuda_data_dir=${CONDA_ENV_PATH}/lib" ...

We can also set command line parameters for the python script model.py, which have to succeed the name of the script:

tarantella -- model.py --batch_size=64 --learning_rate=0.01

On a single node, we can also explicitly specify the number of TensorFlow instances we want to use. This is done with the -n option:

tarantella -n 2 -- model.py --batch_size=64

Here, tarantella will try to execute distributedly on 2 GPUs. If there are not enough GPUs available, tarantella will print a WARNING and run 2 instances of TensorFlow on the CPU instead.

Next, let’s run tarantella on multiple nodes. In order to do this, we need to provide tarantella with a hostfile that contains the hostname s of the nodes that we want to use:

$ cat hostfile
name_of_node_1
name_of_node_2

Note

On the STYX cluster, the list of hostnames that belong to a job can be generated by running the following command:

echo ${CARME_NODES} | uniq > ./hostfile

Caution

Create a job comprising multiple nodes to run Tarantella distributedly! Only nodes that belong to the same job can be accessed by the tarantella command.

With this hostfile we can run tarantella on multiple nodes:

tarantella --hostfile hostfile -- model.py

In this case, tarantella uses all GPUs it can find. If no GPUs are available, tarantella will start one TensorFlow instance per node on the CPUs, and will issue a WARNING message. Again, this can be disabled by explicitly using the --no-gpu option.

As before, you can specify the number of GPUs/CPUs used per node explicitly with the option --n-per-node <number>:

tarantella --hostfile hostfile --n-per-node 2 --no-gpu -- model.py --batch_size=64

In this example, tarantella would execute 2 instances of TensorFlow on the CPUs of each node specified in hostfile.

In addition, tarantella can be run with different levels of logging output. The log-levels that are available are INFO, WARNING, DEBUG and ERROR, and can be set with --log-level:

tarantella --hostfile hostfile --log-level INFO -- model.py

To add your own environment variables, add -x ENV_VAR_NAME=VALUE to your tarantella command. This option will ensure the environment variable ENV_VAR_NAME is exported on all ranks before executing the code. An example is shown below:

tarantella --hostfile hostfile -x DATASET=/scratch/data TF_CPP_MIN_LOG_LEVEL=1 -- model.py

Both DATASET and TF_CPP_MIN_LOG_LEVEL will be exported as environment variables before executing model.py, in the same order they were specified to the command line.

To terminate a running tarantella instance, execute another tarantella command that specifies the --cleanup option in addition to the name of the program you want to interrupt.

tarantella --hostfile hostfile --cleanup -- model.py

The above command will stop the model.py execution on all the nodes provided in hostfile. You can also enable the --force flag to immediately terminate unresponsive processes.

Note

Any running tarantella execution can be terminated by using Ctrl+c, regardless of whether it was started on a single node or on multiple hosts.

Using distributed datasets

This section explains how to use Tarantella’s distributed datasets.

The recommended way in which to provide your dataset to Tarantella is by passing a batched tf.data.Dataset to tnt.Model.fit. In order to do this, create a Dataset and apply the batch transformation using the (global) batch size to it. However, do not provide a value to batch_size in tnt.Model.fit, which would lead to double batching, and thus modified shapes for the input data.

Tarantella can distribute any tf.data.Dataset, regardless of the number and type of transformations that have been applied to it.

Note

When using the dataset.shuffle transformation without a seed, Tarantella will use a fixed default seed.

This guarantees that the input data is shuffled the same way on all devices, when no seed is given, which is necessary for consistency. However, when a random seed is provided by the user, Tarantella will use that one instead.

Tarantella does not support any other way to feed data to fit at the moment. In particular, Numpy arrays, TensorFlow tensors and generators are not supported.

Tarantella’s automatic data distribution can be switched off by passing tnt_distribute_dataset = False in tnt.Model.fit, in which case Tarantella will issue an INFO message. If a validation dataset is passed to tnt.Model.fit, it should also be batched with the global batch size. You can similarly switch off its automatic micro-batching mechanism by setting tnt_distribute_validation_dataset = False.

Callbacks

Tarantella callbacks are discussed in detail in the Tarantella docs.

Quick Start

Code example: LeNet-5 on MNIST

Executing your model with tarantella

Using distributed datasets

Callbacks

Executing your model with `tarantella`