Hello Cyclic Weight Transfer

Cyclic Weight Transfer (CWT) is an alternative to FedAvg. CWT uses the Cyclic Controller to pass the model weights from one site to the next for repeated fine-tuning.

Note

This example uses the MNIST handwritten digits dataset and will load its data within the trainer code.

Running Tensorflow with GPU

We recommend using NVIDIA TensorFlow docker if you want to use GPU. If you don’t need to run using GPU, you can just use python virtual environment.

Run NVIDIA TensorFlow container

Please install the NVIDIA container toolkit first. Then run the following command:

docker run --gpus=all -it --rm -v [path_to_NVFlare]:/NVFlare nvcr.io/nvidia/tensorflow:xx.xx-tf2-py3

Notes on running with GPUs

If you choose to run the example using GPUs, it is important to note that, by default, TensorFlow will attempt to allocate all available GPU memory at the start. In scenarios where multiple clients are involved, you have to prevent TensorFlow from allocating all GPU memory by setting the following flags.

TF_FORCE_GPU_ALLOW_GROWTH=true TF_GPU_ALLOCATOR=cuda_malloc_async

Install NVFlare

For the complete installation instructions, see Installation

pip install nvflare

Get the example code from GitHub:

git clone https://github.com/NVIDIA/NVFlare.git
git switch <release branch>
cd examples/hello-world/hello-cyclic

Install the dependency

pip install -r requirements.txt

Code Structure

Code structure:

hello-cyclic
|
|-- client.py           # client local training script
|-- model.py            # model definition
|-- job.py              # job recipe that defines client and server configurations
|-- prepare_data.sh     # scripts to download the data
|-- requirements.txt    # dependencies

Data

In this example, we will use the MNIST datasets, which is provided by TensorFlow Keras API.

Model

The model.py file defines a simple neural network using TensorFlow’s Keras API. The Net model is a sequential architecture designed for image classification, featuring:

Flatten Layer: Prepares input data for dense layers.
Dense Layer: 128 units with ReLU activation for non-linearity.
Dropout Layer: 20% dropout rate to mitigate overfitting.
Output Layer: 10 units for classifying MNIST digits.

Model (model.py)

from tensorflow.keras import layers, models


class Net(models.Sequential):
    def __init__(self, input_shape=(None, 28, 28)):
        super().__init__()
        self._input_shape = input_shape
        self.add(layers.Flatten())
        self.add(layers.Dense(128, activation="relu"))
        self.add(layers.Dropout(0.2))
        self.add(layers.Dense(10))

Client Code

The client code client.py is responsible for training. Notice the training code is almost identical to the PyTorch standard training code. The only difference is that we added a few lines to receive and send data to the server.

Client Code (client.py)

import tensorflow as tf
from model import Net

import nvflare.client as flare

WEIGHTS_PATH = "./tf_model.weights.h5"


def main():
    flare.init()

    sys_info = flare.system_info()
    print(f"system info is: {sys_info}", flush=True)

    model = Net()
    model.build(input_shape=(None, 28, 28))
    model.compile(
        optimizer="adam", loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=["accuracy"]
    )
    model.summary()

    (train_images, train_labels), (
        test_images,
        test_labels,
    ) = tf.keras.datasets.mnist.load_data()
    train_images, test_images = (
        train_images / 255.0,
        test_images / 255.0,
    )

    # simulate separate datasets for each client by dividing MNIST dataset in half
    client_name = sys_info["site_name"]
    if client_name == "site-1":
        train_images = train_images[: len(train_images) // 2]
        train_labels = train_labels[: len(train_labels) // 2]
        test_images = test_images[: len(test_images) // 2]
        test_labels = test_labels[: len(test_labels) // 2]
    elif client_name == "site-2":
        train_images = train_images[len(train_images) // 2 :]
        train_labels = train_labels[len(train_labels) // 2 :]
        test_images = test_images[len(test_images) // 2 :]
        test_labels = test_labels[len(test_labels) // 2 :]

    while flare.is_running():
        input_model = flare.receive()
        print(f"current_round={input_model.current_round}")

        sys_info = flare.system_info()
        print(f"system info is: {sys_info}")

        for k, v in input_model.params.items():
            model.get_layer(k).set_weights(v)

        _, test_global_acc = model.evaluate(test_images, test_labels, verbose=2)
        print(
            f"Accuracy of the received model on round {input_model.current_round} on the test images: {test_global_acc * 100} %"
        )

        # training
        model.fit(train_images, train_labels, epochs=1, validation_data=(test_images, test_labels))

        print("Finished Training")

        model.save_weights(WEIGHTS_PATH)

        sys_info = flare.system_info()
        print(f"system info is: {sys_info}", flush=True)
        print(f"finished round: {input_model.current_round}", flush=True)

        output_model = flare.FLModel(
            params={layer.name: layer.get_weights() for layer in model.layers},
            params_type="FULL",
            metrics={"accuracy": test_global_acc},
            current_round=input_model.current_round,
        )

        flare.send(output_model)


if __name__ == "__main__":
    main()

Server Code

In cyclic transfer, the server code is responsible for replaying model updates from one client to another. We will directly use the default federated cyclic algorithm provided by NVFlare.

Job Recipe

job recipe (job.py)

from model import Net

from nvflare.app_opt.tf.recipes.cyclic import CyclicRecipe
from nvflare.recipe import SimEnv

if __name__ == "__main__":
    n_clients = 2
    num_rounds = 3
    train_script = "client.py"

    recipe = CyclicRecipe(
        num_rounds=num_rounds,
        min_clients=n_clients,
        # Model can be specified as class instance or dict config:
        model=Net(),
        # Alternative: model={"class_path": "model.Net", "args": {}},
        # For pre-trained weights: initial_ckpt="/server/path/to/model.h5",
        train_script=train_script,
    )

    env = SimEnv(num_clients=n_clients)
    run = recipe.execute(env=env)
    print()
    print("Result can be found in :", run.get_result())
    print("Job Status is:", run.get_status())
    print()

Run the Experiment

Prepare the data first:

bash ./prepare_data.sh
python job.py

Access the Logs and Results

You can find the running logs and results inside the simulator’s workspace:

$ ls "/tmp/nvflare/simulation/cyclic"