Quickstart (TensorFlow 2)¶

Before You Start¶

We recommend you first finish either the Quickstart (PyTorch) or the Quickstart (Numpy) exercise. Those guides go more in depth in explaining the federated learning aspect of NVIDIA FLARE.

Here we assume you have already installed NVIDIA FLARE inside a python virtual environment and have already cloned the repo.

Introduction¶

Through this exercise, you will integrate NVIDIA FLARE with the popular deep learning framework TensorFlow 2 and learn how to use NVIDIA FLARE to train a convolutional network with the MNIST dataset using the Scatter and Gather workflow. You will also be introduced to some new components and concepts, including filters, aggregrators, and event handlers.

The design of this exercise consists of one server and two clients all having the same TensorFlow 2 model. The following steps compose one cycle of weight updates, called a round:

Clients are responsible for generating individual weight-updates for the model using their own MNIST dataset.

These updates are then sent to the server which will aggregate them to produce a model with new weights.

Finally, the server sends this updated version of the model back to each client.

For this exercise, we will be working with the hello-tf2 application in the examples folder. Custom FL applications can contain the folders:

custom: contains the custom components (tf2_net.py, trainer.py, filter.py, tf2_model_persistor.py)

config: contains client and server configurations (config_fed_client.json, config_fed_server.json)

resources: contains the logger config (log.config)

Let’s get started. Since this task is using TensorFlow, let’s go ahead and install the library inside our virtual environment:

(nvflare-env) $ python3 -m pip install tensorflow

NVIDIA FLARE Client¶

Neural Network¶

With all the required dependencies installed, you are ready to run a Federated Learning system with two clients and one server. Before you start, let’s see what a simplified MNIST network looks like.

tf2_net.py¶

import tensorflow as tf


class Net(tf.keras.Model):
    def __init__(self):
        super().__init__()
        self.flatten = tf.keras.layers.Flatten(input_shape=(28, 28))
        self.dense1 = tf.keras.layers.Dense(128, activation="relu")
        self.dropout = tf.keras.layers.Dropout(0.2)
        self.dense2 = tf.keras.layers.Dense(10)

    def call(self, x):
        x = self.flatten(x)
        x = self.dense1(x)
        x = self.dropout(x)
        x = self.dense2(x)
        return x

This Net class is the convolutional neural network to train with MNIST dataset. This is not related to NVIDIA FLARE, so implement it in a file called tf2_net.py.

Dataset & Setup¶

Now you have to implement the class Trainer, which is a subclass of Executor in NVIDIA FLARE, in a file called trainer.py.

Before you can really start a training, you need to set up your dataset. In this exercise, you can download it from the Internet via tf.keras’s datasets module, and split it in half to create a separate dataset for each client. Additionally, you must setup the optimizer, loss function and transform to process the data.

Since every step will be encapsulated in the SimpleTrainer class, let’s put this preparation stage into one method setup:

    def setup(self, fl_ctx: FLContext):
        (self.train_images, self.train_labels), (
            self.test_images,
            self.test_labels,
        ) = tf.keras.datasets.mnist.load_data()
        self.train_images, self.test_images = (
            self.train_images / 255.0,
            self.test_images / 255.0,
        )

        # simulate separate datasets for each client by dividing MNIST dataset in half
        client_name = fl_ctx.get_identity_name()
        if client_name == "site-1":
            self.train_images = self.train_images[: len(self.train_images) // 2]
            self.train_labels = self.train_labels[: len(self.train_labels) // 2]
            self.test_images = self.test_images[: len(self.test_images) // 2]
            self.test_labels = self.test_labels[: len(self.test_labels) // 2]
        elif client_name == "site-2":
            self.train_images = self.train_images[len(self.train_images) // 2 :]
            self.train_labels = self.train_labels[len(self.train_labels) // 2 :]
            self.test_images = self.test_images[len(self.test_images) // 2 :]
            self.test_labels = self.test_labels[len(self.test_labels) // 2 :]

        model = Net()

        loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
        model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
        _ = model(tf.keras.Input(shape=(28, 28)))
        self.var_list = [model.get_layer(index=index).name for index in range(len(model.get_weights()))]
        self.model = model

    def execute(
        self,

How can you ensure this setup method is called before the client receives the model from the server? The Trainer class is also a FLComponent, which always receives Event whenever NVIDIA FLARE enters or leaves a certain stage. In this case, there is an Event called EventType.START_RUN which perfectly matches these requirements. Because our trainer is a subclass of FLComponent, you can implement the handler to handle the event and call the setup method:

    def handle_event(self, event_type: str, fl_ctx: FLContext):
        if event_type == EventType.START_RUN:
            self.setup(fl_ctx)

Note

This is a new concept you haven’t learned in previous two exercises. The concepts of event and handler are very powerful because you are free to add your logic so it can run at different time and process various events. The entire list of events fired by NVIDIA FLARE is shown at Event types.

You have everything you need, now let’s implement the last method called execute, which is called every time the client receives an updated model from the server with the Task we will configure.

Link NVIDIA FLARE with Local Train¶

Take a look at the following code:

    def execute(
        self,
        task_name: str,
        shareable: Shareable,
        fl_ctx: FLContext,
        abort_signal: Signal,
    ) -> Shareable:
        """
        This function is an extended function from the super class.
        As a supervised learning based trainer, the train function will run
        evaluate and train engines based on model weights from `shareable`.
        After finishing training, a new `Shareable` object will be submitted
        to server for aggregation.

        Args:
            task_name: dispatched task
            shareable: the `Shareable` object acheived from server.
            fl_ctx: the `FLContext` object achieved from server.
            abort_signal: if triggered, the training will be aborted.

        Returns:
            a new `Shareable` object to be submitted to server for aggregation.
        """

        # retrieve model weights download from server's shareable
        if abort_signal.triggered:
            return make_reply(ReturnCode.TASK_ABORTED)

        if task_name != "train":
            return make_reply(ReturnCode.TASK_UNKNOWN)

        dxo = from_shareable(shareable)
        model_weights = dxo.data

        # use previous round's client weights to replace excluded layers from server
        prev_weights = {
            self.model.get_layer(index=key).name: value for key, value in enumerate(self.model.get_weights())
        }

        ordered_model_weights = {key: model_weights.get(key) for key in prev_weights}
        for key in self.var_list:
            value = ordered_model_weights.get(key)
            if np.all(value == 0):
                ordered_model_weights[key] = prev_weights[key]

        # update local model weights with received weights
        self.model.set_weights(list(ordered_model_weights.values()))

        # adjust LR or other training time info as needed
        # such as callback in the fit function
        self.model.fit(
            self.train_images,
            self.train_labels,
            epochs=self.epochs_per_round,
            validation_data=(self.test_images, self.test_labels),
        )

        # report updated weights in shareable
        weights = {self.model.get_layer(index=key).name: value for key, value in enumerate(self.model.get_weights())}
        dxo = DXO(data_kind=DataKind.WEIGHTS, data=weights)

        self.log_info(fl_ctx, "Local epochs finished. Returning shareable")
        new_shareable = dxo.to_shareable()
        return new_shareable

Every NVIDIA FLARE client receives the model weights from the server in the shareable. This exercise uses a simple exclude_var filter, so make sure to replace the missing layer with weights from the clients’ previous training round:

        ordered_model_weights = {key: model_weights.get(key) for key in prev_weights}
        for key in self.var_list:
            value = ordered_model_weights.get(key)
            if np.all(value == 0):
                ordered_model_weights[key] = prev_weights[key]

Now update the local model with those received weights:

        self.model.set_weights(list(ordered_model_weights.values()))

Then perform a simple self.model.fit so the client’s model is trained with its own dataset:

        self.model.fit(
            self.train_images,
            self.train_labels,
            epochs=self.epochs_per_round,
            validation_data=(self.test_images, self.test_labels),
        )

After finishing the local train, the train method uses the newly-trained weights to build a new DXO to update the Shareable with and then returns it back to the NVIDIA FLARE server.

NVIDIA FLARE Server & Application¶

Filter¶

filter can be used for additional data processing in the Shareable, for both inbound and outbound data from the client and/or server.

For this exercise, we use a basic exclude_var filter to exclude the variable/layer flatten from the task result as it goes outbound from the client to the server. The excluded layer is replaced with all zeros of the same shape, which reduces compression size and ensures that the clients’ weights for this variable are not shared with the server.

filter.py¶

import re

import numpy as np
from nvflare.apis.dxo import DXO, DataKind, from_shareable
from nvflare.apis.filter import Filter
from nvflare.apis.fl_context import FLContext
from nvflare.apis.shareable import Shareable


class ExcludeVars(Filter):
    """
        Exclude/Remove variables from Sharable

    Args:
        exclude_vars: if not specified (None), all layers are being encrypted;
                      if list of variable/layer names, only specified variables are excluded;
                      if string containing regular expression (e.g. "conv"), only matched variables are being excluded.
    """

    def __init__(self, exclude_vars=None):
        super().__init__()
        self.exclude_vars = exclude_vars
        self.skip = False
        if self.exclude_vars is not None:
            if not (
                isinstance(self.exclude_vars, list)
                or isinstance(self.exclude_vars, str)
            ):
                self.skip = True
                self.logger.debug(
                    "Need to provide a list of layer names or a string for regex matching"
                )
                return

            if isinstance(self.exclude_vars, list):
                for var in self.exclude_vars:
                    if not isinstance(var, str):
                        self.skip = True
                        self.logger.debug(
                            "encrypt_layers needs to be a list of layer names to encrypt."
                        )
                        return
                self.logger.debug(f"Excluding {self.exclude_vars} from shareable")
            elif isinstance(self.exclude_vars, str):
                self.exclude_vars = (
                    re.compile(self.exclude_vars) if self.exclude_vars else None
                )
                if self.exclude_vars is None:
                    self.skip = True
                self.logger.debug(
                    f'Excluding all layers based on regex matches with "{self.exclude_vars}"'
                )
        else:
            self.logger.debug("Not excluding anything")
            self.skip = True

    def process(self, shareable: Shareable, fl_ctx: FLContext) -> Shareable:

        self.log_debug(fl_ctx, "inside filter")
        if self.skip:
            return shareable

        try:
            dxo = from_shareable(shareable)
        except:
            self.log_exception(fl_ctx, "shareable data is not a valid DXO")
            return shareable

        assert isinstance(dxo, DXO)
        if dxo.data_kind not in (DataKind.WEIGHT_DIFF, DataKind.WEIGHTS):
            self.log_debug(fl_ctx, "I cannot handle {}".format(dxo.data_kind))
            return shareable

        if dxo.data is None:
            self.log_debug(fl_ctx, "no data to filter")
            return shareable

        weights = dxo.data

        # parse regex encrypt layers
        if isinstance(self.exclude_vars, re.Pattern):
            re_pattern = self.exclude_vars
            self.exclude_vars = []
            for var_name in weights.keys():
                if re_pattern.search(var_name):
                    self.exclude_vars.append(var_name)
            self.log_debug(fl_ctx, f"Regex found {self.exclude_vars} matching layers.")
            if len(self.exclude_vars) == 0:
                self.log_warning(
                    fl_ctx, f"No matching layers found with regex {re_pattern}"
                )

        # remove variables
        n_excluded = 0
        var_names = list(
            weights.keys()
        )  # needs to recast to list to be used in for loop
        n_vars = len(var_names)
        for var_name in var_names:
            # self.logger.info(f"Checking {var_name}")
            if var_name in self.exclude_vars:
                self.log_debug(fl_ctx, f"Excluding {var_name}")
                weights[var_name] = np.zeros(weights[var_name].shape)
                n_excluded += 1
        self.log_debug(
            fl_ctx,
            f"Excluded {n_excluded} of {n_vars} variables. {len(weights.keys())} remaining.",
        )

        dxo.data = weights
        return dxo.update_shareable(shareable)

The filtering procedure occurs in the one required method, process, which receives and returns a shareable. The parameters for what is excluded and the inbound/outbound option are all set in config_fed_client.json (shown later below) and passed in through the constructor.

Model Aggregator¶

The model aggregator is used by the server to aggregate the clients’ models into one model within the Scatter and Gather workflow.

In this exercise, we perform a simple average over the two clients’ weights with the AccumulateWeightedAggregator and configure for it to be used in config_fed_server.json (shown later below).

Model Persistor¶

The model persistor is used to load and save models on the server.

tf2_model_persistor.py¶

import os
import pickle
import json

import tensorflow as tf
from nvflare.apis.event_type import EventType
from nvflare.apis.fl_constant import FLContextKey
from nvflare.apis.fl_context import FLContext
from nvflare.app_common.abstract.model import ModelLearnable
from nvflare.app_common.abstract.model_persistor import ModelPersistor
from tf2_net import Net
from nvflare.app_common.app_constant import AppConstants
from nvflare.app_common.abstract.model import make_model_learnable


class TF2ModelPersistor(ModelPersistor):
    def __init__(self, save_name="tf2_model.pkl"):
        super().__init__()
        self.save_name = save_name

    def _initialize(self, fl_ctx: FLContext):
        # get save path from FLContext
        app_root = fl_ctx.get_prop(FLContextKey.APP_ROOT)
        env = None
        run_args = fl_ctx.get_prop(FLContextKey.ARGS)
        if run_args:
            env_config_file_name = os.path.join(app_root, run_args.env)
            if os.path.exists(env_config_file_name):
                try:
                    with open(env_config_file_name) as file:
                        env = json.load(file)
                except:
                    self.system_panic(
                        reason="error opening env config file {}".format(env_config_file_name), fl_ctx=fl_ctx
                    )
                    return

        if env is not None:
            if env.get("APP_CKPT_DIR", None):
                fl_ctx.set_prop(AppConstants.LOG_DIR, env["APP_CKPT_DIR"], private=True, sticky=True)
            if env.get("APP_CKPT") is not None:
                fl_ctx.set_prop(
                    AppConstants.CKPT_PRELOAD_PATH,
                    env["APP_CKPT"],
                    private=True,
                    sticky=True,
                )

        log_dir = fl_ctx.get_prop(AppConstants.LOG_DIR)
        if log_dir:
            self.log_dir = os.path.join(app_root, log_dir)
        else:
            self.log_dir = app_root
        self._pkl_save_path = os.path.join(self.log_dir, self.save_name)
        if not os.path.exists(self.log_dir):
            os.makedirs(self.log_dir)

        fl_ctx.sync_sticky()

    def load_model(self, fl_ctx: FLContext) -> ModelLearnable:
        """
            initialize and load the Model.

        Args:
            fl_ctx: FLContext

        Returns:
            Model object
        """

        if os.path.exists(self._pkl_save_path):
            self.logger.info(f"Loading server weights")
            with open(self._pkl_save_path, "rb") as f:
                model_learnable = pickle.load(f)
        else:
            self.logger.info(f"Initializing server model")
            network = Net()
            loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
            network.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
            _ = network(tf.keras.Input(shape=(28, 28)))
            var_dict = {network.get_layer(index=key).name: value for key, value in enumerate(network.get_weights())}
            model_learnable = make_model_learnable(var_dict, dict())
        return model_learnable

    def handle_event(self, event: str, fl_ctx: FLContext):
        if event == EventType.START_RUN:
            self._initialize(fl_ctx)

    def save_model(self, model_learnable: ModelLearnable, fl_ctx: FLContext):
        """
            persist the Model object

        Args:
            model: Model object
            fl_ctx: FLContext
        """
        model_learnable_info = {k: str(type(v)) for k, v in model_learnable.items()}
        self.logger.info(f"Saving aggregated server weights: \n {model_learnable_info}")
        with open(self._pkl_save_path, "wb") as f:
            pickle.dump(model_learnable, f)

In this exercise, we simply serialize the model weights dictionary using pickle and save it to a log directory calculated in initialize. The file is saved on the FL server and the weights file name is defined in config_fed_server.json. Depending on the frameworks and tools, the methods of saving the model may vary.

FLContext is used throughout these functions to provide various useful FL-related information. You can find more details in the documentation.

Application Configuration¶

Finally, inside the config folder there are two files, config_fed_client.json and config_fed_server.json.

config_fed_server.json¶

{
  "format_version": 2,
  "server": {
    "heart_beat_timeout": 600
  },
  "task_data_filters": [],
  "task_result_filters": [],
  "components": [
    {
      "id": "persistor",
      "path": "tf2_model_persistor.TF2ModelPersistor",
      "args": {
        "save_name": "tf2weights.pickle"
      }
    },
    {
      "id": "shareable_generator",
      "path": "nvflare.app_common.shareablegenerators.full_model_shareable_generator.FullModelShareableGenerator",
      "args": {}
    },
    {
      "id": "aggregator",
      "path": "nvflare.app_common.aggregators.accumulate_model_aggregator.AccumulateWeightedAggregator",
      "args": {
        "expected_data_kind": "WEIGHTS"
      }
    }
  ],
  "workflows": [
    {
      "id": "scatter_gather_ctl",
      "path": "nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather",
      "args": {
        "min_clients": 1,
        "num_rounds": 3,
        "start_round": 0,
        "wait_time_after_min_received": 10,
        "aggregator_id": "aggregator",
        "persistor_id": "persistor",
        "shareable_generator_id": "shareable_generator",
        "train_task_name": "train",
        "train_timeout": 0
      }
    }
  ]
}

Note how the ScatterAndGather workflow is configured to use the included aggregator AccumulateWeightedAggregator and shareable_generator FullModelShareableGenerator. The persistor is configured to use TF2ModelPersistor in the custom directory of this hello_tf2 app with full Python module paths.

config_fed_client.json¶

{
  "format_version": 2,
  "executors": [
    {
      "tasks": [
        "train"
      ],
      "executor": {
        "path": "trainer.SimpleTrainer",
        "args": {
          "epochs_per_round": 2
        }
      }
    }
  ],
  "task_result_filters": [
    {
      "tasks": [
        "train"
      ],
      "filters": [
        {
          "path": "filter.ExcludeVars",
          "args": {
            "exclude_vars": [
              "flatten"
            ]
          }
        }
      ]
    }
  ],
  "task_data_filters": []
}

Here, executors is configured with the Trainer implementation SimpleTrainer. Also, we set up filter.ExcludeVars as a task_result_filters and pass in ["flatten"] as the argument. Both of these are configured for the only Task that will be broadcast in the Scatter and Gather workflow, “train”.

Train the Model, Federated!¶

Now you can use admin commands to upload, deploy, and start this example app. To do this on a proof of concept local FL system, follow the sections Setting Up the Application Environment in POC Mode and Starting the Application Environment in POC Mode if you have not already.

Running the FL System¶

With the admin client command prompt successfully connected and logged in, enter the commands below in order. Pay close attention to what happens in each of four terminals. You can see how the admin controls the server and clients with each command.

> upload_app hello-tf2

Uploads the application from the admin client to the server’s staging area.

> set_run_number 1

Creates a run directory in the workspace for the run_number on the server and all clients. The run directory allows for the isolation of different runs so the information in one particular run does not interfere with other runs.

> deploy_app hello-tf2 all

This will make the hello-tf2 application the active one in the run_number workspace. In this exercise, after the above two commands, the server and all the clients know the hello-tf2 application will reside in run_1 workspace.

> start_app all

This start_app command instructs the NVIDIA FLARE server and clients to start training with the hello-tf2 application in the run_1 workspace.

From time to time, you can issue check_status server in the admin client to check the entire training progress.

You should now see how the training does in the very first terminal (the one that started the server).

Once the fl run is complete and the server has successfully aggregated the clients’ results after all the rounds, run the following commands in the fl_admin to shutdown the system (while inputting admin when prompted with user name):

> shutdown client
> shutdown server
> bye

In order to stop all processes, run ./stop_fl.sh.

All artifacts from the FL run can be found in the server run folder you created with set_run_number. In this exercise, the folder is run_1.

Congratulations! You’ve successfully built and run a federated learning system using TensorFlow 2. The full source code for this exercise can be found in examples/hello-tf2.