Hello Scatter and Gather

Before You Start

Before jumping into this guide, make sure you have an environment with NVIDIA FLARE installed.

You can follow Getting Started on the general concept of setting up a Python virtual environment (the recommended environment) and how to install NVIDIA FLARE.

Introduction

This tutorial is meant solely to demonstrate how the NVIDIA FLARE system works, without introducing any actual deep learning concepts.

Through this exercise, you will learn how to use NVIDIA FLARE with numpy to perform basic computations across two clients with the included Scatter and Gather workflow, which broadcasts the training tasks then aggregates the results that come back.

Due to the simplified weights, you will be able to clearly see and understand the results of the FL aggregation and the model persistor process.

The setup of this exercise consists of one server and two clients. The server side model starting with weights [[1, 2, 3], [4, 5, 6], [7, 8, 9]].

The following steps compose one cycle of weight updates, called a round:

  1. Clients are responsible for adding a delta to the weights to calculate new weights for the model.

  2. These updates are then sent to the server which will aggregate them to produce a model with new weights.

  3. Finally, the server sends this updated version of the model back to each client, so the clients can continue to calculate the next model weights in future rounds.

For this exercise, we will be working with the hello-numpy-sag application in the examples folder. Custom FL applications can contain the folders:

  1. custom: contains any custom components (custom Python code)

  2. config: contains client and server configurations (config_fed_client.json, config_fed_server.json)

  3. resources: contains the logger config (log.config)

Let’s get started. First clone the repo, if you haven’t already:

$ git clone https://github.com/NVIDIA/NVFlare.git

Remember to activate your NVIDIA FLARE Python virtual environment from the installation guide. Ensure numpy is installed.

(nvflare-env) $ python3 -m pip install numpy

Now that you have all your dependencies installed, let’s implement the federated learning system.

NVIDIA FLARE Client

You will first notice that the hello-numpy-sag application does not contain a custom folder.

The code for the client and server components has been implemented in the nvflare/app-common/np folder of the NVFlare code tree.

These files, for example the trainer in np_trainer.py can be copied into a custom folder in the hello-numpy-sag application as custom_trainer.py and modified to perform additional tasks.

The config_fed_client.json configuration discussed below would then be modified to point to this custom code by providing the custom path.

For example, replacing nvflare.app_common.np.np_trainer.NPTrainer with custom_trainer.NPTrainer.

In the np_trainer.py trainer, we first import nvflare and numpy. We then implement the execute function to enable the clients to perform a simple addition of a diff to represent one calculation of training a round.

The server sends either the initial weights or any stored weights to each of the clients through the Shareable object passed into execute(). Each client adds the diff to the model data after retrieving it from the DXO (see Data Exchange Object (DXO)) obtained from the Shareable, and creates a new Shareable to include the new weights also contained within a DXO.

In a real federated learning training scenario, each client does its training independently on its own dataset. As such, the weights returned to the server will likely be different for each of the clients.

The FL server can aggregate (in this case average) the clients’ results to produce the aggregated model.

You can learn more about Shareable and FLContext in the programming guide.

NVIDIA FLARE Server & Application

Model Persistor

np_model_persistor.py
 1# Copyright (c) 2022, NVIDIA CORPORATION.  All rights reserved.
 2#
 3# Licensed under the Apache License, Version 2.0 (the "License");
 4# you may not use this file except in compliance with the License.
 5# You may obtain a copy of the License at
 6#
 7#     http://www.apache.org/licenses/LICENSE-2.0
 8#
 9# Unless required by applicable law or agreed to in writing, software
10# distributed under the License is distributed on an "AS IS" BASIS,
11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12# See the License for the specific language governing permissions and
13# limitations under the License.
14
15import os
16
17import numpy as np
18
19from nvflare.apis.fl_constant import FLContextKey
20from nvflare.apis.fl_context import FLContext
21from nvflare.app_common.abstract.model import ModelLearnable, ModelLearnableKey, make_model_learnable
22from nvflare.app_common.abstract.model_persistor import ModelPersistor
23from nvflare.security.logging import secure_format_exception
24
25from .constants import NPConstants
26
27
28def _get_run_dir(fl_ctx: FLContext):
29    engine = fl_ctx.get_engine()
30    if engine is None:
31        raise RuntimeError("engine is missing in fl_ctx.")
32    job_id = fl_ctx.get_prop(FLContextKey.CURRENT_RUN)
33    if job_id is None:
34        raise RuntimeError("job_id is missing in fl_ctx.")
35    run_dir = engine.get_workspace().get_run_dir(job_id)
36    return run_dir
37
38
39class NPModelPersistor(ModelPersistor):
40    def __init__(self, model_dir="models", model_name="server.npy"):
41        super().__init__()
42
43        self.model_dir = model_dir
44        self.model_name = model_name
45
46        # This is default model that will be used if not local model is provided.
47        self.default_data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32)
48
49    def load_model(self, fl_ctx: FLContext) -> ModelLearnable:
50        run_dir = _get_run_dir(fl_ctx)
51        model_path = os.path.join(run_dir, self.model_dir, self.model_name)
52        try:
53            # try loading previous model
54            data = np.load(model_path)
55        except Exception as e:
56            self.log_info(
57                fl_ctx,
58                f"Unable to load model from {model_path}: {secure_format_exception(e)}. Using default data instead.",
59                fire_event=False,
60            )
61            data = self.default_data.copy()
62
63        model_learnable = make_model_learnable(weights={NPConstants.NUMPY_KEY: data}, meta_props={})
64
65        self.log_info(fl_ctx, f"Loaded initial model: {model_learnable[ModelLearnableKey.WEIGHTS]}")
66        return model_learnable
67
68    def save_model(self, model_learnable: ModelLearnable, fl_ctx: FLContext):
69        run_dir = _get_run_dir(fl_ctx)
70        model_root_dir = os.path.join(run_dir, self.model_dir)
71        if not os.path.exists(model_root_dir):
72            os.makedirs(model_root_dir)
73
74        model_path = os.path.join(model_root_dir, self.model_name)
75        np.save(model_path, model_learnable[ModelLearnableKey.WEIGHTS][NPConstants.NUMPY_KEY])
76        self.log_info(fl_ctx, f"Saved numpy model to: {model_path}")

The model persistor is used to load and save models on the server. Here, the model refer to weights packaged into a ModelLearnable object.

Internally, DXO is used to manage data after FullModelShareableGenerator converts Learnable to Shareable on the FL server.

The DXO helps all of the FL components agree on the format.

In this exercise, we can simply save the model as a binary “.npy” file. Depending on the frameworks and tools, the methods of saving the model may vary.

Application Configuration

Inside the config folder there are two files, config_fed_client.json and config_fed_server.json. For now, the default configurations are sufficient.

config_fed_server.json
 1{
 2  "format_version": 2,
 3  "server": {
 4    "heart_beat_timeout": 600
 5  },
 6  "task_data_filters": [],
 7  "task_result_filters": [],
 8  "components": [
 9    {
10      "id": "persistor",
11      "path": "nvflare.app_common.np.np_model_persistor.NPModelPersistor",
12      "args": {}
13    },
14    {
15      "id": "shareable_generator",
16      "path": "nvflare.app_common.shareablegenerators.full_model_shareable_generator.FullModelShareableGenerator",
17      "args": {}
18    },
19    {
20      "id": "aggregator",
21      "path": "nvflare.app_common.aggregators.intime_accumulate_model_aggregator.InTimeAccumulateWeightedAggregator",
22      "args": {
23        "expected_data_kind": "WEIGHTS",
24        "aggregation_weights": {
25          "site-1": 1.0,
26          "site-2": 1.0
27        }
28      }
29    }
30  ],
31  "workflows": [
32    {
33      "id": "scatter_and_gather",
34      "path": "nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather",
35      "args": {
36        "min_clients": 2,
37        "num_rounds": 3,
38        "start_round": 0,
39        "wait_time_after_min_received": 10,
40        "aggregator_id": "aggregator",
41        "persistor_id": "persistor",
42        "shareable_generator_id": "shareable_generator",
43        "train_task_name": "train",
44        "train_timeout": 6000
45      }
46    }
47  ]
48}
config_fed_client.json
 1{
 2  "format_version": 2,
 3  "executors": [
 4    {
 5      "tasks": [
 6        "train"
 7      ],
 8      "executor": {
 9        "path": "nvflare.app_common.np.np_trainer.NPTrainer",
10        "args": {}
11      }
12    }
13  ],
14  "task_result_filters": [],
15  "task_data_filters": [],
16  "components": []
17}

Here, in executors, the Trainer implementation NPTrainer is configured for the task “train”.

If you had implemented your own custom NPTrainer training routine, for example in hello-numpy-sag/custom/custom_trainer.py, this config_fed_client.json configuration would be modified to point to this custom code by providing the custom path.

For example, replacing nvflare.app_common.np.np_trainer.NPTrainer with custom_trainer.NPTrainer.

Federated Numpy with Scatter and Gather Workflow!

Now you can use admin command prompt to submit and start this example job. To do this on a proof of concept local FL system, follow the sections Setting Up the Application Environment in POC Mode and Starting the Application Environment in POC Mode if you have not already.

Running the FL System

With the admin client command prompt successfully connected and logged in, enter the command below.

> submit_job hello-numpy-sag

Pay close attention to what happens in each of four terminals. You can see how the admin submits the job to the server and how the JobRunner on the server automatically picks up the job to deploy and start the run.

This command uploads the job configuration from the admin client to the server. A job id will be returned, and we can use that id to access job information.

Note

If we use submit_job [app] then that app will be treated as a single app job.

From time to time, you can issue check_status server in the admin client to check the entire training progress.

You should now see how the training does in the very first terminal (the one that started the server).

After starting the server and clients, you should begin to see some outputs in each terminal tracking the progress of the FL run. If everything went as planned, you should see that through 10 rounds, the FL system has aggregated new models on the server with the results produced by the clients.

Accessing the results

The results of each job will usually be stored inside the server side workspace.

Please refer to access server-side workspace for accessing the server side workspace.

Shutdown FL system

Once the FL run is complete and the server has successfully aggregated the client’s results after all the rounds, and cross site model evaluation is finished, run the following commands in the fl_admin to shutdown the system (while inputting admin when prompted with password):

> shutdown client
> shutdown server
> bye

Congratulations!

You’ve successfully built and run your first numpy federated learning system.

You now have a decent grasp of the main FL concepts, and are ready to start exploring how NVIDIA FLARE can be applied to many other tasks.

The full application for this exercise can be found in examples/hello-world/hello-numpy-sag, with the client and server components implemented in the nvflare/app-common/np folder of the NVFlare code tree.

Previous Versions of Hello Scatter and Gather