Quickstart (Numpy)

Before You Start

Before jumping into this QuickStart guide, make sure you have an environment with NVIDIA FLARE installed. You can follow installation on the general concept of setting up a Python virtual environment (the recommended environment) and how to install NVIDIA FLARE.

Introduction

This tutorial is meant to solely demonstrate how the NVIDIA FLARE system works, without introducing any actual deep learning concepts. Through this exercise, you will learn how to use NVIDIA FLARE with numpy to perform basic computations across two clients with the included Scatter and Gather workflow, which broadcasts the training tasks then aggregates the results that come back. Due to the simplified weights, you will be able to clearly see and understand the results of the FL aggregation and the model persistor process.

The design of this exercise consists of one server and two clients starting with weights [[1, 2, 3], [4, 5, 6], [7, 8, 9]]. The following steps compose one cycle of weight updates, called a round:

  1. Clients are responsible for adding a delta to the weights to calculate new weights for the model.

  2. These updates are then sent to the server which will aggregate them to produce a model with new weights.

  3. Finally, the server sends this updated version of the model back to each client, so the clients can continue to calculate the next model weights in future rounds.

For this exercise, we will be working with the hello-numpy-sag application in the examples folder. Custom FL applications can contain the folders:

  1. custom: contains the custom components (np_trainer.py, np_model_persistor.py)

  2. config: contains client and server configurations (config_fed_client.json, config_fed_server.json)

  3. resources: contains the logger config (log.config)

Let’s get started. First clone the repo, if you haven’t already:

$ git clone https://github.com/NVIDIA/NVFlare.git

Remember to activate your NVIDIA FLARE Python virtual environment from the installation guide. Ensure numpy is installed.

(nvflare-env) $ python3 -m pip install numpy

Now that you have all your dependencies installed, let’s implement the federated learning system.

NVIDIA FLARE Client

In a file called np_trainer.py, import nvflare and numpy. Now we will implement the execute function to enable the clients to perform a simple addition of a diff to represent one calculation of training a round.

Find the full code of np_trainer.py at examples/hello-numpy-sag/custom/np_trainer.py to follow along.

The server sends either the initial weights or any stored weights to each of the clients through the Shareable object passed into execute(). Each client adds the diff to the model data after retrieving it from the DXO (see Data Exchange Object (DXO)) obtained from the Shareable, and creates a new Shareable to include the new weights also contained within a DXO.

In a real federated learning training scenario, each client does its training independently on its own dataset. As such, the weights returned to the server will likely be different for each of the clients.

The FL server can aggregrate (in this case average) the clients’ results to produce the aggregated model.

You can learn more about Shareable and FLContext in the documentation.

NVIDIA FLARE Server & Application

Model Persistor

np_model_persistor.py
 1# Copyright (c) 2021, NVIDIA CORPORATION.
 2#
 3# Licensed under the Apache License, Version 2.0 (the "License");
 4# you may not use this file except in compliance with the License.
 5# You may obtain a copy of the License at
 6#
 7#     http://www.apache.org/licenses/LICENSE-2.0
 8#
 9# Unless required by applicable law or agreed to in writing, software
10# distributed under the License is distributed on an "AS IS" BASIS,
11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12# See the License for the specific language governing permissions and
13# limitations under the License.
14
15import os
16
17import numpy as np
18
19from constants import NPConstants
20from nvflare.apis.fl_constant import FLContextKey
21from nvflare.apis.fl_context import FLContext
22from nvflare.app_common.abstract.model import ModelLearnable, make_model_learnable, ModelLearnableKey
23from nvflare.app_common.abstract.model_persistor import ModelPersistor
24from nvflare.app_common.app_constant import AppConstants
25
26
27class NPModelPersistor(ModelPersistor):
28    def __init__(self, model_dir="models", model_name="server.npy"):
29        super().__init__()
30
31        self.model_dir = model_dir
32        self.model_name = model_name
33
34        # This is default model that will be used if not local model is provided.
35        self.default_data = np.array(
36            [[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32
37        )
38
39    def load_model(self, fl_ctx: FLContext) -> ModelLearnable:
40        # Get start round from FLContext. If start_round > 0, we will try loading model from disk.
41        start_round = fl_ctx.get_prop(AppConstants.START_ROUND, 0)
42        engine = fl_ctx.get_engine()
43        run_number = fl_ctx.get_prop(FLContextKey.CURRENT_RUN)
44        run_dir = engine.get_workspace().get_run_dir(run_number)
45        model_path = os.path.join(run_dir, self.model_dir, self.model_name)
46
47        # Create a new numpy model
48        if start_round > 0:
49            try:
50                data = np.load(model_path)
51            except Exception as e:
52                self.log_exception(
53                    fl_ctx,
54                    f"Unable to load model from {model_path}. Using default data instead.",
55                    fire_event=False
56                )
57                data = self.default_data.copy()
58        else:
59            data = self.default_data.copy()
60
61        # Generate model dictionary and create model_learnable.
62        weights = {NPConstants.NUMPY_KEY: data}
63        model_learnable = make_model_learnable(weights, {})
64
65        self.logger.info(f"Loaded initial model: {model_learnable[ModelLearnableKey.WEIGHTS]}")
66        return model_learnable
67
68    def save_model(self, model: ModelLearnable, fl_ctx: FLContext):
69        engine = fl_ctx.get_engine()
70        run_number = fl_ctx.get_prop(FLContextKey.CURRENT_RUN)
71        run_dir = engine.get_workspace().get_run_dir(run_number)
72        model_path = os.path.join(run_dir, self.model_dir)
73        if not os.path.exists(model_path):
74            os.makedirs(model_path)
75
76        model_save_path = os.path.join(model_path, self.model_name)
77        if model_save_path:
78            with open(model_save_path, "wb") as f:
79                np.save(f, model[ModelLearnableKey.WEIGHTS][NPConstants.NUMPY_KEY])
80            self.log_info(fl_ctx, f"Saved numpy model to: {model_save_path}")

The model persistor is used to load and save models on the server. Here, the model is weights packaged into a ModelLearnable object.

Internally, DXO is used to manage data after FullModelShareableGenerator converts Learnable to Shareable on the FL server. The DXO helps all of the FL components agree on the format.

In this exercise, we can simply save the model as a binary “.npy” file. Depending on the frameworks and tools, the methods of saving the model may vary.

Application Configuration

Inside the config folder there are two files, config_fed_client.json and config_fed_server.json. For now, the default configurations are sufficient.

config_fed_server.json
 1{
 2  "format_version": 2,
 3  "server": {
 4    "heart_beat_timeout": 600
 5  },
 6  "task_data_filters": [],
 7  "task_result_filters": [],
 8  "components": [
 9    {
10      "id": "persistor",
11      "path": "np_model_persistor.NPModelPersistor",
12      "args": {}
13    },
14    {
15      "id": "shareable_generator",
16      "path": "nvflare.app_common.shareablegenerators.full_model_shareable_generator.FullModelShareableGenerator",
17      "args": {}
18    },
19    {
20      "id": "aggregator",
21      "path": "nvflare.app_common.aggregators.intime_accumulate_model_aggregator.InTimeAccumulateWeightedAggregator",
22      "args": {
23        "expected_data_kind": "WEIGHTS"
24      }
25    }
26  ],
27  "workflows": [
28    {
29      "id": "scatter_and_gather",
30      "path": "nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather",
31      "args": {
32        "min_clients": 2,
33        "num_rounds": 3,
34        "start_round": 0,
35        "wait_time_after_min_received": 10,
36        "aggregator_id": "aggregator",
37        "persistor_id": "persistor",
38        "shareable_generator_id": "shareable_generator",
39        "train_task_name": "train",
40        "train_timeout": 6000
41      }
42    }
43  ]
44}

Note that the component with id persistor points to the custom NPModelPersistor with full Python module path.

config_fed_client.json
 1{
 2  "format_version": 2,
 3  "executors": [
 4    {
 5      "tasks": [
 6        "train"
 7      ],
 8      "executor": {
 9        "path": "np_trainer.NPTrainer",
10        "args": {}
11      }
12    }
13  ],
14  "task_result_filters": [],
15  "task_data_filters": [],
16  "components": []
17}

Here, in executors, the Trainer implementation NPTrainer is configured for the task “train”.

Federated Numpy with Scatter and Gather Workflow!

Now you can use admin commands to upload, deploy, and start this example app. To do this on a proof of concept local FL system, follow the sections Setting Up the Application Environment in POC Mode and Starting the Application Environment in POC Mode if you have not already.

Running the FL System

With the admin client command prompt successfully connected and logged in, enter the commands below in order. Pay close attention to what happens in each of four terminals. You can see how the admin controls the server and clients with each command.

> upload_app hello-numpy-sag

Uploads the application from the admin client to the server’s staging area.

> set_run_number 1

Creates a run directory in the workspace for the run_number on the server and all clients. The run directory allows for the isolation of different runs so the information in one particular run does not interfere with other runs.

> deploy_app hello-numpy-sag all

This will make the hello-numpy-sag application the active one in the run_number workspace. After the above two commands, the server and all the clients know the hello-numpy-sag application will reside in the run_1 workspace.

> start_app all

This start_app command instructs the NVIDIA FLARE server and clients to start training with the hello-numpy-sag application in the run_1 workspace.

From time to time, you can issue check_status server in the admin client to check the entire training progress.

You should now see how the training does in the very first terminal (the one that started the server).

After starting the server and clients, you should begin to see some outputs in each terminal tracking the progress of the FL run. If everything went as planned, you should see that through 10 rounds, the FL system has aggregated new models on the server with the results produced by the clients.

Once the fl run is complete and the server has successfully aggregated the client’s results after all the rounds, run the following commands in the fl_admin to shutdown the system (while inputting admin when prompted with password):

> shutdown client
> shutdown server
> bye

In order to stop all processes, run ./stop_fl.sh.

Congratulations! You’ve successfully built and run your first numpy federated learning system. You now have a decent grasp of the main FL concepts, and are ready to start exploring how NVIDIA FLARE can be applied to many other tasks.

The full source code for this exercise can be found in examples/hello-numpy-sag.