What’s New

What’s New in FLARE v2.4.0

Usability Improvements

Client API

We introduce the new Client API, which streamlines the conversion process from centralized to federated deep learning code. Using the Client API only requires a few lines of code changes, without the need to restructure the code or implement a new class. Users can modify their pre-existing centralized deep learning code with these small changes to easily transform into federated learning code. For PyTorch-Lightning, we provide a tight integration which requires even fewer lines of code changes. Furthermore, the Client API significantly reduces the need for users to delve into FLARE specific concepts, helping to simplify the overall user experience.

Here is a brief example of a common pattern when using the Client API for a client trainer:

# import nvflare client API
import nvflare.client as flare

# initialize NVFlare client API
flare.init()

# run continuously when launching once
while flare.is_running():

  # receive FLModel from NVFlare
  input_model = flare.receive()

  # loads model from NVFlare
  net.load_state_dict(input_model.params)

  # perform local training and evaluation on received model
  {existing centralized deep learning code} ...

  # construct output FLModel
  output_model = flare.FLModel(
      params=net.cpu().state_dict(),
      metrics={"accuracy": accuracy},
      meta={"NUM_STEPS_CURRENT_ROUND": steps},
  )

  # send model back to NVFlare
  flare.send(output_model)

For more in-depth information on the Client API, refer to the Client API documentation and examples.

The 3rd-Party Integration Pattern

In certain scenarios, users face challenges when attempting to moving the training logic to the FLARE client side due to pre-existing ML/DL training system infrastructure. In the 2.4.0 release, we introduce the Third-Party Integration Pattern, which allows the FLARE system and a third-party external training system to seamlessly exchange model parameters without requiring a tightly integrated system.

See the 3rd-Party System Integration documentation for more details.

Job Templates and CLI

The newly added Job Templates serve as pre-defined Job configurations designed to improve the process of creating and adjusting Job configurations. Using the new Job CLI, users can easily leverage existing Job Templates, modify them according to their needs, and generate new ones. Furthermore, the Job CLI also offers users a convenient method for submitting jobs directly from the command line, without the need for starting the Admin console.

nvflare job list_templates|create|submit|show_variables

Also explore the continuously growing Job Template directory we have created for commonly used configurations. For more in-depth information on Job Templates and the Job CLI, refer to the NVIDIA FLARE Job CLI documentation and tutorials.

ModelLearner

The ModelLearner is introduced for a simplified user experience in cases requiring a Learner-pattern. Users exclusively interact with the FLModel object, which includes weights, optimizer, metrics, and metadata, while FLARE-specific concepts remain hidden to users. The ModelLearner defines standard learning functions, such as train(), validate(), and submit_model() that can be subclassed for easy adaptation.

See the Model Learner documentation and API definitions of ModelLearner and FLModel for more detail.

Step-by-Step Example Series

To help users quickly get started with FLARE, we’ve introduced a comprehensive step-by-step example series using Jupyter Notebooks. Unlike traditional examples, each step-by-step example utilizes only two datasets for consistency— CIFAR10 for image data and the HIGGS dataset for tabular data. Each example will build upon previous ones to showcase different features, workflows, or APIs, allowing users to gain a comprehensive understanding of FLARE functionalities.

CIFAR10 Examples:

  • image_stats: federated statistics (histograms) of CIFAR10.

  • sag: scatter and gather (SAG) workflow with PyTorch with Client API.

  • sag_deploy_map: scatter and gather workflow with deploy_map configuration, for deployment of apps to different sites using the Client API.

  • sag_model_learner: scatter and gather workflow illustrating how to write client code using the ModelLearner.

  • sag_executor: scatter and gather workflow demonstrating show to write client-side executors.

  • sag_mlflow: MLflow experiment tracking logs with the Client API in scatter & gather workflows.

  • sag_he: homomorphic encryption using Client API and POC -he mode.

  • cse: cross-site evaluation using the Client API.

  • cyclic: cyclic weight transfer workflow with server-side controller.

  • cyclic_ccwf: client-controlled cyclic weight transfer workflow with client-side controller.

  • swarm: swarm learning and client-side cross-site evaluation with Client API.

HIGGS Examples:

  • tabular_stats: federated statistics tabular histogram calculation.

  • scikit_learn: federated linear model (logistic regression on binary classification) learning on tabular data.

  • sklearn_svm: federated SVM model learning on tabular data.

  • sklearn_kmeans: federated k-Means clustering on tabular data.

  • xgboost: federated horizontal xgboost learning on tabular data with bagging collaboration.

Streaming APIs

To support large language models (LLMs), the 2.4.0 release introduces the streaming API to facilitate the transfer of objects exceeding the 2 GB size limit imposed by gRPC. The addition of a new streaming layer designed to handle large objects allows us to divide the large model into 1M chunks and stream them to the target. We provide built-in streamers for Objects, Bytes, Files, and Blobs, providing a versatile solution for efficient object streaming between different endpoints.

Refer to the nvflare.fuel.f3.stream_cell api for more details, and the Large Models documentation for insights on working with large models in FLARE.

Expanding Federated Learning Workflows

In the 2.4.0 release, we introduce Client Controlled Workflows as an alternative to the existing server-side controlled workflows.

Server-side controlled workflow

  • Server is trusted by all clients to handle the training process, job management as well as final model weights

  • Server controller manages the job lifecycle (eg. health of client sites, monitoring of job status)

  • Server controller manages the training process (eg. task assignment, model initialization, aggregation, and obtaining the distributed final model)

Client-side controlled workflow

  • Clients do not trust the server to handle the training process. Instead task assignment, model initialization, aggregation, and final model distribution are handled by clients.

  • Server controller still manages the job lifecycle (eg. health of client sites, monitoring of job status)

  • Secure Messaging: Peer-to-Peer clients exchange messages using TLS encryption where sender uses the public key of the receiver from certificates received, and encrypts messages with AES256 key. Only the sender and client can view the message. In the case that there is no direction connection between clients and the message is routed via the server, the server will be unable to decrypt the message.

Three commonly used types of client-side controlled workflows are provided:

  • Cyclic Learning: the model is passed from client to client.

  • Swarm Learning: randomly select clients as client-side controller and aggregators, where then Scatter and Gather with FedAvg is performed.

  • Cross Site Evaluation: allow clients to evaluate other sites’ models.

See swarm learning and client-controlled cyclic for examples using these client-controlled workflows.

MLFlow and Weights & Biases Experiment Tracking Support

We expand our experiment tracking support with MLFLow and Weights & Biases systems. The detailed documentation on these features can be found in Experiment Tracking, and examples can be found at FL Experiment Tracking with MLFlow and wandb.

Configuration Enhancements

Multi Configuration File Formats

In the 2.4.0 release, we have added support for multiple configuration formats. Prior to this release, the sole configuration file format was JSON, which although flexible, was lacking in useful features such as comments, variable substitution, and inheritance.

We added two new configuration formats:

  • Pyhocon - a JSON variant and HOCON (Human-Optimized Config Object Notation) parser for Python, with many desired features

  • OmegaConf - a YAML based hierarchical configuration

Users have the flexibility to use a single format or combine several formats, as exemplified by config_fed_client.conf and config_fed_server.json. If multiple configuration formats coexist, then their usage will be prioritized based on the following search order: .json -> .conf -> .yml -> .yaml

Improved Job Configuration File Processing

  • Variable Resolution - for user-defined variable definitions and variable references in config files

  • Built-in System Variables - for pre-defined system variables available to use in config files

  • OS Environment Variables - OS environment variables can be referenced via the dollar sign

  • Parameterized Variable Definition - for creating configuration templates that can be reused and resolved into different concrete configurations

See more details in the Configuration Files documentation.

POC Command Upgrade

We have expanded the POC command to bring users one step closer to the real deployment process. The changes allow users to experiment with deployment options locally, and use the same project.yaml file for both experimentation and in production.

The POC command mode has been changed from “local, non-secure” to “local, secure, production” to better reflect the production environment simulation. Lastly, the POC command is now more aligned with common syntax, nvflare poc -<action> => nvflare poc <action>

See more details in the Proof Of Concept (POC) Command documentation or tutorial.

Security Enhancements

Unsafe component detection

Users now have the capability to define an unsafe component checker, and the checker will be invoked to validate the component to be built. The checker raises UnsafeJob exception if it fails to validate the component, which will cause the job to be aborted.

For more details, refer to the Unsafe Component Detection documentation.

Event-based security plug-in

We have introduced additional FL events that can be used to build plug-ins for job-level function authorizations.

For more details, refer to the Site-specific Authentication and Federated Job-level Authorization documentation as well as the custom authentication example for more details about these capabilites.

FL HUB: Hierarchical Unification Bridge

The FL HUB is a new experimental feature designed to support multiple FLARE systems working together in a hierarchical manner. In Federated Computing, the number of edge devices is usually large with often just a single server, which can cause performance issues. A solution to this problem is to use a hierarchical FLARE system, where tiered FLARE systems connect together to form a tree-like structure. Each leaf of clients (edge devices) only connect to its server, where this server also serves as the client for the parent tier FLARE system.

One potential use case is with global studies, where the client machine may be located across different regions. Rather than requiring every region’s client machines connect to only a single FL server in that region, the FL HUB could enable a more performant tiered multi-server setup.

Learn more about the FL Hub in the Hierarchy Unification Bridge documentation and the code.

Misc. Features

  • FLARE API Parity

    • FLARE API now has the same set of APIs as the Admin Client.

    • Allows users to use almost all of the commands from python API or notebooks.

  • Docker Support

    • NVFLARE cloud CSP startup scripts now support deployment with docker containers in addition to VM deployment.

    • provision command now supports detached docker run, in addition to the interactive docker run.

  • Flare Dashboard

    • Prior to the 2.4.0, the Flare dashboard can only run within a docker container.

    • In the 2.4.0, the Flare dashboard can now run locally without docker for development.

  • Run Model Evaluation Without Training

  • Communication Enhancements

    • We added the application layer ping between Client Job process and Server parent process to replace the gRPC timeout. Previously, we noticed if the gRPC timeout is set too long, the cloud provider (eg. Azure Cloud) will kill the connection after 4 minutes. If the timeout setup is too short (such as 2 minutes), the underlying gRPC will report too many pings. The application level ping will avoid both issues to make sure the server/client is aware of the status of the processes.

    • FLARE provides two drivers for gRPC based communication- asyncio (AIO) and regular (non-AIO) versions of gRPC library. One notable benefit of the AIO gRPC is its ability to handle many more concurrent connections on the server side. However, the AIO gRPC may crash under challenging network conditions on the client side, whereas the non-AIO gRPC is more stable. Hence in FLARE 2.4.0, the default configuration uses the non-AIO gRPC library version for better stability.

      • In order to change the driver selection, users can update comm_config.json in the local directory of the workspace, and set the use_aio_grpc config variable.

New Examples

Federated Large Language Model (LLM) examples

We’ve added several examples to demonstrate how to work with federated LLM:

Vertical Federated XGBoost

With the 2.0 release of XGBoost, we are able to demonstrate the vertical xgboost example. We use Private Set Intersection and XGBoost’s new federated learning support to perform classification on vertically split HIGGS data (where sites share overlapping data samples but contain different features).

Graph Neural Networks (GNNs)

We added two examples using GraphSage to demonstrate how to train Federated GNN on Graph Dataset using Inductive Learning.

Protein Classification: to classify protein roles based on their cellular functions from gene ontology. The dataset we are using is PPI (protein-protein interaction) graphs, where each graph represents a specific human tissue. Protein-protein interaction (PPI) dataset is commonly used in graph-based machine-learning tasks, especially in the field of bioinformatics. This dataset represents interactions between proteins as graphs, where nodes represent proteins and edges represent interactions between them.

Financial Transaction Classification: to classify whether a given transaction is licit or illicit. For this financial application, we use the Elliptic++ dataset which consists of 203k Bitcoin transactions and 822k wallet addresses to enable both the detection of fraudulent transactions and the detection of illicit addresses (actors) in the Bitcoin network by leveraging graph data. For more details, please refer to this paper.

Financial Application Examples

To demonstrate how to perform Fraud Detection in financial applications, we introduced an example illustrating how to use XGBoost in various ways to train a model in a federated manner with a finance dataset. We illustrate both vertical and horizontal federated learning with XGBoost, along with histogram and tree-based approaches.

KeyCloak Site Authentication Integration

FLARE is agnostic to the 3rd party authentication mechanism, and each client can have its own authentication system. We demonstrate FLARE’s support of site-specific authentication using KeyCloak. The KeyCloak Site Authentication Integration example is configured so the admin user will need additional user authentication to submit and run a job.

Migration to 2.4.0: Notes and Tips

FLARE 2.4.0 introduces a few API and behavior changes. This migration guide will help you to migrate from the previous NVFLARE version to the current version.

Job Format: meta.json

In FLARE 2.4.0, users must have a meta.json configuration file defined in their jobs. Legacy app definitions should be updated to the job format to include a meta.json file with a deployment map and any number of app folders (containing config/ and custom/). Here is a basic job structure with a single app:

├── my_job
│   ├── app
│      ├── config
│         ├── config_client.json
│         └── config_server.json
│      └── custom
│   └── meta.json

Here is the default meta.json which can be edited accordingly:

{
  "name": "my_job",
  "resource_spec": {},
  "min_clients" : 2,
  "deploy_map": {
    "app": [
      "@ALL"
    ]
  }
}

FLARE API Parity

In FLARE 2.3.0, an initial version of the FLARE API was implemented as a redesigned FLAdminAPI, however we only included a subset of the functions. In FLARE 2.4.0, the FLARE API has been enhanced to include the remaining functions of the FLAdminAPI, so that the FLAdminAPI can sunset.

See the Migrating to FLARE API from FLAdminAPI for more details on the added functions.

Timeout Handling

In the 2.4.0 release, improvements have been to made to the timeout handling for commands involving Admin Server communication with FL Clients and awaiting responses. Previously, a fixed global timeout value was used on the Admin Server, however this value was sometimes not enough if a command took a long time (e.g. cat server log.txt command may take time to transfer the large log file). In this case, the user could use the set_timeout command to change the default timeout value of the Admin Server, however this command had the drawback of being global, and would affect all users. The global effect of this command meant one user setting a very small timeout value could cause all user commands to fail.

To address this, the set_timeout command has been changed to be session specific. Additionally a new unset_timeout command has been added to revert to use the Admin Server’s default timeout for the session.

Changes to show_stats and show_errors

The old structure puts the server’s result dict directly at the top level of the overall result dict, while each client’s result dict is placed as an item keyed on the client name. To make it consistent between server and client results, we’ve change to put the server’s result as an item keyed on “server”. If any code is based on the old return structure of FLAdminAPI, please update it accordingly.

{
  "server": { # new "server" key for server result dict
    "ScatterAndGather": {
      "tasks": {
        "train": [
          "site-1",
          "site-2"
        ]
      },
      "phase": "train",
      "current_round": 2,
      "num_rounds": 50
    },
    "ServerRunner": {
      "job_id": "3ad5bdef-db12-4ffb-9362-0ff163973f7d",
      "status": "started",
      "workflow": "scatter_and_gather"
    }
  },
  "site-1": {
    "ClientRunner": {
      "job_id": "3ad5bdef-db12-4ffb-9362-0ff163973f7d",
      "current_task_name": "None",
      "status": "started"
    }
  },
  "site-2": {
    "ClientRunner": {
      "job_id": "3ad5bdef-db12-4ffb-9362-0ff163973f7d",
      "current_task_name": "train",
      "status": "started"
    }
  }
}

POC Command Upgrade

The POC command has been upgraded in 2.4.0:

  • Remove -- for action commands, change to subcommands

  • POC is now using “production mode”, the admin user name is now “admin@nvidia.com” instead of “admin” from previous releases.

  • new -d docker and -he Homomorphic encryption options

  • nvflare poc prepare generates .nvflare/config.conf to store location of POC workspace, takes precedent over environment variable NVFLARE_POC_WORKSPACE

  • In the previous version, the startup kits are located directly under default POC workspace at /tmp/nvflare/poc. In the 2.4.0, the startup kit is now under /tmp/nvflare/poc/example_project/prod_00/ to follow the production provision default structure.

  • Multi-org and multi-role support

nvflare poc -h
usage: nvflare poc [-h] [--prepare] [--start] [--stop] [--clean] {prepare,prepare-jobs-dir,start,stop,clean} ...

optional arguments:
  -h, --help            show this help message and exit
  --prepare             deprecated, suggest use 'nvflare poc prepare'
  --start               deprecated, suggest use 'nvflare poc start'
  --stop                deprecated, suggest use 'nvflare poc stop'
  --clean               deprecated, suggest use 'nvflare poc clean'

poc:
  {prepare,prepare-jobs-dir,start,stop,clean}
                        poc subcommand
    prepare             prepare poc environment by provisioning local project
    prepare-jobs-dir    prepare jobs directory
    start               start services in poc mode
    stop                stop services in poc mode
    clean               clean up poc workspace

Refer to Proof Of Concept (POC) Command for more details.

Secure Messaging

A new secure argument has been added for send_aux_request() in ServerEngineSpec, and ClientEngineExecutorSpec.

secure is an optional boolean to determine whether the aux request should be sent in a secure way. One such use case is for secure peer-to-peer messaging, such as in the client-controlled workflows.

@abstractmethod
 def send_aux_request(
     self,
     targets: Union[None, str, List[str]],
     topic: str,
     request: Shareable,
     timeout: float,
     fl_ctx: FLContext,
     optional=False,
     secure: bool = False,
 ) -> dict:
     """Send a request to Server via the aux channel.
     Implementation: simply calls the ClientAuxRunner's send_aux_request method.
     Args:
         targets: aux messages targets. None or empty list means the server.
         topic: topic of the request
         request: request to be sent
         timeout: number of secs to wait for replies. 0 means fire-and-forget.
         fl_ctx: FL context
         optional: whether the request is optional
         secure: should the request sent in the secure way
     Returns:
         a dict of reply Shareable in the format of:
             { site_name: reply_shareable }
     """
     pass

Stats Result Format

In StatisticsController, the result dictionary format originally concatenated “site” and “dataset” to support visualization. In 2.4.0 this has now been changed so “site” and “dataset” have their own keys in the result dictionary.

result = {feature: {statistic: {site-dataset: value}}}

to

result =  feature: {statistic: {site: {dataset: value}}}}

To continue to support the visualization needs, the site-dataset concatenation logic has instead been moved to Visualization.

Previous Releases of FLARE

Also refer to the the NVFlare GitHub releases to see minor release notes for RC versions.