Frequently Asked Questions¶
What is NVIDIA FLARE?
NVIDIA FLARE is a general-purpose framework designed for collaborative computing. In this collaborative computing framework, workflows are not limited to aggregation-based federated learning (usually called a Fed-Average workflow), and applications are not limited to deep learning. NVIDIA FLARE is fundamentally a messaging system running in a multithreaded environment.
What does NVIDIA FLARE stand for?
NVIDIA Federated Learning Application Runtime Environment.
Does NVIDIA FLARE depend on Tensorflow or PyTorch?
No. NVIDIA FLARE is a Python library that implements a general collaborative computing framework. The Controllers, Executors, and Tasks that one defines to execute the collaborative computing workflow are entirely independent.
Is NVIDIA FLARE designed for deep learning model training only?
No. NVIDIA FLARE implements a communication framework that can support any collaborative computing workflow. This could be deep learning, machine learning, or even simple statistical workflows.
Does NVIDIA FLARE require a GPU?
No. Hardware requirements are dependent only on what is implemented in the Controller workflow and client Tasks. Client training tasks will typically benefit from GPU acceleration. Server Controller workflows may or may not require a GPU.
How does NVIDIA FLARE implement its collaborative computing framework?
NVIDIA FLARE collaborative computing is achieved through Controller/Worker interaction.
What is a Controller?
The Controller is a python object that controls or coordinates Workers to perform tasks. The Controller is run on the server. The Controller defines the overall collaborative computing workflow. In its control logic, the Controller assigns tasks to Workers and processes task results from the workers.
What is a Worker?
A Worker is capable of performing tasks (skills). Workers run on Clients.
What is a Task?
A Task is a piece of work (Python code) that is assigned by the Controller to client workers. Depending on how the Task is assigned (broadcast, send, or relay), the task will be performed by one or more clients. The logic to be performed in a Task is defined in an Executor.
What is Learnable?
Learnable is the result of the Federated Learning application maintained by the server. In DL workflows, the Learnable is the aspect of the DL model to be learned. For example, the model weights are commonly the Learnable feature, not the model geometry. Depending on the purpose of your study, the Learnable may be any component of interest. Learnable is an abstract object that is aggregated from the client’s Shareable object and is not DL-specific. It can be any model, or object. The Learnable is managed in the Controller workflow.
What is Shareable?
Shareable is simply a communication between two peers (server and clients). In the task-based interaction, the Shareable from server to clients carries the data of the task for the client to execute; and the Shareable from the client to server carries the result of the task execution. When this is applied to DL model training, the task data typically contains model weights for the client to train on; and the task result contains updated model weights from the client. The concept of Shareable is very general - it can be whatever that makes sense for the task.
What is FLContext and what kind of information does it contain?
FLContext is one of the key features of NVIDIA FLARE and is available to every method of all FLComponent types (Controller, Aggregator, Executor, Filter, Widget, …). An FLContext object contains contextual information of the FL environment: overall system settings (peer name, job id / run number, workspace location, etc.). FLContext also contains an important object called Engine, through which you can access important services provided by the system (e.g. fire events, get all available client names, send aux messages, etc.).
What are events and how are they handled?
Events allow for dynamic notifications to be sent to all objects that are a subclass of FLComponent. Every FLComponent is an event handler.
The event mechanism is like a pub-sub mechanism that enables indirect communication between components for data sharing. Typically, the data generator fires an event to publish the data, and other components handle the events they are subscribed to and consume the data of the event. The fed event mechanism even allows the pub-sub go across network boundaries.
What additional components may be implemented with NVIDIA FLARE to support the Controller Workflow, and where do they run (server or client):
- LearnablePersistor - Server
The LearnablePersistor is a method implemented for the server to save the state of the Learnable object, for example writing a global model to disk for persistence.
- ShareableGenerator - Server
The ShareableGenerator is an object that implements two methods: learnable_to_shareable converts a Learnable object to a form of data to be shared to the client; shareable_to_learnable uses the shareable data (or aggregated shareable data) from the clients to update the learnable object.
- Aggregator - Server
The aggregator defines the algorithm used on the server to aggregate the data passed back to the server in the clients’ Shareable object.
- Executor - Client
The Executor defines the algorithm the clients use to operate on data contained in the Shareable object. For example in DL training, the executor would implement the training loop. There can be multiple executors on the client, designed to execute different tasks (training, validation/evaluation, data preparation, etc.).
- Filter - Clients and Server
Filters are used to define transformations of the data in the Shareable object when transferred between server and client and vice versa. Filters can be applied when the data is sent or received by either the client or server. See the diagram on the Filters page for details on when “task_data_filters” and “task_result_filters” are applied on the client and server.
- Any component of subclass of FLComponent
All component types discussed above are subclasses of FLComponent. You can create your own subclass of FLComponent for various purposes. For example, you can create such a component to listen to certain events and handle the data of the events (analysis, dump to disk or DB, etc.).
What is Provisioning?
NVIDIA FLARE includes an Open Provision API that allows you to generate mutual-trusted system-wide configurations, or startup kits, that allow all participants to join the NVIDIA FLARE system from across different locations. This mutual-trust is a mandatory feature of Open Provision API as every participant authenticates others by the information inside the configuration. The configurations usually include, but are not limited to:
network discovery, such as domain names, port numbers or IP addresses
credentials for authentication, such as certificates of participants and root authority
authorization policy, such as roles, rights and rules
tamper-proof mechanism, such as signatures
convenient commands, such as shell scripts with default command line options to easily start an individual participant
What types of startup kits are generated by the Provision tool?
The Open Provision API allows flexibility in generating startup kits, but typically the provisioning tool is used to generate secure startup kits for the Overseer, FL servers, FL clients, and Admin clients.
What files does each type of startup kit contain? What are these files used for, and by whom?
Startup kits contain the configuration and certificates necessary to establish secure connections between the Overseer, FL servers, FL clients, and Admin clients. These files are used to establish identity and authorization policies between server and clients. Startup kits are distributed to the Overseer, FL servers, clients, and Admin clients depending on role. For the purpose of development, startup kits may be generated with limited security to allow simplified connection between systems or between processes on a single host. See the “poc” functionality of the Open Provision API for details.
How would you distribute the startup kits to the right people?
Distribution of startup kits is inherently flexible and can be via email or shared storage. The API allows the addition of builder components to automation distribution.
What happens after provisioning?
After provisioning, the Admin API is used to submit a job to the FL server, and the JobRunner on the server can pick it up to deploy and run.
What is an Application in NVIDIA FLARE?
An Application is a named directory structure that defines the client and server configuration and any custom code required to implement the Controller/Worker workflow.
What is the basic directory structure of an NVIDIA FLARE Application?
Typically the Application configuration is defined in a
config/subdirectory and defines paths to Controller and Worker executors. Custom code can be defined in a
custom/subdirectory and is subject to rules defined in the Authorization Policy.
How do you deploy an application?
An Application is deployed using the
submit_jobadmin command. For more configuration, apps can be packaged into jobs with deploy_map definitions to specify which sites which apps should be deployed to. The deployment happens automatically with the JobRunner on the FL server.
Do all FL client have to use the same application configuration?
No, they do not have to use the same application configuration, even though they can that is frequently done. The function of FL clients can be customized by the implementation of Tasks and Executors and the client’s response to Events.
What is the difference between the Admin client and the FL client?
The Admin client is used to control the state of the server’s controller workflow and only interacts with the server. FL clients poll the server and perform tasks based on the state of the server. The Admin client does not interact directly with FL client.
Where does the Admin client run?
The Admin client runs as a standalone process, typically on a researcher’s workstation or laptop.
What can you do with the Admin client?
The Admin client is used to orchestrate the FL study, including starting and stopping server and clients, deploying applications, and managing FL experiments.
How can I get the global model at the end of training? What can I do to resolve keys not matching with the model defined?
You can use the download_job command with the Admin client to get the job result into the admin transfer folder. The model is saved in a dict depending on the persistor you used, so you might need to access it with
model.load_state_dict(torch.load(path_to_model)["model"])if you used PTFileModelPersistor because PTModelPersistenceFormatManager saves the model under the key “model”.
Why am I getting an error about my custom files not being found?
Make sure that BYOC is enabled. BYOC is always enabled in POC mode, but disabled by default in secure mode when provisioning. Either through the UI tool or though yml, make sure the
enable_byocflag is set for each participant. If the
enable_byocflag is disabled, even if you have custom code in your application folder, it will not be loaded. There is also a setting for
allow_byocthrough the authorization rule groups. This controls whether or not apps containing BYOC code will be allowed to be uploaded and deployed.
I am getting the following errors, does this mean the server is down?
Trying to obtain server address Obtained server address: nvflare1234.westus2.cloudapp.azure.com:8003 Trying to login, please wait ... Trying to login, please wait ... Trying to login, please wait ... Trying to login, please wait ... Trying to login, please wait ... Communication Error - please try later 2023-03-14 19:36:15,966 - nvflare.fuel.f3.sfm.conn_manager - INFO - Retrying [CH00001 ACTIVE grpc://nvflare1234.westus2.cloudapp.azure.com:8002] in 60 seconds 2023-03-14 19:36:17,091 - Cell - ERROR - [ME=site1 O=? D=server F=? T=? CH=task TP=hear_beat] cannot find path to server 2023-03-14 19:36:17,091 - Cell - ERROR - [ME=site1 O=? D=server F=? T=? CH=task TP=hear_beat] cannot send to 'server': target_unreachable 2023-03-14 19:36:27,101 - Cell - ERROR - [ME=site1 O=? D=server F=? T=? CH=task TP=hear_beat] cannot find path to server 2023-03-14 19:36:27,101 - Cell - ERROR - [ME=site1 O=? D=server F=? T=? CH=task TP=hear_beat] cannot send to 'server': target_unreachable 2023-03-14 19:36:37,110 - Cell - ERROR - [ME=site1 O=? D=server F=? T=? CH=task TP=hear_beat] cannot find path to server 2023-03-14 19:36:37,110 - Cell - ERROR - [ME=site1 O=? D=server F=? T=? CH=task TP=hear_beat] cannot send to 'server': target_unreachable 2023-03-14 19:36:47,121 - Cell - ERROR - [ME=site1 O=? D=server F=? T=? CH=task TP=hear_beat] cannot find path to server
There are a few reasons that could cause the above errors. One of them is the server is down. Another possible reason is caused by delay or cache of DNS name resolution. This happens when the IP address of the NVFlare server changes but its domain name remains the same. The DNS name resolution could take up to 72 hours to propagate to the entire world. Most of the time, it takes tens of minutes.
The OS might have tools to flush its DNS cache. For example, in Ubuntu 20.04, run sudo systemd-resolve –flush-caches to flush DNS cache and force it to get the updated name resolution.
What is the scope of security in NVIDIA FLARE?
Security is multi-faceted and cannot be completely controlled for or provided by the NVIDIA FLARE API. The Open Provision API provides examples of basic communication and identity security using GRPC via shared self-signed certificates and authorization policies. These security measures may be sufficient but can be extended with the provided APIs.
What about data privacy?
NVIDIA FLARE comes with a few techniques to help with data privacy during FL: differential privacy and homomorphic encryption (see Privacy filters).
If the IP of the server changes, the admin client may not be able to connect anymore because the admin server remains bound to the original host and port. A possible workaround is to restart the FL server manually, and then the host will resolve to the updated IP for binding when restarting.
Running out of memory can happen at any time, especially if the server and clients are running on same machine. This can cause the server to die unexpectedly.
shutdown clientfor a client running multi GPUs, a process (sub_worker_process) may remain. The work around for this is to run
abort clientbefore the
If a snapshot is in a corrupted state, the server may try to restore the job and get stuck. To resolve this, delete the snapshot from the location configured in project.yml for the snapshot_persistor storage (by default
abort_jobshould be able to stop the job on the server.