Programming Best Practices

With familiarity in all the concepts in the Programming Guide, here is a list of important considerations to make when designing and building your own NVIDIA FLARE applications.

Define your logic in a subclass of FLComponent

NVIDIA FLARE has a componentized architecture - business logic is implemented by components that interact with each other to get the job done. Most of your code, if not all, should be implemented as subclasses of the FLComponent class.

There are many benefits of being a FLComponent:

  • All such components are automatically event handlers

  • Convenience methods for logging, and firing events

  • Error log streaming to server for centralized error management (if enabled)

  • Well-defined component lifecycle

Simple use of FLContext

FLContext caters relevant data to your component’s processing methods. All methods of FLComponent have a fl_ctx input arg.

The fl_ctx object typically contains the following information:

  • The overall system settings (identity name, workspace, run-time args, run abort signal, etc.): All FLContext objects have these, and there are convenience methods for them.

  • Event data: When an event is fired, the event poster usually places some event specific data into the FLContext before firing the event. Event handlers can access such data in their handle_event() methods.

You could also use the fl_ctx to pass around information between your own functions, but make sure all these function calls happen within the same processing stack of your own component.

Do not assume all components share the same fl_ctx and use fl_ctx to pass information between different components. If you want to reliably pass a piece of information to others, use event instead: you fire an event type to announce the information, and others receive the information by handling your event.

Agreement on Data Format and Content

FL is data-based collaboration among different parties (components running on server and clients). Having an accurate understanding of the shared data is a basic requirement. For example, the model weights generated by the trainer on the client must be consumable by the aggregator running on the server.

NVIDIA FLARE encourages data content to be self-sufficient in that it provides not only the content (e.g. values of model weights), but also complete contextual/meta information about the content (e.g. whether value is model weights or weight diffs, whether the value has been processed by any algorithm, and any processed properties).

To enable easy adoption of this strategy, NVIDIA FLARE provides a simple DXO API (Data eXchange Object) that you can use to describe the data you want to communicate between server and clients.

Use Filters to accommodate data differences

When two parties get into disagreements on the data format, you may modify one of them to accommodate the other. This is doable but may not be the best solution, since it makes code hard to reuse.

A better solution is to use filters. As long as both parties agree on the data content (not format), you can write a filter to convert the data from one format to another, so both parties can stay intact.

Filters can do a lot more than just converting data formats. You can apply any type of massaging to the data for the purpose of security. In fact, NVIDIA FLARE provided privacy protection techniques are all implemented as filters.

Be aware of the running environment

NVIDIA FLARE is a messaging system running in a multithreaded environment. Care must be taken when programming with NVIDIA FLARE to ensure smooth operation of the system.

  • Methods of a component may be running in different threads. Controller callbacks are running in threads that receive messages on a connection with the client.

  • The control_flow() method of the controller runs in a dedicated thread throughout the RUN.

  • In your own component, you can create more separate threads for your own design logic.

  • Events could be fired from any thread, and the handle_event() method runs within the same thread from which the event is fired.

  • All components are running within a RUN, and components are created and released at the beginning and end of the RUN.

Your component may have many methods. Keep in mind that these methods may be running in different threads at the same time!

Manage your component’s life cycle gracefully

During the life cycle of your component, things may happen that could interrupt the normal processing of the component. For example, the admin user or the workflow controller may ask to terminate the current RUN; or to terminate the execution of the current task on the client.

Since many components and threads could be involved in the RUN, it is important to ensure graceful exits of the components and threads, for the smooth operation of the FL system.

To manage these gracefully, follow the following rule-of-thumbs.

Respect the abort signal

An abort-signal is a simple Python object that can be triggered to indicate that either the RUN or Task should be ended. There are two kinds of abort signals:

  • RUN abort signal. When triggered, it indicates that the current RUN is to be aborted. If you have long-running logic, you should check this signal frequently and exit when it is triggered.

  • Task Execution abort signal. This is the abort_signal passed to the execute() method of an executor (on the client side). If you write executors, make sure to check this signal frequently in the execute() method and return from the method when the signal is triggered.

Note

You can always get the RUN abort_signal from an FL context: fl_ctx.get_abort_signal(). You can also keep it with your component object when starting your component (i.e. when handling the EventType.START_RUN event).

Note

You cannot get the Task Abort Signal from an FL Context!

The triggering of abort signals is the responsibility of the framework. NEVER try to trigger them in your component!

If your component runs into a condition that definitely requires drastic actions, call self.system_panic() to abort the whole RUN, or self.task_panic() to end the task execution (in the Executor on client).

Handle EventType.ABORT_TASK event

What if you use a 3rd-party solution to run a long-running execution in your executor, and the 3rd-party solution does not take the abort_signal?

If the 3rd-party solution does not provide any way to interrupt/end its execution prematurely, then you are out of luck - the task cannot be aborted, and the system will just have to wait until the end of the execution. Though this will not cause any logical errors, it could cause the overall slowdown of the FL system, because the client won’t be able to fetch the next task until the current task is done.

If the 3rd-party solution provides a method you can call from a different thread to end its execution, then you are lucky. In addition to triggering the task abort signal, NVIDIA FLARE also fires the special event EventType.ABORT_TASK! You just need to handle this event in your executor’s handle_event() method and call the 3rd-party provided method to stop the execution.

Avoid recursive events if possible

You can fire another event in your event handling logic. But try to avoid this if possible so as to avoid potential dead-loops. If you have to do it, be very careful that it won’t cause cyclic event firing.

Note

The system has a built-in protection against cyclic event firing - if the firing goes beyond certain predefined depth, the firing will be stopped.

Tip

Firing an event from an event handler could be useful in some cases. For example, NVIDIA FLARE’s event system does not enforce any order when invoking handle_event() methods of event handlers, hence you cannot assume some handlers are invoked before some other handlers. If you must ensure handler A is invoked before B for an event type E, you can do it yourself: in A’s handle_event() method, you fire a new event E2 when handling E, and B only handles E2.

If you develop workflow controllers

Use a new Task object for each task

Do not reuse Task objects – always create a new Task object when scheduling tasks instead. Try to avoid using the same “task name” for different functions.

For maximum flexibility, make your task names configurable so you only need to define them in the JSON config file at the time of app creation.

Set reasonable timeouts for your tasks

In the control_flow() method of your controller, you will create tasks. Don’t forget to set a reasonable timeout value (number of secs for the task to complete). The default value is 0, which means never timeout. If you don’t specify a non-zero timeout, under certain circumstances, your task may just keep on waiting until clients start the RUN.

Callbacks should be short-lived

Since callbacks are running within a connection thread, they must return very quickly to release the underlying connection; otherwise the whole system could become frozen when all connections are used.

Task props usage

Once a task is created, do not modify the task.props dict directly (e.g. task.props = {“key”: “value”}) since this dict is carefully managed by the framework.

Instead, always use set_prop():

task.set_prop("key", "value")

Task completion_status

If the task.completion_status is not None, this task will be treated as finished or errored-out. It will not be sent to any clients.

Easy Config of your components

To make it easy to configure your components in JSON, use only primitive data types for init args: number, str, list, dict. Never use Python objects.

Nested Components?

Sometimes your component is composed of other parts (Python objects). How do you get your component created with all the parts instantiated?

Use component ID!

  • Parts should be configured in the “components” section of the config. You need to specify a unique ID for each part.

  • In the configuration of the parent component, specify the ID of the part as an init arg in the JSON configuration file.

  • When handling the EventType.START_RUN event, get the part object from the engine and keep it in the parent component:

    engine = fl_ctx.get_engine()
    self.my_part = engine.get_component(self.my_part_id)
    

Manage your resources properly

The life span of your components is that of a RUN.

Your components may use resources such as memory, threads, and even processes. Make sure to manage them properly, following this pattern:

  • In __init__(), create your resources but do not start them.

  • In handling EventType.START_RUN, start the resources (e.g. open files, start threads, etc.).

  • In handling EventType.END_RUN, terminate the resources (e.g. close files, join threads and/or processes). Make sure to test the resources before terminating them (e.g. if self.thread.is_active()).

Do not terminate your resources before EventType.END_RUN.

Promote Decoupled Component Interactions

Throughout the FL execution, many pieces of data are generated. What if you want to do something about that data? You may have the urge to modify the code that generated the data and insert your logic there to manipulate the data. Don’t! Use event handling instead.

NVIDIA FLARE’s event system is like a pub-sub mechanism. The publisher fires an event, and subscribers handle the event. The mechanism even works across network boundaries - a component on the server can post an event and components on the server and components on clients can handle the event; and vice versa.

Fire an event for the following cases:

  • Your component generates a piece of data that could be useful for other components. You should name the event type to be like “datatype_available” where datatype specifies the type of the data.

  • Your component’s processing flow has come to a point that could be interesting to others. You should name event types like: “before_somepoint” and “after_somepoint”, where somepoint specifies the point.

To fire an event, call self.fire_event() or self.fire_fed_event().

What events should you fire? Well, it’s up to your imagination whether the event could be useful to others.

Event Handling

If you are interested in the data or timing of an event type that is fired by another component, you can handle that event type in the handle_event() method of your component.

Event data is contained in the FLContext object that is passed to the handle_event() method. Unless absolutely necessary, you should treat data in the FLContext as read-only.

Naming of Event Types, FLContext props, Shareable keys, and Headers

Name the names with only letters, digits and underscores. Do not prefix the key names with underscores - such names are for the internal use of NVIDIA FLARE framework.

The name of the fed event type should be prefixed with “fed.”.

Use Shareable Properly

Shareable is just a dictionary. However observe these rules of thumb:

  • The headers element (ReservedHeaderKey.HEADERS) is reserved. Never overwrite it!

  • You can add additional props into the headers element, but never use reserved header names.

Do not share FLContext objects across different threads

To avoid potential race conditions that could compromise data integrity of FLContext objects, avoid sharing the same FLContext across in multiple threads. The framework guarantees to create a new FLContext object for each messaging thread. If your component creates its own thread, that thread can create a FLContext object with engine.new_context().

Tip

The Engine object can be shared across threads.

Handle exceptions

Things can go wrong unexpectedly, even if there are no logical errors in your code. You should place your logic in the try block and handle exceptions gracefully. If not, the framework will catch the exception and may take actions that may not be what you wanted (e.g. stop the whole RUN).

Panic or Not

To create solid components that work well in real-world apps, you are encouraged to do “defensive programming” - trust nothing and test everything, and cover all kinds of edge conditions. In doing so, you may come to face some extreme situations: what to do when a combination of conditions makes the whole thing just unworkable? For example, none of the Clients sends you meaningful training results. Should you just extend the number of rounds and see whether the clients will get better?

If you decide that the situation is truly unworkable, you can simply end the whole RUN by calling self.system_panic(). Similarly, if you run into a bad situation while executing a task (on client), you can end the task by calling self.task_panic().

The panic methods are effective regardless of component types and which thread it is called from.

Logging

Instead of directly using self.logger to log messages, always use the methods self.log_debug, self.log_info, self.log_warning, self.log_error instead. These methods provide many benefits:

  • All log messages are prefixed with some contextual information (e.g. job id / run number, peer name, peer run number, etc.) that can help you understand the log messages more easily.

  • Some log messages may be integrated with other system features. For example, error messages may be reported via admin commands, and/or streamed over to a centralized location.

Be Sensitive to Data Privacy

Data is exchanged between the Server and Clients in following situations:

  • Task execution

  • Fed events

  • Aux channel communication

  • Centralized error reporting

Be very careful about the data sent to peers! Make sure it never contains privacy sensitive information. This also applies to error messages, since they may be sent to a centralized location.