NVIDIA FLARE System Architecture

FLARE Architecture Overview FLARE Job Processing Architecture

This document describes the overall system architecture of NVIDIA FLARE, including its layered structure, major subsystems, and how they interact. It covers the runtime components on both server and client sides, the communication framework, and the process model.

The FLARE architecture (shown above) comprises three main layers:

  • Foundation Layer - Communication infrastructure, messaging protocols, privacy preservation tools, and secure platform management.

  • Application Layer - Building blocks for federated learning, including federation workflows and learning algorithms.

  • Tooling - FL Simulator and POC CLI for experimentation and simulation, plus deployment and management tools for production workflows.

Core Components and Code Structure

Primary System Modules

Process Responsibilities

Server Parent (SP)

  • Runs FederatedServer

  • Manages client registration and heartbeat monitoring

  • Houses ServerEngine which orchestrates job scheduling via JobRunner

  • Spawns Server Job (SJ) processes or docker/pod for each active job for different job launcher.

Server Job (SJ)

  • Runs ServerRunner

  • Executes workflow Controllers (e.g., ScatterAndGather)

  • Broadcasts tasks to client jobs and aggregates results

  • Separate process per job for isolation

Client Parent (CP)

  • Runs FederatedClient

  • Manages client registration with server

  • Houses ClientEngine which coordinates job execution

  • Spawns Client Job (CJ) processes or docker/pod for each assigned job for different job launcher.

Client Job (CJ)

  • Runs ClientRunner

  • Pulls tasks from server via Cell network

  • Launches training processes using JobExecutor

  • Routes task data to/from training process via Pipe

Training Process

  • User’s ML training script

  • Uses Client API: flare.init(), flare.receive(), flare.send()

  • Communicates with CJ via FilePipe (file-based) or CellPipe (network-based)

Communication Mechanisms

Cell Network: All parent and job processes communicate via F3 Cell objects that provide:

  • FQCN (Fully Qualified Cell Name) addressing (e.g., server.job_123)

  • Channel-based routing (SERVER_MAIN, CLIENT_MAIN, AUX_COMMUNICATION)

  • Secure, encrypted messaging with authentication

  • Streaming support for large data transfers

Pipe Abstraction: CJ-to-training-process communication uses Pipe interface:

  • FilePipe: File system-based IPC for same-machine processes

  • CellPipe: Network-based IPC allowing training process on different machine

Deployment Modes

NVFLARE provides three deployment modes that share the same core runtime but differ in packaging, security, and deployment complexity. This design ensures consistency from development to production.

Deployment Modes Comparison

Deployment Modes Comparison

Mode

Use Case

Security

Processes

Setup Time

Simulator

Rapid prototyping, algorithm testing

None

multiple threads, some cases if may create multiple process

Seconds

POC

Local multi-client testing, workflow validation

Optional

Multiple processes on one machine

Minutes

Production

Real-world deployment

Full PKI/TLS

Distributed processes across machines

Hours (with provisioning)

Core FL Runtime

The Core FL Runtime is the execution engine that manages federated learning job processes and orchestration. This page documents the runtime components responsible for process lifecycle management, task coordination, and execution modes.

Scope and Components

The Core FL Runtime consists of:

  • ServerEngine : Server-side process orchestration and job lifecycle management

  • ClientEngine : Client-side process management and communication handling

  • JobRunner : Job scheduling, deployment, and monitoring

  • SimulatorRunner : Single-machine simulation for development

Process Types

Process Types

Process Type

Code Symbol

Description

SP

ProcessType.SERVER_PARENT

Server parent process running ServerEngine

SJ

ProcessType.SERVER_JOB

Server job process running ServerRunner

CP

ProcessType.CLIENT_PARENT

Client parent process running ClientEngine

CJ

ProcessType.CLIENT_JOB

Client job process running ClientRunner

Inter-Process Communication

The runtime uses Cell-based communication between parent and job processes.

Cell Communication Channels

Cell Communication Channels

Channel

Purpose

Used By

CellChannel.SERVER_MAIN

Client-to-server FL messages

CP to SP

CellChannel.CLIENT_MAIN

Server-to-client FL messages

SP to CP

CellChannel.SERVER_COMMAND

Commands to server job

SP to SJ

CellChannel.CLIENT_COMMAND

Commands to client job

CP to CJ

CellChannel.SERVER_PARENT_LISTENER

Parent commands from SJ

SJ to SP

CellChannel.AUX_COMMUNICATION

Auxiliary messages

All processes

JobRunner Architecture

JobRunner Component Structure

FLARE Job Runner Architecture

Communication Framework

Purpose and Scope

The Communication Framework, also known as F3 (FLARE Foundation Framework) and Cellnet, provides the foundational messaging infrastructure for all communication in NVIDIA FLARE. It implements a secure, scalable, and feature-rich messaging layer that handles all interactions between servers, clients, and administrative components.

This section provides an overview of the communication framework architecture, core components, and basic concepts.

  • CellNet Architecture - Detailed architecture and design patterns

  • Cell Communication Patterns - Message sending patterns and channel routing

  • Streaming and Data Transfer - Large data transfer and streaming protocols

  • Security and Encryption - Certificate management and message encryption

for mode details please refer to cellnet architecture CellNet Architecture

Security Architecture

Please refer to NVIDIA FLARE Security Overview for the security architecture.