What’s New in FLARE v2.8.0

NVIDIA FLARE 2.8.0 focuses on making production federated learning easier to operate across organizations, studies, and runtime environments. The release adds Docker and Kubernetes job launchers, a broader automation-friendly CLI, distributed provisioning, multi-study support, stronger observability, and additional production hardening. It also adds new examples and research bundles for multimodal, language-model, Docker, Kubernetes, and privacy-oriented federated learning workflows.

Release Highlights

Modern NVFlare CLI: expanded nvflare command groups for jobs, system operations, local config, startup kits, recipes, distributed provisioning, and deployment preparation, with JSON output and schema support so operators and automation systems can run FLARE workflows without relying on console-only behavior.
Distributed provisioning: new nvflare cert and nvflare package workflows let participants keep private keys local while Project Admins approve certificate requests and generate signed packages, improving security ownership in cross-organization deployments.
Deployment prepare and runtime packaging: new nvflare deploy prepare flow packages existing startup kits for Docker and Kubernetes runtimes, including Kubernetes environments on AWS, Azure, and GCP, so provisioning and runtime packaging can be handled as separate repeatable steps.
Docker and Kubernetes job launchers: each site can configure a process, Docker, or Kubernetes job launcher. With the matching launcher configured, host-based jobs run as subprocesses, Docker-based jobs run as job containers, and Kubernetes-based jobs run as separate job pods, giving production sites Docker/Kubernetes isolation and resource handling plus study-scoped dataset mounts for stronger data isolation.
Multi-study support: study definitions in project.yml, study-scoped sessions, study-aware admin operations, and study CLI commands let one FLARE deployment host multiple collaborations without mixing participants, authorization, data access, or operational context.
Live log streaming: site and job logs stream to the server while jobs are running, reducing time to diagnose remote training failures and making CLI automation more responsive.
Security and production hardening: origin-bound auth tokens, safer archive handling, stricter private-key file permissions, safer loading paths, stronger job metadata validation, and additional dashboard/API hardening reduce common operational risk in federated deployments.
Feature election: a new federated feature selection workflow lets clients perform local feature selection for tabular datasets and share feature scores, not raw data, so FLARE can aggregate a global feature mask for downstream training.
Tensor disk offload for FedAvg: enabling enable_tensor_disk_offload=True significantly reduces server peak memory during FedAvg aggregation. Instead of holding all client tensor updates in memory simultaneously, each update is written to a temporary safetensors file on disk and consumed lazily during aggregation. The benefit scales with model size and client count.
Large-model streaming reliability: large tensor broadcasts are more robust when many clients retry after delayed EOF responses. Finished download refs are handled idempotently, and subprocess Client API jobs now reject unbounded result resends or missing download-completion waits that can turn one slow transfer into repeated large-model retries.
New examples and contributed research: MedGemma, Qwen3-VL, Codon-FM, FedUMM, financial-services fraud detection, Docker job examples, distributed provisioning examples, Hello JAX, and Hello log streaming help teams start from working patterns instead of assembling production and research workflows from scratch.

NVFlare CLI and Automation

FLARE 2.8.0 significantly expands the public nvflare command-line surface. The CLI now has a more consistent command layout, machine-readable output support, and better error contracts for scripts and automation systems.

This matters for production operations because the same interfaces can now be used consistently by humans, shell scripts, service automation, and other tooling. Jobs, system status, startup-kit selection, recipes, provisioning, and deployment preparation can be queried in structured form instead of requiring manual console interaction.

The main additions are command groups for job operations, system operations, local configuration, recipe discovery, distributed provisioning, package assembly, and deployment preparation. Many commands now support structured output and schema discovery, making them easier to use in scripts, notebooks, and operational tooling.

For details, see NVFlare CLI, NVIDIA FLARE Job CLI, System Command, Config Command, and Recipe Command.

For a hands-on CLI workflow, see the NVFlare CLI tutorial.

Deployment and Provisioning

Distributed Provisioning

FLARE 2.8.0 introduces a distributed provisioning workflow for cases where participants generate local private keys and certificate-signing requests instead of receiving all startup-kit materials from a centralized provisioner.

This is important for cross-organization collaborations because private keys no longer need to be generated by the Project Admin or transferred between organizations. Each participant can create and keep its own key material, reducing key-handling risk, while the Project Admin still controls approvals, signed packages, and root CA trust. Teams that prefer the existing centralized provisioning model can continue to use it.

The workflow adds participant-side certificate requests, Project Admin approval, signed package generation, root CA verification, and startup-kit assembly from approved packages. It is intended for deployments where key ownership and participant-controlled certificate requests are important.

See Distributed Provisioning, Cert Command, and Package Command.

For a runnable walkthrough, see the distributed provisioning example.

Deploy Prepare

The new nvflare deploy prepare command packages existing provisioned startup kits for runtime targets such as Docker and Kubernetes. This separates startup-kit generation from runtime-specific packaging, making deployments more repeatable across local, Docker, Kubernetes, and cloud-managed Kubernetes environments such as AWS, Azure, and GCP.

This separation is useful operationally because the same provisioned identities can be reused across runtime-specific packaging flows. Teams can prepare a startup kit once, then produce Docker or Kubernetes artifacts without changing the provisioning model.

See the Deploy Command user guide for Docker and Kubernetes runtime preparation.

Docker and Kubernetes Job Execution

FLARE 2.8.0 adds Docker and Kubernetes job launchers so sites can align FLARE jobs with the runtime isolation and resource controls they already use. Each site must be configured with the matching job launcher for the intended runtime. With that launcher configured, the pattern is:

process job launcher for a host-based parent: jobs run as subprocesses;
Docker job launcher for a Docker-based parent: jobs run as Docker containers;
Kubernetes job launcher for a Kubernetes-based parent pod: jobs run as separate Kubernetes job pods.

This matters because Docker and Kubernetes deployments can now use their runtime isolation instead of treating every job as a local subprocess. Study-dataset mapping is also carried into containers and pods, so each job sees only the datasets configured for its study scope, reducing cross-study data exposure.

Highlights:

Kubernetes deployments can launch jobs in separate pods when configured with the Kubernetes job launcher.
Docker deployments can launch jobs as separate containers when configured with the Docker job launcher.
Study-dataset mappings provide study-scoped data isolation for Docker containers and Kubernetes job pods.
CPU, memory, storage, and GPU requirements can be delegated to Docker or Kubernetes resource handling.
Kubernetes job workspace transfer no longer depends on a shared job PVC.
Runtime packaging, Helm chart updates, Docker job examples, multicloud Kubernetes support, and Brev scripted deployment guides make these modes easier to try and operate across AWS, Azure, and GCP environments.

For deployment details, see the Deploy Command user guide and the Running FLARE in Kubernetes Kubernetes deployment guide. Additional references include Containerized Deployment, Brev Kubernetes Helm Deployment, and Brev Scripted Deployment Quickstart.

For a runnable Docker workflow using nvflare deploy prepare, see the Docker job launcher example.

Multi-Study and Runtime Operations

Multi-Study Support

FLARE 2.8.0 adds study-aware deployment and administration support. A single deployment can define multiple studies, each with its own participating sites and admin role mappings.

This is important for organizations that run more than one collaboration on the same FLARE infrastructure. Study scope keeps participant membership, authorization, admin sessions, and operational commands tied to the intended collaboration, reducing the risk of cross-study confusion or accidental access.

The feature is intended to help one deployment support multiple projects or consortia while keeping each study’s participants, permissions, sessions, commands, and data access scoped to that study. Study support is available in administration workflows, CLI workflows, the FLARE API, production environments, and local PoC development.

See Multi-Study Support for design and configuration details, and NVIDIA FLARE Study CLI for runtime management.

Live Log Streaming

FLARE can now stream job logs from clients to the server while the job is running. Operators can inspect logs through server-side files or CLI commands without waiting for the job to finish.

This shortens the feedback loop for production jobs, especially when training runs remotely or for a long time. Operators can follow failures, progress, and site-specific behavior while the run is active instead of waiting for final job artifacts.

Operators can retrieve or follow job logs through the CLI and control log streaming behavior at the site level. This is intended to make remote job debugging and production monitoring less dependent on manual access to each client machine.

See Live Log Streaming and Site Configuration Metadata.

For a runnable job example, see Hello log streaming.

Recipes, APIs, and ML Features

2.8.0 continues the Recipe API and Client API direction from 2.7.x, with additional workflow coverage and production fixes. These changes make recipe and API-based workflows easier to automate, monitor, and operate in study-aware environments.

Highlights include improved study-aware API behavior, better recipe run management, updated Flower integration, stronger FedAvg and PyTorch workflow handling, XGBoost and SVTPrivacy fixes, and Python support aligned to 3.10 through 3.14.

See NVFlare Job Recipe, Available Recipes, FLARE API, and Evolution of FLARE APIs.

For tutorial examples, see the Hello FLARE API notebook and Job Recipe notebook.

Feature Election

FLARE 2.8.0 adds feature election, a federated feature selection workflow for tabular datasets. Clients perform local feature selection and share selected features and scores rather than raw data; FLARE aggregates the results into a global feature mask that can be used for downstream federated training.

For a runnable workflow, see the feature election example.

Large Models and LLM Workflows

FLARE 2.8.0 builds on the large-model work from 2.7.2 with additional tensor offload, run-scoped temp cleanup, improved timeout guidance for large transfers, and new example coverage.

These improvements help large-model FL jobs operate under tighter memory and runtime constraints, while the new examples give teams concrete starting points for multimodal and language-model workloads.

Large-Model Streaming Reliability

The streaming layer now treats late retries of normally finished download refs as idempotent terminal responses instead of fatal missing-ref errors. This addresses high-fanout large-model broadcasts where a client has completed a download but retries because the final EOF response was delayed by network or server-side contention.

The fix applies at the DownloadService layer, so it benefits large payload transfers regardless of whether they come from FedAvg, Client API subprocess jobs, tensor disk offload, or another feature built on the same streaming path. Cleanup caused by transaction timeout or explicit deletion still returns an invalid-ref error; only normally finished transactions are tombstoned for late terminal retries.

Subprocess Client API jobs also validate risky retry settings earlier. In particular, max_resends=None is now rejected for ClientAPILauncherExecutor jobs because unlimited resends can create an unbounded sequence of large download transactions, and download_complete_timeout=None is rejected because the subprocess must stay alive while the server finishes pulling tensors from it. Jobs with explicitly configured large streaming request timeouts now receive warnings when related pipe/download-completion timeouts are shorter than the configured streaming timeout. Recipe-generated external-process jobs serialize the bounded max_resends=3 default in executor args, and top-level recipe.add_client_config({"max_resends": N}) overrides are applied before the subprocess Client API config is written.

Server Memory: Tensor Disk Offload

FLARE 2.8.0 introduces tensor disk offload for PyTorch FedAvg jobs, which significantly reduces peak server memory during aggregation. Instead of holding all client tensor updates in memory simultaneously, each update is written to a temporary safetensors file on disk and consumed lazily. The benefit scales with model size and client count.

A 5 GB model measurement shows that tensor disk offload keeps server peak memory nearly flat as the number of clients increases, while the in-memory aggregation path grows with the number of client updates.

Server peak memory with tensor disk offload enabled and disabled

To enable, set enable_tensor_disk_offload=True on FedAvgRecipe or the FedAvg controller. In FLARE 2.8.0, this disk-backed tensor path is available for streamed PyTorch tensors in FedAvg workflows.

Deployment note: temporary files use the server process temp directory (TMPDIR or the OS default such as /tmp). In containers or Kubernetes, /tmp is often RAM-backed (tmpfs), which eliminates the memory-saving benefit; point TMPDIR to a disk-backed mount before starting the server. See Large Models for deployment guidance.

For configuration details, see FLARE Tensor Downloader and Memory Management.

Corresponding examples include BioNeMo, Qwen3-VL, MedGemma, and Codon-FM.

Security and Hardening

This release includes a broad set of security, validation, and operational hardening changes.

The focus is reducing deployment risk in environments where jobs, startup kits, archives, credentials, and admin/API traffic cross organizational boundaries.

Key areas include stronger runtime authentication binding, safer archive and path validation, stricter private-key file permissions, safer deserialization and subprocess handling, confidential-computing attestation hardening, dashboard/API hardening, and clearer error behavior for admin and job operations.

Reliability and Bug Fixes

These changes improve day-to-day operability by making job state, startup, resource visibility, and failure reporting more predictable across local, Docker, Kubernetes, and server-connected workflows.

Notable improvements include more consistent job status publication, clearer errors for missing or running jobs, more reliable startup and log-streaming behavior, Docker and Kubernetes runtime fixes, better GPU visibility handling, cleaner client failure reporting, corrected paired-duration monitoring metrics when an end event is skipped, and refreshed integration-test and CI coverage.

New Examples and Research

2.8.0 adds or updates a wide range of examples and contributed research implementations.

These assets matter because they turn new platform capabilities into runnable starting points for teams evaluating FLARE in concrete domains, including containerized operations, multimodal models, financial services, and privacy-oriented research.

Research updates in 2.8.0 include:

FedUMM: a new federated learning implementation for unified multimodal models, using parameter-efficient LoRA adapter federation for multimodal foundation-model workflows.
financial-services fraud detection: a new privacy-preserving federated fraud detection implementation with synthetic payment transaction generation, heterogeneous site configurations, federated analytics, federated training, interpretability, and differential privacy experimentation.
Existing FedBPT research was updated with a Job API entry point for running and exporting FLARE jobs.

Examples and research assets include:

Compatibility and Migration Notes

Python 3.9 is no longer listed as a supported development target. FLARE 2.8.0 targets Python 3.10, 3.11, 3.12, 3.13, and 3.14.
The deprecated FLAdminAPI surface has been removed. Use the FLARE API, Recipe environments, and nvflare CLI workflows for new automation.
HA/Overseer code has been removed from the 2.8 branch.

Class allow-list migration from 2.7

NVFLARE 2.8 adds built-in component class authorization for non-BYOC jobs. If class_allow_list is omitted from resources.json or resources.json.default, NVFLARE uses its curated default list of built-in component classes. The adjacent class_list_enforcement_mode setting accepts "enforce" or "warn" and defaults to "enforce" when omitted. Thus, omitting both settings uses the built-in list in enforce mode.

BYOC-enabled users and jobs keep the same behavior they had in 2.7. The built-in class allow-list check is skipped for BYOC jobs, so these settings do not change which job-provided classes they can load.

Choose the migration path that matches who owns the component trust boundary and how much discovery the application still needs:

Upgrader with custom components: built-in default list + warn, then reviewed prefixes + enforce. Start by omitting class_allow_list so the built-in default remains effective, and set class_list_enforcement_mode to "warn". Run representative jobs, read the warnings for unmatched custom classes, and review those classes. Then configure an explicit list containing the required built-in entries plus the reviewed custom class or package prefixes, and switch the mode to "enforce". This path lets the application continue running while the operator builds a least-privilege policy.
```
{
    "format_version": 2,
    "class_list_enforcement_mode": "warn"
}
```
Operator who owns the trust boundary: wildcard + either mode. Configure class_allow_list as ["*"] when the site operator deliberately accepts every component class. The enforcement mode is irrelevant when the wildcard is present: all other list entries are ignored, and an audit event records that the operator selected the unrestricted policy and identifies its policy source.
```
{
    "format_version": 2,
    "class_allow_list": ["*"]
}
```
Locked-down production: curated prefixes + enforce. Configure only the reviewed built-in and application classes or package prefixes the site needs, and use "enforce". An explicitly configured list replaces the built-in default, so copy any required built-in entries into the curated list. A trailing . authorizes a whole package; an entry without it represents a fully qualified class path.
```
{
    "format_version": 2,
    "class_list_enforcement_mode": "enforce",
    "class_allow_list": [
        "nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather",
        "my_app.executors.CustomExecutor",
        "my_app.filters."
    ]
}
```

"warn" relaxes the 2.8 protection and should be used only as a discovery step in a trusted migration environment. "*" is an explicit decision to place the component trust boundary outside this allow-list check; use it only when the site operator accepts that responsibility. An explicitly configured malformed class_allow_list remains a configuration error for non-BYOC jobs in either enforcement mode. Audit writes are retried on later matching component checks when the audit service is temporarily unavailable. Simulator runs use warning logs because their auditor is a no-op.

See the Migration Guide for additional API and configuration migration notes.

Getting Started

To explore the new 2.8.0 workflows:

start with Quick Start Series for a basic FLARE run.
use NVFlare CLI for the current CLI command surface.
use Distributed Provisioning for participant-managed certificates and signed startup-kit packaging.
use Deploy Command for Docker and Kubernetes runtime packaging.
use Multi-Study Support for multi-tenant deployment configuration.
browse Available Recipes and the new examples under examples/hello-world and examples/advanced.

Please refer to Previous Releases of FLARE for previous releases.