************************** What's New in FLARE v2.8.0 ************************** NVIDIA FLARE 2.8.0 focuses on making production federated learning easier to operate across organizations, studies, and runtime environments. The release adds Docker and Kubernetes job launchers, a broader automation-friendly CLI, distributed provisioning, multi-study support, stronger observability, and additional production hardening. It also adds new examples and research bundles for multimodal, language-model, Docker, Kubernetes, and privacy-oriented federated learning workflows. Release Highlights ================== - **Modern NVFlare CLI**: expanded ``nvflare`` command groups for jobs, system operations, local config, startup kits, recipes, distributed provisioning, and deployment preparation, with JSON output and schema support so operators and automation systems can run FLARE workflows without relying on console-only behavior. - **Distributed provisioning**: new ``nvflare cert`` and ``nvflare package`` workflows let participants keep private keys local while Project Admins approve certificate requests and generate signed packages, improving security ownership in cross-organization deployments. - **Deployment prepare and runtime packaging**: new ``nvflare deploy prepare`` flow packages existing startup kits for Docker and Kubernetes runtimes, including Kubernetes environments on AWS, Azure, and GCP, so provisioning and runtime packaging can be handled as separate repeatable steps. - **Docker and Kubernetes job launchers**: each site can configure a process, Docker, or Kubernetes job launcher. With the matching launcher configured, host-based jobs run as subprocesses, Docker-based jobs run as job containers, and Kubernetes-based jobs run as separate job pods, giving production sites Docker/Kubernetes isolation and resource handling plus study-scoped dataset mounts for stronger data isolation. - **Multi-study support**: study definitions in ``project.yml``, study-scoped sessions, study-aware admin operations, and study CLI commands let one FLARE deployment host multiple collaborations without mixing participants, authorization, data access, or operational context. - **Live log streaming**: site and job logs stream to the server while jobs are running, reducing time to diagnose remote training failures and making CLI automation more responsive. - **Security and production hardening**: origin-bound auth tokens, safer archive handling, stricter private-key file permissions, safer loading paths, stronger job metadata validation, and additional dashboard/API hardening reduce common operational risk in federated deployments. - **Feature election**: a new federated feature selection workflow lets clients perform local feature selection for tabular datasets and share feature scores, not raw data, so FLARE can aggregate a global feature mask for downstream training. - **Tensor disk offload for FedAvg**: enabling ``enable_tensor_disk_offload=True`` significantly reduces server peak memory during FedAvg aggregation. Instead of holding all client tensor updates in memory simultaneously, each update is written to a temporary safetensors file on disk and consumed lazily during aggregation. The benefit scales with model size and client count. - **Large-model streaming reliability**: large tensor broadcasts are more robust when many clients retry after delayed EOF responses. Finished download refs are handled idempotently, and subprocess Client API jobs now reject unbounded result resends or missing download-completion waits that can turn one slow transfer into repeated large-model retries. - **New examples and contributed research**: MedGemma, Qwen3-VL, Codon-FM, FedUMM, financial-services fraud detection, Docker job examples, distributed provisioning examples, Hello JAX, and Hello log streaming help teams start from working patterns instead of assembling production and research workflows from scratch. NVFlare CLI and Automation ========================== FLARE 2.8.0 significantly expands the public ``nvflare`` command-line surface. The CLI now has a more consistent command layout, machine-readable output support, and better error contracts for scripts and automation systems. This matters for production operations because the same interfaces can now be used consistently by humans, shell scripts, service automation, and other tooling. Jobs, system status, startup-kit selection, recipes, provisioning, and deployment preparation can be queried in structured form instead of requiring manual console interaction. The main additions are command groups for job operations, system operations, local configuration, recipe discovery, distributed provisioning, package assembly, and deployment preparation. Many commands now support structured output and schema discovery, making them easier to use in scripts, notebooks, and operational tooling. For details, see :ref:`nvflare_cli`, :ref:`job_cli`, :ref:`system_command`, :ref:`config_command`, and :ref:`recipe_command`. For a hands-on CLI workflow, see the :github_nvflare_link:`NVFlare CLI tutorial `. Deployment and Provisioning =========================== Distributed Provisioning ------------------------ FLARE 2.8.0 introduces a distributed provisioning workflow for cases where participants generate local private keys and certificate-signing requests instead of receiving all startup-kit materials from a centralized provisioner. This is important for cross-organization collaborations because private keys no longer need to be generated by the Project Admin or transferred between organizations. Each participant can create and keep its own key material, reducing key-handling risk, while the Project Admin still controls approvals, signed packages, and root CA trust. Teams that prefer the existing centralized provisioning model can continue to use it. The workflow adds participant-side certificate requests, Project Admin approval, signed package generation, root CA verification, and startup-kit assembly from approved packages. It is intended for deployments where key ownership and participant-controlled certificate requests are important. See :ref:`distributed_provisioning`, :ref:`cert_command`, and :ref:`package_command`. For a runnable walkthrough, see the :github_nvflare_link:`distributed provisioning example `. Deploy Prepare -------------- The new ``nvflare deploy prepare`` command packages existing provisioned startup kits for runtime targets such as Docker and Kubernetes. This separates startup-kit generation from runtime-specific packaging, making deployments more repeatable across local, Docker, Kubernetes, and cloud-managed Kubernetes environments such as AWS, Azure, and GCP. This separation is useful operationally because the same provisioned identities can be reused across runtime-specific packaging flows. Teams can prepare a startup kit once, then produce Docker or Kubernetes artifacts without changing the provisioning model. See the :ref:`deploy_prepare_command` user guide for Docker and Kubernetes runtime preparation. Docker and Kubernetes Job Execution ----------------------------------- FLARE 2.8.0 adds Docker and Kubernetes job launchers so sites can align FLARE jobs with the runtime isolation and resource controls they already use. Each site must be configured with the matching job launcher for the intended runtime. With that launcher configured, the pattern is: - process job launcher for a host-based parent: jobs run as subprocesses; - Docker job launcher for a Docker-based parent: jobs run as Docker containers; - Kubernetes job launcher for a Kubernetes-based parent pod: jobs run as separate Kubernetes job pods. This matters because Docker and Kubernetes deployments can now use their runtime isolation instead of treating every job as a local subprocess. Study-dataset mapping is also carried into containers and pods, so each job sees only the datasets configured for its study scope, reducing cross-study data exposure. Highlights: - Kubernetes deployments can launch jobs in separate pods when configured with the Kubernetes job launcher. - Docker deployments can launch jobs as separate containers when configured with the Docker job launcher. - Study-dataset mappings provide study-scoped data isolation for Docker containers and Kubernetes job pods. - CPU, memory, storage, and GPU requirements can be delegated to Docker or Kubernetes resource handling. - Kubernetes job workspace transfer no longer depends on a shared job PVC. - Runtime packaging, Helm chart updates, Docker job examples, multicloud Kubernetes support, and Brev scripted deployment guides make these modes easier to try and operate across AWS, Azure, and GCP environments. For deployment details, see the :ref:`deploy_prepare_command` user guide and the :ref:`helm_chart` Kubernetes deployment guide. Additional references include :ref:`containerized_deployment`, :ref:`brev_deployment`, and :ref:`brev_scripted_deployment`. For a runnable Docker workflow using ``nvflare deploy prepare``, see the :github_nvflare_link:`Docker job launcher example `. Multi-Study and Runtime Operations ================================== Multi-Study Support ------------------- FLARE 2.8.0 adds study-aware deployment and administration support. A single deployment can define multiple studies, each with its own participating sites and admin role mappings. This is important for organizations that run more than one collaboration on the same FLARE infrastructure. Study scope keeps participant membership, authorization, admin sessions, and operational commands tied to the intended collaboration, reducing the risk of cross-study confusion or accidental access. The feature is intended to help one deployment support multiple projects or consortia while keeping each study's participants, permissions, sessions, commands, and data access scoped to that study. Study support is available in administration workflows, CLI workflows, the FLARE API, production environments, and local PoC development. See :ref:`multi_study_guide` for design and configuration details, and :ref:`study_command` for runtime management. Live Log Streaming ------------------ FLARE can now stream job logs from clients to the server while the job is running. Operators can inspect logs through server-side files or CLI commands without waiting for the job to finish. This shortens the feedback loop for production jobs, especially when training runs remotely or for a long time. Operators can follow failures, progress, and site-specific behavior while the run is active instead of waiting for final job artifacts. Operators can retrieve or follow job logs through the CLI and control log streaming behavior at the site level. This is intended to make remote job debugging and production monitoring less dependent on manual access to each client machine. See :ref:`live_log_streaming` and :ref:`site_config`. For a runnable job example, see :github_nvflare_link:`Hello log streaming `. Recipes, APIs, and ML Features ============================== 2.8.0 continues the Recipe API and Client API direction from 2.7.x, with additional workflow coverage and production fixes. These changes make recipe and API-based workflows easier to automate, monitor, and operate in study-aware environments. Highlights include improved study-aware API behavior, better recipe run management, updated Flower integration, stronger FedAvg and PyTorch workflow handling, XGBoost and SVTPrivacy fixes, and Python support aligned to 3.10 through 3.14. See :ref:`job_recipe`, :ref:`available_recipes`, :ref:`flare_api`, and :ref:`api_evolution`. For tutorial examples, see the :github_nvflare_link:`Hello FLARE API notebook ` and :github_nvflare_link:`Job Recipe notebook `. Feature Election ---------------- FLARE 2.8.0 adds feature election, a federated feature selection workflow for tabular datasets. Clients perform local feature selection and share selected features and scores rather than raw data; FLARE aggregates the results into a global feature mask that can be used for downstream federated training. For a runnable workflow, see the :github_nvflare_link:`feature election example `. Large Models and LLM Workflows ============================== FLARE 2.8.0 builds on the large-model work from 2.7.2 with additional tensor offload, run-scoped temp cleanup, improved timeout guidance for large transfers, and new example coverage. These improvements help large-model FL jobs operate under tighter memory and runtime constraints, while the new examples give teams concrete starting points for multimodal and language-model workloads. Large-Model Streaming Reliability --------------------------------- The streaming layer now treats late retries of normally finished download refs as idempotent terminal responses instead of fatal missing-ref errors. This addresses high-fanout large-model broadcasts where a client has completed a download but retries because the final EOF response was delayed by network or server-side contention. The fix applies at the ``DownloadService`` layer, so it benefits large payload transfers regardless of whether they come from FedAvg, Client API subprocess jobs, tensor disk offload, or another feature built on the same streaming path. Cleanup caused by transaction timeout or explicit deletion still returns an invalid-ref error; only normally finished transactions are tombstoned for late terminal retries. Subprocess Client API jobs also validate risky retry settings earlier. In particular, ``max_resends=None`` is now rejected for ``ClientAPILauncherExecutor`` jobs because unlimited resends can create an unbounded sequence of large download transactions, and ``download_complete_timeout=None`` is rejected because the subprocess must stay alive while the server finishes pulling tensors from it. Jobs with explicitly configured large streaming request timeouts now receive warnings when related pipe/download-completion timeouts are shorter than the configured streaming timeout. Recipe-generated external-process jobs serialize the bounded ``max_resends=3`` default in executor args, and top-level ``recipe.add_client_config({"max_resends": N})`` overrides are applied before the subprocess Client API config is written. Server Memory: Tensor Disk Offload ----------------------------------- FLARE 2.8.0 introduces tensor disk offload for PyTorch FedAvg jobs, which significantly reduces peak server memory during aggregation. Instead of holding all client tensor updates in memory simultaneously, each update is written to a temporary safetensors file on disk and consumed lazily. The benefit scales with model size and client count. A 5 GB model measurement shows that tensor disk offload keeps server peak memory nearly flat as the number of clients increases, while the in-memory aggregation path grows with the number of client updates. .. image:: ../resources/server_peak_memory_disk_offload.png :alt: Server peak memory with tensor disk offload enabled and disabled :width: 80% To enable, set ``enable_tensor_disk_offload=True`` on ``FedAvgRecipe`` or the ``FedAvg`` controller. In FLARE 2.8.0, this disk-backed tensor path is available for streamed PyTorch tensors in FedAvg workflows. Deployment note: temporary files use the server process temp directory (``TMPDIR`` or the OS default such as ``/tmp``). In containers or Kubernetes, ``/tmp`` is often RAM-backed (``tmpfs``), which eliminates the memory-saving benefit; point ``TMPDIR`` to a disk-backed mount before starting the server. See :ref:`notes_on_large_models` for deployment guidance. For configuration details, see :doc:`/programming_guide/tensor_downloader` and :doc:`/programming_guide/memory_management`. Corresponding examples include :github_nvflare_link:`BioNeMo `, :github_nvflare_link:`Qwen3-VL `, :github_nvflare_link:`MedGemma `, and :github_nvflare_link:`Codon-FM `. Security and Hardening ====================== This release includes a broad set of security, validation, and operational hardening changes. The focus is reducing deployment risk in environments where jobs, startup kits, archives, credentials, and admin/API traffic cross organizational boundaries. Key areas include stronger runtime authentication binding, safer archive and path validation, stricter private-key file permissions, safer deserialization and subprocess handling, confidential-computing attestation hardening, dashboard/API hardening, and clearer error behavior for admin and job operations. Built-in statistics, PSI, NumPy model, and XGBoost model artifact writers now require configured output paths to be relative paths that remain inside the job-owned output directory. Jobs that previously used absolute paths or parent directory traversal for these writer outputs should move those artifacts after the job completes instead of writing outside the job workspace directly. Custom ``JsonStatsFileWriter`` ``json_encoder_path`` values are now loaded only for BYOC jobs. The built-in ``ObjectEncoder`` path remains compatible for non-BYOC jobs. Reliability and Bug Fixes ========================= These changes improve day-to-day operability by making job state, startup, resource visibility, and failure reporting more predictable across local, Docker, Kubernetes, and server-connected workflows. Notable improvements include more consistent job status publication, clearer errors for missing or running jobs, more reliable startup and log-streaming behavior, Docker and Kubernetes runtime fixes, better GPU visibility handling, cleaner client failure reporting, corrected paired-duration monitoring metrics when an end event is skipped, and refreshed integration-test and CI coverage. New Examples and Research ========================= 2.8.0 adds or updates a wide range of examples and contributed research implementations. These assets matter because they turn new platform capabilities into runnable starting points for teams evaluating FLARE in concrete domains, including containerized operations, multimodal models, financial services, and privacy-oriented research. Research updates in 2.8.0 include: - :github_nvflare_link:`FedUMM `: a new federated learning implementation for unified multimodal models, using parameter-efficient LoRA adapter federation for multimodal foundation-model workflows. - :github_nvflare_link:`financial-services fraud detection `: a new privacy-preserving federated fraud detection implementation with synthetic payment transaction generation, heterogeneous site configurations, federated analytics, federated training, interpretability, and differential privacy experimentation. - Existing :github_nvflare_link:`FedBPT ` research was updated with a Job API entry point for running and exporting FLARE jobs. Examples and research assets include: - :github_nvflare_link:`Hello JAX `. - :github_nvflare_link:`Hello log streaming `. - :github_nvflare_link:`Docker job execution `. - :github_nvflare_link:`distributed provisioning `. - :github_nvflare_link:`feature election `. - :github_nvflare_link:`MedGemma `. - :github_nvflare_link:`Qwen3-VL `. - :github_nvflare_link:`Codon-FM `. - :github_nvflare_link:`FedUMM `. - :github_nvflare_link:`financial-services fraud detection `. Compatibility and Migration Notes ================================= - Python 3.9 is no longer listed as a supported development target. FLARE 2.8.0 targets Python 3.10, 3.11, 3.12, 3.13, and 3.14. - The deprecated FLAdminAPI surface has been removed. Use the FLARE API, Recipe environments, and ``nvflare`` CLI workflows for new automation. - HA/Overseer code has been removed from the 2.8 branch. - ``JsonStatsFileWriter`` now requires custom ``json_encoder_path`` values to run as BYOC jobs. Built-in ``ObjectEncoder`` configurations continue to work for non-BYOC jobs. See the :ref:`migration_guide` for additional API and configuration migration notes. Getting Started =============== To explore the new 2.8.0 workflows: - start with :ref:`quickstart` for a basic FLARE run. - use :ref:`nvflare_cli` for the current CLI command surface. - use :ref:`distributed_provisioning` for participant-managed certificates and signed startup-kit packaging. - use :ref:`deploy_prepare_command` for Docker and Kubernetes runtime packaging. - use :ref:`multi_study_guide` for multi-tenant deployment configuration. - browse :ref:`available_recipes` and the new examples under ``examples/hello-world`` and ``examples/advanced``.