Containerized Deployment

Containerized Deployment with Docker

Docker support has two common uses:

Run FLARE inside a generic development container for simulation, notebooks, POC mode, or manual experiments. In this pattern, Docker provides the Python environment and filesystem, while FLARE processes still run normally inside that one container.
Prepare provisioned startup kits for Docker runtime execution. This is the recommended path when FLARE parent server/client processes should run in Docker and launch server/client jobs as separate Docker containers.

The current Docker runtime workflow is:

Build a parent image with NVFlare installed.
Build a job image with NVFlare and workload dependencies installed.
Provision server, client, and admin startup kits.
Run Deploy Command on each server or client startup kit that should run in Docker.
Start prepared server/client kits with startup/start_docker.sh.
Submit jobs whose meta.json contains Docker settings in Launcher-Specific Execution Settings.

For a runnable end-to-end workflow, see the Docker job launcher example.

Prerequisites

Before starting with containerized deployment, ensure you have:

Docker installed on your system
NVIDIA Container Toolkit installed for GPU support
System requirements met as per the NVIDIA Container Toolkit Install Guide
Provisioned server/client startup kits when preparing a Docker runtime deployment

Running NVIDIA FLARE in a Docker container provides several benefits:

Consistent environment across different systems
Easy dependency management
Repeatable runtime preparation
Isolated execution environment
GPU support through NVIDIA Container Toolkit

Parent and Job Images

The Docker runtime separates parent containers from job containers:

The parent image runs the long-lived FLARE server process or client process from a prepared startup kit.
The job image runs server job and client job processes launched by DockerJobLauncher.

The parent image is configured in the runtime docker.yaml used by nvflare deploy prepare. The job image is configured in the submitted job’s meta.json under launcher_spec. You can use the same image for both roles, but keeping them separate is often cleaner: the parent image needs the FLARE runtime and Docker SDK access, while the job image needs the workload frameworks and training dependencies.

Docker Runtime Workflow

Build or publish the images that your sites will use. The runnable Docker job launcher example builds an example parent image, nvflare-site:latest, and an example job image, nvflare-job:latest:

cd examples/docker
bash build_docker.sh

After provisioning a project:

nvflare provision -p project.yml

Create a Docker runtime config for nvflare deploy prepare:

runtime: docker

parent:
  docker_image: nvflare-site:latest
  network: nvflare-network

job_launcher:
  default_python_path: /usr/local/bin/python
  default_job_env:
    NCCL_P2P_DISABLE: "1"
  default_job_container_kwargs:
    shm_size: 8g
    ipc_mode: host

Prepare each server or client startup kit that should run in Docker:

nvflare deploy prepare workspace/<project>/prod_00/server \
  --config docker.yaml \
  --output workspace/<project>/prepared/server

nvflare deploy prepare workspace/<project>/prod_00/site-1 \
  --config docker.yaml \
  --output workspace/<project>/prepared/site-1

The prepared kits contain startup/start_docker.sh, patched launcher configuration, and a local/study_data.yaml template. Admin startup kits are not prepared because they do not run parent server or client processes.

Start prepared parent processes with:

cd workspace/<project>/prepared/server
bash startup/start_docker.sh

Run the same command from each prepared client kit. The generated script creates the configured Docker network if needed and mounts the prepared kit into the parent container.

Jobs submitted to Docker-mode sites must specify their job image in launcher_spec:

{
  "launcher_spec": {
    "default": {
      "docker": {"image": "nvflare-job:latest"}
    },
    "site-1": {
      "docker": {"shm_size": "8g", "ipc_mode": "host"}
    }
  },
  "resource_spec": {
    "site-1": {"num_of_gpus": 1}
  }
}

Use launcher_spec for launcher-specific image and container settings. Keep scheduler-facing resource requests, such as num_of_gpus, in resource_spec. Sites that are not configured with the Docker job launcher continue to use their configured launcher, usually process mode.

Development Container

A single-container image can also be useful for local development or simulator work. In this case, Docker captures a Python environment with NVFlare and your development dependencies installed. This is an alternative to a bare-metal Python virtual environment, not a replacement for nvflare deploy prepare when using the Docker job launcher.

You can use this single-container pattern to run all FLARE processes on one host for development: run the simulator, run POC mode, or start provisioned server/client scripts manually from different shells in the same container. This is different from Docker runtime deployment. The job launcher remains process mode unless the server or client startup kit is prepared with nvflare deploy prepare.

After you build an image for your environment and tag it nvflare-dev:latest, run it with GPU support and a persistent workspace:

mkdir my-workspace
docker run --rm -it --gpus all \
    --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
    -v $(pwd -P)/my-workspace:/workspace/my-workspace \
    nvflare-dev:latest

Once the container is running, you can also exec into the container, for example if you need another terminal to start additional FLARE clients. First find the CONTAINER ID using docker ps, and then use that ID to exec into the container:

docker ps  # use the CONTAINER ID in the output
docker exec -it <CONTAINER ID> /bin/bash

Best Practices

Always use the latest compatible NVIDIA Container Toolkit version
Use nvflare deploy prepare for Docker runtime startup kits
Keep parent image and job image responsibilities explicit
Put Docker image and container settings in launcher_spec
Put scheduler-facing resource requests in resource_spec
Mount volumes for persistent data storage
Keep your base images updated for security patches

Note

Docker Compose deployment is deprecated. Use nvflare deploy prepare for current Docker runtime preparation.

Common Issues and Solutions

Docker daemon access: ensure the user running start_docker.sh can access the Docker daemon.
Missing job image: ensure every Docker-mode job provides launcher_spec[site]["docker"]["image"] or launcher_spec["default"]["docker"]["image"].
GPU access issues: ensure NVIDIA Container Toolkit is properly installed and the job’s resource_spec requests the needed GPUs.
Memory or shared-memory issues: set container kwargs such as shm_size or ipc_mode in launcher_spec or in deploy prepare default_job_container_kwargs.
Network connectivity: keep the configured Docker network and server hostname resolvable from admin, parent, and job containers.