.. _containerized_deployment: ######################## Containerized Deployment ######################## Containerized Deployment with Docker ==================================== Docker support has two common uses: - Run FLARE inside a generic development container for simulation, notebooks, POC mode, or manual experiments. In this pattern, Docker provides the Python environment and filesystem, while FLARE processes still run normally inside that one container. - Prepare provisioned startup kits for Docker runtime execution. This is the recommended path when FLARE parent server/client processes should run in Docker and launch server/client jobs as separate Docker containers. The current Docker runtime workflow is: 1. Build a parent image with NVFlare installed. 2. Build a job image with NVFlare and workload dependencies installed. 3. Provision server, client, and admin startup kits. 4. Run :ref:`deploy_prepare_command` on each server or client startup kit that should run in Docker. 5. Start prepared server/client kits with ``startup/start_docker.sh``. 6. Submit jobs whose ``meta.json`` contains Docker settings in :ref:`launcher_spec`. For a runnable end-to-end workflow, see the :github_nvflare_link:`Docker job launcher example `. Prerequisites ------------- Before starting with containerized deployment, ensure you have: 1. Docker installed on your system 2. NVIDIA Container Toolkit installed for GPU support 3. System requirements met as per the `NVIDIA Container Toolkit Install Guide `_ 4. Provisioned server/client startup kits when preparing a Docker runtime deployment Running NVIDIA FLARE in a Docker container provides several benefits: - Consistent environment across different systems - Easy dependency management - Repeatable runtime preparation - Isolated execution environment - GPU support through NVIDIA Container Toolkit Parent and Job Images --------------------- The Docker runtime separates parent containers from job containers: - The parent image runs the long-lived FLARE server process or client process from a prepared startup kit. - The job image runs server job and client job processes launched by ``DockerJobLauncher``. The parent image is configured in the runtime ``docker.yaml`` used by ``nvflare deploy prepare``. The job image is configured in the submitted job's ``meta.json`` under ``launcher_spec``. You can use the same image for both roles, but keeping them separate is often cleaner: the parent image needs the FLARE runtime and Docker SDK access, while the job image needs the workload frameworks and training dependencies. Docker Runtime Workflow ----------------------- Build or publish the images that your sites will use. The runnable :github_nvflare_link:`Docker job launcher example ` builds an example parent image, ``nvflare-site:latest``, and an example job image, ``nvflare-job:latest``: .. code-block:: shell cd examples/docker bash build_docker.sh After provisioning a project: .. code-block:: shell nvflare provision -p project.yml Create a Docker runtime config for ``nvflare deploy prepare``: .. code-block:: yaml runtime: docker parent: docker_image: nvflare-site:latest network: nvflare-network job_launcher: default_python_path: /usr/local/bin/python default_job_env: NCCL_P2P_DISABLE: "1" default_job_container_kwargs: shm_size: 8g ipc_mode: host Prepare each server or client startup kit that should run in Docker: .. code-block:: shell nvflare deploy prepare workspace//prod_00/server \ --config docker.yaml \ --output workspace//prepared/server nvflare deploy prepare workspace//prod_00/site-1 \ --config docker.yaml \ --output workspace//prepared/site-1 The prepared kits contain ``startup/start_docker.sh``, patched launcher configuration, and a ``local/study_data.yaml`` template. Admin startup kits are not prepared because they do not run parent server or client processes. Start prepared parent processes with: .. code-block:: shell cd workspace//prepared/server bash startup/start_docker.sh Run the same command from each prepared client kit. The generated script creates the configured Docker network if needed and mounts the prepared kit into the parent container. Jobs submitted to Docker-mode sites must specify their job image in ``launcher_spec``: .. code-block:: json { "launcher_spec": { "default": { "docker": {"image": "nvflare-job:latest"} }, "site-1": { "docker": {"shm_size": "8g", "ipc_mode": "host"} } }, "resource_spec": { "site-1": {"num_of_gpus": 1} } } Use ``launcher_spec`` for launcher-specific image and container settings. Keep scheduler-facing resource requests, such as ``num_of_gpus``, in ``resource_spec``. Sites that are not configured with the Docker job launcher continue to use their configured launcher, usually process mode. Development Container --------------------- A single-container image can also be useful for local development or simulator work. In this case, Docker captures a Python environment with NVFlare and your development dependencies installed. This is an alternative to a bare-metal Python virtual environment, not a replacement for ``nvflare deploy prepare`` when using the Docker job launcher. You can use this single-container pattern to run all FLARE processes on one host for development: run the simulator, run POC mode, or start provisioned server/client scripts manually from different shells in the same container. This is different from Docker runtime deployment. The job launcher remains process mode unless the server or client startup kit is prepared with ``nvflare deploy prepare``. After you build an image for your environment and tag it ``nvflare-dev:latest``, run it with GPU support and a persistent workspace: .. code-block:: shell mkdir my-workspace docker run --rm -it --gpus all \ --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \ -v $(pwd -P)/my-workspace:/workspace/my-workspace \ nvflare-dev:latest Once the container is running, you can also exec into the container, for example if you need another terminal to start additional FLARE clients. First find the ``CONTAINER ID`` using ``docker ps``, and then use that ID to exec into the container: .. code-block:: shell docker ps # use the CONTAINER ID in the output docker exec -it /bin/bash Best Practices -------------- 1. Always use the latest compatible NVIDIA Container Toolkit version 2. Use ``nvflare deploy prepare`` for Docker runtime startup kits 3. Keep parent image and job image responsibilities explicit 4. Put Docker image and container settings in ``launcher_spec`` 5. Put scheduler-facing resource requests in ``resource_spec`` 6. Mount volumes for persistent data storage 7. Keep your base images updated for security patches .. note:: Docker Compose deployment is deprecated. Use ``nvflare deploy prepare`` for current Docker runtime preparation. Common Issues and Solutions --------------------------- 1. Docker daemon access: ensure the user running ``start_docker.sh`` can access the Docker daemon. 2. Missing job image: ensure every Docker-mode job provides ``launcher_spec[site]["docker"]["image"]`` or ``launcher_spec["default"]["docker"]["image"]``. 3. GPU access issues: ensure NVIDIA Container Toolkit is properly installed and the job's ``resource_spec`` requests the needed GPUs. 4. Memory or shared-memory issues: set container kwargs such as ``shm_size`` or ``ipc_mode`` in ``launcher_spec`` or in deploy prepare ``default_job_container_kwargs``. 5. Network connectivity: keep the configured Docker network and server hostname resolvable from admin, parent, and job containers.