.. _cc_deployment_guide: ################################################ FLARE Confidential Federated AI Deployment Guide ################################################ Overview ======== This guide provides step-by-step instructions for deploying NVIDIA FLARE with Confidential Computing (CC) capabilities on-premises using AMD SEV-SNP CPUs and NVIDIA GPUs. The deployment involves building a Confidential VM (CVM) image that contains the FLARE application, provisioning the system for participants, and launching the CVMs on each site. Deployment Environment ====================== This guide covers the following deployment configuration: - **Platform**: On-Premise AMD CVM with NVIDIA GPU - **CPU**: AMD SEV-SNP (Secure Encrypted Virtualization - Secure Nested Paging) - **GPU**: NVIDIA GPU with Confidential Computing support (optional) - **Host OS**: Ubuntu 25.04 Prerequisites ============= Hardware Requirements --------------------- **CPU Requirements** - AMD CPU with SEV-SNP enabled - AMD firmware supporting SEV-SNP **GPU Requirements (Optional)** - NVIDIA GPU with Confidential Computing support (H100, Blackwell) **Host System** - Host OS: Ubuntu 25.04 Software Requirements --------------------- **Required Software** - Ubuntu 25.04 - QEMU (for virtualization) - Docker **NVFlare Components** 1. **NVFlare Source Code** Clone from GitHub: .. code-block:: bash git clone https://github.com/NVIDIA/NVFlare.git 2. **Image Builder** - Obtain the image builder code from the NVFlare team - Contact: federatedlearning@nvidia.com - Install location: ``~/cc/image_builder`` 3. **Base Images** - Obtain base images from the NVFlare team - Copy to: ``~/cc/image_builder/base_images`` 4. **KBS Client** - Build or obtain KBS client matching your KBS server - Recommended commit: ``a2570329cc33daf9ca16370a1948b5379bb17fbe`` - Copy the kbs-client and credentials to: ``~/cc/image_builder/binaries`` 5. **SNPGuest Tool** - Build SNPGuest version v0.9.2 - Copy snpguest and credentials to: ``~/cc/image_builder/binaries`` **AMD Firmware Installation** To install the AMD SEV-SNP firmware: .. code-block:: bash echo 'deb http://archive.ubuntu.com/ubuntu plucky-proposed main restricted universe multiverse' | \ sudo tee /etc/apt/sources.list.d/plucky-proposed.list sudo tee /etc/apt/preferences.d/99-plucky-proposed <<'EOF' Package: * Pin: release a=plucky-proposed Pin-Priority: 100 EOF sudo apt update sudo apt install -t plucky-proposed ovmf The firmware will be installed at ``/usr/share/ovmf/OVMF.amdsev.fd``. **Policy Files Setup** .. note:: The current KBS doesn't support updating individual rules. You must update the entire rule file when adding a new CVM. Create the policy directory and obtain the required files: .. code-block:: bash mkdir -p /shared/policy Place the following files in ``/shared/policy`` (obtain from the NVFlare team): - ``policy.rego`` - Master policy file - ``set-policy.sh`` - Policy update script - ``private.key`` - Authentication key Project Admin Requirements --------------------------- As the project admin, you need to: 1. **Understand Trustee Service** - Learn about `Trustee Service `_ - Review the `Trustee documentation `_ 2. **Deploy Trustee KBS Server** Follow the :ref:`hashicorp_vault_trustee_deployment` guide to deploy the Trustee Key Broker Service with HashiCorp Vault. Deployment Workflow =================== The deployment consists of four main steps: 1. **Build Docker Image** - Create the application container 2. **Provision** - Generate CVM images and startup kits 3. **Distribute** - Send startup kits to each site 4. **Launch** - Start CVMs at each site Step 1: Build Docker Image --------------------------- The CC image builder supports any generic workload. For NVFlare, create a Docker image with the application pre-installed. **Example Dockerfile:** .. code-block:: dockerfile ARG BASE_IMAGE=python:3.12 FROM ${BASE_IMAGE} ENV PYTHONDONTWRITEBYTECODE=1 ENV PIP_NO_CACHE_DIR=1 RUN pip install -U pip && \ pip install nvflare~=2.7.0rc COPY code/ /local/custom COPY requirements.txt . RUN pip install -r requirements.txt ENTRYPOINT ["/user_config/nvflare/startup/sub_start.sh", "--verify"] .. note:: For CC jobs, custom code at runtime is not allowed. All application code must be included in the Docker image. **Build and save the image:** .. code-block:: bash docker build -t nvflare-site:latest . docker save nvflare-site:latest | gzip > nvflare-site.tar.gz Step 2: Provision ----------------- Navigate to the example directory: .. code-block:: bash cd NVFlare/examples/advanced/cc_provision **2.1 Configure Project** Edit ``project.yml`` and update the ``build_image_cmd`` path: .. code-block:: yaml packager: path: nvflare.lighter.cc_provision.impl.onprem_packager.OnPremPackager args: # Update this path to your image builder location build_image_cmd: ~/nvflare-github/nvflare/lighter/cc/image_builder/cvm_build.sh **2.2 Configure Server** Edit ``cc_server1.yml`` and set the ``docker_archive`` path: .. code-block:: yaml docker_archive: ~/NVFlare/examples/advanced/cc_provision/docker/nvflare-site.tar.gz **2.3 Configure Client** Edit ``cc_site-1.yml``: 1. Set the ``docker_archive`` path: .. code-block:: yaml docker_archive: ~/NVFlare/examples/advanced/cc_provision/docker/nvflare-site.tar.gz 2. If the server name is not a public domain, add host entries: .. code-block:: yaml host_entries: server1: 10.176.4.244 3. If no GPU is available, remove the GPU authorizer and ``cc_gpu_mechanism`` configuration. **2.4 Run Provision** .. code-block:: bash nvflare provision -p project.yml .. note:: Provisioning takes approximately 1000 seconds to build each CVM image. **2.5 Output** Startup packages are generated in: .. code-block:: text ./workspace/example_project/prod_00/ server1/server1.tgz site-1/site-1.tgz admin@nvidia.com/ Step 3: Distribute ------------------- Distribute the generated startup kits to each participant: - Send ``server1.tgz`` to the server site - Send ``site-1.tgz`` to client site-1 - Admin keeps the admin package locally CVM Startup Kit Contents ^^^^^^^^^^^^^^^^^^^^^^^^^ Each startup kit (e.g., ``server1.tgz``) contains: .. code-block:: bash $ tar -zxvf server1.tgz $ ls server1/cvm_885fe8f608b3/ .. list-table:: :header-rows: 1 :widths: 30 70 * - File - Description * - ``applog.qcow2`` - Application log storage (unencrypted, can be mounted and inspected) * - ``crypt_root.qcow2`` - Encrypted root filesystem (requires decryption key) * - ``initrd.img`` - Initramfs with InitApp for attestation * - ``launch_vm.sh`` - CVM launch script * - ``OVMF.amdsev.fd`` - AMD SEV-SNP firmware with kernel-hashes support * - ``README.txt`` - Documentation * - ``user_config.qcow2`` - User configuration storage containing NVFlare startup kits * - ``user_data.qcow2`` - User data storage (placeholder, can be extended) * - ``vmlinuz`` - Linux kernel Step 4: Launch CVMs -------------------- **4.1 Launch Server** On the server machine: .. code-block:: bash tar -zxvf server1.tgz cd server1/cvm_* ./launch_vm.sh **4.2 Launch Client** On each client machine: .. code-block:: bash tar -zxvf site-1.tgz cd site-1/cvm_* ./launch_vm.sh The server and clients will automatically start the NVFlare system inside their respective CVMs. **4.3 Start Admin Console** On the admin machine: .. code-block:: bash cd NVFlare/examples/advanced/cc_provision # Copy jobs to admin transfer folder cp -r jobs/* ./workspace/example_project/prod_00/admin@nvidia.com/transfer/ .. note:: If the server name is not a public domain, add an entry in ``/etc/hosts`` on the admin machine. Start the admin console: .. code-block:: bash ./workspace/example_project/prod_00/admin@nvidia.com/startup/fl_admin.sh **4.4 Submit Job** In the admin console: .. code-block:: bash submit_job hello-pt_cifar10_fedavg Configuration Reference ======================= CC Configuration Parameters --------------------------- .. list-table:: :header-rows: 1 :widths: 25 25 50 * - Parameter - Example Value - Description * - ``compute_env`` - ``onprem_cvm`` - Computation environment type * - ``cc_cpu_mechanism`` - ``amd_sev_snp`` - CPU confidential computing mechanism * - ``role`` - ``server`` / ``client`` - Role in the NVFlare system * - ``root_drive_size`` - ``30`` (GB) - Size of the root filesystem drive * - ``applog_drive_size`` - ``1`` (GB) - Size of the application log drive * - ``user_config_drive_size`` - ``1`` (GB) - Size of the user configuration drive * - ``user_data_drive_size`` - ``1`` (GB) - Size of the user data drive * - ``docker_archive`` - ``~/path/to/app.tar.gz`` - Path to Docker image archive (created with ``docker save``) * - ``user_config`` - Key-value pairs - Paths mounted in container at ``/user_config/[key]`` * - ``allowed_ports`` - List of ports - Inbound ports to whitelist * - ``allowed_out_ports`` - List of ports - Outbound ports to whitelist * - ``cc_issuers`` - List of authorizers - CC attestation token issuers * - ``token_expiration`` - ``100`` (seconds) - Token validity duration (must be < ``check_frequency``) * - ``check_frequency`` - ``120`` (seconds) - Attestation check interval * - ``failure_action`` - ``stop_job`` - Action on attestation failure Complete Configuration Examples -------------------------------- **Project Configuration (project.yml)** .. code-block:: yaml api_version: 3 name: example_project description: NVIDIA FLARE sample project yaml file participants: - name: server1 type: server org: nvidia fed_learn_port: 8002 cc_config: cc_server1.yml - name: site-1 type: client org: nvidia cc_config: cc_site-1.yml - name: admin@nvidia.com type: admin org: nvidia role: project_admin builders: - path: nvflare.lighter.impl.workspace.WorkspaceBuilder - path: nvflare.lighter.impl.static_file.StaticFileBuilder args: config_folder: config - path: nvflare.lighter.impl.cert.CertBuilder - path: nvflare.lighter.impl.signature.SignatureBuilder - path: nvflare.lighter.cc_provision.impl.cc.CCBuilder packager: path: nvflare.lighter.cc_provision.impl.onprem_packager.OnPremPackager args: build_image_cmd: ~/nvflare-github/nvflare/lighter/cc/image_builder/cvm_build.sh **Server Configuration (cc_server1.yml)** .. code-block:: yaml compute_env: onprem_cvm cc_cpu_mechanism: amd_sev_snp role: server # All drive sizes are in GB root_drive_size: 30 applog_drive_size: 1 user_config_drive_size: 1 user_data_drive_size: 1 # Docker image archive saved using: # docker save | gzip > app.tar.gz docker_archive: ~/NVFlare/examples/advanced/cc_provision/docker/nvflare-site.tar.gz # Will be mounted inside docker at "/user_config/nvflare" user_config: nvflare: /tmp/startup_kits # Inbound ports whitelist allowed_ports: - 8002 # Outbound ports whitelist allowed_out_ports: - 443 # HTTPS - 8002 # NVFlare - 8999 # Trustee KBS cc_issuers: - id: snp_authorizer path: nvflare.app_opt.confidential_computing.snp_authorizer.SNPAuthorizer token_expiration: 100 # seconds, must be < check_frequency cc_attestation: check_frequency: 120 # seconds failure_action: stop_job **Client Configuration (cc_site-1.yml)** .. code-block:: yaml compute_env: onprem_cvm cc_cpu_mechanism: amd_sev_snp role: client # All drive sizes are in GB root_drive_size: 30 applog_drive_size: 1 user_config_drive_size: 1 user_data_drive_size: 1 # Docker image archive docker_archive: ~/NVFlare/examples/advanced/cc_provision/docker/nvflare-site.tar.gz # For non-public domain server names hosts_entries: server1: 10.176.200.152 # Will be mounted inside docker at "/user_config/nvflare" user_config: nvflare: /tmp/startup_kits cc_issuers: - id: snp_authorizer path: nvflare.app_opt.confidential_computing.snp_authorizer.SNPAuthorizer token_expiration: 100 # seconds, must be < check_frequency cc_attestation: check_frequency: 120 # seconds failure_action: stop_job Troubleshooting =============== Inspecting QCOW2 Disk Images ----------------------------- To inspect the contents of a QCOW2 disk image (e.g., ``user_config.qcow2``): **1. Load the NBD kernel module:** .. code-block:: bash sudo modprobe nbd max_part=8 **2. Connect the QCOW2 image:** .. code-block:: bash sudo qemu-nbd --connect=/dev/nbd0 user_config.qcow2 **3. Mount the image:** .. code-block:: bash sudo mount /dev/nbd0 /mnt/user_config **4. Inspect the contents:** .. code-block:: bash ls /mnt/user_config For NVFlare startup kits: .. code-block:: bash ls /mnt/user_config/nvflare/ **5. Unmount:** .. code-block:: bash sudo umount /mnt/user_config **6. Disconnect:** .. code-block:: bash sudo qemu-nbd --disconnect /dev/nbd0 Common Issues ------------- **Issue: CVM fails to boot** - Check the ``applog.qcow2`` for boot logs - Verify firmware is correctly installed - Ensure kernel-hashes is enabled in the firmware **Issue: Attestation failure** - Verify Trustee KBS server is accessible - Check network connectivity to attestation service - Ensure correct ports are whitelisted in ``allowed_out_ports`` **Issue: Server/Client connection fails** - Verify ``/etc/hosts`` entries if not using public domain - Check firewall rules - Ensure correct ports are configured in both server and client Next Steps ========== After successfully deploying the system: - Review the :ref:`NVFlare CC Architecture ` for understanding the security model - Consult :ref:`confidential_computing_attestation` for attestation details - Explore advanced configuration options for your specific use case