FLARE Confidential Federated AI Deployment Guide¶
Overview¶
This guide provides step-by-step instructions for deploying NVIDIA FLARE with Confidential Computing (CC) capabilities on-premises using AMD SEV-SNP CPUs and NVIDIA GPUs.
The deployment involves building a Confidential VM (CVM) image that contains the FLARE application, provisioning the system for participants, and launching the CVMs on each site.
Deployment Environment¶
This guide covers the following deployment configuration:
Platform: On-Premise AMD CVM with NVIDIA GPU
CPU: AMD SEV-SNP (Secure Encrypted Virtualization - Secure Nested Paging)
GPU: NVIDIA GPU with Confidential Computing support (optional)
Host OS: Ubuntu 25.04
Prerequisites¶
For a complete and thorough setup guide covering Hardware IT, Host OS Administration, and VM Administration, please refer to [NVIDIA’s Deployment Guide for SecureAI](https://docs.nvidia.com/cc-deployment-guide-snp.pdf)
Hardware Requirements¶
CPU Requirements
AMD CPU with SEV-SNP enabled
AMD firmware supporting SEV-SNP
GPU Requirements (Optional)
NVIDIA GPU with Confidential Computing support (H100, Blackwell)
Host System
Host OS: Ubuntu 25.04
Software Requirements¶
Required Software
Ubuntu 25.04
QEMU (for virtualization)
Docker
NVFlare Components
NVFlare Source Code
Clone from GitHub:
git clone https://github.com/NVIDIA/NVFlare.git
Image Builder
Obtain the image builder code from the NVFlare team
Contact: federatedlearning@nvidia.com
Install location:
~/cc/image_builder
Base Images
Obtain base images from the NVFlare team
Copy to:
~/cc/image_builder/base_images
KBS Client
Build or obtain KBS client matching your KBS server
Recommended commit:
a2570329cc33daf9ca16370a1948b5379bb17fbeCopy the kbs-client and credentials to:
~/cc/image_builder/binaries
SNPGuest Tool
Build SNPGuest version v0.9.2
Copy snpguest and credentials to:
~/cc/image_builder/binaries
AMD Firmware Installation
To install the AMD SEV-SNP firmware:
echo 'deb http://archive.ubuntu.com/ubuntu plucky-proposed main restricted universe multiverse' | \
sudo tee /etc/apt/sources.list.d/plucky-proposed.list
sudo tee /etc/apt/preferences.d/99-plucky-proposed <<'EOF'
Package: *
Pin: release a=plucky-proposed
Pin-Priority: 100
EOF
sudo apt update
sudo apt install -t plucky-proposed ovmf
The firmware will be installed at /usr/share/ovmf/OVMF.amdsev.fd.
Policy Files Setup
Note
The current KBS doesn’t support updating individual rules. You must update the entire rule file when adding a new CVM.
Create the policy directory and obtain the required files:
mkdir -p /shared/policy
Place the following files in /shared/policy (obtain from the NVFlare team):
policy.rego- Master policy fileset-policy.sh- Policy update scriptprivate.key- Authentication key
Project Admin Requirements¶
As the project admin, you need to:
Understand Trustee Service
Learn about Trustee Service
Review the Trustee documentation
Deploy Trustee KBS Server
Follow the HashiCorp Vault and Trustee KBS Joint Deployment Guide guide to deploy the Trustee Key Broker Service with HashiCorp Vault.
Deployment Workflow¶
The deployment consists of five main steps:
Build Docker Image - Create the application container
Provision - Generate CVM images and startup kits
Distribute - Send startup kits to each site
User Data - Prepare user data, optional
Launch - Start CVMs at each site
Step 1: Build Docker Image¶
The CC image builder supports any generic workload. For NVFlare, create a Docker image with the application pre-installed.
Example Dockerfile:
ARG BASE_IMAGE=python:3.12
FROM ${BASE_IMAGE}
ENV PYTHONDONTWRITEBYTECODE=1
ENV PIP_NO_CACHE_DIR=1
RUN pip install -U pip && \
pip install nvflare~=2.7.0rc
COPY code/ /local/custom
COPY requirements.txt .
RUN pip install -r requirements.txt
ENTRYPOINT ["/user_config/nvflare/startup/sub_start.sh", "--verify"]
Note
For CC jobs, custom code at runtime is not allowed. All application code must be included in the Docker image.
Build and save the image:
docker build -t nvflare-site:latest .
docker save nvflare-site:latest | gzip > nvflare-site.tar.gz
Step 2: Provision¶
Navigate to the example directory:
cd NVFlare/examples/advanced/cc_provision
2.1 Configure Project
Edit project.yml and update the build_image_cmd path:
packager:
path: nvflare.lighter.cc_provision.impl.onprem_packager.OnPremPackager
args:
# Update this path to your image builder location
build_image_cmd: ~/nvflare-github/nvflare/lighter/cc/image_builder/cvm_build.sh
2.2 Configure Server
Edit cc_server1.yml and set the docker_archive path:
docker_archive: ~/NVFlare/examples/advanced/cc_provision/docker/nvflare-site.tar.gz
2.3 Configure Client
Edit cc_site-1.yml:
Set the
docker_archivepath:docker_archive: ~/NVFlare/examples/advanced/cc_provision/docker/nvflare-site.tar.gz
If the server name is not a public domain, add host entries:
host_entries: server1: 10.176.4.244
2.4 Run Provision
nvflare provision -p project.yml
Note
Provisioning takes approximately 1000 seconds to build each CVM image.
2.5 Output
Startup packages are generated in:
./workspace/example_project/prod_00/
server1/server1.tgz
site-1/site-1.tgz
admin@nvidia.com/
Step 3: Distribute¶
Distribute the generated startup kits to each participant:
Send
server1.tgzto the server siteSend
site-1.tgzto client site-1Admin keeps the admin package locally
CVM Startup Kit Contents¶
Each startup kit (e.g., server1.tgz) contains:
$ tar -zxvf server1.tgz
$ ls server1/cvm_885fe8f608b3/
File |
Description |
|---|---|
|
Application log storage (unencrypted, can be mounted and inspected) |
|
Encrypted root filesystem (requires decryption key) |
|
Initramfs with InitApp for attestation |
|
CVM launch script |
|
AMD SEV-SNP firmware with kernel-hashes support |
|
Documentation |
|
User configuration storage containing NVFlare startup kits |
|
User data storage (placeholder, can be extended) |
|
Linux kernel |
Step 4: User Data¶
This step is optional. It describes how to prepare user data and make it available inside both the CVM and the workload container. If your workload does not require user data, you may skip this step.
4.1 User Data Drive
The CVM distribution package includes a user-data drive image named user_data.qcow2. This image contains a single ext4 filesystem that occupies the entire drive (no partitions).
The drive is not encrypted. You can access it by attaching it to a VM or by using the qemu-nbd command. For example:
sudo qemu-nbd --connect=/dev/nbd0 user_data.qcow2
sudo mount /dev/nbd0 /mnt
.. note::
Please unmount the drive after usage by running:
.. code-block:: bash
sudo umount /mnt
sudo qemu-nbd --disconnect /dev/nbd0
4.2 Local Data
If your data resides on the local host, it can be copied directly into the user_data drive. The data is
available in CVM and container as /user_data.
For example:
cp -r /training_data /mnt
The user_data.qcow2 image included with the package is a placeholder and has a very small capacity (1 GB). You can resize the drive and expand the filesystem using the following commands:
# Add 20GB to the drive
qemu-img resize user_data.qcow2 +20G
# Extend filesystem
sudo qemu-nbd --connect=/dev/nbd0 user_data.qcow2
sudo e2fsck -f /dev/nbd0
sudo resize2fs /dev/nbd0
sudo qemu-nbd --disconnect /dev/nbd0
4.3 Remote Data using NFS
If your data is hosted on an NFS server, the CVM can automatically mount it when a file named ext_mount.conf
is present at the root of the user_data.qcow2 drive.
The mounted data is available in CVM and container as /user_data/mnt.
The file must contain a single line specifying the exported server path, for example:
nfs-server.example.com:/training_data
By default, the CVM blocks all outbound network traffic. To allow NFS and port-mapper communication, the following outgoing ports must be enabled in the CVM site configuration:
allowed_out_ports: [111, 2049]
This must be done in Step 2.3 of the provisioning process.
Because the CVM cannot control the source port used for NFS connections, secure NFS exports are not supported. The server export must therefore be configured as insecure:
/training_data *(rw,sync,no_subtree_check,insecure)
For details, please refer to exports man page
Step 5: Launch CVMs¶
5.1 Launch Server
On the server machine:
tar -zxvf server1.tgz
cd server1/cvm_*
./launch_vm.sh
5.2 Launch Client
On each client machine:
tar -zxvf site-1.tgz
cd site-1/cvm_*
./launch_vm.sh
The server and clients will automatically start the NVFlare system inside their respective CVMs.
5.3 Start Admin Console
On the admin machine:
cd NVFlare/examples/advanced/cc_provision
# Copy jobs to admin transfer folder
cp -r jobs/* ./workspace/example_project/prod_00/admin@nvidia.com/transfer/
Note
If the server name is not a public domain, add an entry in /etc/hosts on the admin machine.
Start the admin console:
./workspace/example_project/prod_00/admin@nvidia.com/startup/fl_admin.sh
5.4 Submit Job
In the admin console:
submit_job hello-pt_cifar10_fedavg
Configuration Reference¶
CC Configuration Parameters¶
Parameter |
Example Value |
Description |
|---|---|---|
|
|
Computation environment type |
|
|
CPU confidential computing mechanism |
|
|
Role in the NVFlare system |
|
|
Size of the root filesystem drive |
|
|
Size of the application log drive |
|
|
Size of the user configuration drive |
|
|
Size of the user data drive |
|
|
Path to Docker image archive (created with |
|
Key-value pairs |
Paths mounted in container at |
|
List of ports |
Inbound ports to whitelist |
|
List of ports |
Outbound ports to whitelist |
|
List of authorizers |
CC attestation token issuers |
|
|
Token validity duration (must be < |
|
|
Attestation check interval |
Complete Configuration Examples¶
Project Configuration (project.yml)
api_version: 3
name: example_project
description: NVIDIA FLARE sample project yaml file
participants:
- name: server1
type: server
org: nvidia
fed_learn_port: 8002
cc_config: cc_server1.yml
- name: site-1
type: client
org: nvidia
cc_config: cc_site-1.yml
- name: admin@nvidia.com
type: admin
org: nvidia
role: project_admin
builders:
- path: nvflare.lighter.impl.workspace.WorkspaceBuilder
- path: nvflare.lighter.impl.static_file.StaticFileBuilder
args:
config_folder: config
- path: nvflare.lighter.impl.cert.CertBuilder
- path: nvflare.lighter.cc_provision.impl.cc.CCBuilder
- path: nvflare.lighter.impl.signature.SignatureBuilder
packager:
path: nvflare.lighter.cc_provision.impl.onprem_packager.OnPremPackager
args:
build_image_cmd: ~/nvflare-github/nvflare/lighter/cc/image_builder/cvm_build.sh
Server Configuration (cc_server1.yml)
compute_env: onprem_cvm
cc_cpu_mechanism: amd_sev_snp
role: server
# All drive sizes are in GB
root_drive_size: 45
applog_drive_size: 1
user_config_drive_size: 1
user_data_drive_size: 1
# Docker image archive saved using:
# docker save <image_name> | gzip > app.tar.gz
docker_archive: ~/NVFlare/examples/advanced/cc_provision/docker/nvflare-site.tar.gz
# Will be mounted inside docker at "/user_config/nvflare"
user_config:
nvflare: /tmp/startup_kits
# Inbound ports whitelist
allowed_ports:
- 8002
# Outbound ports whitelist
allowed_out_ports:
- 443 # HTTPS
- 8002 # NVFlare
- 8999 # Trustee KBS
cc_issuers:
- id: snp_authorizer
path: nvflare.app_opt.confidential_computing.snp_authorizer.SNPAuthorizer
token_expiration: 100 # seconds, must be < check_frequency
cc_attestation:
check_frequency: 120 # seconds
Client Configuration (cc_site-1.yml)
compute_env: onprem_cvm
cc_cpu_mechanism: amd_sev_snp
role: client
# All drive sizes are in GB
root_drive_size: 45
applog_drive_size: 1
user_config_drive_size: 1
user_data_drive_size: 1
# Docker image archive
docker_archive: ~/NVFlare/examples/advanced/cc_provision/docker/nvflare-site.tar.gz
# For non-public domain server names
hosts_entries:
server1: 10.176.200.152
# Will be mounted inside docker at "/user_config/nvflare"
user_config:
nvflare: /tmp/startup_kits
cc_issuers:
- id: snp_authorizer
path: nvflare.app_opt.confidential_computing.snp_authorizer.SNPAuthorizer
token_expiration: 100 # seconds, must be < check_frequency
cc_attestation:
check_frequency: 120 # seconds
Troubleshooting¶
Inspecting QCOW2 Disk Images¶
To inspect the contents of a QCOW2 disk image (e.g., user_config.qcow2):
1. Load the NBD kernel module:
sudo modprobe nbd max_part=8
2. Connect the QCOW2 image:
sudo qemu-nbd --connect=/dev/nbd0 user_config.qcow2
3. Mount the image:
sudo mount /dev/nbd0 /mnt/user_config
4. Inspect the contents:
ls /mnt/user_config
For NVFlare startup kits:
ls /mnt/user_config/nvflare/
5. Unmount:
sudo umount /mnt/user_config
6. Disconnect:
sudo qemu-nbd --disconnect /dev/nbd0
Common Issues¶
Issue: CVM fails to boot
Check the
applog.qcow2for boot logsVerify firmware is correctly installed
Ensure kernel-hashes is enabled in the firmware
Issue: Attestation failure
Verify Trustee KBS server is accessible
Check network connectivity to attestation service
Ensure correct ports are whitelisted in
allowed_out_ports
Issue: Server/Client connection fails
Verify
/etc/hostsentries if not using public domainCheck firewall rules
Ensure correct ports are configured in both server and client
Notes on using NVIDIA GPU CC¶
For any site that supports GPU CC, you can add NVFLARE’s GPUAuthorizer to the cc_site.yml configuration file:
cc_issuers:
...
- id: gpu_authorizer
path: nvflare.app_opt.confidential_computing.gpu_authorizer.GPUAuthorizer
token_expiration: 100 # seconds, needs to be less than check_frequency
The NVFlare GPUAuthorizer uses NVIDIA’s nv_attestation_sdk. When building the NVFlare app docker image, make sure to include it in the requirements, for example:
torch
torchvision
tensorboard
tensorflow
safetensors
nv_attestation_sdk
- To get GPU working in CVM, you need to ensure:
No GPU driver installed on host, otherwise the passthrough will fail.
You need to create VFIO by running the following command:
NVIDIA_GPU=$(lspci -d 10de: | awk '/NVIDIA/{print $1}')
NVIDIA_PASSTHROUGH=$(lspci -n -s $NVIDIA_GPU | awk -F: '{print $4}' | awk '{print $1}')
echo 10de $NVIDIA_PASSTHROUGH > /sys/bus/pci/drivers/vfio-pci/new_id
For more details, please refer to NVIDIA’s Deployment Guide for SecureAI
Next Steps¶
After successfully deploying the system:
Review the NVFlare CC Architecture for understanding the security model
Consult Confidential Computing: Attestation Service Integration for attestation details
Explore advanced configuration options for your specific use case