.. _brev_deployment:
###############################
Brev Kubernetes Helm Deployment
###############################
This guide walks through an end-to-end NVIDIA FLARE deployment on two Brev
single-node Kubernetes environments, treated as two Kubernetes clusters:
* one cluster for the FLARE server;
* one cluster for a single FLARE client named ``site-1``.
It covers provisioning, editing ``project.yml``, using ``nvflare deploy prepare``
to generate Helm charts for the server and client, creating the Helm workspace
PersistentVolumeClaim (PVC) and any job data PVCs, staging the prepared folders
into the workspace PVC, and deploying the generated charts.
The Kubernetes environments are created from the Brev web UI. The exact control
labels in Brev can change, but the workflow is the same: create an environment,
select compute, switch the software configuration to ``Single-node Kubernetes``,
open a Brev shell, copy the prepared kits to the environment, then deploy with
``kubectl`` and ``helm`` inside each Brev environment.
Brev System Overview
====================
Brev provides managed compute environments that can be created with CPU or GPU
hardware and an optional single-node Kubernetes software configuration. In this
guide, each Brev environment is used as a small independent Kubernetes cluster:
one environment runs the FLARE server, and each client site runs in its own
environment.
The Brev console is used to create environments, choose hardware, select
``Single-node Kubernetes``, and expose the FLARE server port. The Brev CLI is
used from the local workstation to copy files and open shells:
* ``brev copy`` uploads each prepared participant archive.
* ``brev shell`` opens a shell inside a Brev environment.
* ``brev exec`` can run non-interactive commands after the environment is ready.
Inside each Brev Kubernetes environment, ``kubectl`` and ``helm`` operate on
that environment's local cluster. Because the clusters are separate, using the
same Kubernetes namespace and PVC names in each cluster is safe.
The FLARE server environment needs an inbound TCP port for
``fed_learn_port``. Client environments usually do not need inbound FLARE
ports; they connect outbound to the server endpoint configured in
``project.yml``.
Assumptions
===========
The examples use:
* one server named ``server1``;
* one client named ``site-1``;
* one Brev Kubernetes environment named ``nvflare-server-k8s``;
* one Brev Kubernetes environment named ``nvflare-site-1-k8s``;
* namespace ``nvflare`` in both clusters;
* workspace PVC name ``nvflws`` in both clusters;
* optional job data PVC name ``nvfldata`` in both clusters;
* an externally reachable DNS name for the server, for example
``server1.example.com``;
* a container image in a registry that both clusters can pull, for example
``registry.example.com/nvflare:dev``.
Using the same namespace and PVC names in both clusters is safe because each
cluster has its own Kubernetes API and storage backend.
References:
* `NVIDIA Brev documentation `__
* `NVIDIA Brev console documentation `__
* `Brev connectivity documentation `__
* :ref:`helm_chart`
* :ref:`deploy_prepare_command`
Scripted Three-Environment Variant
==================================
If you already have three Brev single-node Kubernetes environments named
``server``, ``site-1``, and ``site-2``, the helper scripts below automate the
same provisioning, deploy prepare, copy, PVC staging, and Helm install flow for
a server plus two clients:
* :download:`prepare_brev_startup_kits.sh `
* :download:`launch_brev_nvflare.sh `
Run the prepare script from a local NVFlare checkout with an external server
host name and an image that all Brev clusters can pull:
.. code-block:: shell
export SERVER_HOST=server1.example.com
export IMAGE=registry.example.com/nvflare:dev
bash docs/user_guide/admin_guide/deployment/brev_scripts/prepare_brev_startup_kits.sh
If your Brev environment names differ from the participant names, set them with
environment variables or ask the script to prompt for them:
.. code-block:: shell
SERVER_BREV=nvflare-server-k8s \
SITE_1_BREV=nvflare-site-1-k8s \
SITE_2_BREV=nvflare-site-2-k8s \
bash docs/user_guide/admin_guide/deployment/brev_scripts/prepare_brev_startup_kits.sh
bash docs/user_guide/admin_guide/deployment/brev_scripts/prepare_brev_startup_kits.sh \
--prompt-brev-names
Then run the launch script inside each Brev environment:
.. code-block:: shell
brev shell "${SERVER_BREV:-server}"
IMAGE="$IMAGE" bash /home/ubuntu/launch_brev_nvflare.sh server
brev shell "${SITE_1_BREV:-site-1}"
IMAGE="$IMAGE" SERVER_HOST="$SERVER_HOST" bash /home/ubuntu/launch_brev_nvflare.sh site-1
brev shell "${SITE_2_BREV:-site-2}"
IMAGE="$IMAGE" SERVER_HOST="$SERVER_HOST" bash /home/ubuntu/launch_brev_nvflare.sh site-2
The current Brev CLI exposes ``brev port-forward`` for local forwarding, but it
does not provide a public TCP port exposure command. Use the Brev UI Access page
to expose TCP ``8002`` on the ``server`` environment before starting the two
sites.
Create Brev Kubernetes Environments
===================================
Create the server Kubernetes environment first, then repeat the same flow for
the client Kubernetes environment. In the Brev UI, a single-node Kubernetes
environment is created from the same ``GPUs`` page used for GPU and CPU
development environments.
Server Kubernetes Environment
-----------------------------
#. Sign in to the `Brev console `__.
#. Open ``GPUs`` in the top navigation.
#. Click ``Create Environment``.
.. figure:: ../../../resources/brev_creating.png
:alt: Brev GPU Environments page with the Create Environment button.
Start from the Brev ``GPUs`` page and create a new environment.
#. Select the hardware for the server environment. A CPU instance is enough for
the FLARE server unless your server-side workflow requires GPU compute.
.. figure:: ../../../resources/brev_instance.png
:alt: Brev Create Environment page with CPU selected.
Select a CPU or GPU instance type. For a basic server deployment, a CPU
instance type is sufficient.
#. Configure storage and region:
* ``Name``: ``nvflare-server-k8s``.
* ``Organization`` or ``Project``: choose the Brev organization that should
own the environment.
* ``Provider`` or ``Cloud``: choose the cloud provider where the server
should run.
* ``Region``: choose a region reachable by the client cluster and by your
admin operator.
* ``Disk Storage``: choose enough space for the container image cache, the
provisioned workspace PVC, server job storage, snapshots, and logs.
.. figure:: ../../../resources/brev_config_instance.png
:alt: Brev hardware, storage, region, and software configuration page.
Configure disk storage and region before changing the software mode.
#. In ``Software Configuration``, click ``Edit``.
#. Select ``Single-node Kubernetes``.
#. Keep ``Install Kubernetes Dashboard`` enabled if you want browser access to
the cluster dashboard.
#. Leave ``Run a cluster init script`` disabled unless your organization has a
required initialization script.
#. Click ``Apply``.
.. figure:: ../../../resources/brev_select_k8s.png
:alt: Brev software picker with Single-node Kubernetes selected.
Choose ``Single-node Kubernetes`` so the environment is created with
Kubernetes, ``kubectl``, and ``helm`` ready to use.
#. Expand ``Advanced`` only if you need to set custom network or startup
options.
#. Set ``Name Instance`` to ``nvflare-server-k8s``.
#. Click ``Deploy``.
.. figure:: ../../../resources/brev_deploy.png
:alt: Brev deployment page showing Name Instance and Deploy.
Name the server environment and deploy it.
#. Wait until the environment status is ``Running`` or ``Ready``.
Client Kubernetes Environment
-----------------------------
Repeat the same web UI flow and use these values:
* ``Name``: ``nvflare-site-1-k8s``.
* ``Instance Type``: choose CPU or GPU compute based on the jobs that ``site-1``
will run.
* ``Networking``: the client cluster needs outbound access to
``server1.example.com:8002``.
* ``Disk Storage``: choose enough space for the client workspace, logs, and data
PVC.
* ``Software Configuration``: choose ``Single-node Kubernetes``.
* ``Ports``: no inbound FLARE port is required for this basic client
deployment. The client connects outbound to the server on ``8002``.
Enable Server Port Access and SSH
---------------------------------
After both Kubernetes environments are running, open the server environment's
``Access`` page. In the ``Using Ports`` section, expose the FLARE federated
learning port, ``fed_learn_port`` ``8002``:
This guide does not set ``admin_port`` in ``project.yml``. When ``admin_port``
is omitted, NVFlare uses the same value as ``fed_learn_port``. Therefore, the
Brev server environment only needs to expose ``fed_learn_port`` ``8002``.
#. Find ``TCP/UDP Ports``.
#. In ``Expose Port(s)``, enter ``8002``.
#. Select the access scope. ``Allow All IPs`` is convenient for a quick test;
restrict this to known client/admin source IPs for a real deployment.
#. Click ``Expose Port``.
#. Confirm that the table lists port ``8002`` and shows a public endpoint such
as ``:8002``.
.. figure:: ../../../resources/brev_port.png
:alt: Brev Access page showing copy, secure links, and TCP ports.
In ``Using Ports``, expose the server ``fed_learn_port`` ``8002``. The same
page also shows the ``brev copy`` command format for uploading files to an
environment.
Copy the public ``host:port`` value for port ``8002``. Point
``server1.example.com`` to the host/IP portion of that endpoint. Do not include
the port in ``default_host``; the port is already configured as
``fed_learn_port: 8002`` in ``project.yml``.
The environment also provides SSH instructions through the ``Access`` page:
.. figure:: ../../../resources/brev_ssh.png
:alt: Brev Access page showing Brev CLI install, login, and shell commands.
Use the Brev CLI commands shown in the UI to install the CLI, log in, and
open a shell on the Kubernetes environment.
Install and authenticate the Brev CLI on your local workstation if it is not
already available:
.. code-block:: shell
sudo bash -c "$(curl -fsSL https://raw.githubusercontent.com/brevdev/brev-cli/main/bin/install-latest.sh)"
brev login
Set environment variables on your local workstation for the rest of the guide:
.. code-block:: shell
export SERVER_BREV=nvflare-server-k8s
export CLIENT_BREV=nvflare-site-1-k8s
export NAMESPACE=nvflare
export SERVER_HOST=server1.example.com
export IMAGE=registry.example.com/nvflare:dev
Verify that you can SSH to both Brev Kubernetes environments:
.. code-block:: shell
brev shell "$SERVER_BREV"
exit
brev shell "$CLIENT_BREV"
exit
Inside each Brev Kubernetes environment, ``kubectl`` and ``helm`` should already
be configured for the local single-node cluster. You can verify this after SSH:
.. code-block:: shell
kubectl get nodes
kubectl get storageclass
.. _brev_build_push_flare_image:
Build and Push the FLARE Image
==============================
Build the FLARE runtime image from an NVFlare source checkout and push it to a
registry that both Brev Kubernetes clusters can pull from:
The ``ServerK8sJobLauncher`` and ``ClientK8sJobLauncher`` use the Kubernetes
Python client from inside the running FLARE container. If you use a custom
Dockerfile, install the dependency in the image:
.. code-block:: dockerfile
RUN pip install kubernetes
The repository ``docker/Dockerfile.parent`` already installs the NVFlare
``K8S`` extra, which includes this dependency. Keep that install line, or add
the explicit ``pip install kubernetes`` line above before building your image.
.. code-block:: shell
docker build -t "$IMAGE" -f docker/Dockerfile.parent .
docker push "$IMAGE"
If the registry is private, make sure both clusters can pull the image. Depending
on your registry and cluster configuration, this can mean configuring node-level
registry credentials or adding Kubernetes image pull secrets. The generated
chart does not add ``imagePullSecrets`` by default, so use a registry already
trusted by the nodes or customize the chart for your environment.
Edit project.yml
================
Generate a sample project file if you do not already have one:
.. code-block:: shell
nvflare provision -g
Edit ``project.yml`` with these deployment-specific goals:
#. Define only one client, ``site-1``.
#. Set the server ``default_host`` to the stable external DNS name that the
client cluster will use.
#. Include the same DNS name in ``host_names`` so the server certificate is
valid for that endpoint.
#. Leave ``admin_port`` unset so it defaults to ``fed_learn_port``. The Brev
server only needs to expose the ``fed_learn_port`` value.
#. Use ``nvflare deploy prepare`` after provisioning to generate Kubernetes
runtime files from the server and client startup kits.
#. Use a container image that both clusters can pull in the deploy prepare
runtime config.
Example:
.. code-block:: yaml
api_version: 3
name: example_project
description: NVFlare Brev Kubernetes Helm deployment
participants:
- name: server1
type: server
org: nvidia
default_host: server1.example.com
host_names:
- server1
- server1.example.com
fed_learn_port: 8002
- name: site-1
type: client
org: nvidia
- name: admin@nvidia.com
type: admin
org: nvidia
role: project_admin
builders:
- path: nvflare.lighter.impl.workspace.WorkspaceBuilder
args:
template_file:
- master_template.yml
- path: nvflare.lighter.impl.static_file.StaticFileBuilder
args:
config_folder: config
scheme: tcp
- path: nvflare.lighter.impl.cert.CertBuilder
- path: nvflare.lighter.impl.signature.SignatureBuilder
The value of ``default_host`` must be chosen before provisioning because it is
written into startup configuration and server certificates. Use a stable DNS
name that you control, such as ``server1.example.com``, in ``project.yml`` and
point that DNS name to the Brev server environment's exposed host after you
enable port access.
The generated server and client charts mount only the configured
``workspace_pvc``. In this guide, that PVC is ``nvflws`` and it is mounted at
``/var/tmp/nvflare/workspace``. Create separate data PVCs, such as
``nvfldata``, only for launched Kubernetes job pods that need study data.
Run Provisioning
================
Run the provision command:
.. code-block:: shell
nvflare provision -p project.yml -w /tmp/nvflare/provision
Set ``PROD_DIR`` to the generated production folder:
.. code-block:: shell
PROJECT_NAME=$(grep '^name:' project.yml | awk '{print $2}')
PROD_DIR=$(find "/tmp/nvflare/provision/${PROJECT_NAME}" \
-maxdepth 1 -type d -name 'prod_*' | sort | tail -n 1)
if [ -z "$PROD_DIR" ]; then
echo "No prod_* folder found for project '${PROJECT_NAME}'" >&2
exit 1
fi
echo "$PROD_DIR"
Prepare the server and client startup kits for Kubernetes:
.. code-block:: shell
cat >/tmp/nvflare-k8s.yaml <<'EOF'
runtime: k8s
namespace: nvflare
parent:
docker_image: registry.example.com/nvflare:dev
parent_port: 8102
workspace_pvc: nvflws
workspace_mount_path: /var/tmp/nvflare/workspace
python_path: /usr/local/bin/python3
job_launcher:
config_file_path:
default_python_path: /usr/local/bin/python3
pending_timeout: 300
EOF
nvflare deploy prepare "$PROD_DIR/server1" --output /tmp/nvflare-prepared/server1 --config /tmp/nvflare-k8s.yaml
nvflare deploy prepare "$PROD_DIR/site-1" --output /tmp/nvflare-prepared/site-1 --config /tmp/nvflare-k8s.yaml
The example above only sets the keys this guide needs. ``parent`` also accepts
optional ``resources`` (parent pod CPU/memory requests and limits) and
``pod_security_context``, and ``job_launcher`` accepts optional
``job_pod_security_context``. See :ref:`deploy_prepare_command` for the full
runtime config schema and :ref:`helm_chart` for how the prepared chart is
installed.
The prepared folders should contain one ``helm_chart`` directory under the
server and client:
.. code-block:: shell
ls /tmp/nvflare-prepared/server1/helm_chart
ls /tmp/nvflare-prepared/site-1/helm_chart
Each participant folder has this structure:
.. code-block:: text
server1/
helm_chart/
Chart.yaml
values.yaml
templates/
local/
startup/
transfer/
During this step, ``nvflare deploy prepare`` updates
``local/resources.json.default`` to use the Kubernetes launcher, removes any
active ``local/resources.json`` override, updates runtime communication to use
the generated Kubernetes Service, creates a ``local/study_data.yaml`` template
when needed, removes the legacy ``startup/start.sh``, ``startup/sub_start.sh``,
and ``startup/stop_fl.sh`` scripts (the parent process is launched by the Helm
chart instead), and generates ``helm_chart/``. For server kits, it also
relocates the default ``job_manager`` and ``snapshot_persistor`` storage paths
under ``parent.workspace_mount_path``
(``/var/tmp/nvflare/workspace/jobs-storage`` and
``/var/tmp/nvflare/workspace/snapshot-storage``) so server job history and
snapshots persist on the workspace PVC. Do not edit the launcher in
``resources.json.default`` by hand after this step; change
``/tmp/nvflare-k8s.yaml`` and rerun ``nvflare deploy prepare`` instead.
If the input kit already configures a custom ``resource_manager``,
``resource_consumer``, or job launcher, ``nvflare deploy prepare`` prints a
warning and replaces those components with the runtime configuration shown
above.
Copy Prepared Kits to Brev Environments
=======================================
Package the prepared server and client folders on your local workstation:
.. code-block:: shell
tar -czf /tmp/nvflare-server1.tgz -C /tmp/nvflare-prepared server1
tar -czf /tmp/nvflare-site-1.tgz -C /tmp/nvflare-prepared site-1
Use the ``Copy Files`` section of the Brev environment ``Access`` page, or run
the equivalent ``brev copy`` commands:
.. code-block:: shell
brev copy /tmp/nvflare-server1.tgz "$SERVER_BREV:/home/ubuntu/"
brev copy /tmp/nvflare-site-1.tgz "$CLIENT_BREV:/home/ubuntu/"
The archive contains the generated ``startup/``, ``local/``, and
``helm_chart/`` folders. The Helm chart is run from the Brev environment after
the archive is extracted. Only ``startup/`` and ``local/`` need to be staged in
the workspace PVC.
Deploy the Server Environment
=============================
Open a shell on the server Brev environment:
.. code-block:: shell
brev shell "$SERVER_BREV"
Run the rest of this section from inside the server environment. First extract
the uploaded archive and set deployment variables:
.. code-block:: shell
export NAMESPACE=nvflare
export IMAGE=registry.example.com/nvflare:dev
mkdir -p ~/nvflare
tar -xzf ~/nvflare-server1.tgz -C ~/nvflare
kubectl get nodes
helm version
Create the namespace and PVCs. The generated server chart requires the
``nvflws`` workspace PVC. The ``nvfldata`` PVC is used later only by launched
Kubernetes job pods that need study data:
.. code-block:: shell
kubectl create namespace "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -
cat > ~/nvflare/nvflare-pvcs.yaml <<'EOF'
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nvflws
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nvfldata
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
EOF
kubectl -n "$NAMESPACE" apply -f ~/nvflare/nvflare-pvcs.yaml
kubectl -n "$NAMESPACE" get pvc
If your Brev Kubernetes environment does not have a default storage class, add
``storageClassName: `` under each PVC ``spec``.
The server folder is already prepared for Kubernetes. Its
``local/resources.json.default`` contains ``ServerK8sJobLauncher`` with
``namespace: nvflare``, ``default_python_path: /usr/local/bin/python3``,
``pending_timeout: 300``, and ``workspace_mount_path:
/var/tmp/nvflare/workspace`` from ``/tmp/nvflare-k8s.yaml``. The same namespace
must be used for the Helm release because the launcher creates dynamic job pods
in that namespace.
Copy the prepared server ``startup/`` and ``local/`` directories into the
``nvflws`` PVC. The chart starts the server with
``-m /var/tmp/nvflare/workspace``, so the PVC root must contain ``startup/``
and ``local/`` directly.
.. code-block:: shell
cat > ~/nvflare/copy-to-pvcs.yaml <<'EOF'
apiVersion: v1
kind: Pod
metadata:
name: nvflare-pvc-copy
spec:
restartPolicy: Never
containers:
- name: copy
image: busybox:1.36
command:
- sh
- -c
- sleep 3600
volumeMounts:
- name: nvflws
mountPath: /mnt/nvflws
volumes:
- name: nvflws
persistentVolumeClaim:
claimName: nvflws
EOF
kubectl -n "$NAMESPACE" delete pod nvflare-pvc-copy --ignore-not-found=true
kubectl -n "$NAMESPACE" apply -f ~/nvflare/copy-to-pvcs.yaml
kubectl -n "$NAMESPACE" wait \
--for=condition=Ready pod/nvflare-pvc-copy --timeout=120s
kubectl -n "$NAMESPACE" exec nvflare-pvc-copy -- \
rm -rf /mnt/nvflws/startup /mnt/nvflws/local
kubectl -n "$NAMESPACE" cp ~/nvflare/server1/startup nvflare-pvc-copy:/mnt/nvflws/startup
kubectl -n "$NAMESPACE" cp ~/nvflare/server1/local nvflare-pvc-copy:/mnt/nvflws/local
kubectl -n "$NAMESPACE" exec nvflare-pvc-copy -- \
ls -la /mnt/nvflws/startup /mnt/nvflws/local
kubectl -n "$NAMESPACE" delete pod nvflare-pvc-copy
Copy ``startup/`` and ``local/`` directly into the PVC root. If the PVC root
only contains a nested ``server1/`` directory, the server pod will not find
``/var/tmp/nvflare/workspace/startup`` and
``/var/tmp/nvflare/workspace/local``.
Install the server Helm chart. Set ``hostPortEnabled=true`` so the server pod
binds ``fed_learn_port`` ``8002`` on the Brev host. This is the port exposed in
the Brev ``Using Ports`` UI.
.. code-block:: shell
helm upgrade --install server1 ~/nvflare/server1/helm_chart \
--namespace "$NAMESPACE" \
--set image.repository="${IMAGE%:*}" \
--set image.tag="${IMAGE##*:}" \
--set service.type=ClusterIP \
--set hostPortEnabled=true
kubectl -n "$NAMESPACE" rollout status deployment/server1 --timeout=300s
kubectl -n "$NAMESPACE" get pods
kubectl -n "$NAMESPACE" logs deploy/server1
Deploy the site-1 Environment
=============================
Open a shell on the client Brev environment:
.. code-block:: shell
brev shell "$CLIENT_BREV"
Run the rest of this section from inside the client environment. First extract
the uploaded archive and set deployment variables:
.. code-block:: shell
export NAMESPACE=nvflare
export IMAGE=registry.example.com/nvflare:dev
export SERVER_HOST=server1.example.com
mkdir -p ~/nvflare
tar -xzf ~/nvflare-site-1.tgz -C ~/nvflare
kubectl get nodes
helm version
Create the namespace and PVCs. The generated client chart requires the
``nvflws`` workspace PVC. The ``nvfldata`` PVC is used later only by launched
Kubernetes job pods that need study data:
.. code-block:: shell
kubectl create namespace "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -
cat > ~/nvflare/nvflare-pvcs.yaml <<'EOF'
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nvflws
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nvfldata
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
EOF
kubectl -n "$NAMESPACE" apply -f ~/nvflare/nvflare-pvcs.yaml
kubectl -n "$NAMESPACE" get pvc
The ``site-1`` folder is already prepared for Kubernetes. Its
``local/resources.json.default`` contains ``ClientK8sJobLauncher`` with the
same launcher settings from ``/tmp/nvflare-k8s.yaml``. Keep the Helm namespace
consistent with the ``namespace`` value used by ``nvflare deploy prepare``.
Copy the prepared ``site-1`` ``startup/`` and ``local/`` directories into the
client ``nvflws`` PVC:
.. code-block:: shell
cat > ~/nvflare/copy-to-pvcs.yaml <<'EOF'
apiVersion: v1
kind: Pod
metadata:
name: nvflare-pvc-copy
spec:
restartPolicy: Never
containers:
- name: copy
image: busybox:1.36
command:
- sh
- -c
- sleep 3600
volumeMounts:
- name: nvflws
mountPath: /mnt/nvflws
volumes:
- name: nvflws
persistentVolumeClaim:
claimName: nvflws
EOF
kubectl -n "$NAMESPACE" delete pod nvflare-pvc-copy --ignore-not-found=true
kubectl -n "$NAMESPACE" apply -f ~/nvflare/copy-to-pvcs.yaml
kubectl -n "$NAMESPACE" wait \
--for=condition=Ready pod/nvflare-pvc-copy --timeout=120s
kubectl -n "$NAMESPACE" exec nvflare-pvc-copy -- \
rm -rf /mnt/nvflws/startup /mnt/nvflws/local
kubectl -n "$NAMESPACE" cp ~/nvflare/site-1/startup nvflare-pvc-copy:/mnt/nvflws/startup
kubectl -n "$NAMESPACE" cp ~/nvflare/site-1/local nvflare-pvc-copy:/mnt/nvflws/local
kubectl -n "$NAMESPACE" exec nvflare-pvc-copy -- \
ls -la /mnt/nvflws/startup /mnt/nvflws/local
kubectl -n "$NAMESPACE" delete pod nvflare-pvc-copy
Before installing the client chart, verify that the client environment can
resolve the server host:
.. code-block:: shell
kubectl -n "$NAMESPACE" run dns-test --rm -it \
--image=busybox:1.36 -- \
nslookup "$SERVER_HOST"
Install the ``site-1`` Helm chart:
.. code-block:: shell
helm upgrade --install site-1 ~/nvflare/site-1/helm_chart \
--namespace "$NAMESPACE" \
--set image.repository="${IMAGE%:*}" \
--set image.tag="${IMAGE##*:}"
kubectl -n "$NAMESPACE" rollout status deployment/site-1 --timeout=300s
kubectl -n "$NAMESPACE" get pods
kubectl -n "$NAMESPACE" logs deploy/site-1
If you reprovision later, back up or remove old PVC contents before copying the
new folders. Certificates, local config, and communication settings are tied to
the provisioned project state.
Connect an Admin Console
========================
Run the admin client from a network location that can reach
``server1.example.com:8002``:
.. code-block:: shell
cd "$PROD_DIR/admin@nvidia.com/startup"
./fl_admin.sh
The generated admin kit connects to the server host configured in
``project.yml``. If you used ``server1.example.com`` as ``default_host``, that
name must resolve to the Brev server environment endpoint.
Kubernetes Job Pods and nvfldata
================================
``nvflare deploy prepare`` writes the Kubernetes launcher into
``local/resources.json.default`` before the participant folders are copied to
Brev. The generated launcher config sets ``study_data_pvc_file_path`` to:
.. code-block:: text
/var/tmp/nvflare/workspace/local/study_data.yaml
When launched job pods need the ``nvfldata`` PVC, edit
``local/study_data.yaml`` in the prepared server and client folders before
copying those folders into ``nvflws``. This example maps the ``default`` study's
``data`` dataset to ``nvfldata``:
.. code-block:: yaml
default:
data:
source: nvfldata
mode: rw
Job pod image, Python, CPU, memory, and ephemeral storage settings should be
specified in the submitted job's ``meta.json`` under ``launcher_spec`` for the
``k8s`` launcher. GPU resource requests such as ``num_of_gpus`` should be
specified under ``resource_spec``, matching :ref:`helm_chart`.
Troubleshooting
===============
PVC stays ``Pending``
---------------------
Check that the Brev cluster has a default storage class, or add an explicit
``storageClassName`` to ``nvflare-pvcs.yaml``:
.. code-block:: shell
kubectl get storageclass
kubectl -n "$NAMESPACE" describe pvc nvflws
Pod has ``ImagePullBackOff``
----------------------------
Confirm the image exists and that both clusters can pull it:
.. code-block:: shell
docker push "$IMAGE"
kubectl -n "$NAMESPACE" describe pod -l app.kubernetes.io/name=server1
kubectl -n "$NAMESPACE" describe pod -l app.kubernetes.io/name=site-1
Server pod cannot find ``startup`` or ``local``
-----------------------------------------------
The participant folder was copied to the wrong level in the PVC. The server
workspace root must contain:
.. code-block:: text
/var/tmp/nvflare/workspace/startup
/var/tmp/nvflare/workspace/local
Use the helper pod to inspect ``/mnt/nvflws`` and restage ``startup/`` and
``local/`` from the extracted prepared folder, such as
``~/nvflare/server1/startup`` and ``~/nvflare/server1/local``, if needed.
site-1 cannot connect to the server
-----------------------------------
Verify these items:
* ``default_host`` in ``project.yml`` matches the DNS name used by the client.
* The DNS name resolves from the client cluster.
* The server cluster exposes TCP port ``8002``.
* The server certificate includes the DNS name in ``host_names``.
Run a DNS check from the client cluster:
.. code-block:: shell
kubectl -n "$NAMESPACE" run dns-test --rm -it \
--image=busybox:1.36 -- \
nslookup "$SERVER_HOST"
If you change ``default_host`` or ``host_names``, reprovision, restage the
updated folders, and redeploy the charts.
Cleanup
========
Remove the Helm releases:
.. code-block:: shell
# Run inside the server Brev environment.
helm uninstall server1 -n "$NAMESPACE"
# Run inside the site-1 Brev environment.
helm uninstall site-1 -n "$NAMESPACE"
Delete the namespaces and PVCs:
.. code-block:: shell
# Run inside each Brev environment.
kubectl delete namespace "$NAMESPACE"
Delete the Brev clusters from the web UI when you no longer need them:
#. Open the Brev console.
#. Open the Kubernetes or clusters page.
#. Select ``nvflare-server-k8s`` and delete it.
#. Select ``nvflare-site-1-k8s`` and delete it.
#. Confirm in the billing or usage page that the resources are no longer
running.