Helm Chart for NVIDIA FLARE

Sometimes, users would like to deploy NVIDIA FLARE to an existing Kubernetes cluster. Now the provisioning tool includes a new builder HelmChartBuilder that can generate a reference Helm Chart for users to deploy NVIDIA FLARE to a local microk8s Kubernetes instance.

Note

The generated Helm Chart is a starting point and serves as a reference. Depending on the Kubernetes cluster, users may need to modify and/or to perform additional operations to successfully deploy the chart.

Note

The following document assumes users have microk8s (common bundle in ubuntu server 20.04 and above) running on his local machine. With the helm chart, users are able to start the overseer, servers in the k8s cluster after provisioning. The clients and admin console can connect to the overseer and servers in the k8s cluster.

Update on provisioning tool

In order to generate the helm chart, add the HelmChartBuilder to the project.yml file.

- path: nvflare.lighter.impl.helm_chart.HelmChartBuilder
    args:
    docker_image: localhost:32000/nvfl-min:0.0.1

The docker_image is the actual image used for all pods running in the k8s. The provisioners have to build it separately and make sure it is available to the k8s cluster. For microk8s, enabling the docker registry server by running this:

microk8s enable registry

This will create a registry server listening to port 32000

Provisioning results

Running provision command as usual, either in the new format nvflare provision or just provision.

After the command, there should a folder with structure similar to the following:

$ tree -L 1
.
├── admin@nvidia.com
├── compose.yaml
├── nvflare_compose
├── nvflare_hc
├── overseer
├── server1
├── server2
├── site-1
└── site-2

8 directories, 1 file

Note there is a nvflare_hc folder. This folder is the Helm Chart package.

Preparing microk8s

Enabling microk8s addons

NVIDIA FLARE Helm Chart depends on a few services (aka addons in microk8s) provided by the Kubernetes cluster. Please check if they are enabled.

$ microk8s status
microk8s is running
high-availability: no
datastore master nodes: 127.0.0.1:19001
datastore standby nodes: none
addons:
enabled:
    dns                  # (core) CoreDNS
    ha-cluster           # (core) Configure high availability on the current node
    helm3                # (core) Helm 3 - Kubernetes package manager
    hostpath-storage     # (core) Storage class; allocates storage from host directory
    ingress              # (core) Ingress controller for external access
    registry             # (core) Private image registry exposed on localhost:32000
    storage              # (core) Alias to hostpath-storage add-on, deprecated
disabled:
    community            # (core) The community addons repository
    dashboard            # (core) The Kubernetes dashboard
    gpu                  # (core) Automatic enablement of Nvidia CUDA
    helm                 # (core) Helm 2 - the package manager for Kubernetes
    host-access          # (core) Allow Pods connecting to Host services smoothly
    mayastor             # (core) OpenEBS MayaStor
    metallb              # (core) Loadbalancer for your Kubernetes cluster
    metrics-server       # (core) K8s Metrics Server for API access to service metrics
    prometheus           # (core) Prometheus operator for monitoring and logging
    rbac                 # (core) Role-Based Access Control for authorisation

If any of the enabled services are not enabled in your environment, please enable it. The following example shows how to enable helm3 addon.

$ microk8s enable helm3
Infer repository core for addon helm3
Enabling Helm 3
Fetching helm version v3.8.0.
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100 12.9M  100 12.9M    0     0  11.5M      0  0:00:01  0:00:01 --:--:-- 11.5M
Helm 3 is enabled

Allowing network traffic

We have to change the cluster to allow incoming network traffic, such as those from admin consoles and NVIDIA FLARE clients, to enter the cluster. After the network traffic enters the cluster, the cluster also needs to know how to route the traffice to the deployed services.

Users have to enable ingress controller and modify some configuration of microk8s cluster.

Complete the following steps to enable microk8s to open and route network traffic to overseer and servers.

Edit configmap of ingress to route traffic

$ microk8s kubectl edit cm nginx-ingress-tcp-microk8s-conf -n ingress

Add this section to the configmap

data:
    "8002": default/server1:8002
    "8003": default/server1:8003
    "8102": default/server2:8102
    "8103": default/server2:8103
    "8443": default/overseer:8443

Edit DaemonSet of ingress to open ports

$ microk8s kubectl edit ds nginx-ingress-microk8s-controller -n ingress

Add this section at (spec.template.spec.containers[0].ports)

- containerPort: 8443
  hostPort: 8443
  name: overseer
  protocol: TCP
- containerPort: 8002
  hostPort: 8002
  name: server1fl
  protocol: TCP
- containerPort: 8003
  hostPort: 8003
  name: server1adm
  protocol: TCP
- containerPort: 8102
  hostPort: 8102
  name: server2fl
  protocol: TCP
- containerPort: 8103
  hostPort: 8103
  name: server2adm
  protocol: TCP

Installing helm chart

To install the helm chart, with microk8s environment, run the following command in the same directory as previous section.

$ mkdir -p /tmp/nvflare
$ microk8s helm3 install --set workspace=$(pwd) --set svc-persist=/tmp/nvflare nvflare-helm-chart-demo nvflare_hc/

NAME: nvflare-helm-chart-demo
LAST DEPLOYED: Fri Sep 23 12:28:24 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

Here the nvflare-helm-chart-demo is the name we choose for this installed application. You can choose a different name so that it’s easy to recognize the deployed application.

The nvflare_hc/ is the folder provisioning tool generated, as shown in the previous section. You can take a look at files in that folder and feel free to change them for your own environment.

Note

Here we use the host’s /tmp/nvflare as the persist storage space for all pods in microk8s. Please make sure that directory exists before running the above command

Verifying NVIDIA FLARE is up and running

You can use kubectl to check the status of NVIDIA FLARE application, installed by the chart. For example, in microk8s environment, run the following command to see if overseer and servers are started.

$ microk8s kubectl get pods
NAME                        READY   STATUS    RESTARTS       AGE
dnsutils                    1/1     Running   74 (13m ago)   62d
server1-7675668544-xvfvp    1/1     Running   0              4m50s
overseer-6f9dd66c97-n7bkd   1/1     Running   0              4m50s
server2-86bc4fc87f-s9n2s    1/1     Running   0              4m50s

The dnsutils is a built-in addon for dns service inside microk8s. You can ignore it.

For more details on the pods inside Kubernetes cluster, you can run the following command.

$ microk8s kubectl describe pods
Name:         dnsutils
Namespace:    default
Priority:     0
Node:         demolaptop/192.168.1.96
Start Time:   Fri, 22 Jul 2022 13:36:54 -0700
Labels:       <none>
Annotations:  cni.projectcalico.org/containerID: 9cfa2cfbb4ef7b11b10c5793965e2a42682dea5d0b05b4454b4232da9ded6a8e
            cni.projectcalico.org/podIP: 10.1.179.67/32
            cni.projectcalico.org/podIPs: 10.1.179.67/32
Status:       Running
IP:           10.1.179.67
IPs:
IP:  10.1.179.67
Containers:
dnsutils:
    Container ID:  containerd://3c31a42f9c5dc10452d2af0a503682cd78e25a4b078877f96a1174d1156a23a5
    Image:         k8s.gcr.io/e2e-test-images/jessie-dnsutils:1.3
    Image ID:      k8s.gcr.io/e2e-test-images/jessie-dnsutils@sha256:8b03e4185ecd305bc9b410faac15d486a3b1ef1946196d429245cdd3c7b152eb
    Port:          <none>
    Host Port:     <none>
    Command:
    sleep
    3600
    State:          Running
    Started:      Fri, 23 Sep 2022 12:19:55 -0700
    Last State:     Terminated
    Reason:       Unknown
    Exit Code:    255
    Started:      Thu, 18 Aug 2022 11:18:34 -0700
    Finished:     Fri, 23 Sep 2022 12:19:25 -0700
    Ready:          True
    Restart Count:  74
    Environment:    <none>
    Mounts:
    /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f4sxs (ro)
Conditions:
Type              Status
Initialized       True
Ready             True
ContainersReady   True
PodScheduled      True
Volumes:
kube-api-access-f4sxs:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                            node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>


Name:         server1-7675668544-xvfvp
Namespace:    default
Priority:     0
Node:         demolaptop/192.168.1.96
Start Time:   Fri, 23 Sep 2022 12:28:25 -0700
Labels:       pod-template-hash=7675668544
            system=server1
Annotations:  cni.projectcalico.org/containerID: 7493a356143ad0c4e4fdbe781d995c01d52c4caa31e961066d4a8769dfa1d360
            cni.projectcalico.org/podIP: 10.1.179.94/32
            cni.projectcalico.org/podIPs: 10.1.179.94/32
Status:       Running
IP:           10.1.179.94
IPs:
IP:           10.1.179.94
Controlled By:  ReplicaSet/server1-7675668544
Containers:
server1:
    Container ID:  containerd://16928775549dbf9cb2d68eea6412e682a170f72b5dbcdbf8c56790c8b9a30fd5
    Image:         localhost:32000/nvfl-min:0.0.1
    Image ID:      localhost:32000/nvfl-min@sha256:71658dc82b15e6cd5a2580c78e56011d166a70e1ff098306c93584c82cb63821
    Ports:         8002/TCP, 8003/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
    /usr/local/bin/python3
    Args:
    -u
    -m
    nvflare.private.fed.app.server.server_train
    -m
    /workspace/server1
    -s
    fed_server.json
    --set
    secure_train=true
    config_folder=config
    State:          Running
    Started:      Fri, 23 Sep 2022 12:28:27 -0700
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
    /tmp/nvflare from svc-persist (rw)
    /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hkhhq (ro)
    /workspace from workspace (rw)
Conditions:
Type              Status
Initialized       True
Ready             True
ContainersReady   True
PodScheduled      True
Volumes:
workspace:
    Type:          HostPath (bare host directory volume)
    Path:          /home/nvflare/workspace/nvf_hc_test/demo
    HostPathType:  Directory
svc-persist:
    Type:          HostPath (bare host directory volume)
    Path:          /tmp/nvflare
    HostPathType:  Directory
kube-api-access-hkhhq:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                            node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>


Name:         overseer-6f9dd66c97-n7bkd
Namespace:    default
Priority:     0
Node:         demolaptop/192.168.1.96
Start Time:   Fri, 23 Sep 2022 12:28:25 -0700
Labels:       pod-template-hash=6f9dd66c97
            system=overseer
Annotations:  cni.projectcalico.org/containerID: e9f6f2efb548c16217377eaaa8b79534a67e016277c3a0933d202d04904f46dc
            cni.projectcalico.org/podIP: 10.1.179.80/32
            cni.projectcalico.org/podIPs: 10.1.179.80/32
Status:       Running
IP:           10.1.179.80
IPs:
IP:           10.1.179.80
Controlled By:  ReplicaSet/overseer-6f9dd66c97
Containers:
overseer:
    Container ID:  containerd://82426e5e414b863fff1cc4c8963a3e18acd49ff1ccb51befaf5c984f3ad0f1a4
    Image:         localhost:32000/nvfl-min:0.0.1
    Image ID:      localhost:32000/nvfl-min@sha256:71658dc82b15e6cd5a2580c78e56011d166a70e1ff098306c93584c82cb63821
    Port:          8443/TCP
    Host Port:     0/TCP
    Command:
    /workspace/overseer/startup/start.sh
    State:          Running
    Started:      Fri, 23 Sep 2022 12:28:27 -0700
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
    /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dz7qz (ro)
    /workspace from workspace (rw)
Conditions:
Type              Status
Initialized       True
Ready             True
ContainersReady   True
PodScheduled      True
Volumes:
workspace:
    Type:          HostPath (bare host directory volume)
    Path:          /home/nvflare/workspace/nvf_hc_test/demo
    HostPathType:  Directory
kube-api-access-dz7qz:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                            node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>


Name:         server2-86bc4fc87f-s9n2s
Namespace:    default
Priority:     0
Node:         demolaptop/192.168.1.96
Start Time:   Fri, 23 Sep 2022 12:28:25 -0700
Labels:       pod-template-hash=86bc4fc87f
            system=server2
Annotations:  cni.projectcalico.org/containerID: 8ac76a0bfad2e4f0b1de9115f0d46c1a0dbacabb847c6160b1f144e82720fe99
            cni.projectcalico.org/podIP: 10.1.179.96/32
            cni.projectcalico.org/podIPs: 10.1.179.96/32
Status:       Running
IP:           10.1.179.96
IPs:
IP:           10.1.179.96
Controlled By:  ReplicaSet/server2-86bc4fc87f
Containers:
server2:
    Container ID:  containerd://c1e530fc6fc320d9b9388d81727440324cc11e0bb61e3b3e76a2362638f89357
    Image:         localhost:32000/nvfl-min:0.0.1
    Image ID:      localhost:32000/nvfl-min@sha256:71658dc82b15e6cd5a2580c78e56011d166a70e1ff098306c93584c82cb63821
    Ports:         8102/TCP, 8103/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
    /usr/local/bin/python3
    Args:
    -u
    -m
    nvflare.private.fed.app.server.server_train
    -m
    /workspace/server2
    -s
    fed_server.json
    --set
    secure_train=true
    config_folder=config
    State:          Running
    Started:      Fri, 23 Sep 2022 12:28:28 -0700
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
    /tmp/nvflare from svc-persist (rw)
    /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6cwbh (ro)
    /workspace from workspace (rw)
Conditions:
Type              Status
Initialized       True
Ready             True
ContainersReady   True
PodScheduled      True
Volumes:
workspace:
    Type:          HostPath (bare host directory volume)
    Path:          /home/nvflare/workspace/nvf_hc_test/demo
    HostPathType:  Directory
svc-persist:
    Type:          HostPath (bare host directory volume)
    Path:          /tmp/nvflare
    HostPathType:  Directory
kube-api-access-6cwbh:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                            node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

Login with admin console

Now on another terminal, with nvflare installed and /etc/hosts modified to include the IP of overseer, server1 and server2, which is the IP of the machine running the microk8s cluster, run fl_admin.sh of admin@nvidia.com/startup. Login as admin@nvidia.com.

For example: /etc/hosts is modified as (if microk8s is running at 192.168.1.123 and clients and admin console is running at slowdesktop machine)

$ cat /etc/hosts
127.0.0.1       localhost
127.0.1.1       slowdesktop
192.168.1.123 overseer server1 server2
# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Uninstalling helm chart

Users can uninstall the chart by running (note nvflare-helm-chart-demo is the release name we used when installing the chart)

$ microk8s helm3 uninstall nvflare-helm-chart-demo