Helm Chart for NVIDIA FLARE¶
Sometimes, users would like to deploy NVIDIA FLARE to an existing Kubernetes cluster. Now
the provisioning tool includes a new builder HelmChartBuilder
that can generate a reference
Helm Chart for users to deploy NVIDIA FLARE to a local microk8s Kubernetes instance.
Note
The generated Helm Chart is a starting point and serves as a reference. Depending on the Kubernetes cluster, users may need to modify and/or to perform additional operations to successfully deploy the chart.
Note
The following document assumes users have microk8s (common bundle in ubuntu server 20.04 and above) running on his local machine. With the helm chart, users are able to start the overseer, servers in the k8s cluster after provisioning. The clients and admin console can connect to the overseer and servers in the k8s cluster.
Update on provisioning tool¶
In order to generate the helm chart, add the HelmChartBuilder to the project.yml file.
- path: nvflare.lighter.impl.helm_chart.HelmChartBuilder
args:
docker_image: localhost:32000/nvfl-min:0.0.1
The docker_image
is the actual image used for all pods running in the k8s. The provisioners have
to build it separately and make sure it is available to the k8s cluster. For microk8s, enabling the docker registry
server by running this:
microk8s enable registry
This will create a registry server listening to port 32000
Provisioning results¶
Running provision command as usual, either in the new format nvflare provision
or just provision
.
After the command, there should a folder with structure similar to the following:
$ tree -L 1
.
├── admin@nvidia.com
├── compose.yaml
├── nvflare_compose
├── nvflare_hc
├── overseer
├── server1
├── server2
├── site-1
└── site-2
8 directories, 1 file
Note there is a nvflare_hc folder. This folder is the Helm Chart package.
Preparing microk8s¶
Enabling microk8s addons¶
NVIDIA FLARE Helm Chart depends on a few services (aka addons in microk8s) provided by the Kubernetes cluster. Please check if they are enabled.
$ microk8s status
microk8s is running
high-availability: no
datastore master nodes: 127.0.0.1:19001
datastore standby nodes: none
addons:
enabled:
dns # (core) CoreDNS
ha-cluster # (core) Configure high availability on the current node
helm3 # (core) Helm 3 - Kubernetes package manager
hostpath-storage # (core) Storage class; allocates storage from host directory
ingress # (core) Ingress controller for external access
registry # (core) Private image registry exposed on localhost:32000
storage # (core) Alias to hostpath-storage add-on, deprecated
disabled:
community # (core) The community addons repository
dashboard # (core) The Kubernetes dashboard
gpu # (core) Automatic enablement of Nvidia CUDA
helm # (core) Helm 2 - the package manager for Kubernetes
host-access # (core) Allow Pods connecting to Host services smoothly
mayastor # (core) OpenEBS MayaStor
metallb # (core) Loadbalancer for your Kubernetes cluster
metrics-server # (core) K8s Metrics Server for API access to service metrics
prometheus # (core) Prometheus operator for monitoring and logging
rbac # (core) Role-Based Access Control for authorisation
If any of the enabled services are not enabled in your environment, please enable it. The following example shows how to enable helm3 addon.
$ microk8s enable helm3
Infer repository core for addon helm3
Enabling Helm 3
Fetching helm version v3.8.0.
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 12.9M 100 12.9M 0 0 11.5M 0 0:00:01 0:00:01 --:--:-- 11.5M
Helm 3 is enabled
Allowing network traffic¶
We have to change the cluster to allow incoming network traffic, such as those from admin consoles and NVIDIA FLARE clients, to enter the cluster. After the network traffic enters the cluster, the cluster also needs to know how to route the traffice to the deployed services.
Users have to enable ingress controller and modify some configuration of microk8s cluster.
Complete the following steps to enable microk8s to open and route network traffic to overseer and servers.
Edit configmap of ingress to route traffic¶
$ microk8s kubectl edit cm nginx-ingress-tcp-microk8s-conf -n ingress
Add this section to the configmap
data:
"8002": default/server1:8002
"8003": default/server1:8003
"8102": default/server2:8102
"8103": default/server2:8103
"8443": default/overseer:8443
Edit DaemonSet of ingress to open ports¶
$ microk8s kubectl edit ds nginx-ingress-microk8s-controller -n ingress
Add this section at (spec.template.spec.containers[0].ports)
- containerPort: 8443
hostPort: 8443
name: overseer
protocol: TCP
- containerPort: 8002
hostPort: 8002
name: server1fl
protocol: TCP
- containerPort: 8003
hostPort: 8003
name: server1adm
protocol: TCP
- containerPort: 8102
hostPort: 8102
name: server2fl
protocol: TCP
- containerPort: 8103
hostPort: 8103
name: server2adm
protocol: TCP
Installing helm chart¶
To install the helm chart, with microk8s environment, run the following command in the same directory as previous section.
$ mkdir -p /tmp/nvflare
$ microk8s helm3 install --set workspace=$(pwd) --set svc-persist=/tmp/nvflare nvflare-helm-chart-demo nvflare_hc/
NAME: nvflare-helm-chart-demo
LAST DEPLOYED: Fri Sep 23 12:28:24 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
Here the nvflare-helm-chart-demo
is the name we choose for this installed application. You can choose a different name so
that it’s easy to recognize the deployed application.
The nvflare_hc/
is the folder provisioning tool generated, as shown in the previous section. You can take a look at files in
that folder and feel free to change them for your own environment.
Note
Here we use the host’s /tmp/nvflare as the persist storage space for all pods in microk8s. Please make sure that directory exists before running the above command
Verifying NVIDIA FLARE is up and running¶
You can use kubectl
to check the status of NVIDIA FLARE application, installed by the chart. For example, in
microk8s environment, run the following command to see if overseer and servers are started.
$ microk8s kubectl get pods
NAME READY STATUS RESTARTS AGE
dnsutils 1/1 Running 74 (13m ago) 62d
server1-7675668544-xvfvp 1/1 Running 0 4m50s
overseer-6f9dd66c97-n7bkd 1/1 Running 0 4m50s
server2-86bc4fc87f-s9n2s 1/1 Running 0 4m50s
The dnsutils
is a built-in addon for dns service inside microk8s. You can ignore it.
For more details on the pods inside Kubernetes cluster, you can run the following command.
$ microk8s kubectl describe pods
Name: dnsutils
Namespace: default
Priority: 0
Node: demolaptop/192.168.1.96
Start Time: Fri, 22 Jul 2022 13:36:54 -0700
Labels: <none>
Annotations: cni.projectcalico.org/containerID: 9cfa2cfbb4ef7b11b10c5793965e2a42682dea5d0b05b4454b4232da9ded6a8e
cni.projectcalico.org/podIP: 10.1.179.67/32
cni.projectcalico.org/podIPs: 10.1.179.67/32
Status: Running
IP: 10.1.179.67
IPs:
IP: 10.1.179.67
Containers:
dnsutils:
Container ID: containerd://3c31a42f9c5dc10452d2af0a503682cd78e25a4b078877f96a1174d1156a23a5
Image: k8s.gcr.io/e2e-test-images/jessie-dnsutils:1.3
Image ID: k8s.gcr.io/e2e-test-images/jessie-dnsutils@sha256:8b03e4185ecd305bc9b410faac15d486a3b1ef1946196d429245cdd3c7b152eb
Port: <none>
Host Port: <none>
Command:
sleep
3600
State: Running
Started: Fri, 23 Sep 2022 12:19:55 -0700
Last State: Terminated
Reason: Unknown
Exit Code: 255
Started: Thu, 18 Aug 2022 11:18:34 -0700
Finished: Fri, 23 Sep 2022 12:19:25 -0700
Ready: True
Restart Count: 74
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f4sxs (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-f4sxs:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
Name: server1-7675668544-xvfvp
Namespace: default
Priority: 0
Node: demolaptop/192.168.1.96
Start Time: Fri, 23 Sep 2022 12:28:25 -0700
Labels: pod-template-hash=7675668544
system=server1
Annotations: cni.projectcalico.org/containerID: 7493a356143ad0c4e4fdbe781d995c01d52c4caa31e961066d4a8769dfa1d360
cni.projectcalico.org/podIP: 10.1.179.94/32
cni.projectcalico.org/podIPs: 10.1.179.94/32
Status: Running
IP: 10.1.179.94
IPs:
IP: 10.1.179.94
Controlled By: ReplicaSet/server1-7675668544
Containers:
server1:
Container ID: containerd://16928775549dbf9cb2d68eea6412e682a170f72b5dbcdbf8c56790c8b9a30fd5
Image: localhost:32000/nvfl-min:0.0.1
Image ID: localhost:32000/nvfl-min@sha256:71658dc82b15e6cd5a2580c78e56011d166a70e1ff098306c93584c82cb63821
Ports: 8002/TCP, 8003/TCP
Host Ports: 0/TCP, 0/TCP
Command:
/usr/local/bin/python3
Args:
-u
-m
nvflare.private.fed.app.server.server_train
-m
/workspace/server1
-s
fed_server.json
--set
secure_train=true
config_folder=config
State: Running
Started: Fri, 23 Sep 2022 12:28:27 -0700
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/tmp/nvflare from svc-persist (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hkhhq (ro)
/workspace from workspace (rw)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
workspace:
Type: HostPath (bare host directory volume)
Path: /home/nvflare/workspace/nvf_hc_test/demo
HostPathType: Directory
svc-persist:
Type: HostPath (bare host directory volume)
Path: /tmp/nvflare
HostPathType: Directory
kube-api-access-hkhhq:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
Name: overseer-6f9dd66c97-n7bkd
Namespace: default
Priority: 0
Node: demolaptop/192.168.1.96
Start Time: Fri, 23 Sep 2022 12:28:25 -0700
Labels: pod-template-hash=6f9dd66c97
system=overseer
Annotations: cni.projectcalico.org/containerID: e9f6f2efb548c16217377eaaa8b79534a67e016277c3a0933d202d04904f46dc
cni.projectcalico.org/podIP: 10.1.179.80/32
cni.projectcalico.org/podIPs: 10.1.179.80/32
Status: Running
IP: 10.1.179.80
IPs:
IP: 10.1.179.80
Controlled By: ReplicaSet/overseer-6f9dd66c97
Containers:
overseer:
Container ID: containerd://82426e5e414b863fff1cc4c8963a3e18acd49ff1ccb51befaf5c984f3ad0f1a4
Image: localhost:32000/nvfl-min:0.0.1
Image ID: localhost:32000/nvfl-min@sha256:71658dc82b15e6cd5a2580c78e56011d166a70e1ff098306c93584c82cb63821
Port: 8443/TCP
Host Port: 0/TCP
Command:
/workspace/overseer/startup/start.sh
State: Running
Started: Fri, 23 Sep 2022 12:28:27 -0700
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dz7qz (ro)
/workspace from workspace (rw)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
workspace:
Type: HostPath (bare host directory volume)
Path: /home/nvflare/workspace/nvf_hc_test/demo
HostPathType: Directory
kube-api-access-dz7qz:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
Name: server2-86bc4fc87f-s9n2s
Namespace: default
Priority: 0
Node: demolaptop/192.168.1.96
Start Time: Fri, 23 Sep 2022 12:28:25 -0700
Labels: pod-template-hash=86bc4fc87f
system=server2
Annotations: cni.projectcalico.org/containerID: 8ac76a0bfad2e4f0b1de9115f0d46c1a0dbacabb847c6160b1f144e82720fe99
cni.projectcalico.org/podIP: 10.1.179.96/32
cni.projectcalico.org/podIPs: 10.1.179.96/32
Status: Running
IP: 10.1.179.96
IPs:
IP: 10.1.179.96
Controlled By: ReplicaSet/server2-86bc4fc87f
Containers:
server2:
Container ID: containerd://c1e530fc6fc320d9b9388d81727440324cc11e0bb61e3b3e76a2362638f89357
Image: localhost:32000/nvfl-min:0.0.1
Image ID: localhost:32000/nvfl-min@sha256:71658dc82b15e6cd5a2580c78e56011d166a70e1ff098306c93584c82cb63821
Ports: 8102/TCP, 8103/TCP
Host Ports: 0/TCP, 0/TCP
Command:
/usr/local/bin/python3
Args:
-u
-m
nvflare.private.fed.app.server.server_train
-m
/workspace/server2
-s
fed_server.json
--set
secure_train=true
config_folder=config
State: Running
Started: Fri, 23 Sep 2022 12:28:28 -0700
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/tmp/nvflare from svc-persist (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6cwbh (ro)
/workspace from workspace (rw)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
workspace:
Type: HostPath (bare host directory volume)
Path: /home/nvflare/workspace/nvf_hc_test/demo
HostPathType: Directory
svc-persist:
Type: HostPath (bare host directory volume)
Path: /tmp/nvflare
HostPathType: Directory
kube-api-access-6cwbh:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
Login with admin console¶
Now on another terminal, with nvflare installed and /etc/hosts modified to include the IP of overseer, server1 and server2, which is the IP of the machine running the microk8s cluster, run fl_admin.sh of admin@nvidia.com/startup. Login as admin@nvidia.com.
For example: /etc/hosts is modified as (if microk8s is running at 192.168.1.123 and clients and admin console is running at slowdesktop machine)
$ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 slowdesktop
192.168.1.123 overseer server1 server2
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
Uninstalling helm chart¶
Users can uninstall the chart by running (note nvflare-helm-chart-demo
is the release name we used when installing the chart)
$ microk8s helm3 uninstall nvflare-helm-chart-demo