FL Experiment Tracking with TensorBoard Streaming¶
Introduction¶
In this exercise, you will learn how to stream TensorBoard events from the clients to the server in order to visualize live training metrics from a central place on the server.
This exercise will be working with the tensorboard
example in the advanced examples folder under experiment-tracking,
which builds upon Hello PyTorch by adding TensorBoard streaming.
The setup of this exercise consists of one server and two clients.
Note
This exercise differs from Hello PyTorch, as it uses the Learner
API along with the LearnerExecutor
.
In short, the execution flow is abstracted away into the LearnerExecutor
, allowing you to only need to implement the required methods in the Learner
class.
This will not be the focus of this guide, however you can learn more at Learner
and LearnerExecutor
.
Let’s get started. Make sure you have an environment with NVIDIA FLARE installed as described in Getting Started. First clone the repo:
$ git clone https://github.com/NVIDIA/NVFlare.git
Now remember to activate your NVIDIA FLARE Python virtual environment from the installation guide. And install the required dependencies in the example folder (NVFlare/examples/advanced/experiment-tracking/tensorboard).
(nvflare-env) $ python3 -m pip install -r requirements.txt
Adding TensorBoard Streaming to Configurations¶
Inside the config folder there are two files, config_fed_client.json
and config_fed_server.json
.
1{
2 "format_version": 2,
3
4 "executors": [
5 {
6 "tasks": [
7 "train",
8 "submit_model",
9 "validate"
10 ],
11 "executor": {
12 "id": "Executor",
13 "path": "nvflare.app_common.executors.learner_executor.LearnerExecutor",
14 "args": {
15 "learner_id": "pt_learner"
16 }
17 }
18 }
19 ],
20 "task_result_filters": [
21 ],
22 "task_data_filters": [
23 ],
24 "components": [
25 {
26 "id": "pt_learner",
27 "path": "pt.learner_with_tb.PTLearner",
28 "args": {
29 "lr": 0.01,
30 "epochs": 5,
31 "analytic_sender_id": "log_writer"
32 }
33 },
34 {
35 "id": "log_writer",
36 "path": "nvflare.app_opt.tracking.tb.tb_writer.TBWriter",
37 "args": {"event_type": "analytix_log_stats"}
38 },
39 {
40 "id": "event_to_fed",
41 "name": "ConvertToFedEvent",
42 "args": {"events_to_convert": ["analytix_log_stats"], "fed_event_prefix": "fed."}
43 }
44 ]
45}
Take a look at the components section of the client config at line 24.
The first component is the pt_learner
which contains the initialization, training, and validation logic.
learner_with_tb.py
(under NVFlare/examples/advanced/experiment-tracking/pt) is where we will add our TensorBoard streaming changes.
Next we have the TBWriter
,
which implements some common methods that follow the signatures from the PyTorch SummaryWriter.
This makes it easy for the pt_learner
to log metrics and send events.
Finally, we have the ConvertToFedEvent
,
which converts local events to federated events.
This changes the event analytix_log_stats
into a fed event fed.analytix_log_stats
,
which will then be streamed from the clients to the server.
1{
2 "format_version": 2,
3
4 "server": {
5 "heart_beat_timeout": 600
6 },
7 "task_data_filters": [],
8 "task_result_filters": [],
9 "components": [
10 {
11 "id": "persistor",
12 "name": "PTFileModelPersistor",
13 "args": {
14 "model": {
15 "path": "pt.simple_network.SimpleNetwork",
16 "args": {}
17 }
18 }
19 },
20 {
21 "id": "shareable_generator",
22 "name": "FullModelShareableGenerator",
23 "args": {}
24 },
25 {
26 "id": "aggregator",
27 "name": "InTimeAccumulateWeightedAggregator",
28 "args": {
29 "expected_data_kind": "WEIGHTS"
30 }
31 },
32 {
33 "id": "model_locator",
34 "name": "PTFileModelLocator",
35 "args": {
36 "pt_persistor_id": "persistor"
37 }
38 },
39 {
40 "id": "json_generator",
41 "path": "nvflare.app_common.widgets.validation_json_generator.ValidationJsonGenerator",
42 "args": {}
43 },
44 {
45 "id": "tb_analytics_receiver",
46 "name": "TBAnalyticsReceiver",
47 "args": {"events": ["fed.analytix_log_stats"]}
48 }
49 ],
50 "workflows": [
51 {
52 "id": "scatter_and_gather",
53 "name": "ScatterAndGather",
54 "args": {
55 "min_clients" : 2,
56 "num_rounds" : 1,
57 "start_round": 0,
58 "wait_time_after_min_received": 10,
59 "aggregator_id": "aggregator",
60 "persistor_id": "persistor",
61 "shareable_generator_id": "shareable_generator",
62 "train_task_name": "train",
63 "train_timeout": 0
64 }
65 },
66 {
67 "id": "cross_site_validate",
68 "name": "CrossSiteModelEval",
69 "args": {
70 "model_locator_id": "model_locator"
71 }
72 }
73 ]
74}
Under the component section in the server config, we have the
TBAnalyticsReceiver
of type AnalyticsReceiver
.
This component receives TensorBoard events from the clients and saves them to a specified folder
(default tb_events
) under the server’s run folder.
Notice how the accepted event type "fed.analytix_log_stats"
matches the output of
ConvertToFedEvent
in the client config.
Adding TensorBoard Streaming to your Code¶
In this exercise, all of the TensorBoard code additions will be made in pt_learner.py
.
First we must initialize our TensorBoard writer to the AnalyticsSender
we defined in the client config:
110 )
111
112 # Tensorboard streaming setup
113 self.writer = parts.get(self.analytic_sender_id) # user configuration from config_fed_client.json
The LearnerExecutor
passes in the component dictionary into the parts
parameter of initialize()
.
We can then access the AnalyticsSender
component we defined in config_fed_client.json
by using the self.analytic_sender_id
as the key in the parts
dictionary.
Note that self.analytic_sender_id
defaults to "analytic_sender"
,
but we can also define it in the client config to be passed into the constructor.
Now that our TensorBoard writer is set to AnalyticsSender
,
we can write and stream training metrics to the server in local_train()
:
151 return outgoing_dxo.to_shareable()
152
153 def local_train(self, fl_ctx, abort_signal):
154 # Basic training
155 current_round = fl_ctx.get_prop(FLContextKey.TASK_DATA).get_header("current_round")
156 for epoch in range(self.epochs):
157 self.model.train()
158 running_loss = 0.0
159 for i, batch in enumerate(self.train_loader):
160 if abort_signal.triggered:
161 return
162
163 images, labels = batch[0].to(self.device), batch[1].to(self.device)
164 self.optimizer.zero_grad()
165
166 predictions = self.model(images)
167 cost = self.loss(predictions, labels)
168 cost.backward()
169 self.optimizer.step()
170
171 running_loss += cost.cpu().detach().numpy() / images.size()[0]
172 if i % 3000 == 0:
173 self.log_info(
174 fl_ctx, f"Epoch: {epoch}/{self.epochs}, Iteration: {i}, " f"Loss: {running_loss/3000}"
175 )
176 running_loss = 0.0
177
178 # Stream training loss at each step
179 current_step = self.n_iterations * self.epochs * current_round + self.n_iterations * epoch + i
180 self.writer.add_scalar("train_loss", cost.item(), current_step)
181
We use add_scalar(tag, scalar, global_step)
on line 181 to send training loss metrics,
while on line 174 we send the validation accuracy at the end of each epoch.
You can learn more about other supported writer methods in
AnalyticsSender
.
Train the Model, Federated!¶
Now you can use admin command prompt to submit and start this example job. To do this on a proof of concept local FL system, follow the sections Setting Up the Application Environment in POC Mode and Starting the Application Environment in POC Mode if you have not already.
Running the FL System¶
With the admin client command prompt successfully connected and logged in, enter the command below.
> submit_job tensorboard-streaming
Pay close attention to what happens in each of four terminals.
You can see how the admin submits the job to the server and how
the JobRunner
on the server
automatically picks up the job to deploy and start the run.
This command uploads the job configuration from the admin client to the server. A job id will be returned, and we can use that id to access job information.
Note
If we use submit_job [app] then that app will be treated as a single app job.
From time to time, you can issue check_status server
in the admin client to check the entire training progress.
You should now see how the training does in the very first terminal (the one that started the server).
Viewing the TensorBoard Dashboard during Training¶
On the client side, the AnalyticsSender
works as a TensorBoard SummaryWriter.
Instead of writing to TB files, it actually generates NVFLARE events of type analytix_log_stats
.
The ConvertToFedEvent
widget will turn the event analytix_log_stats
into a fed event
fed.analytix_log_stats
, which will be delivered to the server side.
On the server side, the TBAnalyticsReceiver
is configured to process fed.analytix_log_stats
events,
which writes received TB data into appropriate TB files on the server
(defaults to server/[JOB ID]/tb_events
).
To view training metrics that are being streamed to the server, run:
tensorboard --logdir=poc/server/[JOB ID]/tb_events
Note
if the server is running on a remote machine, use port forwarding to view the TensorBoard dashboard in a browser. For example:
ssh -L {local_machine_port}:127.0.0.1:6006 user@server_ip
Attention
The server/[JOB ID]
folder only exists when job is running.
After the job is finished, please use download_job [JOB ID] to get the workspace data as explained below.
Accessing the results¶
The results of each job will usually be stored inside the server side workspace.
Please refer to access server-side workspace for accessing the server side workspace.
Shutdown FL system¶
Once the FL run is complete and the server has successfully aggregated the client’s results after all the rounds, and
cross site model evaluation is finished, run the following commands in the fl_admin to shutdown the system (while
inputting admin
when prompted with password):
> shutdown client
> shutdown server
> bye
Congratulations!
Now you will be able to see the live training metrics of each client from a central place on the server.
The full source code for this exercise can be found in examples/advanced/experiment-tracking/tensorboard.