Federated Logistic Regression with Second-Order Newton-Raphson optimization
This example shows how to implement a federated binary classification via logistic regression with second-order Newton-Raphson optimization.
Install NVFLARE and Dependencies
for the complete installation instructions, see Installation
pip install nvflare
get the example code from github:
git clone https://github.com/NVIDIA/NVFlare.git
then navigate to the hello-lr directory:
git switch <release branch>
cd examples/hello-world/hello-lr
Install the dependency
pip install -r requirements.txt
Code Structure
hello-lr
|
|-- client.py # client local training script
|-- job.py # job recipe that defines client and server configurations
|-- download_data.py # download dataset
|-- prepare_data.py # prepare data to convert to numpy
|-- requirements.txt # dependencies
Data
The UCI Heart Disease dataset is used in this example.
All attributes are numeric-valued. Each database has the same instance format. While the databases have 76 raw attributes, only 14 of them are actually used.
The authors of the databases have requested:
"...that any publications resulting from the use of the data include the
names of the principal investigator responsible for the data collection
at each institution. They would be:
1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:
Robert Detrano, M.D., Ph.D. "
dataset contains samples from 4 sites, split into training and testing sets as described below:
site |
sample split |
|---|---|
Cleveland |
train: 199 samples, test: 104 samples |
Hungary |
train: 172 samples, test: 89 samples |
Switzerland |
train: 30 samples, test: 16 samples |
Long Beach V |
train: 85 samples, test: 45 samples |
The number of features in each sample is 13.
Features
Variable Name |
Role |
Type |
Demographic |
Description |
Units |
Missing Values |
|---|---|---|---|---|---|---|
age |
Feature |
Integer |
Age |
years |
no |
|
sex |
Feature |
Categorical |
Sex |
no |
||
cp |
Feature |
Categorical |
no |
|||
trestbps |
Feature |
Integer |
resting blood pressure (on admission to the hospital) |
mm Hg |
no |
|
chol |
Feature |
Integer |
serum cholestoral |
mg/dl |
no |
|
fbs |
Feature |
Categorical |
fasting blood sugar > 120 mg/dl |
no |
||
restecg |
Feature |
Categorical |
no |
|||
thalach |
Feature |
Integer |
maximum heart rate achieved |
no |
||
exang |
Feature |
Categorical |
exercise induced angina |
no |
||
oldpeak |
Feature |
Integer |
ST depression induced by exercise relative to rest |
no |
||
slope |
Feature |
Categorical |
no |
|||
ca |
Feature |
Integer |
number of major vessels (0-3) colored by flourosopy |
yes |
||
thal |
Feature |
Categorical |
yes |
|||
num |
Target |
Integer |
diagnosis of heart disease |
no |
Model
The Newton-Raphson optimization problem can be described as follows.
In a binary classification task with logistic regression, the probability of a data sample \(x\) classified as positive is formulated as:
where \(\sigma(.)\) denotes the sigmoid function. We can incorporate \(\beta_{0}\) and \(\beta\) into a single parameter vector \(\theta = ( \beta_{0}, \beta)\). Let \(d\) be the number of features for each data sample \(x\) and let \(N\) be the number of data samples. We then have the matrix version of the above probability equation:
Here \(X\) is the matrix of all samples, with shape \(N \times (d+1)\), having its first column filled with value 1 to account for the intercept \(\theta_{0}\).
The goal is to compute parameter vector \(\theta\) that maximizes the below likelihood function:
The Newton-Raphson method optimizes the likelihood function via quadratic approximation. Omitting the maths, the theoretical update formula for parameter vector \(\theta\) is:
where
is the gradient of the likelihood function, with \(y\) being the vector of ground truth for sample data matrix \(X\), and
is the Hessian of the likelihood function, with \(D\) a diagonal matrix where diagonal value at \((i,i)\) is \(D(i,i) = p(x_i) (1 - p(x_i))\).
In federated Newton-Raphson optimization, each client will compute its own gradient $nabla L_{theta^{n}}$ and Hessian $H_{theta^{n}}$ based on local training samples. A server will aggregate the gradients and Hessians computed from all clients, and perform the update of parameter $theta$ based on the theoretical update formula described above.
Client Side
On the client side, the local training logic is implemented in client.py.
The implementation is based on the Client API. This allows user to add minimum nvflare-specific codes to turn a typical centralized training script to a federated client side local training script.
During local training, each client receives a copy of the global model, sent by the server, using flare.receive() API. The received global model is an instance of FLModel.
A local validation is first performed, where validation metrics
Then each client computes it’s gradient and Hessian based on local training data, using their respective theoretical formula described above. This is implemented in the train_newton_raphson() method. Each client then sends the computed results (always in FLModel format) to server for aggregation, using flare.send() API.
Each client site corresponds to a site listed in the data table above.
The training logic remains similar to the centralized logic: load data, perform training (Newton-Raphson updates), and valid trained model. The only added differences in the federated code are related to interaction with the FL system, such as receiving and send FLModel.
1
2
3import argparse
4import os
5
6import numpy as np
7from sklearn.metrics import accuracy_score, precision_score
8
9import nvflare.client as flare
10from nvflare.apis.fl_constant import FLMetaKey
11from nvflare.app_common.abstract.fl_model import FLModel, ParamsType
12from nvflare.app_common.np.constants import NPConstants
13from nvflare.client.tracking import SummaryWriter
14
15
16def parse_arguments():
17 """
18 Parse command line args for client side training.
19 """
20 parser = argparse.ArgumentParser(description="Federated Logistic Regression with Second-Order Newton Raphson")
21
22 parser.add_argument("--data_root", type=str, help="Path to load client side data.")
23
24 return parser.parse_args()
25
26
27def load_data(data_root, site_name):
28 """
29 Load the data for each client.
30
31 Args:
32 data_root: root directory storing client site data.
33 site_name: client site name
34 Returns:
35 A dict with client site training and validation data.
36 """
37 print("loading data for client {} from: {}".format(site_name, data_root))
38 train_x_path = os.path.join(data_root, "{}.train.x.npy".format(site_name))
39 train_y_path = os.path.join(data_root, "{}.train.y.npy".format(site_name))
40 test_x_path = os.path.join(data_root, "{}.test.x.npy".format(site_name))
41 test_y_path = os.path.join(data_root, "{}.test.y.npy".format(site_name))
42
43 train_X = np.load(train_x_path)
44 train_y = np.load(train_y_path)
45 valid_X = np.load(test_x_path)
46 valid_y = np.load(test_y_path)
47
48 return {"train_X": train_X, "train_y": train_y, "valid_X": valid_X, "valid_y": valid_y}
49
50
51def sigmoid(inp):
52 return 1.0 / (1.0 + np.exp(-inp))
53
54
55def train_newton_raphson(data, theta):
56 """
57 Compute gradient and hessian on local data
58 based on paramters received from server.
59
60 """
61 train_X = data["train_X"]
62 train_y = data["train_y"]
63
64 # Add intercept, pre-pend 1s to as first
65 # column of train_X
66 train_X = np.concatenate((np.ones((train_X.shape[0], 1)), train_X), axis=1)
67
68 # Compute probabilities from current weights
69 proba = sigmoid(np.dot(train_X, theta))
70
71 # The gradient is X^T . (y - proba)
72 gradient = np.dot(train_X.T, (train_y - proba))
73
74 # The hessian is X^T . D . X, where D is the
75 # diagonal matrix with values proba * (1 - proba)
76 D = np.diag((proba * (1 - proba))[:, 0])
77 hessian = train_X.T.dot(D).dot(train_X)
78
79 return {"gradient": gradient, "hessian": hessian}
80
81
82def validate(data, theta):
83 """
84 Performs local validation.
85 Computes accuracy and precision scores.
86
87 """
88 valid_X = data["valid_X"]
89 valid_y = data["valid_y"]
90
91 # Add intercept, pre-pend 1s to as first
92 # column of valid_X
93 valid_X = np.concatenate((np.ones((valid_X.shape[0], 1)), valid_X), axis=1)
94
95 # Compute probabilities from current weights
96 proba = sigmoid(np.dot(valid_X, theta))
97
98 return {
99 "accuracy": accuracy_score(valid_y.flatten(), proba.flatten().round()),
100 "precision": precision_score(valid_y.flatten(), proba.flatten().round()),
101 }
102
103
104def main():
105 """
106 This is a typical ML training loop,
107 augmented with Flare Client API to
108 perform local training on each client
109 side and send result to server.
110
111 """
112 args = parse_arguments()
113 data_root = args.data_root
114
115 flare.init()
116
117 site_name = flare.get_site_name()
118 print("training on client site: {}".format(site_name))
119
120 # Load client site data.
121 data = load_data(data_root, site_name)
122
123 # Get metric summary writer for TensorBoard
124 writer = SummaryWriter()
125
126 while flare.is_running():
127 # Receive global model (FLModel) from server.
128 global_model = flare.receive()
129
130 print(f"\n{global_model=}")
131
132 curr_round = global_model.current_round
133 print("current_round={}".format(curr_round))
134
135 print(f"[ROUND {curr_round}] - client site: {site_name}, received " "global model: {global_model}")
136
137 # Get the weights, aka parameter theta for
138 # logistic regression.
139 global_weights = global_model.params[NPConstants.NUMPY_KEY]
140 print(f"[ROUND {curr_round}] - global model weights: {global_weights}")
141
142 # Local validation before training
143 print(f"[ROUND {curr_round}] - start validation of global model on client: {site_name}")
144 validation_scores = validate(data, global_weights)
145 print(f"[ROUND {curr_round}] - validation metric scores on client: {site_name} = {validation_scores}")
146
147 # Write validation metrics to TensorBoard
148 writer.add_scalar(f"{site_name}/accuracy", validation_scores["accuracy"], curr_round)
149 writer.add_scalar(f"{site_name}/precision", validation_scores["precision"], curr_round)
150
151 # Local training
152 print(f"[ROUND {curr_round}] - start local training on client site: {site_name}")
153 result_dict = train_newton_raphson(data, theta=global_weights)
154
155 # Send result to server for aggregation.
156 result_model = FLModel(params=result_dict, params_type=ParamsType.FULL)
157 result_model.meta[FLMetaKey.NUM_STEPS_CURRENT_ROUND] = data["train_X"].shape[0]
158
159 print(
160 f"[ROUND {curr_round}] - local training from client: {site_name} complete,"
161 f" sending results to server: {result_model}"
162 )
163
164 flare.send(result_model)
165
166
167if __name__ == "__main__":
168 main()
Server Side
We leverage a builtin FLARE logistic regression with Newton Raphson method. the server side fedavg class is located at nvflare.app_common.workflows.lr.fedavg.FedAvgLR
Job
1
2import argparse
3
4from nvflare.app_common.np.recipes.lr.fedavg import FedAvgLrRecipe
5from nvflare.recipe import SimEnv
6
7# from nvflare.recipe import PocEnv
8
9
10def define_parser():
11 parser = argparse.ArgumentParser()
12 parser.add_argument("--n_clients", type=int, default=2)
13 parser.add_argument("--num_rounds", type=int, default=5)
14 parser.add_argument("--data_root", type=str, default="/tmp/flare/dataset/heart_disease_data")
15
16 return parser.parse_args()
17
18
19def main():
20 args = define_parser()
21
22 n_clients = args.n_clients
23 num_rounds = args.num_rounds
24 data_root = args.data_root
25
26 print("number of clients =", n_clients)
27 recipe = FedAvgLrRecipe(
28 min_clients=n_clients,
29 num_rounds=num_rounds,
30 damping_factor=0.8,
31 num_features=13, # Model is created internally based on num_features
32 # For pre-trained weights: initial_ckpt="/server/path/to/lr_model.npy",
33 train_script="client.py",
34 train_args=f"--data_root {data_root}",
35 )
36 env = SimEnv(num_clients=n_clients, num_threads=n_clients)
37 # env = PocEnv(num_clients=n_clients)
38 run = recipe.execute(env)
39 w = run.get_result()
40 print("result location =", w)
41
42
43if __name__ == "__main__":
44 main()
Download and prepare data
Execute the following script .. code-block:: text
python download_data.py python prepare_data.py
This will download the heart disease dataset under
/tmp/flare/dataset/heart_disease_data/
Running Job
Execute the following command to launch federated logistic regression. This will run in nvflare’s simulation mode.
python job.py