Federated Logistic Regression with Second-Order Newton-Raphson optimization¶
This example shows how to implement a federated binary classification via logistic regression with second-order Newton-Raphson optimization.
Install NVFLARE and Dependencies¶
for the complete installation instructions, see Installation
pip install nvflare
get the example code from github:
git clone https://github.com/NVIDIA/NVFlare.git
then navigate to the hello-lr directory:
git switch <release branch>
cd examples/hello-world/hello-lr
Install the dependency
pip install -r requirements.txt
Code Structure¶
hello-lr
|
|-- client.py # client local training script
|-- job.py # job recipe that defines client and server configurations
|-- download_data.py # download dataset
|-- prepare_data.py # prepare data to convert to numpy
|-- requirements.txt # dependencies
Data¶
The UCI Heart Disease dataset is used in this example.
All attributes are numeric-valued. Each database has the same instance format. While the databases have 76 raw attributes, only 14 of them are actually used.
The authors of the databases have requested:
"...that any publications resulting from the use of the data include the
names of the principal investigator responsible for the data collection
at each institution. They would be:
1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:
Robert Detrano, M.D., Ph.D. "
dataset contains samples from 4 sites, split into training and testing sets as described below:
site |
sample split |
|---|---|
Cleveland |
train: 199 samples, test: 104 samples |
Hungary |
train: 172 samples, test: 89 samples |
Switzerland |
train: 30 samples, test: 16 samples |
Long Beach V |
train: 85 samples, test: 45 samples |
The number of features in each sample is 13.
Features¶
Variable Name |
Role |
Type |
Demographic |
Description |
Units |
Missing Values |
|---|---|---|---|---|---|---|
age |
Feature |
Integer |
Age |
years |
no |
|
sex |
Feature |
Categorical |
Sex |
no |
||
cp |
Feature |
Categorical |
no |
|||
trestbps |
Feature |
Integer |
resting blood pressure (on admission to the hospital) |
mm Hg |
no |
|
chol |
Feature |
Integer |
serum cholestoral |
mg/dl |
no |
|
fbs |
Feature |
Categorical |
fasting blood sugar > 120 mg/dl |
no |
||
restecg |
Feature |
Categorical |
no |
|||
thalach |
Feature |
Integer |
maximum heart rate achieved |
no |
||
exang |
Feature |
Categorical |
exercise induced angina |
no |
||
oldpeak |
Feature |
Integer |
ST depression induced by exercise relative to rest |
no |
||
slope |
Feature |
Categorical |
no |
|||
ca |
Feature |
Integer |
number of major vessels (0-3) colored by flourosopy |
yes |
||
thal |
Feature |
Categorical |
yes |
|||
num |
Target |
Integer |
diagnosis of heart disease |
no |
Model¶
The Newton-Raphson optimization problem can be described as follows.
In a binary classification task with logistic regression, the probability of a data sample \(x\) classified as positive is formulated as:
where \(\sigma(.)\) denotes the sigmoid function. We can incorporate \(\beta_{0}\) and \(\beta\) into a single parameter vector \(\theta = ( \beta_{0}, \beta)\). Let \(d\) be the number of features for each data sample \(x\) and let \(N\) be the number of data samples. We then have the matrix version of the above probability equation:
Here \(X\) is the matrix of all samples, with shape \(N \times (d+1)\), having its first column filled with value 1 to account for the intercept \(\theta_{0}\).
The goal is to compute parameter vector \(\theta\) that maximizes the below likelihood function:
The Newton-Raphson method optimizes the likelihood function via quadratic approximation. Omitting the maths, the theoretical update formula for parameter vector \(\theta\) is:
where
is the gradient of the likelihood function, with \(y\) being the vector of ground truth for sample data matrix \(X\), and
is the Hessian of the likelihood function, with \(D\) a diagonal matrix where diagonal value at \((i,i)\) is \(D(i,i) = p(x_i) (1 - p(x_i))\).
In federated Newton-Raphson optimization, each client will compute its own gradient $nabla L_{theta^{n}}$ and Hessian $H_{theta^{n}}$ based on local training samples. A server will aggregate the gradients and Hessians computed from all clients, and perform the update of parameter $theta$ based on the theoretical update formula described above.
Client Side¶
On the client side, the local training logic is implemented client.py.
The implementation is based on the Client API. This allows user to add minimum nvflare-specific codes to turn a typical centralized training script to a federated client side local training script.
During local training, each client receives a copy of the global model, sent by the server, using flare.receive() API. The received global model is an instance of FLModel.
A local validation is first performed, where validation metrics
Then each client computes it’s gradient and Hessian based on local training data, using their respective theoretical formula described above. This is implemented in the train_newton_raphson() method. Each client then sends the computed results (always in FLModel format) to server for aggregation, using flare.send() API.
Each client site corresponds to a site listed in the data table above.
The training logic remains similar to the centralized logic: load data, perform training (Newton-Raphson updates), and valid trained model. The only added differences in the federated code are related to interaction with the FL system, such as receiving and send FLModel.
1
2
3import argparse
4import os
5
6import numpy as np
7from sklearn.metrics import accuracy_score, precision_score
8
9import nvflare.client as flare
10from nvflare.apis.fl_constant import FLMetaKey
11from nvflare.app_common.abstract.fl_model import FLModel, ParamsType
12from nvflare.app_common.np.constants import NPConstants
13
14
15def parse_arguments():
16 """
17 Parse command line args for client side training.
18 """
19 parser = argparse.ArgumentParser(description="Federated Logistic Regression with Second-Order Newton Raphson")
20
21 parser.add_argument("--data_root", type=str, help="Path to load client side data.")
22
23 return parser.parse_args()
24
25
26def load_data(data_root, site_name):
27 """
28 Load the data for each client.
29
30 Args:
31 data_root: root directory storing client site data.
32 site_name: client site name
33 Returns:
34 A dict with client site training and validation data.
35 """
36 print("loading data for client {} from: {}".format(site_name, data_root))
37 train_x_path = os.path.join(data_root, "{}.train.x.npy".format(site_name))
38 train_y_path = os.path.join(data_root, "{}.train.y.npy".format(site_name))
39 test_x_path = os.path.join(data_root, "{}.test.x.npy".format(site_name))
40 test_y_path = os.path.join(data_root, "{}.test.y.npy".format(site_name))
41
42 train_X = np.load(train_x_path)
43 train_y = np.load(train_y_path)
44 valid_X = np.load(test_x_path)
45 valid_y = np.load(test_y_path)
46
47 return {"train_X": train_X, "train_y": train_y, "valid_X": valid_X, "valid_y": valid_y}
48
49
50def sigmoid(inp):
51 return 1.0 / (1.0 + np.exp(-inp))
52
53
54def train_newton_raphson(data, theta):
55 """
56 Compute gradient and hessian on local data
57 based on paramters received from server.
58
59 """
60 train_X = data["train_X"]
61 train_y = data["train_y"]
62
63 # Add intercept, pre-pend 1s to as first
64 # column of train_X
65 train_X = np.concatenate((np.ones((train_X.shape[0], 1)), train_X), axis=1)
66
67 # Compute probabilities from current weights
68 proba = sigmoid(np.dot(train_X, theta))
69
70 # The gradient is X^T . (y - proba)
71 gradient = np.dot(train_X.T, (train_y - proba))
72
73 # The hessian is X^T . D . X, where D is the
74 # diagnoal matrix with values proba * (1 - proba)
75 D = np.diag((proba * (1 - proba))[:, 0])
76 hessian = train_X.T.dot(D).dot(train_X)
77
78 return {"gradient": gradient, "hessian": hessian}
79
80
81def validate(data, theta):
82 """
83 Performs local validation.
84 Computes accuracy and precision scores.
85
86 """
87 valid_X = data["valid_X"]
88 valid_y = data["valid_y"]
89
90 # Add intercept, pre-pend 1s to as first
91 # column of valid_X
92 valid_X = np.concatenate((np.ones((valid_X.shape[0], 1)), valid_X), axis=1)
93
94 # Compute probabilities from current weights
95 proba = sigmoid(np.dot(valid_X, theta))
96
97 return {"accuracy": accuracy_score(valid_y, proba.round()), "precision": precision_score(valid_y, proba.round())}
98
99
100def main():
101 """
102 This is a typical ML training loop,
103 augmented with Flare Client API to
104 perform local training on each client
105 side and send result to server.
106
107 """
108 args = parse_arguments()
109 data_root = args.data_root
110
111 flare.init()
112
113 site_name = flare.get_site_name()
114 print("training on client site: {}".format(site_name))
115
116 # Load client site data.
117 data = load_data(data_root, site_name)
118
119 while flare.is_running():
120 # Receive global model (FLModel) from server.
121 global_model = flare.receive()
122
123 print(f"\n{global_model=}")
124
125 curr_round = global_model.current_round
126 print("current_round={}".format(curr_round))
127
128 print(f"[ROUND {curr_round}] - client site: {site_name}, received " "global model: {global_model}")
129
130 # Get the weights, aka parameter theta for
131 # logistic regression.
132 global_weights = global_model.params[NPConstants.NUMPY_KEY]
133 print(f"[ROUND {curr_round}] - global model weights: {global_weights}")
134
135 # Local validation before training
136 print(f"[ROUND {curr_round}] - start validation of global model on client: {site_name}")
137 validation_scores = validate(data, global_weights)
138 print(f"[ROUND {curr_round}] - validation metric scores on client: {site_name} = {validation_scores}")
139
140 # Local training
141 print(f"[ROUND {curr_round}] - start local training on client site: {site_name}")
142 result_dict = train_newton_raphson(data, theta=global_weights)
143
144 # Send result to server for aggregation.
145 result_model = FLModel(params=result_dict, params_type=ParamsType.FULL)
146 result_model.meta[FLMetaKey.NUM_STEPS_CURRENT_ROUND] = data["train_X"].shape[0]
147
148 print(
149 f"[ROUND {curr_round}] - local training from client: {site_name} complete,"
150 f" sending results to server: {result_model}"
151 )
152
153 flare.send(result_model)
154
155
156if __name__ == "__main__":
157 main()
Server Side¶
We leverage a builtin FLARE logistic regression with Newton Raphson method. the server side fedavg class is located at nvflare.app_common.workflows.lr.fedavg.FedAvgLR
Job¶
1
2import argparse
3
4from nvflare.app_common.np.recipes.lr.fedavg import FedAvgLrRecipe
5from nvflare.recipe import SimEnv
6
7# from nvflare.recipe import PocEnv
8
9
10def define_parser():
11 parser = argparse.ArgumentParser()
12 parser.add_argument("--n_clients", type=int, default=2)
13 parser.add_argument("--num_rounds", type=int, default=5)
14 parser.add_argument("--data_root", type=str, default="/tmp/flare/dataset/heart_disease_data")
15
16 return parser.parse_args()
17
18
19def main():
20 args = define_parser()
21
22 n_clients = args.n_clients
23 num_rounds = args.num_rounds
24 data_root = args.data_root
25
26 print("number of clients =", n_clients)
27 recipe = FedAvgLrRecipe(
28 num_rounds=num_rounds,
29 damping_factor=0.8,
30 num_features=13,
31 train_script="client.py",
32 train_args=f"--data_root {data_root}",
33 )
34 env = SimEnv(num_clients=n_clients, num_threads=n_clients)
35 # env = PocEnv(num_clients=n_clients)
36 run = recipe.execute(env)
37 w = run.get_result()
38 print("result location =", w)
39
40
41if __name__ == "__main__":
42 main()
Download and prepare data¶
Execute the following script .. code-block:: text
python download_data.py python prepare_data.py
This will download the heart disease dataset under
/tmp/flare/dataset/heart_disease_data/
Running Job¶
Execute the following command to launch federated logistic regression. This will run in nvflare’s simulation mode.
python job.py