Federated Logistic Regression with Second-Order Newton-Raphson optimization

This example shows how to implement a federated binary classification via logistic regression with second-order Newton-Raphson optimization.

Install NVFLARE and Dependencies

for the complete installation instructions, see Installation

pip install nvflare

get the example code from github:

git clone https://github.com/NVIDIA/NVFlare.git

then navigate to the hello-lr directory:

git switch <release branch>
cd examples/hello-world/hello-lr

Install the dependency

pip install -r requirements.txt

Code Structure

hello-lr
|
|-- client.py         # client local training script
|-- job.py            # job recipe that defines client and server configurations
|-- download_data.py  # download dataset
|-- prepare_data.py   # prepare data to convert to numpy
|-- requirements.txt  # dependencies

Data

The UCI Heart Disease dataset is used in this example.

All attributes are numeric-valued. Each database has the same instance format. While the databases have 76 raw attributes, only 14 of them are actually used.

The authors of the databases have requested:

"...that any publications resulting from the use of the data include the
names of the principal investigator responsible for the data collection
at each institution.  They would be:

 1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
 2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
 3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
 4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:
    Robert Detrano, M.D., Ph.D. "

dataset contains samples from 4 sites, split into training and testing sets as described below:

site

sample split

Cleveland

train: 199 samples, test: 104 samples

Hungary

train: 172 samples, test: 89 samples

Switzerland

train: 30 samples, test: 16 samples

Long Beach V

train: 85 samples, test: 45 samples

The number of features in each sample is 13.

Features

Variable Name

Role

Type

Demographic

Description

Units

Missing Values

age

Feature

Integer

Age

years

no

sex

Feature

Categorical

Sex

no

cp

Feature

Categorical

no

trestbps

Feature

Integer

resting blood pressure (on admission to the hospital)

mm Hg

no

chol

Feature

Integer

serum cholestoral

mg/dl

no

fbs

Feature

Categorical

fasting blood sugar > 120 mg/dl

no

restecg

Feature

Categorical

no

thalach

Feature

Integer

maximum heart rate achieved

no

exang

Feature

Categorical

exercise induced angina

no

oldpeak

Feature

Integer

ST depression induced by exercise relative to rest

no

slope

Feature

Categorical

no

ca

Feature

Integer

number of major vessels (0-3) colored by flourosopy

yes

thal

Feature

Categorical

yes

num

Target

Integer

diagnosis of heart disease

no

Model

The Newton-Raphson optimization problem can be described as follows.

In a binary classification task with logistic regression, the probability of a data sample \(x\) classified as positive is formulated as:

\[p(x) = \sigma(\beta \cdot x + \beta_{0})\]

where \(\sigma(.)\) denotes the sigmoid function. We can incorporate \(\beta_{0}\) and \(\beta\) into a single parameter vector \(\theta = ( \beta_{0}, \beta)\). Let \(d\) be the number of features for each data sample \(x\) and let \(N\) be the number of data samples. We then have the matrix version of the above probability equation:

\[p(X) = \sigma( X \theta )\]

Here \(X\) is the matrix of all samples, with shape \(N \times (d+1)\), having its first column filled with value 1 to account for the intercept \(\theta_{0}\).

The goal is to compute parameter vector \(\theta\) that maximizes the below likelihood function:

\[L_{\theta} = \prod_{i=1}^{N} p(x_i)^{y_i} (1 - p(x_i)^{1-y_i})\]

The Newton-Raphson method optimizes the likelihood function via quadratic approximation. Omitting the maths, the theoretical update formula for parameter vector \(\theta\) is:

\[\theta^{n+1} = \theta^{n} - H_{\theta^{n}}^{-1} \nabla L_{\theta^{n}}\]

where

\[\nabla L_{\theta^{n}} = X^{T}(y - p(X))\]

is the gradient of the likelihood function, with \(y\) being the vector of ground truth for sample data matrix \(X\), and

\[H_{\theta^{n}} = -X^{T} D X\]

is the Hessian of the likelihood function, with \(D\) a diagonal matrix where diagonal value at \((i,i)\) is \(D(i,i) = p(x_i) (1 - p(x_i))\).

In federated Newton-Raphson optimization, each client will compute its own gradient $nabla L_{theta^{n}}$ and Hessian $H_{theta^{n}}$ based on local training samples. A server will aggregate the gradients and Hessians computed from all clients, and perform the update of parameter $theta$ based on the theoretical update formula described above.

Client Side

On the client side, the local training logic is implemented in client.py.

The implementation is based on the Client API. This allows user to add minimum nvflare-specific codes to turn a typical centralized training script to a federated client side local training script.

  • During local training, each client receives a copy of the global model, sent by the server, using flare.receive() API. The received global model is an instance of FLModel.

  • A local validation is first performed, where validation metrics

  • Then each client computes it’s gradient and Hessian based on local training data, using their respective theoretical formula described above. This is implemented in the train_newton_raphson() method. Each client then sends the computed results (always in FLModel format) to server for aggregation, using flare.send() API.

Each client site corresponds to a site listed in the data table above.

The training logic remains similar to the centralized logic: load data, perform training (Newton-Raphson updates), and valid trained model. The only added differences in the federated code are related to interaction with the FL system, such as receiving and send FLModel.

Client code (client.py)
  1
  2
  3import argparse
  4import os
  5
  6import numpy as np
  7from sklearn.metrics import accuracy_score, precision_score
  8
  9import nvflare.client as flare
 10from nvflare.apis.fl_constant import FLMetaKey
 11from nvflare.app_common.abstract.fl_model import FLModel, ParamsType
 12from nvflare.app_common.np.constants import NPConstants
 13from nvflare.client.tracking import SummaryWriter
 14
 15
 16def parse_arguments():
 17    """
 18    Parse command line args for client side training.
 19    """
 20    parser = argparse.ArgumentParser(description="Federated Logistic Regression with Second-Order Newton Raphson")
 21
 22    parser.add_argument("--data_root", type=str, help="Path to load client side data.")
 23
 24    return parser.parse_args()
 25
 26
 27def load_data(data_root, site_name):
 28    """
 29    Load the data for each client.
 30
 31    Args:
 32        data_root: root directory storing client site data.
 33        site_name: client site name
 34    Returns:
 35        A dict with client site training and validation data.
 36    """
 37    print("loading data for client {} from: {}".format(site_name, data_root))
 38    train_x_path = os.path.join(data_root, "{}.train.x.npy".format(site_name))
 39    train_y_path = os.path.join(data_root, "{}.train.y.npy".format(site_name))
 40    test_x_path = os.path.join(data_root, "{}.test.x.npy".format(site_name))
 41    test_y_path = os.path.join(data_root, "{}.test.y.npy".format(site_name))
 42
 43    train_X = np.load(train_x_path)
 44    train_y = np.load(train_y_path)
 45    valid_X = np.load(test_x_path)
 46    valid_y = np.load(test_y_path)
 47
 48    return {"train_X": train_X, "train_y": train_y, "valid_X": valid_X, "valid_y": valid_y}
 49
 50
 51def sigmoid(inp):
 52    return 1.0 / (1.0 + np.exp(-inp))
 53
 54
 55def train_newton_raphson(data, theta):
 56    """
 57    Compute gradient and hessian on local data
 58    based on paramters received from server.
 59
 60    """
 61    train_X = data["train_X"]
 62    train_y = data["train_y"]
 63
 64    # Add intercept, pre-pend 1s to as first
 65    # column of train_X
 66    train_X = np.concatenate((np.ones((train_X.shape[0], 1)), train_X), axis=1)
 67
 68    # Compute probabilities from current weights
 69    proba = sigmoid(np.dot(train_X, theta))
 70
 71    # The gradient is X^T . (y - proba)
 72    gradient = np.dot(train_X.T, (train_y - proba))
 73
 74    # The hessian is X^T . D . X, where D is the
 75    # diagonal matrix with values proba * (1 - proba)
 76    D = np.diag((proba * (1 - proba))[:, 0])
 77    hessian = train_X.T.dot(D).dot(train_X)
 78
 79    return {"gradient": gradient, "hessian": hessian}
 80
 81
 82def validate(data, theta):
 83    """
 84    Performs local validation.
 85    Computes accuracy and precision scores.
 86
 87    """
 88    valid_X = data["valid_X"]
 89    valid_y = data["valid_y"]
 90
 91    # Add intercept, pre-pend 1s to as first
 92    # column of valid_X
 93    valid_X = np.concatenate((np.ones((valid_X.shape[0], 1)), valid_X), axis=1)
 94
 95    # Compute probabilities from current weights
 96    proba = sigmoid(np.dot(valid_X, theta))
 97
 98    return {
 99        "accuracy": accuracy_score(valid_y.flatten(), proba.flatten().round()),
100        "precision": precision_score(valid_y.flatten(), proba.flatten().round()),
101    }
102
103
104def main():
105    """
106    This is a typical ML training loop,
107    augmented with Flare Client API to
108    perform local training on each client
109    side and send result to server.
110
111    """
112    args = parse_arguments()
113    data_root = args.data_root
114
115    flare.init()
116
117    site_name = flare.get_site_name()
118    print("training on client site: {}".format(site_name))
119
120    # Load client site data.
121    data = load_data(data_root, site_name)
122
123    # Get metric summary writer for TensorBoard
124    writer = SummaryWriter()
125
126    while flare.is_running():
127        # Receive global model (FLModel) from server.
128        global_model = flare.receive()
129
130        print(f"\n{global_model=}")
131
132        curr_round = global_model.current_round
133        print("current_round={}".format(curr_round))
134
135        print(f"[ROUND {curr_round}] - client site: {site_name}, received " "global model: {global_model}")
136
137        # Get the weights, aka parameter theta for
138        # logistic regression.
139        global_weights = global_model.params[NPConstants.NUMPY_KEY]
140        print(f"[ROUND {curr_round}] - global model weights: {global_weights}")
141
142        # Local validation before training
143        print(f"[ROUND {curr_round}] - start validation of global model on client: {site_name}")
144        validation_scores = validate(data, global_weights)
145        print(f"[ROUND {curr_round}] - validation metric scores on client: {site_name} = {validation_scores}")
146
147        # Write validation metrics to TensorBoard
148        writer.add_scalar(f"{site_name}/accuracy", validation_scores["accuracy"], curr_round)
149        writer.add_scalar(f"{site_name}/precision", validation_scores["precision"], curr_round)
150
151        # Local training
152        print(f"[ROUND {curr_round}] - start local training on client site: {site_name}")
153        result_dict = train_newton_raphson(data, theta=global_weights)
154
155        # Send result to server for aggregation.
156        result_model = FLModel(params=result_dict, params_type=ParamsType.FULL)
157        result_model.meta[FLMetaKey.NUM_STEPS_CURRENT_ROUND] = data["train_X"].shape[0]
158
159        print(
160            f"[ROUND {curr_round}] - local training from client: {site_name} complete,"
161            f" sending results to server: {result_model}"
162        )
163
164        flare.send(result_model)
165
166
167if __name__ == "__main__":
168    main()

Server Side

We leverage a builtin FLARE logistic regression with Newton Raphson method. the server side fedavg class is located at nvflare.app_common.workflows.lr.fedavg.FedAvgLR

Job

Job Recipe (job.py)
 1
 2import argparse
 3
 4from nvflare.app_common.np.recipes.lr.fedavg import FedAvgLrRecipe
 5from nvflare.recipe import SimEnv
 6
 7# from nvflare.recipe import PocEnv
 8
 9
10def define_parser():
11    parser = argparse.ArgumentParser()
12    parser.add_argument("--n_clients", type=int, default=2)
13    parser.add_argument("--num_rounds", type=int, default=5)
14    parser.add_argument("--data_root", type=str, default="/tmp/flare/dataset/heart_disease_data")
15
16    return parser.parse_args()
17
18
19def main():
20    args = define_parser()
21
22    n_clients = args.n_clients
23    num_rounds = args.num_rounds
24    data_root = args.data_root
25
26    print("number of clients =", n_clients)
27    recipe = FedAvgLrRecipe(
28        min_clients=n_clients,
29        num_rounds=num_rounds,
30        damping_factor=0.8,
31        num_features=13,  # Model is created internally based on num_features
32        # For pre-trained weights: initial_ckpt="/server/path/to/lr_model.npy",
33        train_script="client.py",
34        train_args=f"--data_root {data_root}",
35    )
36    env = SimEnv(num_clients=n_clients, num_threads=n_clients)
37    # env = PocEnv(num_clients=n_clients)
38    run = recipe.execute(env)
39    w = run.get_result()
40    print("result location =", w)
41
42
43if __name__ == "__main__":
44    main()

Download and prepare data

Execute the following script .. code-block:: text

python download_data.py python prepare_data.py

This will download the heart disease dataset under

/tmp/flare/dataset/heart_disease_data/

Running Job

Execute the following command to launch federated logistic regression. This will run in nvflare’s simulation mode.

python job.py