Federated Logistic Regression with Second-Order Newton-Raphson optimization

This example shows how to implement a federated binary classification via logistic regression with second-order Newton-Raphson optimization.

Install NVFLARE and Dependencies

for the complete installation instructions, see Installation

pip install nvflare

get the example code from github:

git clone https://github.com/NVIDIA/NVFlare.git

then navigate to the hello-lr directory:

git switch <release branch>
cd examples/hello-world/hello-lr

Install the dependency

pip install -r requirements.txt

Code Structure

hello-lr
|
|-- client.py         # client local training script
|-- job.py            # job recipe that defines client and server configurations
|-- download_data.py  # download dataset
|-- prepare_data.py   # prepare data to convert to numpy
|-- requirements.txt  # dependencies

Data

The UCI Heart Disease dataset is used in this example.

All attributes are numeric-valued. Each database has the same instance format. While the databases have 76 raw attributes, only 14 of them are actually used.

The authors of the databases have requested:

"...that any publications resulting from the use of the data include the
names of the principal investigator responsible for the data collection
at each institution.  They would be:

 1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
 2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
 3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
 4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:
    Robert Detrano, M.D., Ph.D. "

dataset contains samples from 4 sites, split into training and testing sets as described below:

site	sample split
Cleveland	train: 199 samples, test: 104 samples
Hungary	train: 172 samples, test: 89 samples
Switzerland	train: 30 samples, test: 16 samples
Long Beach V	train: 85 samples, test: 45 samples

The number of features in each sample is 13.

Features

Variable Name	Role	Type	Demographic	Description	Units	Missing Values
age	Feature	Integer	Age	years		no
sex	Feature	Categorical	Sex			no
cp	Feature	Categorical				no
trestbps	Feature	Integer		resting blood pressure (on admission to the hospital)	mm Hg	no
chol	Feature	Integer		serum cholestoral	mg/dl	no
fbs	Feature	Categorical		fasting blood sugar > 120 mg/dl		no
restecg	Feature	Categorical				no
thalach	Feature	Integer		maximum heart rate achieved		no
exang	Feature	Categorical		exercise induced angina		no
oldpeak	Feature	Integer		ST depression induced by exercise relative to rest		no
slope	Feature	Categorical				no
ca	Feature	Integer		number of major vessels (0-3) colored by flourosopy		yes
thal	Feature	Categorical				yes
num	Target	Integer		diagnosis of heart disease		no

Model

The Newton-Raphson optimization problem can be described as follows.

In a binary classification task with logistic regression, the probability of a data sample $x$ classified as positive is formulated as:

\[p(x) = \sigma(\beta \cdot x + \beta_{0})\]

where $\sigma(.)$ denotes the sigmoid function. We can incorporate $\beta_{0}$ and $\beta$ into a single parameter vector $\theta = ( \beta_{0}, \beta)$. Let $d$ be the number of features for each data sample $x$ and let $N$ be the number of data samples. We then have the matrix version of the above probability equation:

\[p(X) = \sigma( X \theta )\]

Here $X$ is the matrix of all samples, with shape $N \times (d+1)$, having its first column filled with value 1 to account for the intercept $\theta_{0}$.

The goal is to compute parameter vector $\theta$ that maximizes the below likelihood function:

\[L_{\theta} = \prod_{i=1}^{N} p(x_i)^{y_i} (1 - p(x_i)^{1-y_i})\]

The Newton-Raphson method optimizes the likelihood function via quadratic approximation. Omitting the maths, the theoretical update formula for parameter vector $\theta$ is:

\[\theta^{n+1} = \theta^{n} - H_{\theta^{n}}^{-1} \nabla L_{\theta^{n}}\]

where

\[\nabla L_{\theta^{n}} = X^{T}(y - p(X))\]

is the gradient of the likelihood function, with $y$ being the vector of ground truth for sample data matrix $X$, and

\[H_{\theta^{n}} = -X^{T} D X\]

is the Hessian of the likelihood function, with $D$ a diagonal matrix where diagonal value at $(i,i)$ is $D(i,i) = p(x_i) (1 - p(x_i))$.

In federated Newton-Raphson optimization, each client will compute its own gradient $nabla L_{theta^{n}}$ and Hessian $H_{theta^{n}}$ based on local training samples. A server will aggregate the gradients and Hessians computed from all clients, and perform the update of parameter $theta$ based on the theoretical update formula described above.

Client Side

On the client side, the local training logic is implemented in client.py.

The implementation is based on the Client API. This allows user to add minimum nvflare-specific codes to turn a typical centralized training script to a federated client side local training script.

During local training, each client receives a copy of the global model, sent by the server, using flare.receive() API. The received global model is an instance of FLModel.
A local validation is first performed, where validation metrics
Then each client computes it’s gradient and Hessian based on local training data, using their respective theoretical formula described above. This is implemented in the train_newton_raphson() method. Each client then sends the computed results (always in FLModel format) to server for aggregation, using flare.send() API.

Each client site corresponds to a site listed in the data table above.

The training logic remains similar to the centralized logic: load data, perform training (Newton-Raphson updates), and valid trained model. The only added differences in the federated code are related to interaction with the FL system, such as receiving and send FLModel.

Client code (client.py)

import argparse
import os

import numpy as np
from sklearn.metrics import accuracy_score, precision_score

import nvflare.client as flare
from nvflare.apis.fl_constant import FLMetaKey
from nvflare.app_common.abstract.fl_model import FLModel, ParamsType
from nvflare.app_common.np.constants import NPConstants
from nvflare.client.tracking import SummaryWriter


def parse_arguments():
    """
    Parse command line args for client side training.
    """
    parser = argparse.ArgumentParser(description="Federated Logistic Regression with Second-Order Newton Raphson")

    parser.add_argument("--data_root", type=str, help="Path to load client side data.")

    return parser.parse_args()


def load_data(data_root, site_name):
    """
    Load the data for each client.

    Args:
        data_root: root directory storing client site data.
        site_name: client site name
    Returns:
        A dict with client site training and validation data.
    """
    print("loading data for client {} from: {}".format(site_name, data_root))
    train_x_path = os.path.join(data_root, "{}.train.x.npy".format(site_name))
    train_y_path = os.path.join(data_root, "{}.train.y.npy".format(site_name))
    test_x_path = os.path.join(data_root, "{}.test.x.npy".format(site_name))
    test_y_path = os.path.join(data_root, "{}.test.y.npy".format(site_name))

    train_X = np.load(train_x_path)
    train_y = np.load(train_y_path)
    valid_X = np.load(test_x_path)
    valid_y = np.load(test_y_path)

    return {"train_X": train_X, "train_y": train_y, "valid_X": valid_X, "valid_y": valid_y}


def sigmoid(inp):
    return 1.0 / (1.0 + np.exp(-inp))


def train_newton_raphson(data, theta):
    """
    Compute gradient and hessian on local data
    based on paramters received from server.

    """
    train_X = data["train_X"]
    train_y = data["train_y"]

    # Add intercept, pre-pend 1s to as first
    # column of train_X
    train_X = np.concatenate((np.ones((train_X.shape[0], 1)), train_X), axis=1)

    # Compute probabilities from current weights
    proba = sigmoid(np.dot(train_X, theta))

    # The gradient is X^T . (y - proba)
    gradient = np.dot(train_X.T, (train_y - proba))

    # The hessian is X^T . D . X, where D is the
    # diagonal matrix with values proba * (1 - proba)
    D = np.diag((proba * (1 - proba))[:, 0])
    hessian = train_X.T.dot(D).dot(train_X)

    return {"gradient": gradient, "hessian": hessian}


def validate(data, theta):
    """
    Performs local validation.
    Computes accuracy and precision scores.

    """
    valid_X = data["valid_X"]
    valid_y = data["valid_y"]

    # Add intercept, pre-pend 1s to as first
    # column of valid_X
    valid_X = np.concatenate((np.ones((valid_X.shape[0], 1)), valid_X), axis=1)

    # Compute probabilities from current weights
    proba = sigmoid(np.dot(valid_X, theta))

    return {
        "accuracy": accuracy_score(valid_y.flatten(), proba.flatten().round()),
        "precision": precision_score(valid_y.flatten(), proba.flatten().round()),
    }


def main():
    """
    This is a typical ML training loop,
    augmented with Flare Client API to
    perform local training on each client
    side and send result to server.

    """
    args = parse_arguments()
    data_root = args.data_root

    flare.init()

    site_name = flare.get_site_name()
    print("training on client site: {}".format(site_name))

    # Load client site data.
    data = load_data(data_root, site_name)

    # Get metric summary writer for TensorBoard
    writer = SummaryWriter()

    while flare.is_running():
        # Receive global model (FLModel) from server.
        global_model = flare.receive()

        print(f"\n{global_model=}")

        curr_round = global_model.current_round
        print("current_round={}".format(curr_round))

        print(f"[ROUND {curr_round}] - client site: {site_name}, received " "global model: {global_model}")

        # Get the weights, aka parameter theta for
        # logistic regression.
        global_weights = global_model.params[NPConstants.NUMPY_KEY]
        print(f"[ROUND {curr_round}] - global model weights: {global_weights}")

        # Local validation before training
        print(f"[ROUND {curr_round}] - start validation of global model on client: {site_name}")
        validation_scores = validate(data, global_weights)
        print(f"[ROUND {curr_round}] - validation metric scores on client: {site_name} = {validation_scores}")

        # Write validation metrics to TensorBoard
        writer.add_scalar(f"{site_name}/accuracy", validation_scores["accuracy"], curr_round)
        writer.add_scalar(f"{site_name}/precision", validation_scores["precision"], curr_round)

        # Local training
        print(f"[ROUND {curr_round}] - start local training on client site: {site_name}")
        result_dict = train_newton_raphson(data, theta=global_weights)

        # Send result to server for aggregation.
        result_model = FLModel(params=result_dict, params_type=ParamsType.FULL)
        result_model.meta[FLMetaKey.NUM_STEPS_CURRENT_ROUND] = data["train_X"].shape[0]

        print(
            f"[ROUND {curr_round}] - local training from client: {site_name} complete,"
            f" sending results to server: {result_model}"
        )

        flare.send(result_model)


if __name__ == "__main__":
    main()

Server Side

We leverage a builtin FLARE logistic regression with Newton Raphson method. the server side fedavg class is located at nvflare.app_common.workflows.lr.fedavg.FedAvgLR

Job

Job Recipe (job.py)

import argparse

from nvflare.app_common.np.recipes.lr.fedavg import FedAvgLrRecipe
from nvflare.recipe import SimEnv

# from nvflare.recipe import PocEnv


def define_parser():
    parser = argparse.ArgumentParser()
    parser.add_argument("--n_clients", type=int, default=2)
    parser.add_argument("--num_rounds", type=int, default=5)
    parser.add_argument("--data_root", type=str, default="/tmp/flare/dataset/heart_disease_data")

    return parser.parse_args()


def main():
    args = define_parser()

    n_clients = args.n_clients
    num_rounds = args.num_rounds
    data_root = args.data_root

    print("number of clients =", n_clients)
    recipe = FedAvgLrRecipe(
        min_clients=n_clients,
        num_rounds=num_rounds,
        damping_factor=0.8,
        num_features=13,  # Model is created internally based on num_features
        # For pre-trained weights: initial_ckpt="/server/path/to/lr_model.npy",
        train_script="client.py",
        train_args=f"--data_root {data_root}",
    )
    env = SimEnv(num_clients=n_clients, num_threads=n_clients)
    # env = PocEnv(num_clients=n_clients)
    run = recipe.execute(env)
    w = run.get_result()
    print("result location =", w)


if __name__ == "__main__":
    main()

Download and prepare data

Execute the following script .. code-block:: text

python download_data.py python prepare_data.py

This will download the heart disease dataset under

/tmp/flare/dataset/heart_disease_data/

Running Job

Execute the following command to launch federated logistic regression. This will run in nvflare’s simulation mode.

python job.py