########################## NVFlare XGBoost User Guide ########################## Overview ======== NVFlare supports federated training with XGBoost. It provides the following advantages over doing the training natively with XGBoost: - Secure training with Homomorphic Encryption (HE) - Lifecycle management of XGBoost processes. - Reliable messaging which can overcome network glitches - Training over complicated networks with relays. It supports federated training in the following 4 modes: 1. Row split without encryption 2. Column split without encryption 3. Row split with HE (Requires at least 3 clients. With 2 clients, the other client's histogram can be deduced.) 4. Column split with HE When running with NVFlare, all the GRPC connections in XGBoost are local and the messages are forwarded to other clients through NVFlare's CellNet communication. The local GRPC ports are selected automatically by NVFlare. The encryption is handled in XGBoost by encryption plugins, which are external components that can be installed at runtime. Prerequisites ============= Required Python Packages ------------------------ NVFlare 2.5.0 or above, .. code-block:: bash pip install nvflare~=2.5.0 XGBoost 2.2 or above, which can be installed from the binary build using this command, .. code-block:: bash pip install https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/federated-secure/xgboost-2.2.0.dev0%2B4601688195708f7c31fcceeb0e0ac735e7311e61-py3-none-manylinux_2_28_x86_64.whl .. note:: The xgboost build environment may depend on specific numpy versions that require Python < 3.12. or in case you need to get the most current build of XGBoost, .. code-block:: bash pip install https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/federated-secure/`curl -s https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/federated-secure/meta.json | grep -o 'xgboost-2\.2.*whl'|sed -e 's/+/%2B/'` ``TenSEAL`` package is needed for horizontal secure training, .. code-block:: bash pip install tenseal ``ipcl_python`` package is required for vertical secure training if **nvflare** plugin is used. This package is not needed if **cuda_paillier** plugin is used. .. code-block:: bash pip install ipcl-python This package is only available for Python 3.8 on PyPI. For other versions of python, it needs to be installed from github, .. code-block:: bash pip install git+https://github.com/intel/pailliercryptolib_python.git@development System Environments ------------------- To support secure training, several homomorphic encryption libraries are used. Those libraries require Intel CPU or Nvidia GPU. Linux is the preferred OS. It's tested extensively under Ubuntu 22.4. The following docker image is recommended for GPU training: :: nvcr.io/nvidia/pytorch:24.03-py3 Building Encryption Plugins --------------------------- The secure training requires encryption plugins, which need to be built from the source code for your specific environment. To build the plugins, check out the NVFlare source code from https://github.com/NVIDIA/NVFlare and following the instructions in :github_nvflare_link:`this document. ` .. _xgb_provisioning: NVFlare Provisioning -------------------- For horizontal secure training, the NVFlare system must be provisioned with homomorphic encryption context. The HEBuilder in ``project.yml`` is used to achieve this. An example configuration can be found at :github_nvflare_link:`secure_project.yml `. This is a snippet of the ``secure_project.yml`` file with the HEBuilder: .. code-block:: yaml api_version: 3 name: secure_project description: NVIDIA FLARE sample project yaml file for CIFAR-10 example participants: ... builders: - path: nvflare.lighter.impl.workspace.WorkspaceBuilder args: template_file: master_template.yml - path: nvflare.lighter.impl.template.TemplateBuilder - path: nvflare.lighter.impl.static_file.StaticFileBuilder args: config_folder: config overseer_agent: path: nvflare.ha.dummy_overseer_agent.DummyOverseerAgent overseer_exists: false args: sp_end_point: localhost:8102:8103 heartbeat_interval: 6 - path: nvflare.lighter.impl.he.HEBuilder args: poly_modulus_degree: 8192 coeff_mod_bit_sizes: [60, 40, 40] scale_bits: 40 scheme: CKKS - path: nvflare.lighter.impl.cert.CertBuilder - path: nvflare.lighter.impl.signature.SignatureBuilder Data Preparation ================ Data must be properly formatted for federated XGBoost training based on split mode (row or column). For horizontal (row-split) training, the datasets on all clients must share the same columns. For vertical (column-split) training, the datasets on all clients contain different columns, but must share overlapping rows. For more details on vertical split preprocessing, refer to the :github_nvflare_link:`Vertical XGBoost Example `. XGBoost Plugin Configuration ============================ XGBoost requires an encryption plugin to handle secure training. - **cuda_paillier**: The default plugin. This plugin uses GPU for cryptographic operations. - **nvflare**: This plugin forwards data locally to NVFlare process for encryption. .. note:: All clients must use the same plugin. When different plugins are used in different clients, the behavior of federated XGBoost is undetermined, which can cause the job to crash. The **cuda_paillier** plugin requires NVIDIA GPUs that support compute capability 7.0 or higher. Also, CUDA 12.2 or 12.4 must be installed. Please refer to https://developer.nvidia.com/cuda-gpus for more information. The two included plugins are only different in vertical secure training. For horizontal secure training, both plugins work exactly the same by forwarding the data to NVFlare for encryption. Here are plugin configurations needed for each training mode. Vertical (Non-secure) --------------------- No plugin is needed. Horizontal (Non-secure) ----------------------- No plugin is needed. Vertical Secure --------------- Both plugins can be used for vertical secure training. The default cuda_paillier plugin is preferred because it uses GPU for faster cryptographic operations. .. note:: **cuda_paillier** plugin requires NVIDIA GPUs that support compute capability 7.0 or higher. Please refer to https://developer.nvidia.com/cuda-gpus for more information. If you see the following errors in the log, it means either no GPU is detected or the GPU does not meet the requirements: :: CUDA runtime API error no kernel image is available for execution on the device at line 241 in file /my_home/nvflare-internal/processor/src/cuda-plugin/paillier.h 2024-07-01 12:19:15,683 - SimulatorClientRunner - ERROR - run_client_thread error: EOFError: In this case, the nvflare plugin can be used to perform encryption on CPUs, which requires the ipcl-python package. The plugin can be configured in the ``local/resources.json`` file on clients: .. code-block:: json { "federated_plugin": { "name": "nvflare", "path": "/opt/libs/libnvflare.so" } } Where **name** is the plugin name and **path** is the full path of the plugin including the library file name. The **path** is optional, the default value is the library distributed with NVFlare for the plugin. The following environment variables can be used to override the values in the JSON, .. code-block:: bash export NVFLARE_XGB_PLUGIN_NAME=nvflare export NVFLARE_XGB_PLUGIN_PATH=/opt/libs/libnvflare.so .. note:: When running with the NVFlare simulator, the plugin must be configured using environment variables, as it does not support resources.json. Horizontal Secure ----------------- The plugin setup is the same as vertical secure. This mode requires the tenseal package for all plugins. The provisioning of NVFlare systems must include tenseal context. See :ref:`xgb_provisioning` for details. For simulator, the tenseal context generated by provisioning needs to be copied to the startup folder, ``simulator_workspace/startup/client_context.tenseal`` For example, .. code-block:: bash nvflare provision -p secure_project.yml -w /tmp/poc_workspace mkdir -p /tmp/simulator_workspace/startup cp /tmp/poc_workspace/example_project/prod_00/site-1/startup/client_context.tenseal /tmp/simulator_workspace/startup The server_context.tenseal file is not needed. Job Configuration ================= .. _secure_xgboost_controller: Controller ---------- On the server side, the following controller must be configured in workflows, ``nvflare.app_opt.xgboost.histogram_based_v2.fed_controller.XGBFedController`` Even though the XGBoost training is performed on clients, the parameters are configured on the server so all clients share the same configuration. XGBoost parameters are defined here, https://xgboost.readthedocs.io/en/stable/python/python_intro.html#setting-parameters - **num_rounds**: Number of training rounds. - **data_split_mode**: Same as XGBoost data_split_mode parameter, 0 for row-split, 1 for column-split. - **secure_training**: If true, XGBoost will train in secure mode using the plugin. - **xgb_params**: The training parameters defined in this dict are passed to XGBoost as **params**, the boost paramter. - **xgb_options**: This dict contains other optional parameters passed to XGBoost. Currently, only **early_stopping_rounds** is supported. - **client_ranks**: A dict that maps client name to rank. Executor -------- On the client side, the following executor must be configured in executors, ``nvflare.app_opt.xgboost.histogram_based_v2.fed_executor.FedXGBHistogramExecutor`` Only one parameter is required for executor, - **data_loader_id**: The component ID of Data Loader Data Loader ----------- On the client side, a data loader must be configured in the components. The CSVDataLoader can be used if the data is pre-processed. For example, .. code-block:: json { "id": "dataloader", "path": "nvflare.app_opt.xgboost.histogram_based_v2.csv_data_loader.CSVDataLoader", "args": { "folder": "/opt/dataset/vertical_xgb_data" } } If the data requires any special processing, a custom loader can be implemented. The loader must implement the XGBDataLoader interface. Job Example =========== Vertical Training ----------------- Here are the configuration files for a vertical secure training job. If encryption is not needed, just change the ``secure_training`` arg to false. .. code-block:: json :caption: config_fed_server.json { "format_version": 2, "num_rounds": 3, "workflows": [ { "id": "xgb_controller", "path": "nvflare.app_opt.xgboost.histogram_based_v2.fed_controller.XGBFedController", "args": { "num_rounds": "{num_rounds}", "data_split_mode": 1, "secure_training": true, "xgb_options": { "early_stopping_rounds": 2 }, "xgb_params": { "max_depth": 3, "eta": 0.1, "objective": "binary:logistic", "eval_metric": "auc", "tree_method": "hist", "nthread": 1 }, "client_ranks": { "site-1": 0, "site-2": 1 } } } ] } .. code-block:: json :caption: config_fed_client.json { "format_version": 2, "executors": [ { "tasks": [ "config", "start" ], "executor": { "id": "Executor", "path": "nvflare.app_opt.xgboost.histogram_based_v2.fed_executor.FedXGBHistogramExecutor", "args": { "data_loader_id": "dataloader" } } } ], "components": [ { "id": "dataloader", "path": "nvflare.app_opt.xgboost.histogram_based_v2.csv_data_loader.CSVDataLoader", "args": { "folder": "/opt/dataset/vertical_xgb_data" } } ] } Horizontal Training ------------------- The configuration for horizontal training is the same as vertical except ``data_split_mode`` is 0 and the data loader must point to horizontal split data. .. code-block:: json :caption: config_fed_server.json { "format_version": 2, "num_rounds": 3, "workflows": [ { "id": "xgb_controller", "path": "nvflare.app_opt.xgboost.histogram_based_v2.fed_controller.XGBFedController", "args": { "num_rounds": "{num_rounds}", "data_split_mode": 0, "secure_training": true, "xgb_options": { "early_stopping_rounds": 2 }, "xgb_params": { "max_depth": 3, "eta": 0.1, "objective": "binary:logistic", "eval_metric": "auc", "tree_method": "hist", "nthread": 1 }, "client_ranks": { "site-1": 0, "site-2": 1 }, "in_process": true } } ] } .. code-block:: json :caption: config_fed_client.json { "format_version": 2, "executors": [ { "tasks": [ "config", "start" ], "executor": { "id": "Executor", "path": "nvflare.app_opt.xgboost.histogram_based_v2.fed_executor.FedXGBHistogramExecutor", "args": { "data_loader_id": "dataloader", "in_process": true } } } ], "components": [ { "id": "dataloader", "path": "nvflare.app_opt.xgboost.histogram_based_v2.csv_data_loader.CSVDataLoader", "args": { "folder": "/data/xgboost_secure/dataset/horizontal_xgb_data" } } ] } Pre-Trained Models ================== To continue training using a pre-trained model, the model can be placed in the job folder with the path and name of ``custom/model.json``. Every site should share the same ``model.json``. The result of previous training with the same dataset can be used as the input model. When a pre-trained model is detected, NVFlare prints following line in the log: :: INFO - Pre-trained model is used: /tmp/nvflare/poc/example_project/prod_00/site-1/startup/../996ac44f-e784-4117-b365-24548f1c490d/app_site-1/custom/model.json Performance Tuning ================== Timeouts -------- For secure training, the HE operations are very slow. If a large dataset is used, several timeout values need to be adjusted. The XGBoost messages are transferred between client and server using Reliable Messages (:class:`ReliableMessage`). The following parameters in executor arguments control the timeout behavior: - **per_msg_timeout**: Timeout in seconds for each message. - **tx_timeout**: Timeout for the whole transaction in seconds. This is the total time to wait for a response, accounting for all retry attempts. .. code-block:: json :caption: config_fed_client.json { "format_version": 2, "executors": [ { "tasks": [ "config", "start" ], "executor": { "id": "Executor", "path": "nvflare.app_opt.xgboost.histogram_based_v2.fed_executor.FedXGBHistogramExecutor", "args": { "data_loader_id": "dataloader", "per_msg_timeout": 300.0, "tx_timeout": 900.0, "in_process": true } } } ], ... } Number of Clients ----------------- The default configuration can only handle 20 clients. This parameter needs to be adjusted if more clients are involved in the training: .. code-block:: json :caption: config_fed_client.json { "format_version": 2, "num_rounds": 3, "rm_max_request_workers": 100, ... }