Tabular Data Federated Statistics¶
In this example, we will show how to generate federated statistics for tabular data that can be represented as Pandas Data Frame.
NVIDIA FLARE Installation¶
for the complete installation instructions, see installation
pip install nvflare
get the example code from github:
git clone https://github.com/NVIDIA/NVFlare.git
then navigate to the hello-tabular-stats directory:
git switch <release branch>
cd examples/hello-world/hello-tabular-stats
Install the dependency¶
pip install -r requirements.txt
Install Optional Quantile Dependency – fastdigest¶
If you intend to calculate quantiles, you need to install fastdigest.
Skip this step if you don’t need quantile statistics.
pip install fastdigest==0.4.0
on Ubuntu, you might get the following error:
Cargo, the Rust package manager, is not installed or is not on PATH.
This package requires Rust and Cargo to compile extensions. Install it through
the system's package manager or via https://rustup.rs/
Checking for Rust toolchain....
This is because fastdigest (or its dependencies) requires Rust and Cargo to build.
You need to install Rust and Cargo on your Ubuntu system. Follow these steps: Install Rust and Cargo Run the following command to install Rust using rustup:
./install_cargo.sh
Then you can install fastdigest again
pip install fastdigest==0.4.0
Code Structure¶
hello-tabular-stats
|
├── client.py # client local training script
├── job.py # job recipe that defines client and server configurations
├── prepare_data.py # utilies to download data
├── install_cargo.sh # scripts to install rust and cargo needed for quantil dependency, only needed if you plan to inistall quantile dependency
└── requirements.txt # dependencies
├── demo
│ └── visualization.ipynb # Visualization Notebook
Data¶
In this example, we are using UCI (University of California, Irvine) [adult dataset](https://archive.ics.uci.edu/dataset/2/adult)
The original dataset has already contains “training” and “test” datasets. Here we simply assume that “training” and test data sets are belong to different clients. so we assigned the training data and test data into two clients.
Now we use data utility to download UCI datasets to separate client package directory to /tmp/nvflare/data/ directory
Please note that the UCI’s website may experience occasional downtime.
python prepare_data.py
it should show something like
prepare data for data directory /tmp/nvflare/df_stats/data
download to /tmp/nvflare/df_stats/data/site-1/data.csv
skip empty line
download to /tmp/nvflare/df_stats/data/site-2/data.csv
skip empty line
done with prepare data
Client Code¶
Local statistics generator. The statistics generator AdultStatistics implements Statistics spec.
1
2from typing import Optional
3
4import pandas as pd
5
6from nvflare.apis.fl_context import FLContext
7from nvflare.app_opt.statistics.df.df_core_statistics import DFStatisticsCore
8
9
10class AdultStatistics(DFStatisticsCore):
11 def __init__(self, filename, data_root_dir="/tmp/nvflare/df_stats/data"):
12 super().__init__()
13 self.data_root_dir = data_root_dir
14 self.filename = filename
15 self.data: Optional[dict[str, pd.DataFrame]] = None
16 self.data_features = [
17 "Age",
18 "Workclass",
19 "fnlwgt",
20 "Education",
21 "Education-Num",
22 "Marital Status",
23 "Occupation",
24 "Relationship",
25 "Race",
26 "Sex",
27 "Capital Gain",
28 "Capital Loss",
29 "Hours per week",
30 "Country",
31 "Target",
32 ]
33
34 # the original dataset has no header,
35 # we will use the adult.train dataset for site-1, the adult.test dataset for site-2
36 # the adult.test dataset has incorrect formatted row at 1st line, we will skip it.
37 self.skip_rows = {
38 "site-1": [],
39 "site-2": [0],
40 }
41
42 def load_data(self, fl_ctx: FLContext) -> dict[str, pd.DataFrame]:
43 client_name = fl_ctx.get_identity_name()
44 self.log_info(fl_ctx, f"load data for client {client_name}")
45 try:
46 skip_rows = self.skip_rows[client_name]
47 data_path = f"{self.data_root_dir}/{fl_ctx.get_identity_name()}/{self.filename}"
48 # example of load data from CSV
49 df: pd.DataFrame = pd.read_csv(
50 data_path, names=self.data_features, sep=r"\s*,\s*", skiprows=skip_rows, engine="python", na_values="?"
51 )
52 train = df.sample(frac=0.8, random_state=200) # random state is a seed value
53 test = df.drop(train.index).sample(frac=1.0)
54
55 self.log_info(fl_ctx, f"load data done for client {client_name}")
56 return {"train": train, "test": test}
57
58 except Exception as e:
59 raise Exception(f"Load data for client {client_name} failed! {e}")
60
61 def initialize(self, fl_ctx: FLContext):
62 self.data = self.load_data(fl_ctx)
Many of the functions needed for tabular statistics have already been implemented DFStatisticsCore
In the AdultStatistics class, we really need to have the followings
data_features – here we hard-coded the feature name array.
implement load_data() -> Dict[str, pd.DataFrame] function, where the method will return a dictionary of panda DataFrames with one for each data source (“train”, “test”)
data_path = <data_root_dir>/<site-name>/<filename>
Server Code¶
The server aggregation have already implemented in Statistics Controller
Job Recipe¶
Job is defined via recipe, we will run it in Simulation Execution Env.
1import argparse
2
3from client import AdultStatistics
4
5from nvflare.recipe.fedstats import FedStatsRecipe
6from nvflare.recipe.sim_env import SimEnv
7
8
9def define_parser():
10 parser = argparse.ArgumentParser()
11 parser.add_argument("-n", "--n_clients", type=int, default=2)
12 parser.add_argument("-d", "--data_root_dir", type=str, nargs="?", default="/tmp/nvflare/df_stats/data")
13 parser.add_argument("-o", "--stats_output_path", type=str, nargs="?", default="statistics/adults_stats.json")
14
15 return parser.parse_args()
16
17
18def main():
19 args = define_parser()
20
21 n_clients = args.n_clients
22 data_root_dir = args.data_root_dir
23 output_path = args.stats_output_path
24
25 statistic_configs = {
26 "count": {},
27 "mean": {},
28 "sum": {},
29 "stddev": {},
30 "histogram": {"*": {"bins": 20}, "Age": {"bins": 20, "range": [0, 100]}},
31 "quantile": {"*": [0.1, 0.5, 0.9]},
32 }
33 # define local stats generator
34 df_stats_generator = AdultStatistics(filename="data.csv", data_root_dir=data_root_dir)
35
36 sites = [f"site-{i + 1}" for i in range(n_clients)]
37
38 recipe = FedStatsRecipe(
39 name="stats_df",
40 stats_output_path=output_path,
41 sites=sites,
42 statistic_configs=statistic_configs,
43 stats_generator=df_stats_generator,
44 )
45
46 env = SimEnv(clients=sites, num_threads=n_clients)
47 recipe.execute(env=env)
48
49
50if __name__ == "__main__":
51 main()
The statistics configuration determines which statistics we need generate Here is an example
statistic_configs = {
"count": {},
"mean": {},
"sum": {},
"stddev": {},
"histogram": {"*": {"bins": 20}, "Age": {"bins": 20, "range": [0, 100]}},
"quantile": {"*": [0.1, 0.5, 0.9]},
}
Run Job¶
from terminal try to run the code
python job.py
You should see something like
2025-09-03 20:42:03,392 - INFO - save statistics result to persistence store
2025-09-03 20:42:03,392 - INFO - job dir = /tmp/nvflare/simulation/stats_df/server/simulate_job
2025-09-03 20:42:03,395 - INFO - trying to save data to /tmp/nvflare/simulation/stats_df/server/simulate_job/statistics/adults_stats.json
2025-09-03 20:42:03,395 - INFO - file /tmp/nvflare/simulation/stats_df/server/simulate_job/statistics/adults_stats.json saved
The results are stored in workspace “/tmp/nvflare”
/tmp/nvflare/simulation/stats_df/server/simulate_job/statistics/adults_stats.json
- ## Visualization
with json format, the data can be easily visualized via pandas dataframe and plots. download and copy the output adults_stats.json file to demo directory, then you can run jupyter notebook visualization.ipynb