nvflare.app_opt.pt.recipes package

Submodules

Module contents

class CyclicRecipe(*, name: str = 'cyclic', model: Any | dict[str, Any] | None = None, initial_ckpt: str | None = None, num_rounds: int = 2, min_clients: int = 2, train_script: str, train_args: str = '', launch_external_process: bool = False, command: str = 'python3 -u', framework: FrameworkType = FrameworkType.PYTORCH, server_expected_format: ExchangeFormat = ExchangeFormat.NUMPY, params_transfer_type: TransferType = TransferType.FULL, server_memory_gc_rounds: int = 1, client_memory_gc_rounds: int = 0, cuda_empty_cache: bool = False, task_assignment_timeout: int = 10, shutdown_timeout: float = 0.0, server_config_overrides: Dict[str, Any] | None = None, client_config_overrides: Dict[str, Any] | None = None)[source]

Bases: CyclicRecipe

PyTorch-specific Cyclic federated learning recipe.

Recipe parameters, including train_args and config override dictionaries, must never contain actual secret values. Read secrets from site environment variables or mounted files; references are supported only where documented in nvflare.recipe.secrets.

Parameters:

name – Name identifier for the federated learning job. Defaults to “cyclic”.
model – Starting model object to begin training. Can be: - nn.Module instance - Dict config: {“path”: “module.ClassName”, “args”: {“param”: value}} - PTModel instance (already wrapped) - None: no initial model
initial_ckpt – Path to a pre-trained checkpoint file. Can be: - Relative path: file will be bundled into the job’s custom/ directory. - Absolute path: treated as a server-side path, used as-is at runtime. Note: PyTorch requires model when using initial_ckpt (for architecture).
num_rounds – Number of complete training rounds to execute. Defaults to 2.
min_clients – Minimum number of clients required to participate. Must be >= 2.
train_script – Path to the client training script to execute.
train_args – Additional command-line arguments to pass to the training script.
launch_external_process – Whether to run training in a separate process. Defaults to False.
command – Shell command to execute the training script. Defaults to “python3 -u”.
framework – ML framework type for compatibility. Defaults to FrameworkType.PYTORCH.
server_expected_format – Data exchange format between server and clients.
params_transfer_type – Method for transferring model parameters.
server_memory_gc_rounds – Run memory cleanup every N rounds on server. Defaults to 1.
task_assignment_timeout – Seconds to wait for the assigned client to request its task.
shutdown_timeout – Seconds to wait for an external client process during shutdown.
server_config_overrides – Advanced shallow overrides for the server controller.
client_config_overrides – Advanced shallow overrides for the client script runner.

This is base class of a recipe. Recipes are implemented by jobs. A concrete recipe must provide the job for recipe implementation.

Security contract – no secrets in recipe parameters:

Recipe parameters (train_args, task_args, eval_args, per_site_config, config overrides, dicts passed to add_client_config/add_server_config, exec params, etc.) can be written in clear text into generated job configuration. These parameters and their nested values must never contain actual passwords, API keys, tokens, private keys, or other credentials. Instead, read secrets from site environment variables or mounted secret files inside your code, or pass a placeholder created with nvflare.recipe.secrets.secret_ref() or nvflare.recipe.secrets.secret_file_ref() at a supported runtime boundary. See nvflare.recipe.secrets for the supported parameter locations.

Before export or run, recipes scan their parameters with heuristics and emit nvflare.recipe.secrets.PotentialSecretWarning when a value looks like an actual secret. The scan is best-effort: absence of a warning does not prove a parameter is safe to share.

Parameters:: job – the job that implements the recipe.

class FedAvgRecipe(*, name: str = 'fedavg', model: Any | dict[str, Any] | None = None, initial_ckpt: str | None = None, min_clients: int, num_rounds: int = 2, train_script: str, train_args: str = '', aggregator: Aggregator | None = None, aggregator_data_kind: DataKind | None = DataKind.WEIGHTS, launch_external_process: bool = False, command: str = 'python3 -u', server_expected_format: ExchangeFormat = ExchangeFormat.NUMPY, params_transfer_type: TransferType = TransferType.FULL, model_persistor: ModelPersistor | None = None, model_locator: ModelLocator | None = None, per_site_config: dict[str, dict] | None = None, launch_once: bool = True, shutdown_timeout: float = 0.0, key_metric: str = 'accuracy', stop_cond: str | None = None, patience: int | None = None, best_model_filename: str | None = None, save_filename: str | None = None, exclude_vars: str | None = None, aggregation_weights: dict[str, float] | None = None, server_memory_gc_rounds: int = 0, enable_tensor_disk_offload: bool = False, client_memory_gc_rounds: int = 0, cuda_empty_cache: bool = False)[source]

Bases: FedAvgRecipe

A recipe for implementing Federated Averaging (FedAvg) for PyTorch.

Recipe parameters, including train_args and nested per_site_config values, must never contain actual secrets. Read secrets from site environment variables or mounted files; references are supported only where documented in nvflare.recipe.secrets.

FedAvg is a fundamental federated learning algorithm that aggregates model updates from multiple clients by computing a weighted average based on the amount of local training data. This recipe sets up a complete federated learning workflow with memory-efficient InTime aggregation.

The recipe configures: - A federated job with initial model (optional) - FedAvg controller with InTime aggregation for memory efficiency - Optional early stopping and model selection - Script runners for client-side training execution

Parameters:

name – Name of the federated learning job. Defaults to “fedavg”.
model – Initial model to start federated training with. Can be: - nn.Module instance - Dict config: {“class_path”: “module.ClassName”, “args”: {“param”: value}} - None: no initial model
initial_ckpt – Absolute path to a pre-trained checkpoint file. The file may not exist locally as it could be on the server. Used to load initial weights. Note: PyTorch requires model when using initial_ckpt (for architecture).
min_clients – Minimum number of clients required to start a training round.
num_rounds – Number of federated training rounds to execute. Defaults to 2.
train_script – Path to the training script that will be executed on each client.
train_args – Command line arguments to pass to the training script.
aggregator – Custom aggregator (ModelAggregator) for combining client model updates. Must implement accept_model(), aggregate_model(), reset_stats() methods. If None, uses built-in memory-efficient weighted averaging.
aggregator_data_kind – Data kind to use for the aggregator. When a custom aggregator declares expected_data_kind, the declaration must match. Defaults to DataKind.WEIGHTS.
launch_external_process (bool) – Whether to launch the script in external process. Defaults to False.
command (str) – If launch_external_process=True, command to run script (prepended to script). Defaults to “python3 -u”.
server_expected_format (str) – What format to exchange the parameters between server and client.
params_transfer_type (str) – How to transfer the parameters. DIFF enables automatic difference calculation for full-model client results. A client’s FLModel.params_type remains authoritative. Defaults to TransferType.FULL.
model_persistor – Custom model persistor. If None, PTFileModelPersistor will be used.
model_locator – Custom model locator. If None, PTFileModelLocator will be used.
per_site_config – Deprecated constructor form. New code should call set_per_site_config(recipe, config) immediately after construction.
launch_once – Whether external process is launched once or per task. Defaults to True.
shutdown_timeout – Seconds to wait before shutdown. Defaults to 0.0.
key_metric – Metric used to determine if the model is globally best. Defaults to “accuracy”.
stop_cond – Early stopping condition based on metric. String literal in the format of ‘<key> <op> <value>’ (e.g. “accuracy >= 80”). If None, early stopping is disabled.
patience – Number of rounds with no improvement after which FL will be stopped.
best_model_filename – Filename for saving the best model. If unset, the default PyTorch persistor uses DefaultCheckpointFileName.BEST_GLOBAL_MODEL.
save_filename – Deprecated alias for best_model_filename. If both are specified, they must match.
exclude_vars – Regex pattern for variables to exclude from aggregation.
aggregation_weights – Per-client aggregation weights dict. Defaults to equal weights.
enable_tensor_disk_offload – Enable disk-backed tensor offload for incoming streamed payloads.

Example

Basic usage with early stopping:

```python recipe = FedAvgRecipe(

name=”my_fedavg_job”, model=pretrained_model, min_clients=2, num_rounds=10, train_script=”client.py”, train_args=”–epochs 5 –batch_size 32”, stop_cond=”accuracy >= 95”, patience=3

)

Note

This recipe uses InTime (streaming) aggregation for memory efficiency - each client result is aggregated immediately upon receipt rather than collecting all results first. Memory usage is constant regardless of the number of clients.

This is base class of a recipe. Recipes are implemented by jobs. A concrete recipe must provide the job for recipe implementation.

Security contract – no secrets in recipe parameters:

Recipe parameters (train_args, task_args, eval_args, per_site_config, config overrides, dicts passed to add_client_config/add_server_config, exec params, etc.) can be written in clear text into generated job configuration. These parameters and their nested values must never contain actual passwords, API keys, tokens, private keys, or other credentials. Instead, read secrets from site environment variables or mounted secret files inside your code, or pass a placeholder created with nvflare.recipe.secrets.secret_ref() or nvflare.recipe.secrets.secret_file_ref() at a supported runtime boundary. See nvflare.recipe.secrets for the supported parameter locations.

Before export or run, recipes scan their parameters with heuristics and emit nvflare.recipe.secrets.PotentialSecretWarning when a value looks like an actual secret. The scan is best-effort: absence of a warning does not prove a parameter is safe to share.

param job:: the job that implements the recipe.

class FedEvalRecipe(*, name: str = 'eval', model: Any | Dict[str, Any], eval_ckpt: str, min_clients: int, eval_script: str, eval_args: str = '', launch_external_process: bool = False, command: str = 'python3 -u', server_expected_format: ExchangeFormat = ExchangeFormat.NUMPY, validation_timeout: int = 6000, per_site_config: Dict[str, Dict] | None = None, client_memory_gc_rounds: int = 0, cuda_empty_cache: bool = False)[source]

Bases: Recipe

A recipe for federated evaluation of a PyTorch model across multiple sites.

This recipe sets up a federated evaluation workflow where a global model from the server is sent to multiple clients for evaluation. Each client evaluates the model on their local data and reports metrics back to the server.

The recipe configures: - A federated job with an initial model to evaluate - EvalController for coordinating federated evaluation across clients - Script runners for client-side evaluation execution

Parameters:

name – Name of the federated evaluation job. Defaults to “eval”.
model – Model structure to evaluate. Can be: - An instantiated nn.Module (e.g., Net()) - A dict config: {“class_path”: “module.ClassName”, “args”: {…}}
eval_ckpt – Absolute path to pre-trained checkpoint file (.pt, .pth, etc.). Required for evaluation - specifies which weights to evaluate. The file may not exist locally (server-side path).
min_clients – Minimum number of clients required to start evaluation.
eval_script – Path to the evaluation script that will be executed on each client.
eval_args – Command line arguments to pass to the evaluation script. The string is stored in the job definition and must not contain actual secret values; see nvflare.recipe.secrets for safe runtime references. Defaults to “”.
launch_external_process – Whether to launch the script in external process. Defaults to False.
command – If launch_external_process=True, command to run script (prepended to script). Defaults to “python3 -u”.
server_expected_format – What format to exchange the parameters between server and client. Defaults to ExchangeFormat.NUMPY.
validation_timeout – Timeout for evaluation task in seconds. Defaults to 6000.
per_site_config – Deprecated constructor form of per-site configuration. New code should call set_per_site_config(recipe, config) immediately after construction. Each config dict can contain optional overrides: eval_script, eval_args, launch_external_process, command, server_expected_format. Values are stored in the job definition and must not contain actual secret values. If not provided, the same configuration will be used for all clients. Defaults to None.

Example

Basic usage with model instance:

```python from nvflare.app_opt.pt.recipes.fedeval import FedEvalRecipe from model import Net

recipe = FedEvalRecipe(: name=”eval_job”, model=Net(), eval_ckpt=”/path/to/pretrained_model.pt”, min_clients=2, eval_script=”client.py”, eval_args=”–batch_size 32”,

)

Using dict config:

```python recipe = FedEvalRecipe(

name=”eval_job”, model={“class_path”: “my_module.Net”, “args”: {“num_classes”: 10}}, eval_ckpt=”/path/to/pretrained_model.pt”, min_clients=2, eval_script=”client.py”,

)

This is base class of a recipe. Recipes are implemented by jobs. A concrete recipe must provide the job for recipe implementation.

Security contract – no secrets in recipe parameters:

Recipe parameters (train_args, task_args, eval_args, per_site_config, config overrides, dicts passed to add_client_config/add_server_config, exec params, etc.) can be written in clear text into generated job configuration. These parameters and their nested values must never contain actual passwords, API keys, tokens, private keys, or other credentials. Instead, read secrets from site environment variables or mounted secret files inside your code, or pass a placeholder created with nvflare.recipe.secrets.secret_ref() or nvflare.recipe.secrets.secret_file_ref() at a supported runtime boundary. See nvflare.recipe.secrets for the supported parameter locations.

Before export or run, recipes scan their parameters with heuristics and emit nvflare.recipe.secrets.PotentialSecretWarning when a value looks like an actual secret. The scan is best-effort: absence of a warning does not prove a parameter is safe to share.

param job:: the job that implements the recipe.

class FedOptRecipe(*, name: str = 'fedopt', model: Any | dict[str, Any] | None = None, initial_ckpt: str | None = None, min_clients: int, num_rounds: int = 2, train_script: str, train_args: str = '', aggregator: Aggregator | None = None, launch_external_process: bool = False, command: str = 'python3 -u', server_expected_format: ExchangeFormat = ExchangeFormat.NUMPY, device: str | None = None, source_model: str = 'model', optimizer_args: dict | None = None, lr_scheduler_args: dict | None = None, server_memory_gc_rounds: int = 1, client_memory_gc_rounds: int = 0, cuda_empty_cache: bool = False, enable_tensor_disk_offload: bool = False)[source]

Bases: Recipe

A recipe for implementing Federated Optimization (FedOpt) in NVFlare.

Recipe parameters, including train_args, optimizer_args, and lr_scheduler_args, must never contain actual secret values. Read secrets from site environment variables or mounted files; references are supported only where documented in nvflare.recipe.secrets.

FedOpt is a federated learning algorithm that optimizes the global model using a server-side optimizer and learning rate scheduler. After each round, the global model is updated using the specified optimizer and learning rate scheduler. The algorithm is proposed in Reddi et al. “Adaptive Federated Optimization.” arXiv preprint arXiv:2003.00295 (2020).

Note: FedOpt requires client weight differences and DataKind.WEIGHT_DIFF in the aggregator.

Parameters:

name – Name of the federated learning job. Defaults to “fedopt”.
model – Initial model to start federated training with (REQUIRED). Can be: - nn.Module instance - Dict config: {“class_path”: “module.ClassName”, “args”: {“param”: value}} Note: FedOpt requires a model for the server-side optimizer to work.
initial_ckpt – Absolute path to a pre-trained checkpoint file. The file may not exist locally as it could be on the server. Used to load initial weights. Note: PyTorch requires model when using initial_ckpt (for architecture).
min_clients – Minimum number of clients required to start a training round.
num_rounds – Number of federated training rounds to execute. Defaults to 2.
train_script – Path to the training script that will be executed on each client.
train_args – Command line arguments to pass to the training script.
aggregator – Aggregator for combining client updates. If None, uses InTimeAccumulateWeightedAggregator with expected_data_kind=DataKind.WEIGHT_DIFF.
launch_external_process (bool) – Whether to launch the script in external process. Defaults to False.
command (str) – If launch_external_process=True, command to run script (prepended to script). Defaults to “python3”.
server_expected_format (str) – What format to exchange the parameters between server and client.
source_model (str) – ID of the source model component. Defaults to “model”.
optimizer_args (dict) – Configuration for server-side optimizer with keys: - path: Fully qualified optimizer class (e.g., “torch.optim.SGD”). “class_path” is also accepted. - args: Dictionary of optimizer arguments (e.g., {“lr”: 1.0, “momentum”: 0.6}) - config_type: Optional; if omitted, set to “dict” so the config is not instantiated at load time.
lr_scheduler_args (dict) – Optional configuration for learning rate scheduler with keys: - path: Fully qualified scheduler class (e.g., “torch.optim.lr_scheduler.CosineAnnealingLR”). “class_path” is also accepted. - args: Dictionary of scheduler arguments (e.g., {“T_max”: 100, “eta_min”: 0.9}) - config_type: Optional; if omitted, set to “dict” so the config is not instantiated at load time.
device (str) – Device to use for server-side optimization, e.g. “cpu” or “cuda:0”. Defaults to None; will default to cuda if available and no device is specified.
server_memory_gc_rounds – Run memory cleanup (gc.collect + malloc_trim) every N rounds on server. Set to 0 to disable. Defaults to 1 (every round).
enable_tensor_disk_offload (bool) – Download streamed PyTorch tensors to disk on the server during FOBS deserialization instead of keeping all incoming client tensors in memory. Defaults to False.

Example

```python recipe = FedOptRecipe(

name=”my_fedopt_job”, model=pretrained_model, min_clients=2, num_rounds=10, train_script=”client.py”, train_args=”–epochs 5 –batch_size 32”, device=”cpu”, source_model=”model”, optimizer_args={

“path”: “torch.optim.SGD”, “args”: {“lr”: 1.0, “momentum”: 0.6}, “config_type”: “dict”

}, lr_scheduler_args={

“path”: “torch.optim.lr_scheduler.CosineAnnealingLR”, “args”: {“T_max”: “{num_rounds}”, “eta_min”: 0.9}, “config_type”: “dict”

}

)

This is base class of a recipe. Recipes are implemented by jobs. A concrete recipe must provide the job for recipe implementation.

Security contract – no secrets in recipe parameters:

Recipe parameters (train_args, task_args, eval_args, per_site_config, config overrides, dicts passed to add_client_config/add_server_config, exec params, etc.) can be written in clear text into generated job configuration. These parameters and their nested values must never contain actual passwords, API keys, tokens, private keys, or other credentials. Instead, read secrets from site environment variables or mounted secret files inside your code, or pass a placeholder created with nvflare.recipe.secrets.secret_ref() or nvflare.recipe.secrets.secret_file_ref() at a supported runtime boundary. See nvflare.recipe.secrets for the supported parameter locations.

Before export or run, recipes scan their parameters with heuristics and emit nvflare.recipe.secrets.PotentialSecretWarning when a value looks like an actual secret. The scan is best-effort: absence of a warning does not prove a parameter is safe to share.

param job:: the job that implements the recipe.

class ScaffoldRecipe(*, name: str = 'scaffold', model: Any | dict[str, Any] | None = None, initial_ckpt: str | None = None, min_clients: int, num_rounds: int = 2, train_script: str, train_args: str = '', launch_external_process: bool = False, command: str = 'python3 -u', server_expected_format: ExchangeFormat = ExchangeFormat.NUMPY, params_transfer_type: TransferType = TransferType.FULL, server_memory_gc_rounds: int = 0, client_memory_gc_rounds: int = 0, cuda_empty_cache: bool = False, enable_tensor_disk_offload: bool = False)[source]

Bases: Recipe

A recipe for implementing Scaffold in NVFlare.

Recipe parameters, including train_args, become part of the generated job definition and must never contain actual secret values. Read secrets from site environment variables or mounted files; references are supported only where documented in nvflare.recipe.secrets.

Implements the training algorithm proposed in Karimireddy et al. “SCAFFOLD: Stochastic Controlled Averaging for Federated Learning” (https://arxiv.org/abs/1910.06378).

Client script requirement: Unlike FedAvgRecipe, the client script must use PTScaffoldHelper (nvflare.app_opt.pt.scaffold): call init(model), model_update() during training, terms_update() after training, and include meta[AlgorithmConstants.SCAFFOLD_CTRL_DIFF] = scaffold_helper.get_delta_controls() in the FLModel sent back to the server. A standard flare.receive/send loop without PTScaffoldHelper will cause server-side aggregation to fail.

This recipe sets up a complete federated learning workflow with Scaffold controller.

Parameters:

name – Name of the federated learning job. Defaults to “scaffold”.
model – Initial model to start federated training with. Can be: - nn.Module instance - Dict config: {“class_path”: “module.ClassName”, “args”: {“param”: value}} - None: no initial model
initial_ckpt – Absolute path to a pre-trained checkpoint file. The file may not exist locally as it could be on the server. Used to load initial weights. Note: PyTorch requires model when using initial_ckpt (for architecture).
min_clients – Minimum number of clients required to start a training round. Defaults to 2.
num_rounds – Number of federated training rounds to execute. Defaults to 2.
train_script – Path to the training script that will be executed on each client. Defaults to “client.py”.
train_args – Command line arguments to pass to the training script. Defaults to “”.
server_memory_gc_rounds – Run memory cleanup (gc.collect + malloc_trim) every N rounds on server. Set to 0 to disable. Defaults to 0.
enable_tensor_disk_offload (bool) – Download streamed PyTorch tensors to disk on the server during receive/aggregation instead of holding them in memory, reducing server memory pressure for large models. Requires server_expected_format=ExchangeFormat.PYTORCH. Defaults to False.

Example

```python recipe = ScaffoldRecipe(

name=”my_scaffold_job”, model=pretrained_model, min_clients=2, num_rounds=10, train_script=”client.py”, train_args=”–epochs 5 –batch_size 32”

)

This is base class of a recipe. Recipes are implemented by jobs. A concrete recipe must provide the job for recipe implementation.

Security contract – no secrets in recipe parameters:

Recipe parameters (train_args, task_args, eval_args, per_site_config, config overrides, dicts passed to add_client_config/add_server_config, exec params, etc.) can be written in clear text into generated job configuration. These parameters and their nested values must never contain actual passwords, API keys, tokens, private keys, or other credentials. Instead, read secrets from site environment variables or mounted secret files inside your code, or pass a placeholder created with nvflare.recipe.secrets.secret_ref() or nvflare.recipe.secrets.secret_file_ref() at a supported runtime boundary. See nvflare.recipe.secrets for the supported parameter locations.

Before export or run, recipes scan their parameters with heuristics and emit nvflare.recipe.secrets.PotentialSecretWarning when a value looks like an actual secret. The scan is best-effort: absence of a warning does not prove a parameter is safe to share.

param job:: the job that implements the recipe.