nvflare.app_common.ccwf.recipes.swarm module

class BaseSwarmLearningRecipe(name: str, server_config: SwarmServerConfig, client_config: SwarmClientConfig, cse_config: CrossSiteEvalConfig | None = None, job: CCWFJob | None = None)[source]

Bases: Recipe

Base recipe for Swarm Learning (framework-agnostic).

Server, client, and cross-site-evaluation config values become part of the generated job definition and must never contain actual secrets. Read secrets from site environment variables or mounted files; references are supported only where documented in nvflare.recipe.secrets.

Parameters:

name – Name of the federated learning job.
server_config – Swarm server configuration.
client_config – Swarm client configuration.
cse_config – Optional cross-site evaluation configuration.
job – Optional pre-created CCWFJob. If None, a new one is created. Subclasses may create the job early to add files before building configs.

This is base class of a recipe. Recipes are implemented by jobs. A concrete recipe must provide the job for recipe implementation.

Security contract – no secrets in recipe parameters:

Recipe parameters (train_args, task_args, eval_args, per_site_config, config overrides, dicts passed to add_client_config/add_server_config, exec params, etc.) can be written in clear text into generated job configuration. These parameters and their nested values must never contain actual passwords, API keys, tokens, private keys, or other credentials. Instead, read secrets from site environment variables or mounted secret files inside your code, or pass a placeholder created with nvflare.recipe.secrets.secret_ref() or nvflare.recipe.secrets.secret_file_ref() at a supported runtime boundary. See nvflare.recipe.secrets for the supported parameter locations.

Before export or run, recipes scan their parameters with heuristics and emit nvflare.recipe.secrets.PotentialSecretWarning when a value looks like an actual secret. The scan is best-effort: absence of a warning does not prove a parameter is safe to share.

Parameters:: job – the job that implements the recipe.

class SwarmLearningRecipe(name: str, model: Any | Dict[str, Any], num_rounds: int, train_script: str, min_clients: int, initial_ckpt: str | None = None, train_args: dict | None = None, do_cross_site_eval: bool = False, cross_site_eval_timeout: float = 300, launch_external_process: bool = False, command: str = 'python3 -u', memory_gc_rounds: int = 1, cuda_empty_cache: bool = False, expected_data_kind: str = DataKind.WEIGHTS, params_transfer_type: str = 'FULL', start_task_timeout: float = 300, progress_timeout: float = 3600, max_status_report_interval: float = 300, round_timeout: float = 3600, learn_task_timeout: float | None = None, max_concurrent_submissions: int = 1, learn_task_abort_timeout: float | None = None, learn_task_ack_timeout: float | None = None, final_result_ack_timeout: float | None = None, server_config_overrides: Dict[str, Any] | None = None, client_config_overrides: Dict[str, Any] | None = None, pipe_type: str = 'cell_pipe', pipe_root_path: str | None = None)[source]

Bases: BaseSwarmLearningRecipe

A simple recipe for Swarm Learning with PyTorch models.

Parameters:

name – Name of the federated learning job.
model – PyTorch model to use as the initial model. Can be: - An nn.Module instance (e.g., MyModel()) - A dict config: {“class_path”: “module.ClassName”, “args”: {“param”: value}}
num_rounds – Number of training rounds.
train_script – Path to the training script.
min_clients – Minimum number of clients required.
initial_ckpt – Path to a pre-trained checkpoint file (.pt, .pth). Can be: - Relative path: file will be bundled into the job’s custom/ directory. - Absolute path: treated as a server-side path, used as-is at runtime.
train_args – Additional arguments for the training script. The dictionary is stored in the job definition and must not contain actual secret values; see nvflare.recipe.secrets for safe runtime references.
do_cross_site_eval –
Whether to perform cross-site evaluation. When combined with launch_external_process=True, the trained model is loaded from the persistor on disk (saved by PTFileModelPersistor after each round). Two limitations apply in that combination:
1. Custom persistors: If your persistor streams models to a remote store without supporting local get(), the persistor path returns None and CSE falls back to the executor, which also fails for ext-process mode. Ensure your persistor’s get() can retrieve the model locally.
2. Cross-job evaluation: CSE against a model trained in a different job is not supported with launch_external_process=True because the current job’s persistor cannot locate the other job’s workspace. Use in-process mode or copy the trained model into the evaluating job’s workspace.
cross_site_eval_timeout – Timeout for cross-site evaluation.
launch_external_process – Whether to launch the training script in an external process. Defaults to False (in-process execution).
command – Shell command used to launch the script when launch_external_process=True. Defaults to “python3 -u”.
memory_gc_rounds – Run gc.collect() + malloc_trim every N FL rounds on both the trainer and aggregator roles. Defaults to 1 (every round) to match legacy behavior where gc.collect() was called unconditionally after each trainer submission. Set to 0 to disable.
cuda_empty_cache – Call torch.cuda.empty_cache() during cleanup. Defaults to False.
expected_data_kind – The data kind the aggregator expects from clients. Defaults to DataKind.WEIGHTS for full-weight FedAvg. Clients returning differences must label their result with FLModel.params_type=ParamsType.DIFF.
params_transfer_type – How parameters are transferred between client script and NVFlare. DIFF enables automatic difference calculation for full-model client results. A client’s FLModel.params_type remains authoritative. Defaults to FULL.
start_task_timeout – Seconds to wait for the starting client to acknowledge the start task. Increase for large models that need time to load. Defaults to 300.
progress_timeout – Seconds of no progress from any client before the workflow is considered stalled. Defaults to 3600.
max_status_report_interval – Maximum seconds between consecutive status reports from a client before it is considered silent. Defaults to 300.
round_timeout – P2P model transfer ACK budget in seconds — how long the aggregator waits for a receiver to acknowledge the model download via tensor streaming. The “ACK” includes the full model download, so the hardcoded default of 10s in SwarmClientConfig is too short for models larger than ~2GB. Set higher for large models (7B+) where P2P transfer can take minutes. Does NOT cap per-round training time (learn_task_timeout remains unbounded by default). Defaults to 3600 (matching progress_timeout).
learn_task_timeout – Maximum seconds allowed for a learning task. None means no limit. Defaults to None.
max_concurrent_submissions – Maximum number of concurrent result submissions accepted by the aggregation client. Must be at least 1. Defaults to 1.
learn_task_abort_timeout – Seconds to wait for a learning task to stop after an abort request. Must be positive when specified. None uses the Swarm client controller default.
learn_task_ack_timeout – Seconds to wait for acknowledgment when dispatching a learning task. None uses round_timeout for backward compatibility.
final_result_ack_timeout – Seconds to wait for clients to acknowledge the final result. None uses round_timeout for backward compatibility.
server_config_overrides – Advanced shallow overrides for SwarmServerConfig. Values here take precedence over named constructor parameters, except min_clients, which must be set through the named parameter to keep scheduler, server-controller, and client aggregation quorums aligned. This dictionary is stored in the job definition and must not contain secrets.
client_config_overrides – Advanced shallow overrides for SwarmClientConfig. Values here take precedence over named constructor parameters. Recipe-managed fields (executor, aggregator, persistor, shareable generator, and min_responses_required) cannot be replaced through this dictionary; use BaseSwarmLearningRecipe for custom components or quorum settings. This dictionary is stored in the job definition and must not contain secrets.
pipe_type –
Pipe used for communication between the NVFlare client process and the external training process when launch_external_process=True. Accepted values:
- "cell_pipe" (default): CellPipe with zero-copy tensor forwarding — the NVFlare client process relays model tensors without loading them into memory (~1 GB RAM for large models).
- "file_pipe": FilePipe backed by a shared directory. The NVFlare client process fully loads and re-serializes the model (~2× model size in RAM). Use when cell networking is unavailable or for third-party integrations that cannot resolve NVFlare cell addresses.
Ignored when launch_external_process=False.
pipe_root_path – Base directory for FilePipe when pipe_type="file_pipe". None (default) uses {WORKSPACE}/{JOB_ID}/{SITE_NAME}, matching the sag_cse_ccwf_pt reference template. If provided, the path must be an absolute path (e.g. "/dev/shm/nvflare_pipes" for a RAM-backed tmpfs); the directory is treated as a runtime path and does not need to exist on the machine that builds or exports the job. {JOB_ID}/{SITE_NAME} is always appended so concurrent jobs and sites remain isolated. Ignored for "cell_pipe".

Example

Using nn.Module instance:

```python recipe = SwarmLearningRecipe(

name=”swarm_job”, model=MyModel(), min_clients=3, num_rounds=5, train_script=”train.py”,

)

Using dict config:

```python recipe = SwarmLearningRecipe(

name=”swarm_job”, model={“class_path”: “my_module.MyModel”, “args”: {“num_classes”: 10}}, min_clients=3, num_rounds=5, train_script=”train.py”,

)

This is base class of a recipe. Recipes are implemented by jobs. A concrete recipe must provide the job for recipe implementation.

Security contract – no secrets in recipe parameters:

Recipe parameters (train_args, task_args, eval_args, per_site_config, config overrides, dicts passed to add_client_config/add_server_config, exec params, etc.) can be written in clear text into generated job configuration. These parameters and their nested values must never contain actual passwords, API keys, tokens, private keys, or other credentials. Instead, read secrets from site environment variables or mounted secret files inside your code, or pass a placeholder created with nvflare.recipe.secrets.secret_ref() or nvflare.recipe.secrets.secret_file_ref() at a supported runtime boundary. See nvflare.recipe.secrets for the supported parameter locations.

Before export or run, recipes scan their parameters with heuristics and emit nvflare.recipe.secrets.PotentialSecretWarning when a value looks like an actual secret. The scan is best-effort: absence of a warning does not prove a parameter is safe to share.

param job:: the job that implements the recipe.