nvflare.app_common.ccwf.recipes.swarm module
- class BaseSwarmLearningRecipe(name: str, server_config: SwarmServerConfig, client_config: SwarmClientConfig, cse_config: CrossSiteEvalConfig | None = None, job: CCWFJob | None = None)[source]
Bases:
RecipeBase recipe for Swarm Learning (framework-agnostic).
- Parameters:
name – Name of the federated learning job.
server_config – Swarm server configuration.
client_config – Swarm client configuration.
cse_config – Optional cross-site evaluation configuration.
job – Optional pre-created CCWFJob. If None, a new one is created. Subclasses may create the job early to add files before building configs.
This is base class of a recipe. Recipes are implemented by jobs. A concrete recipe must provide the job for recipe implementation.
- Parameters:
job – the job that implements the recipe.
- class SwarmLearningRecipe(name: str, model: Any | Dict[str, Any], num_rounds: int, train_script: str, min_clients: int, initial_ckpt: str | None = None, train_args: dict | None = None, do_cross_site_eval: bool = False, cross_site_eval_timeout: float = 300, launch_external_process: bool = False, command: str = 'python3 -u', memory_gc_rounds: int = 1, cuda_empty_cache: bool = False, expected_data_kind: str = DataKind.WEIGHTS, params_transfer_type: str = 'FULL', start_task_timeout: float = 300, progress_timeout: float = 3600, max_status_report_interval: float = 300, round_timeout: float = 3600, pipe_type: str = 'cell_pipe', pipe_root_path: str | None = None)[source]
Bases:
BaseSwarmLearningRecipeA simple recipe for Swarm Learning with PyTorch models.
- Parameters:
name – Name of the federated learning job.
model – PyTorch model to use as the initial model. Can be: - An nn.Module instance (e.g., MyModel()) - A dict config: {“class_path”: “module.ClassName”, “args”: {“param”: value}}
num_rounds – Number of training rounds.
train_script – Path to the training script.
min_clients – Minimum number of clients required.
initial_ckpt – Path to a pre-trained checkpoint file (.pt, .pth). Can be: - Relative path: file will be bundled into the job’s custom/ directory. - Absolute path: treated as a server-side path, used as-is at runtime.
train_args – Additional arguments for the training script.
do_cross_site_eval –
Whether to perform cross-site evaluation. When combined with
launch_external_process=True, the trained model is loaded from the persistor on disk (saved by PTFileModelPersistor after each round). Two limitations apply in that combination:Custom persistors: If your persistor streams models to a remote store without supporting local
get(), the persistor path returns None and CSE falls back to the executor, which also fails for ext-process mode. Ensure your persistor’sget()can retrieve the model locally.Cross-job evaluation: CSE against a model trained in a different job is not supported with
launch_external_process=Truebecause the current job’s persistor cannot locate the other job’s workspace. Use in-process mode or copy the trained model into the evaluating job’s workspace.
cross_site_eval_timeout – Timeout for cross-site evaluation.
launch_external_process – Whether to launch the training script in an external process. Defaults to False (in-process execution).
command – Shell command used to launch the script when launch_external_process=True. Defaults to “python3 -u”.
memory_gc_rounds – Run gc.collect() + malloc_trim every N FL rounds on both the trainer and aggregator roles. Defaults to 1 (every round) to match legacy behavior where gc.collect() was called unconditionally after each trainer submission. Set to 0 to disable.
cuda_empty_cache – Call torch.cuda.empty_cache() during cleanup. Defaults to False.
expected_data_kind – The data kind the aggregator expects from clients. Defaults to DataKind.WEIGHTS for full-weight FedAvg. Use DataKind.WEIGHT_DIFF when clients send parameter deltas (e.g. LoRA adapter diffs with params_transfer_type=DIFF).
params_transfer_type – How parameters are transferred between client script and NVFlare. FULL sends the entire parameter state each round; DIFF sends only the delta. Defaults to FULL. Must match the ParamsType used in the training script.
start_task_timeout – Seconds to wait for the starting client to acknowledge the start task. Increase for large models that need time to load. Defaults to 300.
progress_timeout – Seconds of no progress from any client before the workflow is considered stalled. Defaults to 3600.
max_status_report_interval – Maximum seconds between consecutive status reports from a client before it is considered silent. Defaults to 300.
round_timeout – P2P model transfer ACK budget in seconds — how long the aggregator waits for a receiver to acknowledge the model download via tensor streaming. The “ACK” includes the full model download, so the hardcoded default of 10s in SwarmClientConfig is too short for models larger than ~2GB. Set higher for large models (7B+) where P2P transfer can take minutes. Does NOT cap per-round training time (learn_task_timeout remains unbounded by default). Defaults to 3600 (matching progress_timeout).
pipe_type –
Pipe used for communication between the NVFlare client process and the external training process when
launch_external_process=True. Accepted values:"cell_pipe"(default):CellPipewith zero-copy tensor forwarding — the NVFlare client process relays model tensors without loading them into memory (~1 GB RAM for large models)."file_pipe":FilePipebacked by a shared directory. The NVFlare client process fully loads and re-serializes the model (~2× model size in RAM). Use when cell networking is unavailable or for third-party integrations that cannot resolve NVFlare cell addresses.
Ignored when
launch_external_process=False.pipe_root_path – Base directory for
FilePipewhenpipe_type="file_pipe".None(default) uses{WORKSPACE}/{JOB_ID}/{SITE_NAME}, matching thesag_cse_ccwf_ptreference template. If provided, the path must be an absolute path (e.g."/dev/shm/nvflare_pipes"for a RAM-backed tmpfs); the directory is treated as a runtime path and does not need to exist on the machine that builds or exports the job.{JOB_ID}/{SITE_NAME}is always appended so concurrent jobs and sites remain isolated. Ignored for"cell_pipe".
Example
Using nn.Module instance:
```python recipe = SwarmLearningRecipe(
name=”swarm_job”, model=MyModel(), min_clients=3, num_rounds=5, train_script=”train.py”,
)
Using dict config:
```python recipe = SwarmLearningRecipe(
name=”swarm_job”, model={“class_path”: “my_module.MyModel”, “args”: {“num_classes”: 10}}, min_clients=3, num_rounds=5, train_script=”train.py”,
)
This is base class of a recipe. Recipes are implemented by jobs. A concrete recipe must provide the job for recipe implementation.
- param job:
the job that implements the recipe.