nvflare.edge.tools.edge_fed_buff_recipe module

class DeviceManagerConfig(device_selection_size: int = 100, min_hole_to_fill: int = 1, device_reuse: bool = True)[source]

Bases: object

Configuration class for the device manager in federated learning.

This class configures how the device manager selects and manages devices for participation in federated learning workflow.

device_selection_size: Number of devices to select for each training round. Default: 100

min_hole_to_fill: Minimum number of model updates to wait for before sampling the next batch of devices and dispatching the current global model. - If set to 1, the server immediately dispatch the current global model to a sampled device. - Higher values cause the server to wait for more updates before dispatching. - If set to device_selection_size, we will have synchronous training since all devices’ responses need to be collected before dispatching the next global model. This parameter works with num_updates_for_model from model manager to achieve trade-off between global model versioning and local execution. Default: 1 (immediately dispatch the current global model)

device_reuse: Whether to allow devices to participate in multiple rounds. if False, devices will be selected only once, which could be realistic for real-world scenarios where the device pool is huge while participation is random. Default: True (always reuse / include the existing devices for further learning)

class EdgeFedBuffRecipe(job_name: str, model: Any | Dict, model_manager_config: ModelManagerConfig, device_manager_config: DeviceManagerConfig, initial_ckpt: str | None = None, evaluator_config: EvaluatorConfig | None = None, simulation_config: SimulationConfig | None = None, custom_source_root: str | None = None, device_wait_timeout: float | None = None)[source]

Bases: Recipe

Recipe class for cross-edge federated learning using NVFlare’s hierarchical edge system.

Recipe parameters and nested manager configuration values become part of the generated job definition and must never contain actual secrets. Read secrets from site environment variables or mounted files; references are supported only where documented in nvflare.recipe.secrets.

This class provides a high-level interface for configuring cross-edge federated learning jobs. It configures the necessary components including model managers, device managers, evaluators, and device simulation settings.

The recipe supports both real device connections and simulated device training, making it suitable for both production deployment and prototyping/testing.

Example usage:

```python # Basic configuration with model instance recipe = EdgeFedBuffRecipe(

job_name=”my_edge_job”, model=MyModel(), model_manager_config=ModelManagerConfig(…), device_manager_config=DeviceManagerConfig(…)

)

# With dict config and pre-trained checkpoint recipe = EdgeFedBuffRecipe(

job_name=”my_edge_job”, model={“class_path”: “my_module.MyModel”, “args”: {“num_classes”: 10}}, initial_ckpt=”/path/to/pretrained.pt”, model_manager_config=ModelManagerConfig(…), device_manager_config=DeviceManagerConfig(…)

Parameters:

job_name – Name of the federated learning job.
model – PyTorch model to be trained. Can be: - nn.Module instance: e.g., MyModel() - Dict config: {“class_path”: “module.ClassName”, “args”: {“param”: value}}
initial_ckpt – Absolute path to a pre-trained checkpoint file (.pt, .pth). The file may not exist locally (server-side path). Used to resume training from pre-trained weights.
model_manager_config – Configuration for the model manager.
device_manager_config – Configuration for the device manager.
evaluator_config – Configuration for the global evaluator (optional).
simulation_config – Configuration for simulated devices settings (optional).
custom_source_root – Path to custom source code (optional).
device_wait_timeout – Timeout in seconds for waiting for sufficient devices to join before stopping the job. None means wait indefinitely. WARNING: when device_reuse=False with a finite device pool, leaving this as None can cause the job to hang indefinitely once the pool is exhausted. In that case, set an explicit timeout (e.g., 300.0 seconds). Default: None

This is base class of a recipe. Recipes are implemented by jobs. A concrete recipe must provide the job for recipe implementation.

Security contract – no secrets in recipe parameters:

Recipe parameters (train_args, task_args, eval_args, per_site_config, config overrides, dicts passed to add_client_config/add_server_config, exec params, etc.) can be written in clear text into generated job configuration. These parameters and their nested values must never contain actual passwords, API keys, tokens, private keys, or other credentials. Instead, read secrets from site environment variables or mounted secret files inside your code, or pass a placeholder created with nvflare.recipe.secrets.secret_ref() or nvflare.recipe.secrets.secret_file_ref() at a supported runtime boundary. See nvflare.recipe.secrets for the supported parameter locations.

Before export or run, recipes scan their parameters with heuristics and emit nvflare.recipe.secrets.PotentialSecretWarning when a value looks like an actual secret. The scan is best-effort: absence of a warning does not prove a parameter is safe to share.

Parameters:: job – the job that implements the recipe.

create_job() → EdgeJob[source]

Create a new EdgeJob instance for cross-edge federated learning.

Returns:: A configured edge job instance
Return type:: EdgeJob

process_env(env: ExecEnv)[source]

Process environment-specific configuration.

Subclasses can override to add environment-specific processing. Script validation is handled by each ExecEnv subclass in deploy().

class EvaluatorConfig(eval_frequency: int = 1, torchvision_dataset: Dict | None = None, custom_dataset: Dict | None = None)[source]

Bases: object

Configuration class for the global evaluator.

This class configures how the global model is evaluated during training, including dataset selection and evaluation frequency.

eval_frequency: Frequency of global model evaluation (every N new model versions). Default: 1

torchvision_dataset: Configuration for torchvision datasets. Should be a dict with ‘name’ and ‘path’ keys. Default: None

custom_dataset: Configuration for custom datasets. Default: None

class ModelManagerConfig(max_num_active_model_versions: int = 3, max_model_version: int = 20, update_timeout: int = 5, num_updates_for_model: int = 100, max_model_history: int = 10, staleness_weight: bool = False, global_lr: float = 0.01)[source]

Bases: object

Configuration class for the model manager in federated learning.

This class configures how the model manager handles model updates, versioning, and aggregation strategies for federated learning workflow.

max_num_active_model_versions: Maximum number of active model versions that can be processed for the current model version. Default: 3

max_model_version: Maximum model version number before stopping training. We start with version 1 initial model, so the minimum for this arg is 2 to have at least one local update phase. Default: 20

update_timeout: Timeout in seconds for waiting for model updates. Default: 5.0

num_updates_for_model: Number of received updates required before generating a new global model. Default: 100

max_model_history: Maximum number of model versions to keep in history for staleness calculations and update aggregation. Default: 10

staleness_weight: Whether to apply staleness weighting to model updates. Default: False

global_lr: Global learning rate for model aggregation. Default: 0.01

class SimulationConfig(task_processor: DeviceTaskProcessor | None, job_timeout: float = 60.0, num_devices: int = 1000, num_workers: int = 10)[source]

Bases: object

Configuration class for simulation settings in federated learning.

This class configures the simulated devices for testing federated learning pipelines.

task_processor: Task processor for handling device training simulation.

job_timeout: Timeout in seconds for the entire job execution. Default: 60.0

num_devices: Total number of simulated devices for each leaf node. Default: 1000

num_workers: Number of worker processes for parallel device simulation on each leaf node. Default: 10