nvflare.recipe.utils module

add_cross_site_evaluation(recipe: Recipe, submit_model_timeout: int = 600, validation_timeout: int = 6000, participating_clients: List[str] | None = None)[source]

Add cross-site evaluation to an existing recipe.

This utility automatically configures cross-site evaluation by: - Auto-detecting the framework from the recipe - Adding the appropriate model locator - Adding the CrossSiteModelEval controller - Adding ValidationJsonGenerator for results - Auto-adding the appropriate validator to clients (for NumPy recipes)

For standalone CSE without training, use NumpyCrossSiteEvalRecipe instead.

Note: This utility is designed for adding CSE to training recipes. Standalone CSE recipes such as NumpyCrossSiteEvalRecipe already configure their CSE workflow; calling this utility on them raises RuntimeError through the idempotency check.

WARNING: Do not call this function multiple times on the same recipe instance. This function is idempotent and will raise a RuntimeError if called more than once on the same recipe to prevent duplicate component registration.

IMPORTANT for PyTorch: Your client training script must handle validation tasks by checking flare.is_evaluate() and returning metrics without training. Example pattern:

```python # In your client script: while flare.is_running():

input_model = flare.receive() model.load_state_dict(input_model.params)

# Evaluate model (always required) metrics = evaluate(model, test_loader)

# Handle CSE validation task if flare.is_evaluate():

output_model = flare.FLModel(metrics=metrics) flare.send(output_model) continue # Skip training for validation-only tasks

# Normal training code here…

```

Example (NumPy - fully automatic):

```python from nvflare.app_common.np.recipes import NumpyFedAvgRecipe from nvflare.recipe.utils import add_cross_site_evaluation

recipe = NumpyFedAvgRecipe(: name=”my-job”, model=[1.0, 2.0, 3.0], min_clients=2, num_rounds=3, train_script=”client.py”

)

# That’s it! Framework auto-detected, validator auto-added add_cross_site_evaluation(recipe) ```

Example (PyTorch - requires client script support):

```python from nvflare.app_opt.pt.recipes import FedAvgRecipe from nvflare.recipe.utils import add_cross_site_evaluation

recipe = FedAvgRecipe(: name=”my-job”, min_clients=2, num_rounds=3, model=MyModel(), train_script=”client.py”

)

# Note: client.py must handle flare.is_evaluate() for validation add_cross_site_evaluation(recipe) ```

Example (TensorFlow - Client API pattern, recommended):

```python from nvflare.app_opt.tf.recipes import FedAvgRecipe from nvflare.recipe.utils import add_cross_site_evaluation

recipe = FedAvgRecipe(: name=”my-job”, min_clients=2, num_rounds=3, model=MyTFModel(), train_script=”client.py”

)

# Note: client.py must handle flare.is_evaluate() for validation add_cross_site_evaluation(recipe) ```

TensorFlow component-based validators are executors, not plain components. Use the lower-level Job API when explicit TFValidator placement is required; Recipe-based jobs should use the Client API pattern above.

Parameters:

recipe – Recipe instance to augment with cross-site evaluation.
submit_model_timeout – Timeout (seconds) for submitting models to clients. Defaults to 600.
validation_timeout – Timeout (seconds) for validation tasks on clients. Defaults to 6000.
participating_clients – Optional list of client names to include in cross-site evaluation. If not provided, all clients connected at controller start are used.

Raises:

ValueError – If the recipe doesn’t have a framework attribute or uses an unsupported framework.
RuntimeError – If cross-site evaluation has already been added to this recipe.

Note

Currently supports PyTorch, NumPy, and TensorFlow frameworks.
NumPy recipes using `NumpyFedAvgRecipe`: Validators (NPValidator) are automatically added to clients to handle validation tasks. The idempotency check prevents duplicate CSE augmentation and validator registration.
Unified `FedAvgRecipe` with `framework=FrameworkType.NUMPY`: Uses the same Client API validation pattern as PyTorch and TensorFlow. Your client script should handle flare.is_evaluate() and return metrics for validation tasks.
PyTorch recipes: No separate validator component is needed. The client training script handles validation tasks through the Client API’s flare.is_evaluate() check. See the hello-pt example for implementation pattern.
TensorFlow recipes: Similar to PyTorch, uses the Client API pattern. The client script should handle validation tasks via flare.is_evaluate() check.

add_experiment_tracking(recipe: Recipe, tracking_type: str, tracking_config: dict | None = None, client_side: bool = False, server_side: bool = True, clients: List[str] | None = None)[source]

Add experiment tracking to a recipe.

Adds tracking receivers to the server and/or clients to collect and log metrics during training.

Parameters:

recipe – Recipe instance to augment with experiment tracking.
tracking_type – Type of tracking to enable (“mlflow”, “tensorboard”, or “wandb”).
tracking_config – Optional configuration dict for the tracking receiver. For MLflow, omitting this uses a local file store and derives experiment_name and run_name from the recipe name. The configuration becomes part of the generated job definition and must never contain actual credentials; configure authentication through the executing site’s environment or a mounted secret instead.
client_side – If True, add tracking to clients (each client tracks locally).
server_side – If True, add tracking to server (aggregates metrics from all clients). Default: True.
clients – Optional list of client names for client-side tracking. If None, the client-side receiver is added to all clients. Only valid with client_side=True. To give sites different receiver configs (e.g. per-site tracking_uri), call this function once per site with that site’s tracking_config and clients=[site]. Targeting specific clients requires the recipe’s client apps to be per-site (call set_per_site_config immediately after constructing a supported recipe), and each name must match an existing per-site client app; with the default all-clients topology or unknown site names, targeted placement raises ValueError.

Examples

# Server-side MLflow tracking with local storage and recipe-derived names add_experiment_tracking(recipe, “mlflow”)

# Client-side tracking only (each client tracks independently) add_experiment_tracking(recipe, “mlflow”, client_side=True, server_side=False)

# Both server and client tracking add_experiment_tracking(recipe, “mlflow”, {…}, client_side=True, server_side=True)

# Per-site client tracking configs (one call per site) add_experiment_tracking(

recipe, “mlflow”, {“tracking_uri”: “file:///tmp/site-1/mlruns”}, client_side=True, server_side=False, clients=[“site-1”],

) add_experiment_tracking(

recipe, “mlflow”, {“tracking_uri”: “file:///tmp/site-2/mlruns”}, client_side=True, server_side=False, clients=[“site-2”],

)

add_final_global_evaluation(recipe: Recipe, participating_clients: List[str] | None = None, validation_timeout: int = 6000) → None[source]

Evaluate a PyTorch recipe’s final global model on selected clients.

Unlike full cross-site evaluation, this helper does not ask clients to submit their local models. It locates the recipe’s persisted global model and sends only that model for validation after training.

Parameters:

recipe – PyTorch recipe to augment with final global model evaluation.
participating_clients – Optional client names to run validation. If not provided, all clients connected when the controller starts are used.
validation_timeout – Timeout in seconds for validation tasks. Defaults to 6000, matching CrossSiteModelEval’s existing default.

Raises:

TypeError – If participating_clients is not a list of strings.
ValueError – If participating_clients is empty, the recipe is not PyTorch, or the recipe has no model persistor.
RuntimeError – If a cross-site evaluation workflow is already configured.

collect_non_local_scripts(job: FedJob) → List[str][source]

Collect scripts that don’t exist locally.

This utility function is used by ExecEnv subclasses to validate script resources before deployment. Scripts are considered “non-local” if they are absolute paths that don’t exist on the local machine.

Parameters:: job – The FedJob to check for non-local scripts.
Returns:: List of absolute script paths that don’t exist on the local machine.

ensure_config_type_dict(config: Dict[str, Any] | None) → Dict[str, Any] | None[source]

Ensure a component config dict has config_type ‘dict’ and is normalized for the config layer.

Used by FedOpt-style recipes for optimizer_args and lr_scheduler_args: those dicts have ‘path’ or ‘class_path’ plus ‘args’, and would otherwise be treated as component configs and instantiated during config scan (e.g. torch.optim.SGD without params). This function: - Accepts either ‘path’ or ‘class_path’ (for consistency with recipe model_config); if only

‘class_path’ is set, copies it to ‘path’ so the component builder and runtime code work unchanged.

Sets config_type to ‘dict’ when missing so the component builder does not instantiate at load time; the optimizer/scheduler is instantiated at runtime when params/optimizer are available.

Parameters:: config – A component-style config dict (e.g. {‘class_path’: ‘torch.optim.SGD’, ‘args’: {‘lr’: 1.0}} or {‘path’: ‘…’, ‘args’: {…}}) or None.
Returns:: A copy of config with config_type ‘dict’ if missing and path set from class_path if needed; None if config is None.

extract_persistor_id(result: Any) → str[source]

merge_config_overrides(defaults: Dict[str, Any], overrides: Dict[str, Any] | None, name: str) → Dict[str, Any][source]: Return a shallow merge of recipe defaults and user overrides.

prepare_initial_ckpt(initial_ckpt: str | None, job) → str | None[source]

Prepare initial_ckpt for job deployment.

Relative path: treated as a local file. The file is bundled into the server app’s custom directory and the basename is returned for runtime resolution.
Absolute path: treated as a server-side (remote) path and returned as-is. The file is expected to exist on the server at runtime.

Parameters:

initial_ckpt – Checkpoint file path (absolute or relative).
job – BaseFedJob instance to add the file to.

Returns:

None if initial_ckpt is None
Basename for relative paths (file is bundled into app/custom/)
Absolute path as-is for server-side checkpoints

Return type:

The checkpoint path to pass to the persistor

recipe_model_to_job_model(recipe_model: Dict[str, Any]) → Dict[str, Any][source]

Validate and convert recipe model dict to job/config format (path).

Calls validate_dict_model_config() internally so callers do not need to validate separately. Recipes accept {“class_path”: “module.Class”, “args”: {…}} or {“path”: “module.Class”, “args”: {…}}. The Job API and config parsing expect {“path”: “module.Class”, “args”: {…}}.

Parameters:: recipe_model – Dict with ‘class_path’ or ‘path’ and optional ‘args’.
Returns:: Dict with ‘path’ and ‘args’ for use by PTModel, persistors, etc.

resolve_initial_ckpt(initial_ckpt: str | None, prepared_initial_ckpt: str | None, job) → str | None[source]

set_per_site_config(recipe: Recipe, config: Dict[str, Dict]) → None[source]

Set site-keyed configuration on a recipe.

Call this once, immediately after recipe construction and before adding client customizations. The helper validates the generic shape: - top-level keys are site names - values are recipe-specific dictionaries - the mapping is not empty

Each recipe is responsible for validating and interpreting the fields inside each site’s dictionary. Supported recipes materialize client apps later, before the first client customization or before export or execution. The execution environment still controls which clients are present for a run. Per-site values become part of the generated job definition and must never contain actual secret values; see nvflare.recipe.secrets.

set_recipe_meta(recipe: Recipe, key: JobMetaKey, value: Any) → None[source]

Set one generated job metadata value through meta_props.

The key must be one of nvflare.apis.job_def.USER_SETTABLE_JOB_META_KEYS. Keys with dedicated FedJob constructor fields, such as MIN_CLIENTS and MANDATORY_CLIENTS, are not accepted here – set those through the recipe/FedJob constructor so the controller, scheduler, and metadata stay in sync. STUDY is not accepted either: the server assigns the study from the admin session’s active study at job submission, so a recipe-set value would be silently overwritten.

The value shape depends on the key: SCOPE takes a string; RESOURCE_SPEC and JOB_LAUNCHER_SPEC take a dict keyed by site name with dict values; CUSTOM_PROPS takes a dict. Dict values must be completely JSON-serializable, cannot contain non-finite floats, and have their keys coerced to strings as they will appear in meta.json. The value is stored in meta_props and replaces any existing meta_props value for that key. Metadata is emitted in clear text in meta.json and must never contain actual secret values; see nvflare.recipe.secrets.

setup_custom_persistor(*, job, model_persistor=None) → str[source]

validate_aggregator_data_kind(*, data_kind: DataKind | None, recipe_name: str, data_kind_arg: str = 'aggregator_data_kind', aggregator: Any | None = None, require_data_kind: bool = False, fixed_data_kind: bool = False) → None[source]

Validate recipe-owned server aggregation data-kind settings.

fixed_data_kind is for recipes such as FedOpt that do not expose the update-kind settings. Its error guidance tells users to replace or reconfigure their custom aggregator instead of suggesting recipe arguments they cannot change.

This intentionally does not infer the client result kind from TransferType. FLModel.params_type is the authoritative description of a client result, and a recipe cannot inspect an arbitrary training script at construction time.

validate_ckpt(ckpt: str | None) → None[source]

Validate a checkpoint path if provided.

For absolute paths: no local existence check (file may be a server-side path). For relative paths: verifies the file exists locally (it will be bundled into the job).

Parameters:: ckpt – Checkpoint file path to validate (e.g. initial_ckpt or eval_ckpt).
Raises:: ValueError – If relative path does not exist locally.

validate_dict_model_config(model: Any) → None[source]

Validate recipe dict model config structure.

Recipes accept model config with class_path or the path alias. The job/config layer uses path.

Parameters:: model – Model input to validate.
Raises:: ValueError – If dict config is missing ‘class_path’/’path’ or value is not a string.