Timeouts in NVIDIA FLARE (Reference)

This document provides a comprehensive overview of all timeout configurations in NVIDIA FLARE, organized by functional categories with relationships, impacts, and usage examples.

Network Communication Timeouts

This section covers all network-related timeouts including the F3/CellNet communication layer, server configuration, and client communication settings.

F3/CellNet Layer

The F3 (Flare-Friendly Framework) and CellNet provide the core communication infrastructure. These timeouts are configured in comm_config.json.

CommConfigurator Settings

Low-level communication configuration (comm_config.py):

Parameter

Default

Purpose

heartbeat_interval

varies

Interval for heartbeat messages

subnet_heartbeat_interval

5.0

Interval for subnet heartbeat checks

streaming_read_timeout

300

Timeout for reading streamed data

streaming_ack_interval

4MB

Bytes between ACK messages during streaming

streaming_ack_wait

varies

Time to wait for streaming ACK

CoreCell Settings

Core cell communication parameters (core_cell.py):

Parameter

Default

Purpose

max_timeout

3600

Default timeout for send_and_receive (1 hour)

bulk_check_interval

0.5

Interval for bulk message checking

bulk_process_interval

0.5

Interval for bulk message processing

Cell Request Timeouts

Cell-level request timeouts (cell.py):

Parameter

Default

Purpose

timeout

10.0

Default timeout for send_request/broadcast_request

Timeout Phases: Requests go through three timeout phases:

  1. Sending timeout: Time to complete message sending

  2. Remote processing timeout: Time for remote to process request

  3. Receiving timeout: Time to receive response

Example comm_config.json:

{
  "heartbeat_interval": 10,
  "subnet_heartbeat_interval": 5,
  "streaming_read_timeout": 300,
  "streaming_ack_interval": 4194304,
  "max_message_size": 1048576
}

Server Configuration

These timeouts are configured in fed_server.json or server configuration.

FedServer Timeouts

Server heartbeat and connection management (fed_server.py):

Parameter

Default

Purpose

heart_beat_timeout

600

Time without heartbeat before client considered dead

remove_interval

5.0

Interval for checking/removing dead clients

check_interval

0.2

Interval for connection checking loop

ServerRunner Timeouts

Server runner configuration (server_runner.py, server_json_config.py):

Parameter

Default

Purpose

heartbeat_timeout

60

Client heartbeat timeout in seconds

task_request_interval

2

Task request interval in seconds

Admin Server Timeouts

Admin server command timeouts (admin.py):

Parameter

Default

Purpose

timeout

10.0

Admin command timeout

timeout_secs

2.0

Timeout for send_requests to clients

Example (fed_server.json):

{
  "heart_beat_timeout": 600,
  "admin_timeout": 10.0
}

Client Configuration

Client heartbeat and retry configuration (client_train.py, base_client_deployer.py):

Parameter

Default

Purpose

heart_beat_interval

10.0

Interval for sending heartbeats to server

retry_timeout

30

Timeout for retry operations

Note: heart_beat_interval must be less than the server’s heart_beat_timeout for proper client status tracking.

Client-to-Server Communication

Low-level client communication timeouts (communicator.py, fed_client_base.py):

Parameter

Default

Purpose

communication_timeout

300.0

General communication timeout

maint_msg_timeout

30.0

Maintenance message timeout

engine_create_timeout

30.0

Timeout for engine creation

retry_timeout

30.0

Retry timeout for operations

Flare Agent

FlareAgent for external process integration (flare_agent.py):

Parameter

Default

Purpose

heartbeat_timeout

60.0

Time without heartbeat before peer is dead

submit_result_timeout

60.0

Timeout for submitting task result to the client training process. 60 s is too short for large models; configure via add_client_config({"submit_result_timeout": 1800}).

max_resends

None in raw FlareAgent; 3 through Client API job config

Maximum send retries on failure. For ClientAPILauncherExecutor jobs, the default is the finite value 3; None is rejected at job initialization. Override via add_client_config({"max_resends": N}).

download_complete_timeout

1800.0

Time the subprocess waits after result ACK while the server finishes downloading tensors from the subprocess DownloadService. Must not be None for ClientAPILauncherExecutor jobs.

Note: Raw FlareAgentWithCellPipe defaults to 60.0 s for submit_result_timeout and unlimited max_resends. When launched through ClientAPILauncherExecutor, the generated Client API config supplies the safer job defaults described above. Recipe-based external-process jobs also serialize max_resends=3 in the executor args, so reloaded jobs do not fall back to the raw unlimited retry default. Use recipe.add_client_config({"max_resends": N}) only when a job needs a different finite retry budget.

IPC Agent

IPC Agent for inter-process communication (ipc_agent.py):

Parameter

Default

Purpose

submit_result_timeout

30.0

Timeout for submitting results

flare_site_connection_timeout

60.0

Timeout for CJ disconnection

flare_site_heartbeat_timeout

None

Timeout for missing CJ heartbeats

gRPC Utility Timeouts

gRPC connection establishment (grpc_utils.py):

Parameter

Default

Purpose

ready_timeout

varies

Time to wait for gRPC server to be ready

Reliable Message

Reliable Messages provide guaranteed delivery with retry logic (reliable_message.py):

Parameter

Default

Purpose

per_msg_timeout

varies

Timeout for each individual message attempt

tx_timeout

varies

Timeout for entire transaction including all retries

Behavior:

  • If tx_timeout <= per_msg_timeout, request is sent only once without retrying

  • Messages are retried until tx_timeout is reached

  • Completed requests are tracked for 2 × tx_timeout to handle late duplicates

Example:

from nvflare.apis.utils.reliable_message import ReliableMessage

ReliableMessage.send_request(
    target="site-1",
    topic="my_topic",
    request=shareable,
    per_msg_timeout=30.0,   # Each attempt times out after 30s
    tx_timeout=300.0,       # Total transaction timeout 5 minutes
    abort_signal=abort_signal,
    fl_ctx=fl_ctx,
)

Federated Event Timeouts

Fed event runner intervals (fed_event.py):

Parameter

Default

Purpose

regular_interval

0.01

Regular processing interval

grace_period

2.0

Grace period before shutdown

queue_empty_period

2.0

Period to wait when queue is empty

Simulator Timeouts

Simulator-specific timeouts (simulator_runner.py, simulator_worker.py):

Parameter

Default

Purpose

simulator_worker_timeout

60.0

Timeout for simulator worker

app_runner_timeout

60.0

Timeout for app runner

CELL_CONNECT_CHECK_TIMEOUT

10.0

Timeout for cell connection check

FETCH_TASK_RUN_RETRY

3

Number of retry attempts for task fetch

Flare API Session Timeouts

Session management for programmatic API (flare_api.py):

Parameter

Default

Purpose

timeout (new_session)

10.0

Timeout to establish session

poll_interval

2.0

Interval for polling job status

set_timeout()

varies

Session-specific command timeout

Example:

from nvflare.fuel.flare_api.flare_api import new_secure_session

# Create session with timeout
sess = new_secure_session(
    username="admin@nvidia.com",
    startup_kit_location="/path/to/startup",
    timeout=30.0,
)

# Set command timeout
sess.set_timeout(60.0)

# Monitor job with timeout and poll interval
rc = sess.monitor_job(job_id, timeout=3600, poll_interval=5.0)

Heartbeat Timeouts

Executor Heartbeat

Heartbeat mechanisms ensure connectivity between components:

Timeout

Default

Location

Purpose

heartbeat_interval

5.0

LauncherExecutor launcher_executor.py:49

Interval for sending heartbeat messages

heartbeat_timeout

60.0

LauncherExecutor launcher_executor.py:50

Timeout for waiting for heartbeat from peer

peer_read_timeout

60.0

LauncherExecutor launcher_executor.py:46

Time to wait for peer to accept sent message

Client API Heartbeat

The Client API inherits heartbeat configuration from the task exchange settings (config.py:154-159):

def get_heartbeat_timeout(self):
    return self.config.get(ConfigKey.TASK_EXCHANGE, {}).get(
        ConfigKey.HEARTBEAT_TIMEOUT,
        self.config.get(ConfigKey.METRICS_EXCHANGE, {}).get(ConfigKey.HEARTBEAT_TIMEOUT, 60),
    )

Executor and Launcher Timeouts

LauncherExecutor Base Class

The LauncherExecutor class defines core timeout parameters for external process management (launcher_executor.py:38-58):

Parameter

Default

Purpose

launch_timeout

None

Timeout for launcher’s “launch_task” method completion

task_wait_timeout

None

Timeout for retrieving task results

last_result_transfer_timeout

300.0

Timeout for transmitting final result from external process

external_pre_init_timeout

60.0

Time to wait for external process before flare.init() call

ClientAPILauncherExecutor

The Client API executor extends base timeouts with more conservative defaults (client_api_launcher_executor.py:29-53):

Parameter

Default

Purpose

external_pre_init_timeout

300.0

Extended timeout for heavy library imports

peer_read_timeout

300.0

Timeout for peer message acceptance

heartbeat_timeout

300.0

Extended heartbeat timeout for Client API

submit_result_timeout

300.0

Subprocess-side wait for CJ to acknowledge each result message

max_resends

3

Maximum retries after the initial result send; None is rejected

download_complete_timeout

1800.0

Time the subprocess remains alive for server-side tensor download completion

For subprocess-mode Client API jobs with large payloads, FLARE validates the following at job start:

  • download_complete_timeout must not be None.

  • max_resends must be a finite non-negative integer. Recipe-based jobs serialize the default value 3 in executor args. Use 0 to disable retries; do not use None for unlimited retries.

Values supplied through recipe.add_client_config() are top-level entries in config_fed_client.json. For subprocess-mode Client API jobs, ClientAPILauncherExecutor applies these overrides before writing the subprocess client_api_config.json, so submit_result_timeout, download_complete_timeout, and max_resends are seen by both the parent client job process and the external training process.

When tensor_streaming_per_request_timeout or np_streaming_per_request_timeout is explicitly configured, FLARE also warns if PEER_READ_TIMEOUT or download_complete_timeout is shorter than that streaming timeout. Set PEER_READ_TIMEOUT through add_client_config when the parent client job needs a larger pipe-read budget:

recipe.add_client_config({
    "tensor_streaming_per_request_timeout": 600,
    "tensor_min_download_timeout": 600,
    "PEER_READ_TIMEOUT": 600,
    "download_complete_timeout": 1800,
    "max_resends": 3,
})

External Pre-Init Override

Jobs can override the external pre-init timeout via client configuration (constants.py:20-22):

# Configuration key for overriding external_pre_init_timeout in ClientAPILauncherExecutor
EXTERNAL_PRE_INIT_TIMEOUT = "EXTERNAL_PRE_INIT_TIMEOUT"

TaskExchanger

The TaskExchanger base class manages pipe-based task exchange with external processes (task_exchanger.py:38-68):

Parameter

Default

Purpose

read_interval

0.5

How often to read from pipe

heartbeat_interval

5.0

How often to send heartbeat to peer

heartbeat_timeout

60.0

Time to wait for heartbeat from peer (None = disable)

resend_interval

2.0

How often to resend a message if failing to send

peer_read_timeout

60.0

Time to wait for peer to accept sent message

result_poll_interval

0.5

How often to poll for task result

IPCExchanger

The IPCExchanger manages IPC-based communication with Flare Agents (ipc_exchanger.py:50-82):

Parameter

Default

Purpose

send_task_timeout

5.0

How long to wait for response when sending task to Agent

resend_task_interval

2.0

How often to resend task if failed

agent_connection_timeout

60.0

Time allowed to miss heartbeat before considering agent disconnected

agent_heartbeat_timeout

None

Time allowed to miss heartbeat before stopping (None = disabled)

agent_heartbeat_interval

5.0

How often to send heartbeats to the agent

agent_ack_timeout

5.0

How long to wait for agent ack (heartbeat and bye messages)

InProcessClientAPIExecutor

The in-process executor for Client API (in_process_client_api_executor.py:50-70):

Parameter

Default

Purpose

result_pull_interval

0.5

How often to poll for task result

log_pull_interval

None

How often to pull logs (None = same as result_pull_interval)

Pipe Handler

Inter-process communication pipe timeouts for Client API (pipe_handler.py):

Parameter

Default

Purpose

heartbeat_interval

5.0

Interval for sending heartbeats

heartbeat_timeout

30.0

Max time without heartbeat before peer is dead

default_request_timeout

5.0

Default timeout for requests

resend_interval

2.0

Interval between message resends

Important: heartbeat_interval must be less than heartbeat_timeout.

P2P Executor

Peer-to-peer sync executor (sync_executor.py):

Parameter

Default

Purpose

sync_timeout

10

Timeout waiting for values from neighbors

Admin Client Timeouts

Admin client timeouts control session management and command execution:

Timeout

Default

Location

Purpose

idle_timeout

900.0

Admin config

Automatic shutdown after idle period

login_timeout

10.0

Admin config

Max time to attempt login

authenticate_msg_timeout

2.0

Admin config

Timeout for authentication messages

Command timeout

5.0

FLARE API session

Default timeout for admin commands

Session-Specific Timeouts

Admin API supports session-specific command timeouts (api_spec.py:305-318):

def set_timeout(self, value: float):
    """Set a session-specific command timeout. This is the amount of time the server
    will wait for responses after sending commands to FL clients.
    Note that this value is only effective for the current API session."""

Task Communication and Messaging

These timeouts control task assignment and result collection between server and clients.

WfCommServer (Workflow Communication Server)

Server-side workflow communication (wf_comm_server.py):

Parameter

Default

Purpose

task.timeout

varies

Overall task timeout

task_assignment_timeout

0

Time to wait for client to pick task

task_result_timeout

0

Time to wait for client to return result

task_check_period

0.2

Interval for checking task status

Validation Rules:

  • task_assignment_timeout must be <= task.timeout

  • task_result_timeout must be <= task.timeout

WfCommClient (Workflow Communication Client)

Client-side workflow communication (wf_comm_client.py):

Parameter

Default

Purpose

max_task_timeout

3600

Maximum single task execution time; used as the effective timeout when the controller sets task.timeout = 0 (i.e., “no timeout”)

Task Pull/Fetch Timeouts

Client-side task fetching from server (client_runner.py, communicator.py, fed_client_base.py):

Parameter

Default

Purpose

get_task_timeout

None

Timeout for client to fetch task from server

submit_task_result_timeout

None

Timeout for client to submit result to server

timeout (pull_task)

None

Timeout for pull_task communication

Configuration: Set via ConfigVarName.GET_TASK_TIMEOUT and ConfigVarName.SUBMIT_TASK_RESULT_TIMEOUT in client config.

Example (client params in job):

recipe.add_client_config({
    "get_task_timeout": 300,  # 5 minutes
})

Task Manager Timeouts

Task managers control sequential and relay task distribution (send_manager.py, seq_relay_manager.py, any_relay_manager.py):

Parameter

Default

Purpose

task_assignment_timeout

0

Time window for client to request task

task_result_timeout

0

Time to wait for client result before moving to next

Behavior:

  • For SendOrder.SEQUENTIAL: Clients are assigned in order with sliding time window

  • For SendOrder.ANY: First available client gets the task

  • Timeout of 0 means no timeout (wait indefinitely)

Workflow and Controller Timeouts

Client-Controlled Workflows (Server-Side)

Server-side controller timeouts for workflow management (common.py:79-92):

Timeout

Default

Purpose

configure_task_timeout

300

Time for clients to respond to config task

start_task_timeout

10

Time for starting client to begin workflow

end_workflow_timeout

2.0

Timeout for ending workflow message

progress_timeout

3600.0

Max time without workflow progress

max_status_report_interval

90.0

Max time for client to miss status report

Client-Controlled Workflows (Client-Side)

Client-side timeouts for task coordination (common.py:87-92):

Timeout

Default

Purpose

learn_task_check_interval

1.0

Interval for checking new learning tasks

learn_task_ack_timeout

10

P2P model-transfer ACK budget (seconds). 10 s is too short for models >2 GB. Set via SwarmLearningRecipe(round_timeout=3600) which wires both learn_task_ack_timeout and final_result_ack_timeout.

learn_task_abort_timeout

5.0

Timeout for task abortion

final_result_ack_timeout

10

Timeout for final result acknowledgment. See learn_task_ack_timeout note above.

get_model_timeout

10

Timeout for getting model from peers

max_task_timeout

3600

Maximum single task execution time

ScatterAndGather Controller

The SAG controller manages aggregation timing (scatter_and_gather.py:37-67):

Parameter

Default

Purpose

train_timeout

0

Time to wait for clients to do local training (0 = no timeout)

wait_time_after_min_received

10

Time to wait for additional responses after min_clients

task_check_interval

0.5

Interval for checking task completion

ModelController-Based Workflows

FedAvg, Cyclic, Scaffold, and other ModelController-based workflows (model_controller.py, base_model_controller.py):

Parameter

Default

Purpose

timeout

0

Time to wait for clients to perform task (0 = no timeout)

Note: FedAvg, Scaffold, Cyclic all inherit from ModelController and use the same timeout parameter.

CyclicController

Cyclic workflow controller (cyclic_ctl.py):

Parameter

Default

Purpose

task_assignment_timeout

10

Timeout for client to request its assigned task

CrossSiteModelEval / CrossSiteEval

Cross-site model evaluation workflows (cross_site_model_eval.py, cross_site_eval.py):

Parameter

Default

Purpose

submit_model_timeout

600

Timeout for submit_model_task (10 min)

validation_timeout

6000

Timeout for validate_model task (100 min)

wait_for_clients_timeout

300

Timeout for clients to appear (5 min)

eval_task_timeout (CCWF)

1200+

Time for model evaluation by clients

configure_task_timeout (CCWF)

300

Timeout for configuration task

progress_timeout (CCWF)

7200+

Overall workflow progress timeout

Example configuration:

from nvflare.app_common.np.recipes import NumpyCrossSiteEvalRecipe

recipe = NumpyCrossSiteEvalRecipe(
    submit_model_timeout=600,
    validation_timeout=6000,
)

GlobalModelEval

Global model evaluation controller (global_model_eval.py):

Parameter

Default

Purpose

validation_timeout

6000

Timeout for validate_model task

wait_for_clients_timeout

300

Timeout for clients to appear

BroadcastAndProcess / InitializeGlobalWeights

Broadcast workflows (broadcast_and_process.py, initialize_global_weights.py):

Parameter

Default

Purpose

timeout / task_timeout

0

Task timeout (0 = no timeout)

wait_time_after_min_received

0-10

Wait time after min responses received

StatisticsController / HierarchicalStatisticsController

Statistics workflow controllers (statistics_controller.py, hierarchical_statistics_controller.py):

Parameter

Default

Purpose

result_wait_timeout

10

Seconds to wait for results per statistic

wait_time_after_min_received

1

Seconds to wait after min clients received

Note: result_wait_timeout is reset for each statistic, not an overall timeout.

SplitNNController

Split learning controller (splitnn_workflow.py:47-79):

Parameter

Default

Purpose

task_timeout

10

Timeout for client to request its assigned task

TIMEOUT (class constant)

60.0

Timeout for auxiliary message requests

TIE Controller (Third-party Integration)

Base controller for third-party integration (tie/controller.py, tie/defs.py):

Parameter

Default

Purpose

configure_task_timeout

10

Time to wait for clients to complete config task

start_task_timeout

10

Time to wait for clients to complete start task

job_status_check_interval

2.0

How often to check client job statuses

max_client_op_interval

90.0

Max time allowed between app ops from a client

progress_timeout

3600.0

Max time allowed with no workflow progress

Note: TIE is used by XGBoost, Flower, and other third-party framework integrations.

Flower Integration Timeouts

Flower-specific controller and executor timeouts (flower/controller.py, flower/executor.py):

Parameter

Default

Purpose

superlink_ready_timeout

10.0

Time to wait for Flower superlink to become ready

superlink_min_query_interval

10.0

Minimal interval for querying superlink status

monitor_interval

0.5

How often to check Flower run status

per_msg_timeout

10.0

Per-message timeout for ReliableMessage

tx_timeout

100.0

Transaction timeout for ReliableMessage

client_shutdown_timeout

5.0

Max time for graceful client shutdown

Private Set Intersection (PSI)

PSI workflows do not have explicit timeout parameters at the PSI controller level. PSI inherits general workflow timeouts from the underlying task system.

For PSI operations, timeouts are controlled at lower levels:

  • Task-level timeouts: Use controller’s general timeout parameter

  • Communication timeouts: Inherited from system heartbeat_timeout and peer_read_timeout

Note: For large-scale PSI operations, ensure adequate system-level timeouts in application.conf to handle the iterative Diffie-Hellman protocol exchanges.

Aggregator Timeouts

LazyAggregator

Lazy aggregator for async aggregation (lazy.py):

Parameter

Default

Purpose

accept_timeout

600.0

Max time to wait for accept to finish

Job Scheduler Timeouts

The DefaultJobScheduler controls job scheduling frequency (job.rst:255-270):

Parameter

Default

Purpose

min_schedule_interval

10.0

Minimum interval between schedule attempts

max_schedule_interval

600.0

Maximum interval between schedule attempts

max_schedule_count

10

Maximum times to try scheduling a job

Scheduling Strategy: The scheduler uses adaptive frequency - doubling interval after each failure up to the maximum.

Recipe Timeouts

Standard Recipe Timeouts

All standard recipes support these timeout parameters (fedavg.py, cyclic.py):

Parameter

Default

Purpose

shutdown_timeout

0.0

Wait time before shutdown for cleanup

task_assignment_timeout

10

Timeout for cyclic task assignment (CyclicRecipe only)

Evaluation Recipe Timeouts

Evaluation recipes have specific timeout requirements (fedeval.py, cross_site_eval.py):

Parameter

Default

Purpose

validation_timeout

6000

Time allowed for model validation

submit_model_timeout

600

Time for clients to submit models for evaluation

Large Model and Streaming Timeouts

File Streaming Timeouts

File streaming for large files (file_streamer.py):

Parameter

Default

Purpose

chunk_timeout

5.0

Timeout for each chunk sent to targets

chunk_size

1M bytes

Size of each chunk streamed

Example:

from nvflare.app_common.streamers.file_streamer import FileStreamer

FileStreamer.stream_file(
    targets=["site-1", "site-2"],
    file_name="/path/to/large_file.bin",
    fl_ctx=fl_ctx,
    chunk_size=1024 * 1024,  # 1MB chunks
    chunk_timeout=10.0,      # 10 seconds per chunk
)

Container Streaming Timeouts

Container/object streaming (container_streamer.py):

Parameter

Default

Purpose

entry_timeout

60.0

Timeout for each entry sent to targets

Example:

from nvflare.app_common.streamers.container_streamer import ContainerStreamer

ContainerStreamer.stream_container(
    targets=["site-1"],
    container=my_large_container,
    fl_ctx=fl_ctx,
    entry_timeout=120.0,  # 2 minutes per entry
)

Object Retrieval Timeouts

Retrieving files/containers from remote sites (file_retriever.py, container_retriever.py):

Parameter

Default

Purpose

timeout

varies

Max seconds to wait for data retrieval

chunk_timeout

varies

Timeout per chunk during file retrieval

Byte Streaming Timeouts

Byte streaming timeouts and intervals (byte_receiver.py, byte_streamer.py):

Parameter

Default

Purpose

streaming_read_timeout

300

Timeout for reading streamed data

ack_interval

4MB

Bytes between acknowledgment messages

ack_wait

varies

Time to wait for ACK before timing out

Note: ACK timeout triggers StreamError and stops the stream.

Download Transaction Timeouts

Object download transaction timeouts (download_service.py, obj_downloader.py):

Parameter

Default

Purpose

timeout

varies

Transaction timeout (time since last activity)

per_request_timeout

varies

Timeout for each request to object owner

Note: Transaction times out if no activity from any receiver for the specified duration. Normally finished download refs are tombstoned temporarily so a late retry from the same receiver can receive the original EOF or error status instead of a fatal missing-ref response. Timeout and deleted transactions are not tombstoned.

Tensor Streaming Timeouts

Tensor streaming provides efficient transfer of large model weights. These timeouts control the streaming behavior (tensor_stream/server.py, client.py):

Parameter

Default

Purpose

tensor_send_timeout

30.0

Timeout for each tensor entry transfer operation

wait_send_task_data_all_clients_timeout

300.0

Timeout for sending tensors to all clients

wait_for_tensors timeout

5.0

Time to wait for tensors to be received

Server-side configuration (TensorServerStreamer):

from nvflare.app_opt.tensor_stream.server import TensorServerStreamer

streamer = TensorServerStreamer(
    format="pytorch",
    tensor_send_timeout=60.0,  # Per-tensor timeout
    wait_send_task_data_all_clients_timeout=600.0,  # All clients timeout
)

Client-side configuration (TensorClientStreamer):

from nvflare.app_opt.tensor_stream.client import TensorClientStreamer

streamer = TensorClientStreamer(
    format="pytorch",
    tensor_send_timeout=60.0,  # Per-tensor timeout
)

Warning

Critical Timeout Relationship for Tensor Streaming

When using tensor streaming, you must ensure that get_task_timeout is set and is greater than or equal to wait_send_task_data_all_clients_timeout. If get_task_timeout is not set, it defaults to the communicator’s timeout, which may be shorter than the tensor streaming timeout.

Problem: If streaming timeout > communicator timeout and no get_task_timeout is set, some clients may receive weights while others are still waiting. The server may not send the task in time, causing a timeout that restarts the tensor streaming process. This can result in clients receiving empty tensors and job failure.

Solution: Always set get_task_timeout when using tensor streaming:

# Ensure get_task_timeout >= wait_send_task_data_all_clients_timeout
recipe.add_client_config({
    "get_task_timeout": 600,  # Must be >= streaming timeout
})

Streaming Download Timeouts

Framework-level settings for large payload transfers (fl_constant.py:553, comm_config.py:41):

Parameter

Default

Purpose

streaming_per_request_timeout

600

Per-request timeout for streaming chunks

streaming_read_timeout

300

Timeout for reading streaming data

np_min_download_timeout

300

Minimum idle time (seconds) before an inactive NumPy array download transaction is declared dead. Applies to NumPy/sklearn-based models. Increase to 600 s for 70B+ models on congested networks. Configure via add_client_config({"np_min_download_timeout": 600}).

tensor_min_download_timeout

300

Minimum idle time (seconds) before an inactive PyTorch tensor download transaction is declared dead. Applies to PyTorch-based models. Increase to 600 s for 70B+ models on congested networks. Configure via add_client_config({"tensor_min_download_timeout": 600}).

np_download_chunk_size

2097152

Chunk size for NumPy array downloads (bytes)

tensor_download_chunk_size

2097152

Chunk size for PyTorch tensor downloads (bytes)

For Client API subprocess jobs, keep these download settings aligned with the subprocess pipe settings:

  • tensor_min_download_timeout / np_min_download_timeout should be at least tensor_streaming_per_request_timeout / np_streaming_per_request_timeout.

  • PEER_READ_TIMEOUT should be at least the configured streaming per-request timeout so the parent client job does not resend the task while the subprocess is still downloading a large payload.

  • download_complete_timeout should be at least the configured streaming per-request timeout and long enough for the server to pull large tensor results from the subprocess after result ACK.

  • max_resends should stay finite. The recipe default is 3; raise it only when the network is expected to recover after a small number of delayed result acknowledgments.

Swarm Learning Large Model Setup

Recommended timeouts for large models in Swarm Learning:

recipe = SwarmLearningRecipe(
    name="swarm",
    model=MyModel(),
    min_clients=3,
    num_rounds=5,
    train_script="client.py",
    round_timeout=7200,   # P2P ACK budget; covers learn_task_ack_timeout + final_result_ack_timeout
    progress_timeout=7200,
    start_task_timeout=300,
)

# Server-side streaming configuration
recipe.add_server_config({
    "np_download_chunk_size": 2097152,
    "streaming_per_request_timeout": 600,
})

# Subprocess-mode timeouts (when launch_external_process=True)
recipe.add_client_config({
    "submit_result_timeout": 1800,
    "download_complete_timeout": 1800,
    "tensor_min_download_timeout": 600,
    "PEER_READ_TIMEOUT": 600,
    "max_resends": 5,
})

XGBoost-Specific Timeouts

XGBoost Histogram-Based Controller

XGBoost histogram-based controller timeouts (histogram_based_v2/controller.py):

Parameter

Default

Purpose

configure_task_timeout

300

Timeout for configuration task

start_task_timeout

10

Timeout for start task

progress_timeout

3600.0

Overall workflow progress timeout

Note: XGBoost uses Reliable Messages for secure training. See the Reliable Message section for per_msg_timeout and tx_timeout configuration.

XGBoost gRPC Client

gRPC client for XGBoost communication (grpc_client.py, grpc_server_adaptor.py):

Parameter

Default

Purpose

ready_timeout

10

Timeout for gRPC server to be ready

xgb_server_ready_timeout

varies

Timeout for XGBoost server readiness

aggr_timeout

10.0

Aggregation timeout for mock servicer

Example configuration for large datasets:

"per_msg_timeout": 300.0,
"tx_timeout": 900.0,

Confidential Computing Timeouts

SNP Authorizer Timeouts

AMD SEV-SNP attestation timeouts (snp_authorizer.py):

Parameter

Default

Purpose

cmd_timeout

60

SNPGuest command execution timeout

retry_interval

10

Wait time between retry attempts

max_retries

5

Maximum retry attempts

CC Manager Timeouts

Cross-site CC verification timeouts (cc_manager.py):

Parameter

Default

Purpose

get_site_request_timeout

10.0

Timeout for get site request

get_token_request_timeout

10.0

Timeout for get token request

verify_frequency

600

CC token verification interval (seconds)

cross_validation_interval

varies

Interval between cross-site validation cycles

Note: Other CC authorizers (ACI, TDX, GPU, Azure CVM) do not have explicit timeout parameters and rely on system defaults.

Job Launcher Timeouts

Kubernetes Launcher

K8s job launcher timeouts (k8s_launcher.py):

Parameter

Default

Purpose

timeout

None

Timeout for pod to enter RUNNING/TERMINATED state

Docker Launcher

Docker container launcher timeouts (docker_launcher.py):

Parameter

Default

Purpose

timeout

None

Timeout for container to enter target state

Edge Device Timeouts

This section covers all edge device, mobile client, and hierarchical FL timeouts.

Edge Device General

Edge devices have specific timeout requirements:

Parameter

Default

Purpose

update_timeout

5

Timeout for model updates from devices

device_wait_timeout

None

Time to wait for sufficient devices to join

job_timeout

60.0

Overall timeout for edge job execution

Example:

from nvflare.edge.tools.edge_fed_buff_recipe import EdgeFedBuffRecipe

recipe = EdgeFedBuffRecipe(
    model=MyModel(),
    update_timeout=10,
    job_timeout=120.0,
)

Hierarchical FL

Hierarchical FL enables multi-tier federation with edge devices organized in a tree structure.

ScatterAndGatherForEdge (SAGE) Controller

Server-side controller for hierarchical edge FL (edge/controllers/sage.py):

Parameter

Default

Purpose

assess_interval

0.5

Interval for invoking the assessor during task execution

update_interval

1.0

Interval for children to send updates

task_check_period

0.5

Interval for checking status of tasks

HierarchicalUpdateGatherer (HUG) Executor

Executor for hierarchical update gathering (edge/executors/hug.py):

Parameter

Default

Purpose

update_timeout

required

Timeout for update messages sent to parent

EdgeTaskExecutor (ETE)

Edge task executor for leaf nodes (edge/executors/ete.py):

Parameter

Default

Purpose

update_timeout

required

Timeout for update messages sent to parent

Example:

from nvflare.edge.controllers.sage import ScatterAndGatherForEdge
from nvflare.edge.executors.hug import HierarchicalUpdateGatherer

# Server-side controller
sage = ScatterAndGatherForEdge(
    num_rounds=5,
    assess_interval=0.5,
    update_interval=1.0,
    task_check_period=0.5,
)

# Client-side executor
hug = HierarchicalUpdateGatherer(
    learner_id="learner",
    updater_id="updater",
    update_timeout=30.0,
)

Mobile Client

Android SDK includes job operation timeout (mobile_android.rst:43-58):

AndroidFlareRunner(
    // ... other parameters
    jobTimeout: Float,  // Timeout in seconds for job operations
)

SubprocessLauncher Timeouts

Subprocess launcher timeout (subprocess_launcher.py):

Parameter

Default

Purpose

shutdown_timeout

0.0

Time to wait before forcefully stopping subprocess

Experiment Tracking Timeouts

WandB Receiver

Weights & Biases integration timeouts (wandb_receiver.py):

Parameter

Default

Purpose

process_timeout

10.0

Timeout for joining WandB processes at shutdown

login timeout

1.0

Internal timeout for WandB login verification

MLflow Receiver

MLflow integration timing (mlflow_receiver.py):

Parameter

Default

Purpose

buffer_flush_time

1

Seconds between deliveries to MLflow tracking server

Note: Reducing buffer_flush_time increases traffic to MLflow server and may cause latency.

TensorBoard Receiver

TensorBoard receiver (tb_receiver.py) does not have explicit timeout parameters. Events are written directly to disk without buffering.

Metrics Relay and Sender

Metrics exchange timeouts for experiment tracking (metric_relay.py, metrics_sender.py):

Parameter

Default

Purpose

heartbeat_timeout

30.0-60.0

Timeout for peer heartbeat (MetricRelay: 60s, MetricsSender: 30s)

heartbeat_interval

5.0

Interval between heartbeats

read_interval

0.1

Interval for reading from pipe

Example:

from nvflare.app_common.widgets.metric_relay import MetricRelay

metric_relay = MetricRelay(
    heartbeat_interval=5.0,
    heartbeat_timeout=60.0,
    read_interval=0.1,
)

Timeout Relationships and Dependencies

Hierarchical Relationships

┌─────────────────────────────────────────────────────────────────┐
│                    SYSTEM-LEVEL TIMEOUTS                        │
├─────────────────────────────────────────────────────────────────┤
│  Server Configuration (fed_server.json)                        │
│  ├── heart_beat_timeout (600s) - Client liveness detection     │
│  ├── admin_timeout (10s) - Admin command processing            │
│  └── task_request_interval (2s) - Task polling rate            │
│                                                                 │
│  Client Configuration                                           │
│  ├── heart_beat_interval (10s) - Keep-alive to server          │
│  ├── retry_timeout (30s) - Operation retry                      │
│  └── communication_timeout (300s) - Network operations          │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    F3/CELLNET LAYER                             │
├─────────────────────────────────────────────────────────────────┤
│  CommConfigurator (comm_config.json)                            │
│  ├── heartbeat_interval < heartbeat_timeout (REQUIRED)          │
│  ├── subnet_heartbeat_interval (5s)                             │
│  ├── streaming_read_timeout (300s)                              │
│  └── max_timeout (3600s) - CoreCell default                     │
│                                                                 │
│  Cell Requests                                                  │
│  └── timeout (10s) → Sending → Processing → Receiving           │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    TASK COMMUNICATION                           │
├─────────────────────────────────────────────────────────────────┤
│  Task Lifecycle                                                 │
│  ├── task_assignment_timeout ≤ task.timeout (REQUIRED)          │
│  ├── task_result_timeout ≤ task.timeout (REQUIRED)              │
│  ├── get_task_timeout - Client fetching task                    │
│  └── submit_task_result_timeout - Client submitting result      │
│                                                                 │
│  max_task_timeout (3600s) - Applied when task.timeout = 0       │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    WORKFLOW LAYER                               │
├─────────────────────────────────────────────────────────────────┤
│  ModelController-Based (FedAvg, Cyclic, Scaffold, etc.)         │
│  └── timeout (0 = no timeout) - Per-task timeout                │
│                                                                 │
│  ScatterAndGather / ScatterAndGatherScaffold                    │
│  ├── train_timeout (0 = no timeout)                             │
│  └── wait_time_after_min_received (10s)                         │
│                                                                 │
│  CyclicController                                               │
│  └── task_assignment_timeout (10s)                              │
│                                                                 │
│  CrossSiteModelEval / CrossSiteEval                             │
│  ├── submit_model_timeout (600s)                                │
│  ├── validation_timeout (6000s)                                 │
│  └── wait_for_clients_timeout (300s)                            │
│                                                                 │
│  GlobalModelEval                                                │
│  ├── validation_timeout (6000s)                                 │
│  └── wait_for_clients_timeout (300s)                            │
│                                                                 │
│  BroadcastAndProcess / InitializeGlobalWeights                  │
│  ├── timeout / task_timeout (0 = no timeout)                    │
│  └── wait_time_after_min_received (0-10s)                       │
│                                                                 │
│  StatisticsController / HierarchicalStatisticsController        │
│  └── result_wait_timeout (10s) - Per-statistic timeout          │
│                                                                 │
│  SplitNNController                                              │
│  └── task_timeout (10s)                                         │
│                                                                 │
│  TIE Controller (XGBoost, Flower, etc.)                         │
│  ├── configure_task_timeout (10s)                               │
│  ├── start_task_timeout (10s)                                   │
│  ├── job_status_check_interval (2s)                             │
│  ├── max_client_op_interval (90s)                               │
│  └── progress_timeout (3600s)                                   │
│                                                                 │
│  Flower-Specific                                                │
│  ├── superlink_ready_timeout (10s)                              │
│  ├── per_msg_timeout (10s)                                      │
│  ├── tx_timeout (100s)                                          │
│  └── client_shutdown_timeout (5s)                               │
│                                                                 │
│  CCWF Server-Side                                               │
│  ├── configure_task_timeout (300s)                              │
│  ├── start_task_timeout (10s)                                   │
│  └── progress_timeout (3600s) - Overall workflow                │
│                                                                 │
│  CCWF Client-Side (Swarm Learning)                              │
│  ├── learn_task_ack_timeout (10s)                               │
│  ├── learn_task_abort_timeout (5s)                              │
│  └── final_result_ack_timeout (10s)                             │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    EXECUTOR LAYER                               │
├─────────────────────────────────────────────────────────────────┤
│  LauncherExecutor / ClientAPILauncherExecutor                   │
│  ├── launch_timeout                                             │
│  ├── external_pre_init_timeout (60-300s)                        │
│  ├── task_wait_timeout                                          │
│  ├── last_result_transfer_timeout (300s)                        │
│  └── heartbeat_timeout (60-300s)                                │
│                                                                 │
│  TaskExchanger (Pipe Handler)                                   │
│  ├── heartbeat_interval < heartbeat_timeout (REQUIRED)          │
│  ├── read_interval (0.5s)                                       │
│  ├── resend_interval (2s)                                       │
│  ├── peer_read_timeout (60s)                                    │
│  └── result_poll_interval (0.5s)                                │
│                                                                 │
│  IPCExchanger (Agent-based)                                     │
│  ├── send_task_timeout (5s)                                     │
│  ├── resend_task_interval (2s)                                  │
│  ├── agent_connection_timeout (60s)                             │
│  ├── agent_heartbeat_timeout (None)                             │
│  └── agent_ack_timeout (5s)                                     │
│                                                                 │
│  InProcessClientAPIExecutor                                     │
│  ├── result_pull_interval (0.5s)                                │
│  └── log_pull_interval (None)                                   │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    STREAMING LAYER                              │
├─────────────────────────────────────────────────────────────────┤
│  Reliable Message                                               │
│  └── per_msg_timeout ≤ tx_timeout (for retries to work)         │
│                                                                 │
│  File/Container Streaming                                       │
│  ├── chunk_timeout (5s per chunk)                               │
│  └── entry_timeout (60s per entry)                              │
│                                                                 │
│  Tensor Streaming (CRITICAL RELATIONSHIP)                       │
│  ├── tensor_send_timeout (30s)                                  │
│  ├── wait_send_task_data_all_clients_timeout (300s)             │
│  └── get_task_timeout >= wait_send_task_data_all_clients_timeout│
│      (REQUIRED to prevent task fetch timeout during streaming)  │
└─────────────────────────────────────────────────────────────────┘

Impact Analysis

Too Short Timeouts:

Timeout Category

Impact of Too Short Value

heart_beat_timeout

Clients incorrectly marked dead, frequent reconnections

task.timeout / train_timeout

Training interrupted before completion, lost work

external_pre_init_timeout

Large model loading fails, external processes killed

streaming_read_timeout

Large file transfers fail mid-stream

per_msg_timeout

Reliable messages fail on slow networks

get_task_timeout

Clients fail to receive tasks, job stalls

admin_timeout

Admin commands fail, poor CLI experience

task_assignment_timeout (Cyclic)

Client fails to fetch task in time, job aborts

submit_model_timeout (CrossSiteEval)

Model submission fails, evaluation incomplete

validation_timeout (CrossSiteEval)

Validation tasks fail prematurely

result_wait_timeout (Statistics)

Statistics collection aborted before all clients respond

agent_connection_timeout (IPC)

External agent incorrectly marked disconnected

send_task_timeout (IPC)

Task delivery to agent fails, triggers resends

superlink_ready_timeout (Flower)

Flower integration fails to initialize

configure_task_timeout (TIE)

Third-party framework configuration fails

max_client_op_interval (TIE)

Healthy clients marked as stuck

Too Long Timeouts:

Timeout Category

Impact of Too Long Value

heart_beat_timeout

Dead clients not detected, resources wasted

task_assignment_timeout

Slow failover to backup clients

progress_timeout

Hung workflows not detected for hours

retry_timeout

Long delays before retry attempts

shutdown_timeout

Slow job termination, resource cleanup delayed

wait_for_clients_timeout (CrossSiteEval)

Long wait for clients that won’t join

agent_heartbeat_timeout (IPC)

Hung agents not detected, job stalls

resend_task_interval (IPC/TaskExchanger)

Slow recovery from transient failures

result_poll_interval (Executor)

Delayed result detection, slower job completion

job_status_check_interval (TIE)

Delayed detection of job completion or failure

tx_timeout (ReliableMessage)

Long waits for failed transactions

Configuration File Locations

This section describes where timeout configuration files are located and which timeouts each file controls. Configuration is divided into system-level (startup kit) and job-level (application) settings.

System-Level Configuration (Startup Kit)

System-level timeouts are configured in the startup kit and apply to all jobs. These files are located in the local/ directory of each participant.

Startup Kit Structure:

startup_kit/
├── server/
│   └── local/
│       ├── fed_server.json          # Server heartbeat, admin timeouts
│       ├── comm_config.json         # F3/CellNet communication layer
│       └── resources.json           # Resource configuration
│
├── site-1/ (client)
│   └── local/
│       ├── fed_client.json          # Client heartbeat, retry timeouts
│       ├── comm_config.json         # F3/CellNet communication layer
│       └── resources.json           # Resource configuration
│
└── admin/
    └── local/
        └── admin.json               # Admin session timeouts

Deployed System Paths:

After deployment, these files are located at:

Component

Startup Kit Path

Deployed Path

Server

startup_kit/server/local/

/opt/nvflare/workspace/server/local/ or ~/nvflare/workspace/server/local/

Client (Site)

startup_kit/site-\*/local/

/opt/nvflare/workspace/site-\*/local/ or ~/nvflare/workspace/site-\*/local/

Admin

startup_kit/admin/local/

/opt/nvflare/workspace/admin/local/ or ~/nvflare/workspace/admin/local/

System-Level Configuration Files:

File

Location

Timeouts Controlled

fed_server.json

server/local/

heart_beat_timeout, admin_timeout, task_request_interval, heartbeat_timeout

fed_client.json

site-*/local/

heart_beat_interval, retry_timeout, communication_timeout

comm_config.json

server/local/, site-*/local/

heartbeat_interval, subnet_heartbeat_interval, streaming_read_timeout, streaming_ack_interval, max_message_size

resources.json

server/local/, site-*/local/

Resource allocation and limits

admin.json

admin/local/

idle_timeout, login_timeout, command_timeout

Note: Changes to system-level files require restarting the affected FLARE components.

Job-Level Configuration

Job-level timeouts are configured per job and override defaults for that specific job. These files are located in the job’s app/config/ directory.

Job Configuration Files:

File

Location

Timeouts Controlled

application.conf

app/config/

Task timeouts, streaming timeouts, runner sync timeouts

config_fed_client.json

app/config/

Executor timeouts, Client API task exchange, pipe handler settings

config_fed_server.json

app/config/

Controller timeouts, workflow component configurations

Ways to Configure Job-Level Timeouts:

  1. Recipe API - Using recipe.add_client_config() to pass client parameters:

    # Apply to all clients
    recipe.add_client_config({
        "get_task_timeout": 300,
        "submit_task_result_timeout": 300,
    })
    
    # Apply to specific clients
    recipe.add_client_config({
        "get_task_timeout": 600,
    }, clients=["site-1", "site-2"])
    
  2. Job config files - In app/config/ directory:

    • config_fed_client.json - Client-side executor and task exchange settings

    • config_fed_server.json - Server-side controller and workflow settings

Configuration Examples

fed_server.json (Server Configuration)

{
  "heart_beat_timeout": 600,
  "admin_timeout": 10.0,
  "servers": [
    {
      "heart_beat_timeout": 600
    }
  ]
}

comm_config.json (F3/CellNet Layer)

{
  "heartbeat_interval": 10,
  "subnet_heartbeat_interval": 5,
  "streaming_read_timeout": 300,
  "streaming_ack_interval": 4194304,
  "streaming_chunk_size": 1048576,
  "max_message_size": 1048576
}

Client API Configuration (config_fed_client.json)

{
  "TASK_EXCHANGE": {
    "heartbeat_timeout": 60.0,
    "heartbeat_interval": 5.0,
    "resend_interval": 2.0,
    "pipe": {
      "ARG": {
        "root_url": "tcp://localhost:8002"
      }
    }
  }
}

application.conf Settings

# Task communication timeouts
get_task_timeout = 60.0
submit_task_result_timeout = 120.0
task_check_timeout = 5.0

# Cell/messaging timeouts
cell_wait_timeout = 5.0

# Streaming timeouts
streaming_per_request_timeout = 600.0
np_download_chunk_size = 4194304
tensor_download_chunk_size = 4194304

# Runner sync timeouts
runner_sync_timeout = 10.0
max_runner_sync_timeout = 60.0

# Shutdown
end_run_readiness_timeout = 10.0

# Server startup/dead-job safety flags
strict_start_job_reply_check = false
sync_client_jobs_require_previous_report = true

Server Startup and Dead-Job Safety Flags

These application.conf flags are server-side safety controls used during job startup and client heartbeat synchronization:

Parameter

Default

Purpose

strict_start_job_reply_check

false

Enables strict START_JOB reply validation (detects missing/timeout replies and non-OK return codes).

sync_client_jobs_require_previous_report

true

Requires a prior positive heartbeat report before treating “missing job on client” as a dead-job signal.

Recommended usage:

  • strict_start_job_reply_check defaults to false for backward compatibility. In non-strict mode, timed-out clients are silently excluded from the active set and the job continues — but min_sites / required_sites constraints are not enforced for those timeouts, so startup problems can go undetected. In strict mode, timeouts are detected and surfaced: required_sites and min_sites are then checked, and the job only continues (with a warning) if constraints are still satisfied. Enable strict mode when you want timeouts to be visible and constraints to be enforced at startup.

  • Keep sync_client_jobs_require_previous_report=true (default) to prevent false dead-job reports during startup races and transient heartbeat delays.

  • Set sync_client_jobs_require_previous_report=false only to restore legacy behavior where the first missing-job heartbeat immediately triggers dead-job detection.

Admin Client Session (Python API)

from nvflare.fuel.flare_api.flare_api import new_secure_session

# Create session with connection timeout
sess = new_secure_session(
    username="admin@nvidia.com",
    startup_kit_location="/path/to/startup",
    timeout=30.0,
)

# Set session-specific command timeout
sess.set_timeout(60.0)  # 60 seconds for commands

# Monitor job with timeout
rc = sess.monitor_job(
    job_id,
    timeout=3600,       # 1 hour max
    poll_interval=5.0,  # Check every 5 seconds
)

# Reset to server default
sess.unset_timeout()

Recipe with Extended Timeouts

from nvflare.app_opt.pt.recipes import FedAvgRecipe

recipe = FedAvgRecipe(
    name="large_model_training",
    model={"class_path": "model.LargeModel", "args": {}},
    min_clients=8,
    num_rounds=100,
    shutdown_timeout=120.0,
    train_script="client.py",
)

# Client timeout parameters
recipe.add_client_config({
    "get_task_timeout": 300,
    "submit_task_result_timeout": 300,
})

CCWF/Swarm Learning Configuration

from nvflare.app_opt.pt.recipes.swarm import SwarmLearningRecipe

recipe = SwarmLearningRecipe(
    min_clients=3,
    num_rounds=10,
    model=model,
    train_script="train.py",
    cross_site_eval_timeout=600.0,
    round_timeout=3600,   # P2P model-transfer ACK budget; increase for large models
)

Flower Integration

from nvflare.app_opt.flower.recipe import FlowerRecipe

recipe = FlowerRecipe(
    server_app=ServerApp(...),
    client_app=ClientApp(...),
    superlink_ready_timeout=30.0,
    configure_task_timeout=300,
    start_task_timeout=30,
    progress_timeout=7200,
    per_msg_timeout=30.0,
    tx_timeout=300.0,
    client_shutdown_timeout=10.0,
)

Edge Device Configuration

from nvflare.edge.tools.edge_fed_buff_recipe import EdgeFedBuffRecipe

recipe = EdgeFedBuffRecipe(
    model=MyModel(),
    update_timeout=30,
    job_timeout=600.0,
    device_wait_timeout=120.0,
)

TaskExchanger Configuration

from nvflare.app_common.executors.task_exchanger import TaskExchanger

executor = TaskExchanger(
    read_interval=0.5,
    heartbeat_interval=5.0,
    heartbeat_timeout=120.0,
    resend_interval=5.0,
    peer_read_timeout=120.0,
    result_poll_interval=1.0,
)

LauncherExecutor Configuration

from nvflare.app_common.executors.launcher_executor import LauncherExecutor

executor = LauncherExecutor(
    launch_timeout=60.0,
    task_wait_timeout=3600.0,
    last_result_transfer_timeout=600.0,
    external_pre_init_timeout=300.0,
    peer_read_timeout=120.0,
    monitor_interval=0.5,
    read_interval=0.5,
    heartbeat_interval=10.0,
    heartbeat_timeout=120.0,
)

ModelController-Based Workflow

from nvflare.app_common.workflows.fedavg import FedAvg

controller = FedAvg(
    num_clients=8,
    num_rounds=100,
)

# Task with timeout
controller.send_model_and_wait(
    targets=None,
    data=model,
    timeout=3600,  # 1 hour per round
)

ScatterAndGather Configuration

from nvflare.app_common.workflows.scatter_and_gather import ScatterAndGather

controller = ScatterAndGather(
    min_clients=4,
    num_rounds=50,
    train_timeout=7200,              # 2 hours per round
    wait_time_after_min_received=30, # Wait 30s for stragglers
    task_check_interval=1.0,
)

CyclicController Configuration

from nvflare.app_common.workflows.cyclic_ctl import CyclicController

controller = CyclicController(
    num_rounds=10,
    task_assignment_timeout=30,  # 30 sec to request task
)

TIE Controller Configuration

from nvflare.app_common.tie.controller import TieController

controller = TieController(
    configure_task_timeout=60,
    start_task_timeout=30,
    job_status_check_interval=5.0,
    max_client_op_interval=120.0,
    progress_timeout=7200.0,
)

Notes and Best Practices

General Rules:

  • Timeout values are in seconds unless otherwise specified

  • None or 0 often means no timeout limit (wait indefinitely)

  • Chunk size values of 0 disable streaming and use native serialization

Critical Constraints:

  • heartbeat_interval must be less than heartbeat_timeout

  • task_assignment_timeout must be less than or equal to task.timeout

  • task_result_timeout must be less than or equal to task.timeout

  • per_msg_timeout should be less than or equal to tx_timeout for retries to work

  • agent_heartbeat_interval must be less than agent_connection_timeout

  • IMPORTANT: When using tensor streaming, get_task_timeout must be greater than or equal to wait_send_task_data_all_clients_timeout to prevent task fetch timeouts while waiting for all clients to receive tensors

Tensor Streaming Timeout Warning:

When tensor streaming is enabled, if get_task_timeout is not explicitly set, it defaults to the communicator’s timeout. If the streaming timeout (wait_send_task_data_all_clients_timeout) exceeds the communicator timeout, clients may timeout while waiting for other clients to receive weights. This can cause the tensor streaming process to restart and clients may receive empty tensors, causing the job to fail.

Recommended relationship for tensor streaming:

get_task_timeout >= wait_send_task_data_all_clients_timeout >= tensor_send_timeout * num_clients

Hierarchy:

  • Session-specific timeouts override server defaults

  • Client config overrides can be set via recipe.add_client_config()

  • comm_config.json settings apply to all F3/CellNet communication

Best Practices by Component:

Controllers:

  • Start with timeout=0 (no timeout) during development

  • Set appropriate train_timeout based on expected round duration

  • For cross-site eval, validation_timeout should exceed longest validation time

  • Use wait_for_clients_timeout to limit waiting for slow clients

Executors:

  • external_pre_init_timeout should cover model loading + library imports

  • heartbeat_timeout should be 2-3x heartbeat_interval

  • Set last_result_transfer_timeout based on result size

  • For IPC: agent_connection_timeout > agent_heartbeat_interval * 3

Workflows:

  • progress_timeout catches hung jobs; set to 2-3x expected round time

  • job_status_check_interval trades responsiveness vs overhead

  • For statistics: result_wait_timeout per statistic, not total

Network/Streaming:

  • Increase per_msg_timeout and tx_timeout for high-latency networks

  • streaming_read_timeout should handle slowest expected transfer

  • Use longer ack_wait for unreliable connections

Debugging Tips:

  • Enable debug logging to see timeout-related messages

  • Check num_timeout_reqs counter in CoreCell for timeout statistics

  • Monitor heartbeat status to detect connectivity issues early

  • Look for “timeout” in logs to identify which timeouts are triggering

  • For IPC issues, check agent_connection_timeout and agent logs

  • For third-party integration (TIE), monitor max_client_op_interval triggers

Common Timeout Patterns:

  1. Layered Timeouts: Higher-level timeouts should exceed lower-level ones

    • progress_timeout > train_timeout > task_wait_timeout

    • validation_timeout > per-batch validation time * num_batches

  2. Heartbeat Relationships: Always maintain proper ratios

    • heartbeat_timeout = 3-6x heartbeat_interval

    • agent_heartbeat_timeout = 3-6x agent_heartbeat_interval

  3. Retry Allowance: Leave room for retries

    • tx_timeout > per_msg_timeout * expected_retries

    • task.timeout > task_assignment_timeout + actual_work_time