Timeouts in NVIDIA FLARE (Reference)

This document provides a comprehensive overview of all timeout configurations in NVIDIA FLARE, organized by functional categories with relationships, impacts, and usage examples.

Network Communication Timeouts 

This section covers all network-related timeouts including the F3/CellNet communication layer, server configuration, and client communication settings.

F3/CellNet Layer 

The F3 (Flare-Friendly Framework) and CellNet provide the core communication infrastructure. These timeouts are configured in comm_config.json.

CommConfigurator Settings

Low-level communication configuration (comm_config.py):

Parameter	Default	Purpose
heartbeat_interval	varies	Interval for heartbeat messages
subnet_heartbeat_interval	5.0	Interval for subnet heartbeat checks
streaming_read_timeout	300	Timeout for reading streamed data
streaming_ack_interval	4MB	Bytes between ACK messages during streaming
streaming_ack_wait	varies	Time to wait for streaming ACK

CoreCell Settings

Core cell communication parameters (core_cell.py):

Parameter	Default	Purpose
max_timeout	3600	Default timeout for send_and_receive (1 hour)
bulk_check_interval	0.5	Interval for bulk message checking
bulk_process_interval	0.5	Interval for bulk message processing

Cell Request Timeouts

Cell-level request timeouts (cell.py):

Parameter	Default	Purpose
timeout	10.0	Default timeout for send_request/broadcast_request

Timeout Phases: Requests go through three timeout phases:

Sending timeout: Time to complete message sending
Remote processing timeout: Time for remote to process request
Receiving timeout: Time to receive response

Example comm_config.json:

{
  "heartbeat_interval": 10,
  "subnet_heartbeat_interval": 5,
  "streaming_read_timeout": 300,
  "streaming_ack_interval": 4194304,
  "max_message_size": 1048576
}

Server Configuration 

These timeouts are configured in fed_server.json or server configuration.

FedServer Timeouts

Server heartbeat and connection management (fed_server.py):

Parameter	Default	Purpose
heart_beat_timeout	600	Time without heartbeat before client considered dead
remove_interval	5.0	Interval for checking/removing dead clients
check_interval	0.2	Interval for connection checking loop

ServerRunner Timeouts

Server runner configuration (server_runner.py, server_json_config.py):

Parameter	Default	Purpose
heartbeat_timeout	60	Client heartbeat timeout in seconds
task_request_interval	2	Task request interval in seconds

Admin Server Timeouts

Admin server command timeouts (admin.py):

Parameter	Default	Purpose
timeout	10.0	Admin command timeout
timeout_secs	2.0	Timeout for send_requests to clients

Example (fed_server.json):

{
  "heart_beat_timeout": 600,
  "admin_timeout": 10.0
}

Client Configuration 

Client heartbeat and retry configuration (client_train.py, base_client_deployer.py):

Parameter	Default	Purpose
heart_beat_interval	10.0	Interval for sending heartbeats to server
retry_timeout	30	Timeout for retry operations

Note: heart_beat_interval must be less than the server’s heart_beat_timeout for proper client status tracking.

Client-to-Server Communication

Low-level client communication timeouts (communicator.py, fed_client_base.py):

Parameter	Default	Purpose
communication_timeout	300.0	General communication timeout
maint_msg_timeout	30.0	Maintenance message timeout
engine_create_timeout	30.0	Timeout for engine creation
retry_timeout	30.0	Retry timeout for operations

Flare Agent

FlareAgent for external process integration (flare_agent.py):

Parameter	Default	Purpose
heartbeat_timeout	60.0	Time without heartbeat before peer is dead
submit_result_timeout	60.0	Timeout for submitting task result to the client training process. 60 s is too short for large models; configure via `add_client_config({"submit_result_timeout": 1800})`.
max_resends	None in raw `FlareAgent`; 3 through Client API job config	Maximum send retries on failure. For `ClientAPILauncherExecutor` jobs, the default is the finite value `3`; `None` is rejected at job initialization. Override via `add_client_config({"max_resends": N})`.
download_complete_timeout	1800.0	Time the subprocess waits after result ACK while the server finishes downloading tensors from the subprocess `DownloadService`. Must not be `None` for `ClientAPILauncherExecutor` jobs.

Note: Raw FlareAgentWithCellPipe defaults to 60.0 s for submit_result_timeout and unlimited max_resends. When launched through ClientAPILauncherExecutor, the generated Client API config supplies the safer job defaults described above. Recipe-based external-process jobs also serialize max_resends=3 in the executor args, so reloaded jobs do not fall back to the raw unlimited retry default. Use recipe.add_client_config({"max_resends": N}) only when a job needs a different finite retry budget.

IPC Agent

IPC Agent for inter-process communication (ipc_agent.py):

Parameter	Default	Purpose
submit_result_timeout	30.0	Timeout for submitting results
flare_site_connection_timeout	60.0	Timeout for CJ disconnection
flare_site_heartbeat_timeout	None	Timeout for missing CJ heartbeats

gRPC Utility Timeouts 

gRPC connection establishment (grpc_utils.py):

Parameter	Default	Purpose
ready_timeout	varies	Time to wait for gRPC server to be ready

Reliable Message 

Reliable Messages provide guaranteed delivery with retry logic (reliable_message.py):

Parameter	Default	Purpose
per_msg_timeout	varies	Timeout for each individual message attempt
tx_timeout	varies	Timeout for entire transaction including all retries

Behavior:

If tx_timeout <= per_msg_timeout, request is sent only once without retrying
Messages are retried until tx_timeout is reached
Completed requests are tracked for 2 × tx_timeout to handle late duplicates

Example:

from nvflare.apis.utils.reliable_message import ReliableMessage

ReliableMessage.send_request(
    target="site-1",
    topic="my_topic",
    request=shareable,
    per_msg_timeout=30.0,   # Each attempt times out after 30s
    tx_timeout=300.0,       # Total transaction timeout 5 minutes
    abort_signal=abort_signal,
    fl_ctx=fl_ctx,
)

Federated Event Timeouts 

Fed event runner intervals (fed_event.py):

Parameter	Default	Purpose
regular_interval	0.01	Regular processing interval
grace_period	2.0	Grace period before shutdown
queue_empty_period	2.0	Period to wait when queue is empty

Simulator Timeouts 

Simulator-specific timeouts (simulator_runner.py, simulator_worker.py):

Parameter	Default	Purpose
simulator_worker_timeout	60.0	Timeout for simulator worker
app_runner_timeout	60.0	Timeout for app runner
CELL_CONNECT_CHECK_TIMEOUT	10.0	Timeout for cell connection check
FETCH_TASK_RUN_RETRY	3	Number of retry attempts for task fetch

Flare API Session Timeouts 

Session management for programmatic API (flare_api.py):

Parameter	Default	Purpose
timeout (new_session)	10.0	Timeout to establish session
poll_interval	2.0	Interval for polling job status
set_timeout()	varies	Session-specific command timeout

Example:

from nvflare.fuel.flare_api.flare_api import new_secure_session

# Create session with timeout
sess = new_secure_session(
    username="admin@nvidia.com",
    startup_kit_location="/path/to/startup",
    timeout=30.0,
)

# Set command timeout
sess.set_timeout(60.0)

# Monitor job with timeout and poll interval
rc = sess.monitor_job(job_id, timeout=3600, poll_interval=5.0)

Heartbeat Timeouts 

Executor Heartbeat 

Heartbeat mechanisms ensure connectivity between components:

Timeout	Default	Location	Purpose
heartbeat_interval	5.0	`LauncherExecutor` launcher_executor.py:49	Interval for sending heartbeat messages
heartbeat_timeout	60.0	`LauncherExecutor` launcher_executor.py:50	Timeout for waiting for heartbeat from peer
peer_read_timeout	60.0	`LauncherExecutor` launcher_executor.py:46	Time to wait for peer to accept sent message

Client API Heartbeat

The Client API inherits heartbeat configuration from the task exchange settings (config.py:154-159):

def get_heartbeat_timeout(self):
    return self.config.get(ConfigKey.TASK_EXCHANGE, {}).get(
        ConfigKey.HEARTBEAT_TIMEOUT,
        self.config.get(ConfigKey.METRICS_EXCHANGE, {}).get(ConfigKey.HEARTBEAT_TIMEOUT, 60),
    )

Executor and Launcher Timeouts 

LauncherExecutor Base Class 

The LauncherExecutor class defines core timeout parameters for external process management (launcher_executor.py:38-58):

Parameter	Default	Purpose
launch_timeout	None	Timeout for launcher’s “launch_task” method completion
task_wait_timeout	None	Timeout for retrieving task results
last_result_transfer_timeout	300.0	Timeout for transmitting final result from external process
external_pre_init_timeout	60.0	Time to wait for external process before `flare.init()` call

ClientAPILauncherExecutor 

The Client API executor extends base timeouts with more conservative defaults (client_api_launcher_executor.py:29-53):

Parameter	Default	Purpose
external_pre_init_timeout	300.0	Extended timeout for heavy library imports
peer_read_timeout	300.0	Timeout for peer message acceptance
heartbeat_timeout	300.0	Extended heartbeat timeout for Client API
submit_result_timeout	300.0	Subprocess-side wait for CJ to acknowledge each result message
max_resends	3	Maximum retries after the initial result send; `None` is rejected
download_complete_timeout	1800.0	Time the subprocess remains alive for server-side tensor download completion

For subprocess-mode Client API jobs with large payloads, FLARE validates the following at job start:

download_complete_timeout must not be None.
max_resends must be a finite non-negative integer. Recipe-based jobs serialize the default value 3 in executor args. Use 0 to disable retries; do not use None for unlimited retries.

Values supplied through recipe.add_client_config() are top-level entries in config_fed_client.json. For subprocess-mode Client API jobs, ClientAPILauncherExecutor applies these overrides before writing the subprocess client_api_config.json, so submit_result_timeout, download_complete_timeout, and max_resends are seen by both the parent client job process and the external training process.

When tensor_streaming_per_request_timeout or np_streaming_per_request_timeout is explicitly configured, FLARE also warns if PEER_READ_TIMEOUT or download_complete_timeout is shorter than that streaming timeout. Set PEER_READ_TIMEOUT through add_client_config when the parent client job needs a larger pipe-read budget:

recipe.add_client_config({
    "tensor_streaming_per_request_timeout": 600,
    "tensor_min_download_timeout": 600,
    "PEER_READ_TIMEOUT": 600,
    "download_complete_timeout": 1800,
    "max_resends": 3,
})

External Pre-Init Override

Jobs can override the external pre-init timeout via client configuration (constants.py:20-22):

# Configuration key for overriding external_pre_init_timeout in ClientAPILauncherExecutor
EXTERNAL_PRE_INIT_TIMEOUT = "EXTERNAL_PRE_INIT_TIMEOUT"

TaskExchanger 

The TaskExchanger base class manages pipe-based task exchange with external processes (task_exchanger.py:38-68):

Parameter	Default	Purpose
read_interval	0.5	How often to read from pipe
heartbeat_interval	5.0	How often to send heartbeat to peer
heartbeat_timeout	60.0	Time to wait for heartbeat from peer (None = disable)
resend_interval	2.0	How often to resend a message if failing to send
peer_read_timeout	60.0	Time to wait for peer to accept sent message
result_poll_interval	0.5	How often to poll for task result

IPCExchanger 

The IPCExchanger manages IPC-based communication with Flare Agents (ipc_exchanger.py:50-82):

Parameter	Default	Purpose
send_task_timeout	5.0	How long to wait for response when sending task to Agent
resend_task_interval	2.0	How often to resend task if failed
agent_connection_timeout	60.0	Time allowed to miss heartbeat before considering agent disconnected
agent_heartbeat_timeout	None	Time allowed to miss heartbeat before stopping (None = disabled)
agent_heartbeat_interval	5.0	How often to send heartbeats to the agent
agent_ack_timeout	5.0	How long to wait for agent ack (heartbeat and bye messages)

InProcessClientAPIExecutor 

The in-process executor for Client API (in_process_client_api_executor.py:50-70):

Parameter	Default	Purpose
result_pull_interval	0.5	How often to poll for task result
log_pull_interval	None	How often to pull logs (None = same as result_pull_interval)

Pipe Handler 

Inter-process communication pipe timeouts for Client API (pipe_handler.py):

Parameter	Default	Purpose
heartbeat_interval	5.0	Interval for sending heartbeats
heartbeat_timeout	30.0	Max time without heartbeat before peer is dead
default_request_timeout	5.0	Default timeout for requests
resend_interval	2.0	Interval between message resends

Important: heartbeat_interval must be less than heartbeat_timeout.

P2P Executor 

Peer-to-peer sync executor (sync_executor.py):

Parameter	Default	Purpose
sync_timeout	10	Timeout waiting for values from neighbors

Admin Client Timeouts 

Admin client timeouts control session management and command execution:

Timeout	Default	Location	Purpose
idle_timeout	900.0	Admin config	Automatic shutdown after idle period
login_timeout	10.0	Admin config	Max time to attempt login
authenticate_msg_timeout	2.0	Admin config	Timeout for authentication messages
Command timeout	5.0	FLARE API session	Default timeout for admin commands

Session-Specific Timeouts 

Admin API supports session-specific command timeouts (api_spec.py:305-318):

def set_timeout(self, value: float):
    """Set a session-specific command timeout. This is the amount of time the server
    will wait for responses after sending commands to FL clients.
    Note that this value is only effective for the current API session."""

Task Communication and Messaging 

These timeouts control task assignment and result collection between server and clients.

WfCommServer (Workflow Communication Server)

Server-side workflow communication (wf_comm_server.py):

Parameter	Default	Purpose
task.timeout	varies	Overall task timeout
task_assignment_timeout	0	Time to wait for client to pick task
task_result_timeout	0	Time to wait for client to return result
task_check_period	0.2	Interval for checking task status

Validation Rules:

task_assignment_timeout must be <= task.timeout
task_result_timeout must be <= task.timeout

WfCommClient (Workflow Communication Client)

Client-side workflow communication (wf_comm_client.py):

Parameter	Default	Purpose
max_task_timeout	3600	Maximum single task execution time; used as the effective timeout when the controller sets task.timeout = 0 (i.e., “no timeout”)

Task Pull/Fetch Timeouts 

Client-side task fetching from server (client_runner.py, communicator.py, fed_client_base.py):

Parameter	Default	Purpose
get_task_timeout	None	Timeout for client to fetch task from server
submit_task_result_timeout	None	Timeout for client to submit result to server
timeout (pull_task)	None	Timeout for pull_task communication

Configuration: Set via ConfigVarName.GET_TASK_TIMEOUT and ConfigVarName.SUBMIT_TASK_RESULT_TIMEOUT in client config.

Example (client params in job):

recipe.add_client_config({
    "get_task_timeout": 300,  # 5 minutes
})

Task Manager Timeouts 

Task managers control sequential and relay task distribution (send_manager.py, seq_relay_manager.py, any_relay_manager.py):

Parameter	Default	Purpose
task_assignment_timeout	0	Time window for client to request task
task_result_timeout	0	Time to wait for client result before moving to next

Behavior:

For SendOrder.SEQUENTIAL: Clients are assigned in order with sliding time window
For SendOrder.ANY: First available client gets the task
Timeout of 0 means no timeout (wait indefinitely)

Workflow and Controller Timeouts 

Client-Controlled Workflows (Server-Side)

Server-side controller timeouts for workflow management (common.py:79-92):

Timeout	Default	Purpose
configure_task_timeout	300	Time for clients to respond to config task
start_task_timeout	10	Time for starting client to begin workflow
end_workflow_timeout	2.0	Timeout for ending workflow message
progress_timeout	3600.0	Max time without workflow progress
max_status_report_interval	90.0	Max time for client to miss status report

Client-Controlled Workflows (Client-Side)

Client-side timeouts for task coordination (common.py:87-92):

Timeout	Default	Purpose
learn_task_check_interval	1.0	Interval for checking new learning tasks
learn_task_ack_timeout	10	P2P model-transfer ACK budget (seconds). 10 s is too short for models >2 GB. Set via `SwarmLearningRecipe(round_timeout=3600)` which wires both `learn_task_ack_timeout` and `final_result_ack_timeout`.
learn_task_abort_timeout	5.0	Timeout for task abortion
final_result_ack_timeout	10	Timeout for final result acknowledgment. See `learn_task_ack_timeout` note above.
get_model_timeout	10	Timeout for getting model from peers
max_task_timeout	3600	Maximum single task execution time

ScatterAndGather Controller 

The SAG controller manages aggregation timing (scatter_and_gather.py:37-67):

Parameter	Default	Purpose
train_timeout	0	Time to wait for clients to do local training (0 = no timeout)
wait_time_after_min_received	10	Time to wait for additional responses after min_clients
task_check_interval	0.5	Interval for checking task completion

ModelController-Based Workflows 

FedAvg, Cyclic, Scaffold, and other ModelController-based workflows (model_controller.py, base_model_controller.py):

Parameter	Default	Purpose
timeout	0	Time to wait for clients to perform task (0 = no timeout)

Note: FedAvg, Scaffold, Cyclic all inherit from ModelController and use the same timeout parameter.

CyclicController 

Cyclic workflow controller (cyclic_ctl.py):

Parameter	Default	Purpose
task_assignment_timeout	10	Timeout for client to request its assigned task

CrossSiteModelEval / CrossSiteEval 

Cross-site model evaluation workflows (cross_site_model_eval.py, cross_site_eval.py):

Parameter	Default	Purpose
submit_model_timeout	600	Timeout for submit_model_task (10 min)
validation_timeout	6000	Timeout for validate_model task (100 min)
wait_for_clients_timeout	300	Timeout for clients to appear (5 min)
eval_task_timeout (CCWF)	1200+	Time for model evaluation by clients
configure_task_timeout (CCWF)	300	Timeout for configuration task
progress_timeout (CCWF)	7200+	Overall workflow progress timeout

Example configuration:

from nvflare.app_common.np.recipes import NumpyCrossSiteEvalRecipe

recipe = NumpyCrossSiteEvalRecipe(
    submit_model_timeout=600,
    validation_timeout=6000,
)

GlobalModelEval 

Global model evaluation controller (global_model_eval.py):

Parameter	Default	Purpose
validation_timeout	6000	Timeout for validate_model task
wait_for_clients_timeout	300	Timeout for clients to appear

BroadcastAndProcess / InitializeGlobalWeights 

Broadcast workflows (broadcast_and_process.py, initialize_global_weights.py):

Parameter	Default	Purpose
timeout / task_timeout	0	Task timeout (0 = no timeout)
wait_time_after_min_received	0-10	Wait time after min responses received

StatisticsController / HierarchicalStatisticsController 

Statistics workflow controllers (statistics_controller.py, hierarchical_statistics_controller.py):

Parameter	Default	Purpose
result_wait_timeout	10	Seconds to wait for results per statistic
wait_time_after_min_received	1	Seconds to wait after min clients received

Note: result_wait_timeout is reset for each statistic, not an overall timeout.

SplitNNController 

Split learning controller (splitnn_workflow.py:47-79):

Parameter	Default	Purpose
task_timeout	10	Timeout for client to request its assigned task
TIMEOUT (class constant)	60.0	Timeout for auxiliary message requests

TIE Controller (Third-party Integration)

Base controller for third-party integration (tie/controller.py, tie/defs.py):

Parameter	Default	Purpose
configure_task_timeout	10	Time to wait for clients to complete config task
start_task_timeout	10	Time to wait for clients to complete start task
job_status_check_interval	2.0	How often to check client job statuses
max_client_op_interval	90.0	Max time allowed between app ops from a client
progress_timeout	3600.0	Max time allowed with no workflow progress

Note: TIE is used by XGBoost, Flower, and other third-party framework integrations.

Flower Integration Timeouts 

Flower-specific controller and executor timeouts (flower/controller.py, flower/executor.py):

Parameter	Default	Purpose
superlink_ready_timeout	10.0	Time to wait for Flower superlink to become ready
superlink_min_query_interval	10.0	Minimal interval for querying superlink status
monitor_interval	0.5	How often to check Flower run status
per_msg_timeout	10.0	Per-message timeout for ReliableMessage
tx_timeout	100.0	Transaction timeout for ReliableMessage
client_shutdown_timeout	5.0	Max time for graceful client shutdown

Private Set Intersection (PSI)

PSI workflows do not have explicit timeout parameters at the PSI controller level. PSI inherits general workflow timeouts from the underlying task system.

For PSI operations, timeouts are controlled at lower levels:

Task-level timeouts: Use controller’s general timeout parameter
Communication timeouts: Inherited from system heartbeat_timeout and peer_read_timeout

Note: For large-scale PSI operations, ensure adequate system-level timeouts in application.conf to handle the iterative Diffie-Hellman protocol exchanges.

Aggregator Timeouts 

LazyAggregator

Lazy aggregator for async aggregation (lazy.py):

Parameter	Default	Purpose
accept_timeout	600.0	Max time to wait for accept to finish

Job Scheduler Timeouts 

The DefaultJobScheduler controls job scheduling frequency (job.rst:255-270):

Parameter	Default	Purpose
min_schedule_interval	10.0	Minimum interval between schedule attempts
max_schedule_interval	600.0	Maximum interval between schedule attempts
max_schedule_count	10	Maximum times to try scheduling a job

Scheduling Strategy: The scheduler uses adaptive frequency - doubling interval after each failure up to the maximum.

Recipe Timeouts 

Standard Recipe Timeouts 

All standard recipes support these timeout parameters (fedavg.py, cyclic.py):

Parameter	Default	Purpose
shutdown_timeout	0.0	Wait time before shutdown for cleanup
task_assignment_timeout	10	Timeout for cyclic task assignment (CyclicRecipe only)

Evaluation Recipe Timeouts 

Evaluation recipes have specific timeout requirements (fedeval.py, cross_site_eval.py):

Parameter	Default	Purpose
validation_timeout	6000	Time allowed for model validation
submit_model_timeout	600	Time for clients to submit models for evaluation

Large Model and Streaming Timeouts 

File Streaming Timeouts 

File streaming for large files (file_streamer.py):

Parameter	Default	Purpose
chunk_timeout	5.0	Timeout for each chunk sent to targets
chunk_size	1M bytes	Size of each chunk streamed

Example:

from nvflare.app_common.streamers.file_streamer import FileStreamer

FileStreamer.stream_file(
    targets=["site-1", "site-2"],
    file_name="/path/to/large_file.bin",
    fl_ctx=fl_ctx,
    chunk_size=1024 * 1024,  # 1MB chunks
    chunk_timeout=10.0,      # 10 seconds per chunk
)

Container Streaming Timeouts 

Container/object streaming (container_streamer.py):

Parameter	Default	Purpose
entry_timeout	60.0	Timeout for each entry sent to targets

Example:

from nvflare.app_common.streamers.container_streamer import ContainerStreamer

ContainerStreamer.stream_container(
    targets=["site-1"],
    container=my_large_container,
    fl_ctx=fl_ctx,
    entry_timeout=120.0,  # 2 minutes per entry
)

Object Retrieval Timeouts 

Retrieving files/containers from remote sites (file_retriever.py, container_retriever.py):

Parameter	Default	Purpose
timeout	varies	Max seconds to wait for data retrieval
chunk_timeout	varies	Timeout per chunk during file retrieval

Byte Streaming Timeouts 

Byte streaming timeouts and intervals (byte_receiver.py, byte_streamer.py):

Parameter	Default	Purpose
streaming_read_timeout	300	Timeout for reading streamed data
ack_interval	4MB	Bytes between acknowledgment messages
ack_wait	varies	Time to wait for ACK before timing out

Note: ACK timeout triggers StreamError and stops the stream.

Download Transaction Timeouts 

Object download transaction timeouts (download_service.py, obj_downloader.py):

Parameter	Default	Purpose
timeout	varies	Transaction timeout (time since last activity)
per_request_timeout	varies	Timeout for each request to object owner

Note: Transaction times out if no activity from any receiver for the specified duration. Normally finished download refs are tombstoned temporarily so a late retry from the same receiver can receive the original EOF or error status instead of a fatal missing-ref response. Timeout and deleted transactions are not tombstoned.

Tensor Streaming Timeouts 

Tensor streaming provides efficient transfer of large model weights. These timeouts control the streaming behavior (tensor_stream/server.py, client.py):

Parameter	Default	Purpose
tensor_send_timeout	30.0	Timeout for each tensor entry transfer operation
wait_send_task_data_all_clients_timeout	300.0	Timeout for sending tensors to all clients
wait_for_tensors timeout	5.0	Time to wait for tensors to be received

Server-side configuration (TensorServerStreamer):

from nvflare.app_opt.tensor_stream.server import TensorServerStreamer

streamer = TensorServerStreamer(
    format="pytorch",
    tensor_send_timeout=60.0,  # Per-tensor timeout
    wait_send_task_data_all_clients_timeout=600.0,  # All clients timeout
)

Client-side configuration (TensorClientStreamer):

from nvflare.app_opt.tensor_stream.client import TensorClientStreamer

streamer = TensorClientStreamer(
    format="pytorch",
    tensor_send_timeout=60.0,  # Per-tensor timeout
)

Warning

Critical Timeout Relationship for Tensor Streaming

When using tensor streaming, you must ensure that get_task_timeout is set and is greater than or equal to wait_send_task_data_all_clients_timeout. If get_task_timeout is not set, it defaults to the communicator’s timeout, which may be shorter than the tensor streaming timeout.

Problem: If streaming timeout > communicator timeout and no get_task_timeout is set, some clients may receive weights while others are still waiting. The server may not send the task in time, causing a timeout that restarts the tensor streaming process. This can result in clients receiving empty tensors and job failure.

Solution: Always set get_task_timeout when using tensor streaming:

# Ensure get_task_timeout >= wait_send_task_data_all_clients_timeout
recipe.add_client_config({
    "get_task_timeout": 600,  # Must be >= streaming timeout
})

Streaming Download Timeouts 

Framework-level settings for large payload transfers (fl_constant.py:553, comm_config.py:41):

Parameter	Default	Purpose
streaming_per_request_timeout	600	Per-request timeout for streaming chunks
streaming_read_timeout	300	Timeout for reading streaming data
np_min_download_timeout	300	Minimum idle time (seconds) before an inactive NumPy array download transaction is declared dead. Applies to NumPy/sklearn-based models. Increase to 600 s for 70B+ models on congested networks. Configure via `add_client_config({"np_min_download_timeout": 600})`.
tensor_min_download_timeout	300	Minimum idle time (seconds) before an inactive PyTorch tensor download transaction is declared dead. Applies to PyTorch-based models. Increase to 600 s for 70B+ models on congested networks. Configure via `add_client_config({"tensor_min_download_timeout": 600})`.
np_download_chunk_size	2097152	Chunk size for NumPy array downloads (bytes)
tensor_download_chunk_size	2097152	Chunk size for PyTorch tensor downloads (bytes)

For Client API subprocess jobs, keep these download settings aligned with the subprocess pipe settings:

tensor_min_download_timeout / np_min_download_timeout should be at least tensor_streaming_per_request_timeout / np_streaming_per_request_timeout.
PEER_READ_TIMEOUT should be at least the configured streaming per-request timeout so the parent client job does not resend the task while the subprocess is still downloading a large payload.
download_complete_timeout should be at least the configured streaming per-request timeout and long enough for the server to pull large tensor results from the subprocess after result ACK.
max_resends should stay finite. The recipe default is 3; raise it only when the network is expected to recover after a small number of delayed result acknowledgments.

Swarm Learning Large Model Setup 

Recommended timeouts for large models in Swarm Learning:

recipe = SwarmLearningRecipe(
    name="swarm",
    model=MyModel(),
    min_clients=3,
    num_rounds=5,
    train_script="client.py",
    round_timeout=7200,   # P2P ACK budget; covers learn_task_ack_timeout + final_result_ack_timeout
    progress_timeout=7200,
    start_task_timeout=300,
)

# Server-side streaming configuration
recipe.add_server_config({
    "np_download_chunk_size": 2097152,
    "streaming_per_request_timeout": 600,
})

# Subprocess-mode timeouts (when launch_external_process=True)
recipe.add_client_config({
    "submit_result_timeout": 1800,
    "download_complete_timeout": 1800,
    "tensor_min_download_timeout": 600,
    "PEER_READ_TIMEOUT": 600,
    "max_resends": 5,
})

XGBoost-Specific Timeouts 

XGBoost Histogram-Based Controller 

XGBoost histogram-based controller timeouts (histogram_based_v2/controller.py):

Parameter	Default	Purpose
configure_task_timeout	300	Timeout for configuration task
start_task_timeout	10	Timeout for start task
progress_timeout	3600.0	Overall workflow progress timeout

Note: XGBoost uses Reliable Messages for secure training. See the Reliable Message section for per_msg_timeout and tx_timeout configuration.

XGBoost gRPC Client 

gRPC client for XGBoost communication (grpc_client.py, grpc_server_adaptor.py):

Parameter	Default	Purpose
ready_timeout	10	Timeout for gRPC server to be ready
xgb_server_ready_timeout	varies	Timeout for XGBoost server readiness
aggr_timeout	10.0	Aggregation timeout for mock servicer

Example configuration for large datasets:

"per_msg_timeout": 300.0,
"tx_timeout": 900.0,

Confidential Computing Timeouts 

SNP Authorizer Timeouts 

AMD SEV-SNP attestation timeouts (snp_authorizer.py):

Parameter	Default	Purpose
cmd_timeout	60	SNPGuest command execution timeout
retry_interval	10	Wait time between retry attempts
max_retries	5	Maximum retry attempts

CC Manager Timeouts 

Cross-site CC verification timeouts (cc_manager.py):

Parameter	Default	Purpose
get_site_request_timeout	10.0	Timeout for get site request
get_token_request_timeout	10.0	Timeout for get token request
verify_frequency	600	CC token verification interval (seconds)
cross_validation_interval	varies	Interval between cross-site validation cycles

Note: Other CC authorizers (ACI, TDX, GPU, Azure CVM) do not have explicit timeout parameters and rely on system defaults.

Job Launcher Timeouts 

Kubernetes Launcher 

K8s job launcher timeouts (k8s_launcher.py):

Parameter	Default	Purpose
timeout	None	Timeout for pod to enter RUNNING/TERMINATED state

Docker Launcher 

Docker container launcher timeouts (docker_launcher.py):

Parameter	Default	Purpose
timeout	None	Timeout for container to enter target state

Edge Device Timeouts 

This section covers all edge device, mobile client, and hierarchical FL timeouts.

Edge Device General 

Edge devices have specific timeout requirements:

Parameter	Default	Purpose
update_timeout	5	Timeout for model updates from devices
device_wait_timeout	None	Time to wait for sufficient devices to join
job_timeout	60.0	Overall timeout for edge job execution

Example:

from nvflare.edge.tools.edge_fed_buff_recipe import EdgeFedBuffRecipe

recipe = EdgeFedBuffRecipe(
    model=MyModel(),
    update_timeout=10,
    job_timeout=120.0,
)

Hierarchical FL 

Hierarchical FL enables multi-tier federation with edge devices organized in a tree structure.

ScatterAndGatherForEdge (SAGE) Controller

Server-side controller for hierarchical edge FL (edge/controllers/sage.py):

Parameter	Default	Purpose
assess_interval	0.5	Interval for invoking the assessor during task execution
update_interval	1.0	Interval for children to send updates
task_check_period	0.5	Interval for checking status of tasks

HierarchicalUpdateGatherer (HUG) Executor

Executor for hierarchical update gathering (edge/executors/hug.py):

Parameter	Default	Purpose
update_timeout	required	Timeout for update messages sent to parent

EdgeTaskExecutor (ETE)

Edge task executor for leaf nodes (edge/executors/ete.py):

Parameter	Default	Purpose
update_timeout	required	Timeout for update messages sent to parent

Example:

from nvflare.edge.controllers.sage import ScatterAndGatherForEdge
from nvflare.edge.executors.hug import HierarchicalUpdateGatherer

# Server-side controller
sage = ScatterAndGatherForEdge(
    num_rounds=5,
    assess_interval=0.5,
    update_interval=1.0,
    task_check_period=0.5,
)

# Client-side executor
hug = HierarchicalUpdateGatherer(
    learner_id="learner",
    updater_id="updater",
    update_timeout=30.0,
)

Mobile Client 

Android SDK includes job operation timeout (mobile_android.rst:43-58):

AndroidFlareRunner(
    // ... other parameters
    jobTimeout: Float,  // Timeout in seconds for job operations
)

SubprocessLauncher Timeouts 

Subprocess launcher timeout (subprocess_launcher.py):

Parameter	Default	Purpose
shutdown_timeout	0.0	Time to wait before forcefully stopping subprocess

Experiment Tracking Timeouts 

WandB Receiver 

Weights & Biases integration timeouts (wandb_receiver.py):

Parameter	Default	Purpose
process_timeout	10.0	Timeout for joining WandB processes at shutdown
login timeout	1.0	Internal timeout for WandB login verification

MLflow Receiver 

MLflow integration timing (mlflow_receiver.py):

Parameter	Default	Purpose
buffer_flush_time	1	Seconds between deliveries to MLflow tracking server

Note: Reducing buffer_flush_time increases traffic to MLflow server and may cause latency.

TensorBoard Receiver 

TensorBoard receiver (tb_receiver.py) does not have explicit timeout parameters. Events are written directly to disk without buffering.

Metrics Relay and Sender 

Metrics exchange timeouts for experiment tracking (metric_relay.py, metrics_sender.py):

Parameter	Default	Purpose
heartbeat_timeout	30.0-60.0	Timeout for peer heartbeat (MetricRelay: 60s, MetricsSender: 30s)
heartbeat_interval	5.0	Interval between heartbeats
read_interval	0.1	Interval for reading from pipe

Example:

from nvflare.app_common.widgets.metric_relay import MetricRelay

metric_relay = MetricRelay(
    heartbeat_interval=5.0,
    heartbeat_timeout=60.0,
    read_interval=0.1,
)

Timeout Relationships and Dependencies 

Hierarchical Relationships 

┌─────────────────────────────────────────────────────────────────┐
│                    SYSTEM-LEVEL TIMEOUTS                        │
├─────────────────────────────────────────────────────────────────┤
│  Server Configuration (fed_server.json)                        │
│  ├── heart_beat_timeout (600s) - Client liveness detection     │
│  ├── admin_timeout (10s) - Admin command processing            │
│  └── task_request_interval (2s) - Task polling rate            │
│                                                                 │
│  Client Configuration                                           │
│  ├── heart_beat_interval (10s) - Keep-alive to server          │
│  ├── retry_timeout (30s) - Operation retry                      │
│  └── communication_timeout (300s) - Network operations          │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    F3/CELLNET LAYER                             │
├─────────────────────────────────────────────────────────────────┤
│  CommConfigurator (comm_config.json)                            │
│  ├── heartbeat_interval < heartbeat_timeout (REQUIRED)          │
│  ├── subnet_heartbeat_interval (5s)                             │
│  ├── streaming_read_timeout (300s)                              │
│  └── max_timeout (3600s) - CoreCell default                     │
│                                                                 │
│  Cell Requests                                                  │
│  └── timeout (10s) → Sending → Processing → Receiving           │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    TASK COMMUNICATION                           │
├─────────────────────────────────────────────────────────────────┤
│  Task Lifecycle                                                 │
│  ├── task_assignment_timeout ≤ task.timeout (REQUIRED)          │
│  ├── task_result_timeout ≤ task.timeout (REQUIRED)              │
│  ├── get_task_timeout - Client fetching task                    │
│  └── submit_task_result_timeout - Client submitting result      │
│                                                                 │
│  max_task_timeout (3600s) - Applied when task.timeout = 0       │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    WORKFLOW LAYER                               │
├─────────────────────────────────────────────────────────────────┤
│  ModelController-Based (FedAvg, Cyclic, Scaffold, etc.)         │
│  └── timeout (0 = no timeout) - Per-task timeout                │
│                                                                 │
│  ScatterAndGather / ScatterAndGatherScaffold                    │
│  ├── train_timeout (0 = no timeout)                             │
│  └── wait_time_after_min_received (10s)                         │
│                                                                 │
│  CyclicController                                               │
│  └── task_assignment_timeout (10s)                              │
│                                                                 │
│  CrossSiteModelEval / CrossSiteEval                             │
│  ├── submit_model_timeout (600s)                                │
│  ├── validation_timeout (6000s)                                 │
│  └── wait_for_clients_timeout (300s)                            │
│                                                                 │
│  GlobalModelEval                                                │
│  ├── validation_timeout (6000s)                                 │
│  └── wait_for_clients_timeout (300s)                            │
│                                                                 │
│  BroadcastAndProcess / InitializeGlobalWeights                  │
│  ├── timeout / task_timeout (0 = no timeout)                    │
│  └── wait_time_after_min_received (0-10s)                       │
│                                                                 │
│  StatisticsController / HierarchicalStatisticsController        │
│  └── result_wait_timeout (10s) - Per-statistic timeout          │
│                                                                 │
│  SplitNNController                                              │
│  └── task_timeout (10s)                                         │
│                                                                 │
│  TIE Controller (XGBoost, Flower, etc.)                         │
│  ├── configure_task_timeout (10s)                               │
│  ├── start_task_timeout (10s)                                   │
│  ├── job_status_check_interval (2s)                             │
│  ├── max_client_op_interval (90s)                               │
│  └── progress_timeout (3600s)                                   │
│                                                                 │
│  Flower-Specific                                                │
│  ├── superlink_ready_timeout (10s)                              │
│  ├── per_msg_timeout (10s)                                      │
│  ├── tx_timeout (100s)                                          │
│  └── client_shutdown_timeout (5s)                               │
│                                                                 │
│  CCWF Server-Side                                               │
│  ├── configure_task_timeout (300s)                              │
│  ├── start_task_timeout (10s)                                   │
│  └── progress_timeout (3600s) - Overall workflow                │
│                                                                 │
│  CCWF Client-Side (Swarm Learning)                              │
│  ├── learn_task_ack_timeout (10s)                               │
│  ├── learn_task_abort_timeout (5s)                              │
│  └── final_result_ack_timeout (10s)                             │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    EXECUTOR LAYER                               │
├─────────────────────────────────────────────────────────────────┤
│  LauncherExecutor / ClientAPILauncherExecutor                   │
│  ├── launch_timeout                                             │
│  ├── external_pre_init_timeout (60-300s)                        │
│  ├── task_wait_timeout                                          │
│  ├── last_result_transfer_timeout (300s)                        │
│  └── heartbeat_timeout (60-300s)                                │
│                                                                 │
│  TaskExchanger (Pipe Handler)                                   │
│  ├── heartbeat_interval < heartbeat_timeout (REQUIRED)          │
│  ├── read_interval (0.5s)                                       │
│  ├── resend_interval (2s)                                       │
│  ├── peer_read_timeout (60s)                                    │
│  └── result_poll_interval (0.5s)                                │
│                                                                 │
│  IPCExchanger (Agent-based)                                     │
│  ├── send_task_timeout (5s)                                     │
│  ├── resend_task_interval (2s)                                  │
│  ├── agent_connection_timeout (60s)                             │
│  ├── agent_heartbeat_timeout (None)                             │
│  └── agent_ack_timeout (5s)                                     │
│                                                                 │
│  InProcessClientAPIExecutor                                     │
│  ├── result_pull_interval (0.5s)                                │
│  └── log_pull_interval (None)                                   │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    STREAMING LAYER                              │
├─────────────────────────────────────────────────────────────────┤
│  Reliable Message                                               │
│  └── per_msg_timeout ≤ tx_timeout (for retries to work)         │
│                                                                 │
│  File/Container Streaming                                       │
│  ├── chunk_timeout (5s per chunk)                               │
│  └── entry_timeout (60s per entry)                              │
│                                                                 │
│  Tensor Streaming (CRITICAL RELATIONSHIP)                       │
│  ├── tensor_send_timeout (30s)                                  │
│  ├── wait_send_task_data_all_clients_timeout (300s)             │
│  └── get_task_timeout >= wait_send_task_data_all_clients_timeout│
│      (REQUIRED to prevent task fetch timeout during streaming)  │
└─────────────────────────────────────────────────────────────────┘

Impact Analysis 

Too Short Timeouts:

Timeout Category	Impact of Too Short Value
heart_beat_timeout	Clients incorrectly marked dead, frequent reconnections
task.timeout / train_timeout	Training interrupted before completion, lost work
external_pre_init_timeout	Large model loading fails, external processes killed
streaming_read_timeout	Large file transfers fail mid-stream
per_msg_timeout	Reliable messages fail on slow networks
get_task_timeout	Clients fail to receive tasks, job stalls
admin_timeout	Admin commands fail, poor CLI experience
task_assignment_timeout (Cyclic)	Client fails to fetch task in time, job aborts
submit_model_timeout (CrossSiteEval)	Model submission fails, evaluation incomplete
validation_timeout (CrossSiteEval)	Validation tasks fail prematurely
result_wait_timeout (Statistics)	Statistics collection aborted before all clients respond
agent_connection_timeout (IPC)	External agent incorrectly marked disconnected
send_task_timeout (IPC)	Task delivery to agent fails, triggers resends
superlink_ready_timeout (Flower)	Flower integration fails to initialize
configure_task_timeout (TIE)	Third-party framework configuration fails
max_client_op_interval (TIE)	Healthy clients marked as stuck

Too Long Timeouts:

Timeout Category	Impact of Too Long Value
heart_beat_timeout	Dead clients not detected, resources wasted
task_assignment_timeout	Slow failover to backup clients
progress_timeout	Hung workflows not detected for hours
retry_timeout	Long delays before retry attempts
shutdown_timeout	Slow job termination, resource cleanup delayed
wait_for_clients_timeout (CrossSiteEval)	Long wait for clients that won’t join
agent_heartbeat_timeout (IPC)	Hung agents not detected, job stalls
resend_task_interval (IPC/TaskExchanger)	Slow recovery from transient failures
result_poll_interval (Executor)	Delayed result detection, slower job completion
job_status_check_interval (TIE)	Delayed detection of job completion or failure
tx_timeout (ReliableMessage)	Long waits for failed transactions

Recommended Settings by Use Case 

Development Environment 

Fast iteration with quick feedback:

# Server (fed_server.json)
heart_beat_timeout = 60        # Quick dead client detection
admin_timeout = 5.0            # Fast admin commands

# Client parameters
heartbeat_timeout = 30.0
task_wait_timeout = 60.0
external_pre_init_timeout = 60.0

# Flare API
login_timeout = 5.0
poll_interval = 1.0

Production - Standard Training 

Balanced settings for typical federated learning:

# Server (fed_server.json)
heart_beat_timeout = 600       # 10 min before client considered dead
admin_timeout = 10.0
task_request_interval = 2.0

# comm_config.json
heartbeat_interval = 10
subnet_heartbeat_interval = 5
streaming_read_timeout = 300

# Executor
external_pre_init_timeout = 300.0
heartbeat_timeout = 300.0
last_result_transfer_timeout = 300.0

Production - Large Models (100M+ parameters)

Extended timeouts for large model training:

# Server
heart_beat_timeout = 1200      # 20 min for large model operations

# Executor/Launcher
external_pre_init_timeout = 600.0   # 10 min for model loading
task_wait_timeout = 3600.0          # 1 hour for training

# Streaming
streaming_per_request_timeout = 900  # 15 min per chunk
tensor_send_timeout = 120.0

# CCWF
progress_timeout = 14400       # 4 hours
learn_task_timeout = 7200      # 2 hours

LLM/Foundation Model Training 

For billion-parameter models (examples/advanced/llm_hf):

# Recipe configuration
recipe = FedAvgRecipe(
    name="llm_training",
    model=None,  # Use dict config for large models
    shutdown_timeout=120.0,
)

# Client parameters - CRITICAL for LLM
recipe.add_client_config({
    "get_task_timeout": 600,            # 10 min to receive task
    "submit_task_result_timeout": 600,  # 10 min to submit results
    "external_pre_init_timeout": 900,   # 15 min for model init
})

Unreliable/High-Latency Networks 

Conservative settings for challenging network conditions:

# More frequent heartbeats with longer tolerance
heartbeat_interval = 15.0      # Less frequent to reduce traffic
heartbeat_timeout = 180.0      # 3 min tolerance

# Extended communication timeouts
communication_timeout = 600.0
peer_read_timeout = 180.0
maint_msg_timeout = 60.0

# Reliable message settings
per_msg_timeout = 60.0
tx_timeout = 600.0             # Long transaction timeout for retries

# Streaming with larger windows
streaming_read_timeout = 600
ack_wait = 30

Edge/Hierarchical FL 

Settings for edge device deployments:

# Edge device timeouts
update_timeout = 30
job_timeout = 300.0
device_wait_timeout = 120.0

# Hierarchical FL
assess_interval = 1.0
update_interval = 2.0

XGBoost Secure Training 

Settings for histogram-based XGBoost:

# Controller
configure_task_timeout = 300
start_task_timeout = 30
progress_timeout = 7200

# Reliable messaging for large histograms
per_msg_timeout = 120.0
tx_timeout = 600.0
xgb_server_ready_timeout = 30

Cross-Site Model Evaluation 

Settings for model evaluation across sites:

from nvflare.app_common.workflows.cross_site_model_eval import CrossSiteModelEval

controller = CrossSiteModelEval(
    submit_model_timeout=900,        # 15 min for large model submission
    validation_timeout=7200,         # 2 hours for thorough validation
    wait_for_clients_timeout=600,    # 10 min for clients to connect
)

Federated Statistics 

Settings for statistics computation:

from nvflare.app_common.workflows.statistics_controller import StatisticsController

controller = StatisticsController(
    result_wait_timeout=60,          # 1 min per statistic
    min_clients=2,
)

Split Learning 

Settings for split neural network training:

from nvflare.app_common.workflows.splitnn_workflow import SplitNNController

controller = SplitNNController(
    task_timeout=30,                 # 30 sec for task assignment
    num_rounds=10,
)

Flower Integration 

Settings for Flower framework integration:

from nvflare.app_opt.flower.flower_job import FlowerJob

job = FlowerJob(
    superlink_ready_timeout=30.0,    # 30 sec for Flower server
    configure_task_timeout=60,
    start_task_timeout=30,
    progress_timeout=7200,           # 2 hours for training
    per_msg_timeout=30.0,
    tx_timeout=300.0,
    client_shutdown_timeout=10.0,
)

Configuration File Locations 

This section describes where timeout configuration files are located and which timeouts each file controls. Configuration is divided into system-level (startup kit) and job-level (application) settings.

System-Level Configuration (Startup Kit)

System-level timeouts are configured in the startup kit and apply to all jobs. These files are located in the local/ directory of each participant.

Startup Kit Structure:

startup_kit/
├── server/
│   └── local/
│       ├── fed_server.json          # Server heartbeat, admin timeouts
│       ├── comm_config.json         # F3/CellNet communication layer
│       └── resources.json           # Resource configuration
│
├── site-1/ (client)
│   └── local/
│       ├── fed_client.json          # Client heartbeat, retry timeouts
│       ├── comm_config.json         # F3/CellNet communication layer
│       └── resources.json           # Resource configuration
│
└── admin/
    └── local/
        └── admin.json               # Admin session timeouts

Deployed System Paths:

After deployment, these files are located at:

Component	Startup Kit Path	Deployed Path
Server	`startup_kit/server/local/`	`/opt/nvflare/workspace/server/local/` or `~/nvflare/workspace/server/local/`
Client (Site)	`startup_kit/site-\*/local/`	`/opt/nvflare/workspace/site-\/local/` or `~/nvflare/workspace/site-\/local/`
Admin	`startup_kit/admin/local/`	`/opt/nvflare/workspace/admin/local/` or `~/nvflare/workspace/admin/local/`

System-Level Configuration Files:

File	Location	Timeouts Controlled
fed_server.json	server/local/	`heart_beat_timeout`, `admin_timeout`, `task_request_interval`, `heartbeat_timeout`
fed_client.json	site-*/local/	`heart_beat_interval`, `retry_timeout`, `communication_timeout`
comm_config.json	server/local/, site-*/local/	`heartbeat_interval`, `subnet_heartbeat_interval`, `streaming_read_timeout`, `streaming_ack_interval`, `max_message_size`
resources.json	server/local/, site-*/local/	Resource allocation and limits
admin.json	admin/local/	`idle_timeout`, `login_timeout`, `command_timeout`

Note: Changes to system-level files require restarting the affected FLARE components.

Job-Level Configuration 

Job-level timeouts are configured per job and override defaults for that specific job. These files are located in the job’s app/config/ directory.

Job Configuration Files:

File	Location	Timeouts Controlled
application.conf	app/config/	Task timeouts, streaming timeouts, runner sync timeouts
config_fed_client.json	app/config/	Executor timeouts, Client API task exchange, pipe handler settings
config_fed_server.json	app/config/	Controller timeouts, workflow component configurations

Ways to Configure Job-Level Timeouts:

Recipe API - Using recipe.add_client_config() to pass client parameters:

# Apply to all clients
recipe.add_client_config({
    "get_task_timeout": 300,
    "submit_task_result_timeout": 300,
})

# Apply to specific clients
recipe.add_client_config({
    "get_task_timeout": 600,
}, clients=["site-1", "site-2"])

Job config files - In app/config/ directory:
- config_fed_client.json - Client-side executor and task exchange settings
- config_fed_server.json - Server-side controller and workflow settings

Configuration Examples 

fed_server.json (Server Configuration)

{
  "heart_beat_timeout": 600,
  "admin_timeout": 10.0,
  "servers": [
    {
      "heart_beat_timeout": 600
    }
  ]
}

comm_config.json (F3/CellNet Layer)

{
  "heartbeat_interval": 10,
  "subnet_heartbeat_interval": 5,
  "streaming_read_timeout": 300,
  "streaming_ack_interval": 4194304,
  "streaming_chunk_size": 1048576,
  "max_message_size": 1048576
}

Client API Configuration (config_fed_client.json)

{
  "TASK_EXCHANGE": {
    "heartbeat_timeout": 60.0,
    "heartbeat_interval": 5.0,
    "resend_interval": 2.0,
    "pipe": {
      "ARG": {
        "root_url": "tcp://localhost:8002"
      }
    }
  }
}

application.conf Settings 

# Task communication timeouts
get_task_timeout = 60.0
submit_task_result_timeout = 120.0
task_check_timeout = 5.0

# Cell/messaging timeouts
cell_wait_timeout = 5.0

# Streaming timeouts
streaming_per_request_timeout = 600.0
np_download_chunk_size = 4194304
tensor_download_chunk_size = 4194304

# Runner sync timeouts
runner_sync_timeout = 10.0
max_runner_sync_timeout = 60.0

# Shutdown
end_run_readiness_timeout = 10.0

# Server startup/dead-job safety flags
strict_start_job_reply_check = false
sync_client_jobs_require_previous_report = true

Server Startup and Dead-Job Safety Flags 

These application.conf flags are server-side safety controls used during job startup and client heartbeat synchronization:

Parameter	Default	Purpose
strict_start_job_reply_check	false	Enables strict START_JOB reply validation (detects missing/timeout replies and non-OK return codes).
sync_client_jobs_require_previous_report	true	Requires a prior positive heartbeat report before treating “missing job on client” as a dead-job signal.

Recommended usage:

strict_start_job_reply_check defaults to false for backward compatibility. In non-strict mode, timed-out clients are silently excluded from the active set and the job continues — but min_sites / required_sites constraints are not enforced for those timeouts, so startup problems can go undetected. In strict mode, timeouts are detected and surfaced: required_sites and min_sites are then checked, and the job only continues (with a warning) if constraints are still satisfied. Enable strict mode when you want timeouts to be visible and constraints to be enforced at startup.
Keep sync_client_jobs_require_previous_report=true (default) to prevent false dead-job reports during startup races and transient heartbeat delays.
Set sync_client_jobs_require_previous_report=false only to restore legacy behavior where the first missing-job heartbeat immediately triggers dead-job detection.

Admin Client Session (Python API)

from nvflare.fuel.flare_api.flare_api import new_secure_session

# Create session with connection timeout
sess = new_secure_session(
    username="admin@nvidia.com",
    startup_kit_location="/path/to/startup",
    timeout=30.0,
)

# Set session-specific command timeout
sess.set_timeout(60.0)  # 60 seconds for commands

# Monitor job with timeout
rc = sess.monitor_job(
    job_id,
    timeout=3600,       # 1 hour max
    poll_interval=5.0,  # Check every 5 seconds
)

# Reset to server default
sess.unset_timeout()

Recipe with Extended Timeouts 

from nvflare.app_opt.pt.recipes import FedAvgRecipe

recipe = FedAvgRecipe(
    name="large_model_training",
    model={"class_path": "model.LargeModel", "args": {}},
    min_clients=8,
    num_rounds=100,
    shutdown_timeout=120.0,
    train_script="client.py",
)

# Client timeout parameters
recipe.add_client_config({
    "get_task_timeout": 300,
    "submit_task_result_timeout": 300,
})

CCWF/Swarm Learning Configuration 

from nvflare.app_opt.pt.recipes.swarm import SwarmLearningRecipe

recipe = SwarmLearningRecipe(
    min_clients=3,
    num_rounds=10,
    model=model,
    train_script="train.py",
    cross_site_eval_timeout=600.0,
    round_timeout=3600,   # P2P model-transfer ACK budget; increase for large models
)

Flower Integration 

from nvflare.app_opt.flower.recipe import FlowerRecipe

recipe = FlowerRecipe(
    server_app=ServerApp(...),
    client_app=ClientApp(...),
    superlink_ready_timeout=30.0,
    configure_task_timeout=300,
    start_task_timeout=30,
    progress_timeout=7200,
    per_msg_timeout=30.0,
    tx_timeout=300.0,
    client_shutdown_timeout=10.0,
)

Edge Device Configuration 

from nvflare.edge.tools.edge_fed_buff_recipe import EdgeFedBuffRecipe

recipe = EdgeFedBuffRecipe(
    model=MyModel(),
    update_timeout=30,
    job_timeout=600.0,
    device_wait_timeout=120.0,
)

TaskExchanger Configuration 

from nvflare.app_common.executors.task_exchanger import TaskExchanger

executor = TaskExchanger(
    read_interval=0.5,
    heartbeat_interval=5.0,
    heartbeat_timeout=120.0,
    resend_interval=5.0,
    peer_read_timeout=120.0,
    result_poll_interval=1.0,
)

LauncherExecutor Configuration 

from nvflare.app_common.executors.launcher_executor import LauncherExecutor

executor = LauncherExecutor(
    launch_timeout=60.0,
    task_wait_timeout=3600.0,
    last_result_transfer_timeout=600.0,
    external_pre_init_timeout=300.0,
    peer_read_timeout=120.0,
    monitor_interval=0.5,
    read_interval=0.5,
    heartbeat_interval=10.0,
    heartbeat_timeout=120.0,
)

ModelController-Based Workflow 

from nvflare.app_common.workflows.fedavg import FedAvg

controller = FedAvg(
    num_clients=8,
    num_rounds=100,
)

# Task with timeout
controller.send_model_and_wait(
    targets=None,
    data=model,
    timeout=3600,  # 1 hour per round
)

ScatterAndGather Configuration 

from nvflare.app_common.workflows.scatter_and_gather import ScatterAndGather

controller = ScatterAndGather(
    min_clients=4,
    num_rounds=50,
    train_timeout=7200,              # 2 hours per round
    wait_time_after_min_received=30, # Wait 30s for stragglers
    task_check_interval=1.0,
)

CyclicController Configuration 

from nvflare.app_common.workflows.cyclic_ctl import CyclicController

controller = CyclicController(
    num_rounds=10,
    task_assignment_timeout=30,  # 30 sec to request task
)

TIE Controller Configuration 

from nvflare.app_common.tie.controller import TieController

controller = TieController(
    configure_task_timeout=60,
    start_task_timeout=30,
    job_status_check_interval=5.0,
    max_client_op_interval=120.0,
    progress_timeout=7200.0,
)

Notes and Best Practices 

General Rules:

Timeout values are in seconds unless otherwise specified
None or 0 often means no timeout limit (wait indefinitely)
Chunk size values of 0 disable streaming and use native serialization

Critical Constraints:

heartbeat_interval must be less than heartbeat_timeout
task_assignment_timeout must be less than or equal to task.timeout
task_result_timeout must be less than or equal to task.timeout
per_msg_timeout should be less than or equal to tx_timeout for retries to work
agent_heartbeat_interval must be less than agent_connection_timeout
IMPORTANT: When using tensor streaming, get_task_timeout must be greater than or equal to wait_send_task_data_all_clients_timeout to prevent task fetch timeouts while waiting for all clients to receive tensors

Tensor Streaming Timeout Warning:

When tensor streaming is enabled, if get_task_timeout is not explicitly set, it defaults to the communicator’s timeout. If the streaming timeout (wait_send_task_data_all_clients_timeout) exceeds the communicator timeout, clients may timeout while waiting for other clients to receive weights. This can cause the tensor streaming process to restart and clients may receive empty tensors, causing the job to fail.

Recommended relationship for tensor streaming:

get_task_timeout >= wait_send_task_data_all_clients_timeout >= tensor_send_timeout * num_clients

Hierarchy:

Session-specific timeouts override server defaults
Client config overrides can be set via recipe.add_client_config()
comm_config.json settings apply to all F3/CellNet communication

Best Practices by Component:

Controllers:

Start with timeout=0 (no timeout) during development
Set appropriate train_timeout based on expected round duration
For cross-site eval, validation_timeout should exceed longest validation time
Use wait_for_clients_timeout to limit waiting for slow clients

Executors:

external_pre_init_timeout should cover model loading + library imports
heartbeat_timeout should be 2-3x heartbeat_interval
Set last_result_transfer_timeout based on result size
For IPC: agent_connection_timeout > agent_heartbeat_interval * 3

Workflows:

progress_timeout catches hung jobs; set to 2-3x expected round time
job_status_check_interval trades responsiveness vs overhead
For statistics: result_wait_timeout per statistic, not total

Network/Streaming:

Increase per_msg_timeout and tx_timeout for high-latency networks
streaming_read_timeout should handle slowest expected transfer
Use longer ack_wait for unreliable connections

Debugging Tips:

Enable debug logging to see timeout-related messages
Check num_timeout_reqs counter in CoreCell for timeout statistics
Monitor heartbeat status to detect connectivity issues early
Look for “timeout” in logs to identify which timeouts are triggering
For IPC issues, check agent_connection_timeout and agent logs
For third-party integration (TIE), monitor max_client_op_interval triggers

Common Timeout Patterns:

Layered Timeouts: Higher-level timeouts should exceed lower-level ones
- progress_timeout > train_timeout > task_wait_timeout
- validation_timeout > per-batch validation time * num_batches
Heartbeat Relationships: Always maintain proper ratios
- heartbeat_timeout = 3-6x heartbeat_interval
- agent_heartbeat_timeout = 3-6x agent_heartbeat_interval
Retry Allowance: Leave room for retries
- tx_timeout > per_msg_timeout * expected_retries
- task.timeout > task_assignment_timeout + actual_work_time