.. _timeouts_programming_guide:

####################################
Timeouts in NVIDIA FLARE (Reference)
####################################

This document provides a comprehensive overview of all timeout configurations in NVIDIA FLARE,
organized by functional categories with relationships, impacts, and usage examples.

.. contents:: Table of Contents
   :local:
   :depth: 2

Network Communication Timeouts
==============================

This section covers all network-related timeouts including the F3/CellNet communication layer,
server configuration, and client communication settings.

F3/CellNet Layer
----------------

The F3 (Flare-Friendly Framework) and CellNet provide the core communication infrastructure.
These timeouts are configured in ``comm_config.json``.

CommConfigurator Settings
^^^^^^^^^^^^^^^^^^^^^^^^^

Low-level communication configuration (comm_config.py):

.. list-table::
   :header-rows: 1
   :widths: 32 10 58

   * - Parameter
     - Default
     - Purpose
   * - heartbeat_interval
     - varies
     - Interval for heartbeat messages
   * - subnet_heartbeat_interval
     - 5.0
     - Interval for subnet heartbeat checks
   * - streaming_read_timeout
     - 300
     - Timeout for reading streamed data
   * - streaming_ack_interval
     - 4MB
     - Bytes between ACK messages during streaming
   * - streaming_ack_wait
     - varies
     - Time to wait for streaming ACK


CoreCell Settings
^^^^^^^^^^^^^^^^^

Core cell communication parameters (core_cell.py):

.. list-table::
   :header-rows: 1
   :widths: 28 12 60

   * - Parameter
     - Default
     - Purpose
   * - max_timeout
     - 3600
     - Default timeout for send_and_receive (1 hour)
   * - bulk_check_interval
     - 0.5
     - Interval for bulk message checking
   * - bulk_process_interval
     - 0.5
     - Interval for bulk message processing


Cell Request Timeouts
^^^^^^^^^^^^^^^^^^^^^

Cell-level request timeouts (cell.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - timeout
     - 10.0
     - Default timeout for send_request/broadcast_request

**Timeout Phases**: Requests go through three timeout phases:

1. **Sending timeout**: Time to complete message sending
2. **Remote processing timeout**: Time for remote to process request
3. **Receiving timeout**: Time to receive response


Example ``comm_config.json``:

.. code-block:: json

   {
     "heartbeat_interval": 10,
     "subnet_heartbeat_interval": 5,
     "streaming_read_timeout": 300,
     "streaming_ack_interval": 4194304,
     "max_message_size": 1048576
   }


Server Configuration
--------------------

These timeouts are configured in ``fed_server.json`` or server configuration.

FedServer Timeouts
^^^^^^^^^^^^^^^^^^

Server heartbeat and connection management (fed_server.py):

.. list-table::
   :header-rows: 1
   :widths: 28 12 60

   * - Parameter
     - Default
     - Purpose
   * - heart_beat_timeout
     - 600
     - Time without heartbeat before client considered dead
   * - remove_interval
     - 5.0
     - Interval for checking/removing dead clients
   * - check_interval
     - 0.2
     - Interval for connection checking loop


ServerRunner Timeouts
^^^^^^^^^^^^^^^^^^^^^

Server runner configuration (server_runner.py, server_json_config.py):

.. list-table::
   :header-rows: 1
   :widths: 28 12 60

   * - Parameter
     - Default
     - Purpose
   * - heartbeat_timeout
     - 60
     - Client heartbeat timeout in seconds
   * - task_request_interval
     - 2
     - Task request interval in seconds


Admin Server Timeouts
^^^^^^^^^^^^^^^^^^^^^

Admin server command timeouts (admin.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - timeout
     - 10.0
     - Admin command timeout
   * - timeout_secs
     - 2.0
     - Timeout for send_requests to clients

**Example** (fed_server.json):

.. code-block:: json

   {
     "heart_beat_timeout": 600,
     "admin_timeout": 10.0
   }


Client Configuration
--------------------

Client heartbeat and retry configuration (client_train.py, base_client_deployer.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - heart_beat_interval
     - 10.0
     - Interval for sending heartbeats to server
   * - retry_timeout
     - 30
     - Timeout for retry operations

**Note**: ``heart_beat_interval`` must be less than the server's ``heart_beat_timeout`` for
proper client status tracking.

Client-to-Server Communication
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Low-level client communication timeouts (communicator.py, fed_client_base.py):

.. list-table::
   :header-rows: 1
   :widths: 28 12 60

   * - Parameter
     - Default
     - Purpose
   * - communication_timeout
     - 300.0
     - General communication timeout
   * - maint_msg_timeout
     - 30.0
     - Maintenance message timeout
   * - engine_create_timeout
     - 30.0
     - Timeout for engine creation
   * - retry_timeout
     - 30.0
     - Retry timeout for operations

Flare Agent
^^^^^^^^^^^

FlareAgent for external process integration (flare_agent.py):

.. list-table::
   :header-rows: 1
   :widths: 28 12 60

   * - Parameter
     - Default
     - Purpose
   * - heartbeat_timeout
     - 60.0
     - Time without heartbeat before peer is dead
   * - submit_result_timeout
     - 60.0
     - Timeout for submitting task result to the client training process. 60 s is too short
       for large models; configure via ``add_client_config({"submit_result_timeout": 1800})``.
   * - max_resends
     - None in raw ``FlareAgent``; 3 through Client API job config
     - Maximum send retries on failure. For ``ClientAPILauncherExecutor`` jobs,
       the default is the finite value ``3``; ``None`` is rejected at job
       initialization. Override via ``add_client_config({"max_resends": N})``.
   * - download_complete_timeout
     - 1800.0
     - Time the subprocess waits after result ACK while the server finishes
       downloading tensors from the subprocess ``DownloadService``. Must not be
       ``None`` for ``ClientAPILauncherExecutor`` jobs.

**Note**: Raw ``FlareAgentWithCellPipe`` defaults to 60.0 s for
``submit_result_timeout`` and unlimited ``max_resends``. When launched through
``ClientAPILauncherExecutor``, the generated Client API config supplies the
safer job defaults described above. Recipe-based external-process jobs also
serialize ``max_resends=3`` in the executor args, so reloaded jobs do not fall
back to the raw unlimited retry default. Use
``recipe.add_client_config({"max_resends": N})`` only when a job needs a
different finite retry budget.

IPC Agent
^^^^^^^^^

IPC Agent for inter-process communication (ipc_agent.py):

.. list-table::
   :header-rows: 1
   :widths: 32 10 58

   * - Parameter
     - Default
     - Purpose
   * - submit_result_timeout
     - 30.0
     - Timeout for submitting results
   * - flare_site_connection_timeout
     - 60.0
     - Timeout for CJ disconnection
   * - flare_site_heartbeat_timeout
     - None
     - Timeout for missing CJ heartbeats


gRPC Utility Timeouts
---------------------

gRPC connection establishment (grpc_utils.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - ready_timeout
     - varies
     - Time to wait for gRPC server to be ready


Reliable Message
----------------

Reliable Messages provide guaranteed delivery with retry logic (reliable_message.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - per_msg_timeout
     - varies
     - Timeout for each individual message attempt
   * - tx_timeout
     - varies
     - Timeout for entire transaction including all retries

**Behavior**:

- If ``tx_timeout <= per_msg_timeout``, request is sent only once without retrying
- Messages are retried until ``tx_timeout`` is reached
- Completed requests are tracked for ``2 × tx_timeout`` to handle late duplicates

**Example**:

.. code-block:: python

   from nvflare.apis.utils.reliable_message import ReliableMessage

   ReliableMessage.send_request(
       target="site-1",
       topic="my_topic",
       request=shareable,
       per_msg_timeout=30.0,   # Each attempt times out after 30s
       tx_timeout=300.0,       # Total transaction timeout 5 minutes
       abort_signal=abort_signal,
       fl_ctx=fl_ctx,
   )


Federated Event Timeouts
========================

Fed event runner intervals (fed_event.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - regular_interval
     - 0.01
     - Regular processing interval
   * - grace_period
     - 2.0
     - Grace period before shutdown
   * - queue_empty_period
     - 2.0
     - Period to wait when queue is empty


Simulator Timeouts
==================

Simulator-specific timeouts (simulator_runner.py, simulator_worker.py):

.. list-table::
   :header-rows: 1
   :widths: 30 10 60

   * - Parameter
     - Default
     - Purpose
   * - simulator_worker_timeout
     - 60.0
     - Timeout for simulator worker
   * - app_runner_timeout
     - 60.0
     - Timeout for app runner
   * - CELL_CONNECT_CHECK_TIMEOUT
     - 10.0
     - Timeout for cell connection check
   * - FETCH_TASK_RUN_RETRY
     - 3
     - Number of retry attempts for task fetch


Flare API Session Timeouts
==========================

Session management for programmatic API (flare_api.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - timeout (new_session)
     - 10.0
     - Timeout to establish session
   * - poll_interval
     - 2.0
     - Interval for polling job status
   * - set_timeout()
     - varies
     - Session-specific command timeout

**Example**:

.. code-block:: python

   from nvflare.fuel.flare_api.flare_api import new_secure_session

   # Create session with timeout
   sess = new_secure_session(
       username="admin@nvidia.com",
       startup_kit_location="/path/to/startup",
       timeout=30.0,
   )

   # Set command timeout
   sess.set_timeout(60.0)

   # Monitor job with timeout and poll interval
   rc = sess.monitor_job(job_id, timeout=3600, poll_interval=5.0)


Heartbeat Timeouts
==================

Executor Heartbeat
------------------

Heartbeat mechanisms ensure connectivity between components:

.. list-table::
   :header-rows: 1
   :widths: 25 10 35 30

   * - Timeout
     - Default
     - Location
     - Purpose
   * - heartbeat_interval
     - 5.0
     - ``LauncherExecutor`` launcher_executor.py:49
     - Interval for sending heartbeat messages
   * - heartbeat_timeout
     - 60.0
     - ``LauncherExecutor`` launcher_executor.py:50
     - Timeout for waiting for heartbeat from peer
   * - peer_read_timeout
     - 60.0
     - ``LauncherExecutor`` launcher_executor.py:46
     - Time to wait for peer to accept sent message

Client API Heartbeat
^^^^^^^^^^^^^^^^^^^^

The Client API inherits heartbeat configuration from the task exchange settings (config.py:154-159):

.. code-block:: python

   def get_heartbeat_timeout(self):
       return self.config.get(ConfigKey.TASK_EXCHANGE, {}).get(
           ConfigKey.HEARTBEAT_TIMEOUT,
           self.config.get(ConfigKey.METRICS_EXCHANGE, {}).get(ConfigKey.HEARTBEAT_TIMEOUT, 60),
       )

Executor and Launcher Timeouts
==============================

LauncherExecutor Base Class
---------------------------

The ``LauncherExecutor`` class defines core timeout parameters for external process management
(launcher_executor.py:38-58):

.. list-table::
   :header-rows: 1
   :widths: 30 12 58

   * - Parameter
     - Default
     - Purpose
   * - launch_timeout
     - None
     - Timeout for launcher's "launch_task" method completion
   * - task_wait_timeout
     - None
     - Timeout for retrieving task results
   * - last_result_transfer_timeout
     - 300.0
     - Timeout for transmitting final result from external process
   * - external_pre_init_timeout
     - 60.0
     - Time to wait for external process before ``flare.init()`` call

ClientAPILauncherExecutor
-------------------------

The Client API executor extends base timeouts with more conservative defaults
(client_api_launcher_executor.py:29-53):

.. list-table::
   :header-rows: 1
   :widths: 30 12 58

   * - Parameter
     - Default
     - Purpose
   * - external_pre_init_timeout
     - 300.0
     - Extended timeout for heavy library imports
   * - peer_read_timeout
     - 300.0
     - Timeout for peer message acceptance
   * - heartbeat_timeout
     - 300.0
     - Extended heartbeat timeout for Client API
   * - submit_result_timeout
     - 300.0
     - Subprocess-side wait for CJ to acknowledge each result message
   * - max_resends
     - 3
     - Maximum retries after the initial result send; ``None`` is rejected
   * - download_complete_timeout
     - 1800.0
     - Time the subprocess remains alive for server-side tensor download completion

For subprocess-mode Client API jobs with large payloads, FLARE validates the
following at job start:

- ``download_complete_timeout`` must not be ``None``.
- ``max_resends`` must be a finite non-negative integer. Recipe-based jobs
  serialize the default value ``3`` in executor args. Use ``0`` to disable
  retries; do not use ``None`` for unlimited retries.

Values supplied through ``recipe.add_client_config()`` are top-level entries in
``config_fed_client.json``. For subprocess-mode Client API jobs,
``ClientAPILauncherExecutor`` applies these overrides before writing the
subprocess ``client_api_config.json``, so ``submit_result_timeout``,
``download_complete_timeout``, and ``max_resends`` are seen by both the parent
client job process and the external training process.

When ``tensor_streaming_per_request_timeout`` or
``np_streaming_per_request_timeout`` is explicitly configured, FLARE also warns
if ``PEER_READ_TIMEOUT`` or ``download_complete_timeout`` is shorter than that
streaming timeout. Set ``PEER_READ_TIMEOUT`` through ``add_client_config`` when
the parent client job needs a larger pipe-read budget:

.. code-block:: python

   recipe.add_client_config({
       "tensor_streaming_per_request_timeout": 600,
       "tensor_min_download_timeout": 600,
       "PEER_READ_TIMEOUT": 600,
       "download_complete_timeout": 1800,
       "max_resends": 3,
   })

External Pre-Init Override
^^^^^^^^^^^^^^^^^^^^^^^^^^

Jobs can override the external pre-init timeout via client configuration (constants.py:20-22):

.. code-block:: python

   # Configuration key for overriding external_pre_init_timeout in ClientAPILauncherExecutor
   EXTERNAL_PRE_INIT_TIMEOUT = "EXTERNAL_PRE_INIT_TIMEOUT"


TaskExchanger
-------------

The ``TaskExchanger`` base class manages pipe-based task exchange with external processes
(task_exchanger.py:38-68):

.. list-table::
   :header-rows: 1
   :widths: 28 12 60

   * - Parameter
     - Default
     - Purpose
   * - read_interval
     - 0.5
     - How often to read from pipe
   * - heartbeat_interval
     - 5.0
     - How often to send heartbeat to peer
   * - heartbeat_timeout
     - 60.0
     - Time to wait for heartbeat from peer (None = disable)
   * - resend_interval
     - 2.0
     - How often to resend a message if failing to send
   * - peer_read_timeout
     - 60.0
     - Time to wait for peer to accept sent message
   * - result_poll_interval
     - 0.5
     - How often to poll for task result


IPCExchanger
------------

The ``IPCExchanger`` manages IPC-based communication with Flare Agents
(ipc_exchanger.py:50-82):

.. list-table::
   :header-rows: 1
   :widths: 28 12 60

   * - Parameter
     - Default
     - Purpose
   * - send_task_timeout
     - 5.0
     - How long to wait for response when sending task to Agent
   * - resend_task_interval
     - 2.0
     - How often to resend task if failed
   * - agent_connection_timeout
     - 60.0
     - Time allowed to miss heartbeat before considering agent disconnected
   * - agent_heartbeat_timeout
     - None
     - Time allowed to miss heartbeat before stopping (None = disabled)
   * - agent_heartbeat_interval
     - 5.0
     - How often to send heartbeats to the agent
   * - agent_ack_timeout
     - 5.0
     - How long to wait for agent ack (heartbeat and bye messages)


InProcessClientAPIExecutor
--------------------------

The in-process executor for Client API (in_process_client_api_executor.py:50-70):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - result_pull_interval
     - 0.5
     - How often to poll for task result
   * - log_pull_interval
     - None
     - How often to pull logs (None = same as result_pull_interval)


Pipe Handler
------------

Inter-process communication pipe timeouts for Client API (pipe_handler.py):

.. list-table::
   :header-rows: 1
   :widths: 28 12 60

   * - Parameter
     - Default
     - Purpose
   * - heartbeat_interval
     - 5.0
     - Interval for sending heartbeats
   * - heartbeat_timeout
     - 30.0
     - Max time without heartbeat before peer is dead
   * - default_request_timeout
     - 5.0
     - Default timeout for requests
   * - resend_interval
     - 2.0
     - Interval between message resends

**Important**: ``heartbeat_interval`` must be less than ``heartbeat_timeout``.


P2P Executor
------------

Peer-to-peer sync executor (sync_executor.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - sync_timeout
     - 10
     - Timeout waiting for values from neighbors


Admin Client Timeouts
=====================

Admin client timeouts control session management and command execution:

.. list-table::
   :header-rows: 1
   :widths: 28 10 32 30

   * - Timeout
     - Default
     - Location
     - Purpose
   * - idle_timeout
     - 900.0
     - Admin config
     - Automatic shutdown after idle period
   * - login_timeout
     - 10.0
     - Admin config
     - Max time to attempt login
   * - authenticate_msg_timeout
     - 2.0
     - Admin config
     - Timeout for authentication messages
   * - Command timeout
     - 5.0
     - FLARE API session
     - Default timeout for admin commands

Session-Specific Timeouts
-------------------------

Admin API supports session-specific command timeouts (api_spec.py:305-318):

.. code-block:: python

   def set_timeout(self, value: float):
       """Set a session-specific command timeout. This is the amount of time the server
       will wait for responses after sending commands to FL clients.
       Note that this value is only effective for the current API session."""


Task Communication and Messaging
================================

These timeouts control task assignment and result collection between server and clients.

WfCommServer (Workflow Communication Server)
--------------------------------------------

Server-side workflow communication (wf_comm_server.py):

.. list-table::
   :header-rows: 1
   :widths: 28 12 60

   * - Parameter
     - Default
     - Purpose
   * - task.timeout
     - varies
     - Overall task timeout
   * - task_assignment_timeout
     - 0
     - Time to wait for client to pick task
   * - task_result_timeout
     - 0
     - Time to wait for client to return result
   * - task_check_period
     - 0.2
     - Interval for checking task status

**Validation Rules**:

- ``task_assignment_timeout`` must be <= ``task.timeout``
- ``task_result_timeout`` must be <= ``task.timeout``


WfCommClient (Workflow Communication Client)
--------------------------------------------

Client-side workflow communication (wf_comm_client.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - max_task_timeout
     - 3600
     - Maximum single task execution time; used as the effective timeout when the controller sets task.timeout = 0 (i.e., "no timeout")


Task Pull/Fetch Timeouts
------------------------

Client-side task fetching from server (client_runner.py, communicator.py, fed_client_base.py):

.. list-table::
   :header-rows: 1
   :widths: 28 12 60

   * - Parameter
     - Default
     - Purpose
   * - get_task_timeout
     - None
     - Timeout for client to fetch task from server
   * - submit_task_result_timeout
     - None
     - Timeout for client to submit result to server
   * - timeout (pull_task)
     - None
     - Timeout for pull_task communication

**Configuration**: Set via ``ConfigVarName.GET_TASK_TIMEOUT`` and ``ConfigVarName.SUBMIT_TASK_RESULT_TIMEOUT`` in client config.

**Example** (client params in job):

.. code-block:: python

   recipe.add_client_config({
       "get_task_timeout": 300,  # 5 minutes
   })


Task Manager Timeouts
---------------------

Task managers control sequential and relay task distribution (send_manager.py, seq_relay_manager.py, any_relay_manager.py):

.. list-table::
   :header-rows: 1
   :widths: 28 12 60

   * - Parameter
     - Default
     - Purpose
   * - task_assignment_timeout
     - 0
     - Time window for client to request task
   * - task_result_timeout
     - 0
     - Time to wait for client result before moving to next

**Behavior**:

- For SendOrder.SEQUENTIAL: Clients are assigned in order with sliding time window
- For SendOrder.ANY: First available client gets the task
- Timeout of 0 means no timeout (wait indefinitely)


Workflow and Controller Timeouts
================================

Client-Controlled Workflows (Server-Side)
-----------------------------------------

Server-side controller timeouts for workflow management (common.py:79-92):

.. list-table::
   :header-rows: 1
   :widths: 30 12 58

   * - Timeout
     - Default
     - Purpose
   * - configure_task_timeout
     - 300
     - Time for clients to respond to config task
   * - start_task_timeout
     - 10
     - Time for starting client to begin workflow
   * - end_workflow_timeout
     - 2.0
     - Timeout for ending workflow message
   * - progress_timeout
     - 3600.0
     - Max time without workflow progress
   * - max_status_report_interval
     - 90.0
     - Max time for client to miss status report

Client-Controlled Workflows (Client-Side)
-----------------------------------------

Client-side timeouts for task coordination (common.py:87-92):

.. list-table::
   :header-rows: 1
   :widths: 30 12 58

   * - Timeout
     - Default
     - Purpose
   * - learn_task_check_interval
     - 1.0
     - Interval for checking new learning tasks
   * - learn_task_ack_timeout
     - 10
     - P2P model-transfer ACK budget (seconds). 10 s is too short for models >2 GB.
       Set via ``SwarmLearningRecipe(round_timeout=3600)`` which wires both
       ``learn_task_ack_timeout`` and ``final_result_ack_timeout``.
   * - learn_task_abort_timeout
     - 5.0
     - Timeout for task abortion
   * - final_result_ack_timeout
     - 10
     - Timeout for final result acknowledgment. See ``learn_task_ack_timeout`` note above.
   * - get_model_timeout
     - 10
     - Timeout for getting model from peers
   * - max_task_timeout
     - 3600
     - Maximum single task execution time

ScatterAndGather Controller
---------------------------

The SAG controller manages aggregation timing (scatter_and_gather.py:37-67):

.. list-table::
   :header-rows: 1
   :widths: 35 12 53

   * - Parameter
     - Default
     - Purpose
   * - train_timeout
     - 0
     - Time to wait for clients to do local training (0 = no timeout)
   * - wait_time_after_min_received
     - 10
     - Time to wait for additional responses after min_clients
   * - task_check_interval
     - 0.5
     - Interval for checking task completion


ModelController-Based Workflows
-------------------------------

FedAvg, Cyclic, Scaffold, and other ModelController-based workflows (model_controller.py, base_model_controller.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - timeout
     - 0
     - Time to wait for clients to perform task (0 = no timeout)

**Note**: FedAvg, Scaffold, Cyclic all inherit from ModelController and use the same ``timeout`` parameter.


CyclicController
----------------

Cyclic workflow controller (cyclic_ctl.py):

.. list-table::
   :header-rows: 1
   :widths: 28 12 60

   * - Parameter
     - Default
     - Purpose
   * - task_assignment_timeout
     - 10
     - Timeout for client to request its assigned task


CrossSiteModelEval / CrossSiteEval
----------------------------------

Cross-site model evaluation workflows (cross_site_model_eval.py, cross_site_eval.py):

.. list-table::
   :header-rows: 1
   :widths: 28 12 60

   * - Parameter
     - Default
     - Purpose
   * - submit_model_timeout
     - 600
     - Timeout for submit_model_task (10 min)
   * - validation_timeout
     - 6000
     - Timeout for validate_model task (100 min)
   * - wait_for_clients_timeout
     - 300
     - Timeout for clients to appear (5 min)
   * - eval_task_timeout (CCWF)
     - 1200+
     - Time for model evaluation by clients
   * - configure_task_timeout (CCWF)
     - 300
     - Timeout for configuration task
   * - progress_timeout (CCWF)
     - 7200+
     - Overall workflow progress timeout

Example configuration:

.. code-block:: python

   from nvflare.app_common.np.recipes import NumpyCrossSiteEvalRecipe

   recipe = NumpyCrossSiteEvalRecipe(
       submit_model_timeout=600,
       validation_timeout=6000,
   )


GlobalModelEval
---------------

Global model evaluation controller (global_model_eval.py):

.. list-table::
   :header-rows: 1
   :widths: 28 12 60

   * - Parameter
     - Default
     - Purpose
   * - validation_timeout
     - 6000
     - Timeout for validate_model task
   * - wait_for_clients_timeout
     - 300
     - Timeout for clients to appear


BroadcastAndProcess / InitializeGlobalWeights
---------------------------------------------

Broadcast workflows (broadcast_and_process.py, initialize_global_weights.py):

.. list-table::
   :header-rows: 1
   :widths: 30 10 60

   * - Parameter
     - Default
     - Purpose
   * - timeout / task_timeout
     - 0
     - Task timeout (0 = no timeout)
   * - wait_time_after_min_received
     - 0-10
     - Wait time after min responses received


StatisticsController / HierarchicalStatisticsController
--------------------------------------------------------

Statistics workflow controllers (statistics_controller.py, hierarchical_statistics_controller.py):

.. list-table::
   :header-rows: 1
   :widths: 30 12 58

   * - Parameter
     - Default
     - Purpose
   * - result_wait_timeout
     - 10
     - Seconds to wait for results per statistic
   * - wait_time_after_min_received
     - 1
     - Seconds to wait after min clients received

**Note**: ``result_wait_timeout`` is reset for each statistic, not an overall timeout.


SplitNNController
-----------------

Split learning controller (splitnn_workflow.py:47-79):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - task_timeout
     - 10
     - Timeout for client to request its assigned task
   * - TIMEOUT (class constant)
     - 60.0
     - Timeout for auxiliary message requests


TIE Controller (Third-party Integration)
----------------------------------------

Base controller for third-party integration (tie/controller.py, tie/defs.py):

.. list-table::
   :header-rows: 1
   :widths: 28 12 60

   * - Parameter
     - Default
     - Purpose
   * - configure_task_timeout
     - 10
     - Time to wait for clients to complete config task
   * - start_task_timeout
     - 10
     - Time to wait for clients to complete start task
   * - job_status_check_interval
     - 2.0
     - How often to check client job statuses
   * - max_client_op_interval
     - 90.0
     - Max time allowed between app ops from a client
   * - progress_timeout
     - 3600.0
     - Max time allowed with no workflow progress

**Note**: TIE is used by XGBoost, Flower, and other third-party framework integrations.


Flower Integration Timeouts
---------------------------

Flower-specific controller and executor timeouts (flower/controller.py, flower/executor.py):

.. list-table::
   :header-rows: 1
   :widths: 28 12 60

   * - Parameter
     - Default
     - Purpose
   * - superlink_ready_timeout
     - 10.0
     - Time to wait for Flower superlink to become ready
   * - superlink_min_query_interval
     - 10.0
     - Minimal interval for querying superlink status
   * - monitor_interval
     - 0.5
     - How often to check Flower run status
   * - per_msg_timeout
     - 10.0
     - Per-message timeout for ReliableMessage
   * - tx_timeout
     - 100.0
     - Transaction timeout for ReliableMessage
   * - client_shutdown_timeout
     - 5.0
     - Max time for graceful client shutdown


Private Set Intersection (PSI)
------------------------------

PSI workflows do not have explicit timeout parameters at the PSI controller level. 
PSI inherits general workflow timeouts from the underlying task system.

For PSI operations, timeouts are controlled at lower levels:

- **Task-level timeouts**: Use controller's general ``timeout`` parameter
- **Communication timeouts**: Inherited from system ``heartbeat_timeout`` and ``peer_read_timeout``

**Note**: For large-scale PSI operations, ensure adequate system-level timeouts in 
``application.conf`` to handle the iterative Diffie-Hellman protocol exchanges.


Aggregator Timeouts
-------------------

LazyAggregator
^^^^^^^^^^^^^^

Lazy aggregator for async aggregation (lazy.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - accept_timeout
     - 600.0
     - Max time to wait for accept to finish


Job Scheduler Timeouts
----------------------

The ``DefaultJobScheduler`` controls job scheduling frequency (job.rst:255-270):

.. list-table::
   :header-rows: 1
   :widths: 30 12 58

   * - Parameter
     - Default
     - Purpose
   * - min_schedule_interval
     - 10.0
     - Minimum interval between schedule attempts
   * - max_schedule_interval
     - 600.0
     - Maximum interval between schedule attempts
   * - max_schedule_count
     - 10
     - Maximum times to try scheduling a job

**Scheduling Strategy**: The scheduler uses adaptive frequency - doubling interval after each
failure up to the maximum.

Recipe Timeouts
===============

Standard Recipe Timeouts
------------------------

All standard recipes support these timeout parameters (fedavg.py, cyclic.py):

.. list-table::
   :header-rows: 1
   :widths: 30 12 58

   * - Parameter
     - Default
     - Purpose
   * - shutdown_timeout
     - 0.0
     - Wait time before shutdown for cleanup
   * - task_assignment_timeout
     - 10
     - Timeout for cyclic task assignment (CyclicRecipe only)

Evaluation Recipe Timeouts
--------------------------

Evaluation recipes have specific timeout requirements (fedeval.py, cross_site_eval.py):

.. list-table::
   :header-rows: 1
   :widths: 30 12 58

   * - Parameter
     - Default
     - Purpose
   * - validation_timeout
     - 6000
     - Time allowed for model validation
   * - submit_model_timeout
     - 600
     - Time for clients to submit models for evaluation


Large Model and Streaming Timeouts
==================================

File Streaming Timeouts
-----------------------

File streaming for large files (file_streamer.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - chunk_timeout
     - 5.0
     - Timeout for each chunk sent to targets
   * - chunk_size
     - 1M bytes
     - Size of each chunk streamed

**Example**:

.. code-block:: python

   from nvflare.app_common.streamers.file_streamer import FileStreamer

   FileStreamer.stream_file(
       targets=["site-1", "site-2"],
       file_name="/path/to/large_file.bin",
       fl_ctx=fl_ctx,
       chunk_size=1024 * 1024,  # 1MB chunks
       chunk_timeout=10.0,      # 10 seconds per chunk
   )


Container Streaming Timeouts
----------------------------

Container/object streaming (container_streamer.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - entry_timeout
     - 60.0
     - Timeout for each entry sent to targets

**Example**:

.. code-block:: python

   from nvflare.app_common.streamers.container_streamer import ContainerStreamer

   ContainerStreamer.stream_container(
       targets=["site-1"],
       container=my_large_container,
       fl_ctx=fl_ctx,
       entry_timeout=120.0,  # 2 minutes per entry
   )


Object Retrieval Timeouts
-------------------------

Retrieving files/containers from remote sites (file_retriever.py, container_retriever.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - timeout
     - varies
     - Max seconds to wait for data retrieval
   * - chunk_timeout
     - varies
     - Timeout per chunk during file retrieval


Byte Streaming Timeouts
-----------------------

Byte streaming timeouts and intervals (byte_receiver.py, byte_streamer.py):

.. list-table::
   :header-rows: 1
   :widths: 28 12 60

   * - Parameter
     - Default
     - Purpose
   * - streaming_read_timeout
     - 300
     - Timeout for reading streamed data
   * - ack_interval
     - 4MB
     - Bytes between acknowledgment messages
   * - ack_wait
     - varies
     - Time to wait for ACK before timing out

**Note**: ACK timeout triggers ``StreamError`` and stops the stream.


Download Transaction Timeouts
-----------------------------

Object download transaction timeouts (download_service.py, obj_downloader.py):

.. list-table::
   :header-rows: 1
   :widths: 28 12 60

   * - Parameter
     - Default
     - Purpose
   * - timeout
     - varies
     - Transaction timeout (time since last activity)
   * - per_request_timeout
     - varies
     - Timeout for each request to object owner

**Note**: Transaction times out if no activity from any receiver for the specified duration.
Normally finished download refs are tombstoned temporarily so a late retry from
the same receiver can receive the original EOF or error status instead of a
fatal missing-ref response. Timeout and deleted transactions are not tombstoned.


Tensor Streaming Timeouts
-------------------------

Tensor streaming provides efficient transfer of large model weights. These timeouts control
the streaming behavior (tensor_stream/server.py, client.py):

.. list-table::
   :header-rows: 1
   :widths: 38 12 50

   * - Parameter
     - Default
     - Purpose
   * - tensor_send_timeout
     - 30.0
     - Timeout for each tensor entry transfer operation
   * - wait_send_task_data_all_clients_timeout
     - 300.0
     - Timeout for sending tensors to all clients
   * - wait_for_tensors timeout
     - 5.0
     - Time to wait for tensors to be received

**Server-side configuration** (TensorServerStreamer):

.. code-block:: python

   from nvflare.app_opt.tensor_stream.server import TensorServerStreamer

   streamer = TensorServerStreamer(
       format="pytorch",
       tensor_send_timeout=60.0,  # Per-tensor timeout
       wait_send_task_data_all_clients_timeout=600.0,  # All clients timeout
   )

**Client-side configuration** (TensorClientStreamer):

.. code-block:: python

   from nvflare.app_opt.tensor_stream.client import TensorClientStreamer

   streamer = TensorClientStreamer(
       format="pytorch",
       tensor_send_timeout=60.0,  # Per-tensor timeout
   )

.. warning::

   **Critical Timeout Relationship for Tensor Streaming**
   
   When using tensor streaming, you **must** ensure that ``get_task_timeout`` is set and is 
   greater than or equal to ``wait_send_task_data_all_clients_timeout``. If ``get_task_timeout`` 
   is not set, it defaults to the communicator's timeout, which may be shorter than the tensor 
   streaming timeout.
   
   **Problem**: If streaming timeout > communicator timeout and no ``get_task_timeout`` is set, 
   some clients may receive weights while others are still waiting. The server may not send the 
   task in time, causing a timeout that restarts the tensor streaming process. This can result 
   in clients receiving empty tensors and job failure.
   
   **Solution**: Always set ``get_task_timeout`` when using tensor streaming:
   
   .. code-block:: python
   
      # Ensure get_task_timeout >= wait_send_task_data_all_clients_timeout
      recipe.add_client_config({
          "get_task_timeout": 600,  # Must be >= streaming timeout
      })


Streaming Download Timeouts
---------------------------

Framework-level settings for large payload transfers (fl_constant.py:553, comm_config.py:41):

.. list-table::
   :header-rows: 1
   :widths: 35 15 50

   * - Parameter
     - Default
     - Purpose
   * - streaming_per_request_timeout
     - 600
     - Per-request timeout for streaming chunks
   * - streaming_read_timeout
     - 300
     - Timeout for reading streaming data
   * - np_min_download_timeout
     - 300
     - Minimum idle time (seconds) before an inactive NumPy array download transaction
       is declared dead. Applies to NumPy/sklearn-based models.
       Increase to 600 s for 70B+ models on congested networks.
       Configure via ``add_client_config({"np_min_download_timeout": 600})``.
   * - tensor_min_download_timeout
     - 300
     - Minimum idle time (seconds) before an inactive PyTorch tensor download transaction
       is declared dead. Applies to PyTorch-based models.
       Increase to 600 s for 70B+ models on congested networks.
       Configure via ``add_client_config({"tensor_min_download_timeout": 600})``.
   * - np_download_chunk_size
     - 2097152
     - Chunk size for NumPy array downloads (bytes)
   * - tensor_download_chunk_size
     - 2097152
     - Chunk size for PyTorch tensor downloads (bytes)

For Client API subprocess jobs, keep these download settings aligned with the
subprocess pipe settings:

- ``tensor_min_download_timeout`` / ``np_min_download_timeout`` should be at
  least ``tensor_streaming_per_request_timeout`` /
  ``np_streaming_per_request_timeout``.
- ``PEER_READ_TIMEOUT`` should be at least the configured streaming per-request
  timeout so the parent client job does not resend the task while the subprocess
  is still downloading a large payload.
- ``download_complete_timeout`` should be at least the configured streaming
  per-request timeout and long enough for the server to pull large tensor
  results from the subprocess after result ACK.
- ``max_resends`` should stay finite. The recipe default is ``3``; raise it
  only when the network is expected to recover after a small number of delayed
  result acknowledgments.

Swarm Learning Large Model Setup
--------------------------------

Recommended timeouts for large models in Swarm Learning:

.. code-block:: python

   recipe = SwarmLearningRecipe(
       name="swarm",
       model=MyModel(),
       min_clients=3,
       num_rounds=5,
       train_script="client.py",
       round_timeout=7200,   # P2P ACK budget; covers learn_task_ack_timeout + final_result_ack_timeout
       progress_timeout=7200,
       start_task_timeout=300,
   )

   # Server-side streaming configuration
   recipe.add_server_config({
       "np_download_chunk_size": 2097152,
       "streaming_per_request_timeout": 600,
   })

   # Subprocess-mode timeouts (when launch_external_process=True)
   recipe.add_client_config({
       "submit_result_timeout": 1800,
       "download_complete_timeout": 1800,
       "tensor_min_download_timeout": 600,
       "PEER_READ_TIMEOUT": 600,
       "max_resends": 5,
   })


XGBoost-Specific Timeouts
=========================

XGBoost Histogram-Based Controller
----------------------------------

XGBoost histogram-based controller timeouts (histogram_based_v2/controller.py):

.. list-table::
   :header-rows: 1
   :widths: 30 12 58

   * - Parameter
     - Default
     - Purpose
   * - configure_task_timeout
     - 300
     - Timeout for configuration task
   * - start_task_timeout
     - 10
     - Timeout for start task
   * - progress_timeout
     - 3600.0
     - Overall workflow progress timeout

**Note**: XGBoost uses Reliable Messages for secure training. See the `Reliable Message`_ section
for ``per_msg_timeout`` and ``tx_timeout`` configuration.

XGBoost gRPC Client
-------------------

gRPC client for XGBoost communication (grpc_client.py, grpc_server_adaptor.py):

.. list-table::
   :header-rows: 1
   :widths: 28 12 60

   * - Parameter
     - Default
     - Purpose
   * - ready_timeout
     - 10
     - Timeout for gRPC server to be ready
   * - xgb_server_ready_timeout
     - varies
     - Timeout for XGBoost server readiness
   * - aggr_timeout
     - 10.0
     - Aggregation timeout for mock servicer

Example configuration for large datasets:

.. code-block:: python

   "per_msg_timeout": 300.0,
   "tx_timeout": 900.0,


Confidential Computing Timeouts
===============================

SNP Authorizer Timeouts
-----------------------

AMD SEV-SNP attestation timeouts (snp_authorizer.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - cmd_timeout
     - 60
     - SNPGuest command execution timeout
   * - retry_interval
     - 10
     - Wait time between retry attempts
   * - max_retries
     - 5
     - Maximum retry attempts

CC Manager Timeouts
-------------------

Cross-site CC verification timeouts (cc_manager.py):

.. list-table::
   :header-rows: 1
   :widths: 30 12 58

   * - Parameter
     - Default
     - Purpose
   * - get_site_request_timeout
     - 10.0
     - Timeout for get site request
   * - get_token_request_timeout
     - 10.0
     - Timeout for get token request
   * - verify_frequency
     - 600
     - CC token verification interval (seconds)
   * - cross_validation_interval
     - varies
     - Interval between cross-site validation cycles

**Note**: Other CC authorizers (ACI, TDX, GPU, Azure CVM) do not have explicit timeout parameters
and rely on system defaults.


Job Launcher Timeouts
=====================

Kubernetes Launcher
-------------------

K8s job launcher timeouts (k8s_launcher.py):

.. list-table::
   :header-rows: 1
   :widths: 20 12 68

   * - Parameter
     - Default
     - Purpose
   * - timeout
     - None
     - Timeout for pod to enter RUNNING/TERMINATED state

Docker Launcher
---------------

Docker container launcher timeouts (docker_launcher.py):

.. list-table::
   :header-rows: 1
   :widths: 20 12 68

   * - Parameter
     - Default
     - Purpose
   * - timeout
     - None
     - Timeout for container to enter target state


Edge Device Timeouts
====================

This section covers all edge device, mobile client, and hierarchical FL timeouts.

Edge Device General
-------------------

Edge devices have specific timeout requirements:

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - update_timeout
     - 5
     - Timeout for model updates from devices
   * - device_wait_timeout
     - None
     - Time to wait for sufficient devices to join
   * - job_timeout
     - 60.0
     - Overall timeout for edge job execution

Example:

.. code-block:: python

   from nvflare.edge.tools.edge_fed_buff_recipe import EdgeFedBuffRecipe

   recipe = EdgeFedBuffRecipe(
       model=MyModel(),
       update_timeout=10,
       job_timeout=120.0,
   )


Hierarchical FL
---------------

Hierarchical FL enables multi-tier federation with edge devices organized in a tree structure.

ScatterAndGatherForEdge (SAGE) Controller
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Server-side controller for hierarchical edge FL (edge/controllers/sage.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - assess_interval
     - 0.5
     - Interval for invoking the assessor during task execution
   * - update_interval
     - 1.0
     - Interval for children to send updates
   * - task_check_period
     - 0.5
     - Interval for checking status of tasks

HierarchicalUpdateGatherer (HUG) Executor
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Executor for hierarchical update gathering (edge/executors/hug.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - update_timeout
     - required
     - Timeout for update messages sent to parent

EdgeTaskExecutor (ETE)
^^^^^^^^^^^^^^^^^^^^^^

Edge task executor for leaf nodes (edge/executors/ete.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - update_timeout
     - required
     - Timeout for update messages sent to parent

**Example**:

.. code-block:: python

   from nvflare.edge.controllers.sage import ScatterAndGatherForEdge
   from nvflare.edge.executors.hug import HierarchicalUpdateGatherer

   # Server-side controller
   sage = ScatterAndGatherForEdge(
       num_rounds=5,
       assess_interval=0.5,
       update_interval=1.0,
       task_check_period=0.5,
   )

   # Client-side executor
   hug = HierarchicalUpdateGatherer(
       learner_id="learner",
       updater_id="updater",
       update_timeout=30.0,
   )


Mobile Client
-------------

Android SDK includes job operation timeout (mobile_android.rst:43-58):

.. code-block:: kotlin

   AndroidFlareRunner(
       // ... other parameters
       jobTimeout: Float,  // Timeout in seconds for job operations
   )


SubprocessLauncher Timeouts
===========================

Subprocess launcher timeout (subprocess_launcher.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - shutdown_timeout
     - 0.0
     - Time to wait before forcefully stopping subprocess


Experiment Tracking Timeouts
============================

WandB Receiver
--------------

Weights & Biases integration timeouts (wandb_receiver.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - process_timeout
     - 10.0
     - Timeout for joining WandB processes at shutdown
   * - login timeout
     - 1.0
     - Internal timeout for WandB login verification


MLflow Receiver
---------------

MLflow integration timing (mlflow_receiver.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - buffer_flush_time
     - 1
     - Seconds between deliveries to MLflow tracking server

**Note**: Reducing ``buffer_flush_time`` increases traffic to MLflow server and may cause latency.


TensorBoard Receiver
--------------------

TensorBoard receiver (tb_receiver.py) does not have explicit timeout parameters.
Events are written directly to disk without buffering.


Metrics Relay and Sender
------------------------

Metrics exchange timeouts for experiment tracking (metric_relay.py, metrics_sender.py):

.. list-table::
   :header-rows: 1
   :widths: 25 12 63

   * - Parameter
     - Default
     - Purpose
   * - heartbeat_timeout
     - 30.0-60.0
     - Timeout for peer heartbeat (MetricRelay: 60s, MetricsSender: 30s)
   * - heartbeat_interval
     - 5.0
     - Interval between heartbeats
   * - read_interval
     - 0.1
     - Interval for reading from pipe

**Example**:

.. code-block:: python

   from nvflare.app_common.widgets.metric_relay import MetricRelay

   metric_relay = MetricRelay(
       heartbeat_interval=5.0,
       heartbeat_timeout=60.0,
       read_interval=0.1,
   )


Timeout Relationships and Dependencies
======================================

Hierarchical Relationships
--------------------------

.. code-block:: text

   ┌─────────────────────────────────────────────────────────────────┐
   │                    SYSTEM-LEVEL TIMEOUTS                        │
   ├─────────────────────────────────────────────────────────────────┤
   │  Server Configuration (fed_server.json)                        │
   │  ├── heart_beat_timeout (600s) - Client liveness detection     │
   │  ├── admin_timeout (10s) - Admin command processing            │
   │  └── task_request_interval (2s) - Task polling rate            │
   │                                                                 │
   │  Client Configuration                                           │
   │  ├── heart_beat_interval (10s) - Keep-alive to server          │
   │  ├── retry_timeout (30s) - Operation retry                      │
   │  └── communication_timeout (300s) - Network operations          │
   └─────────────────────────────────────────────────────────────────┘

   ┌─────────────────────────────────────────────────────────────────┐
   │                    F3/CELLNET LAYER                             │
   ├─────────────────────────────────────────────────────────────────┤
   │  CommConfigurator (comm_config.json)                            │
   │  ├── heartbeat_interval < heartbeat_timeout (REQUIRED)          │
   │  ├── subnet_heartbeat_interval (5s)                             │
   │  ├── streaming_read_timeout (300s)                              │
   │  └── max_timeout (3600s) - CoreCell default                     │
   │                                                                 │
   │  Cell Requests                                                  │
   │  └── timeout (10s) → Sending → Processing → Receiving           │
   └─────────────────────────────────────────────────────────────────┘

   ┌─────────────────────────────────────────────────────────────────┐
   │                    TASK COMMUNICATION                           │
   ├─────────────────────────────────────────────────────────────────┤
   │  Task Lifecycle                                                 │
   │  ├── task_assignment_timeout ≤ task.timeout (REQUIRED)          │
   │  ├── task_result_timeout ≤ task.timeout (REQUIRED)              │
   │  ├── get_task_timeout - Client fetching task                    │
   │  └── submit_task_result_timeout - Client submitting result      │
   │                                                                 │
   │  max_task_timeout (3600s) - Applied when task.timeout = 0       │
   └─────────────────────────────────────────────────────────────────┘

   ┌─────────────────────────────────────────────────────────────────┐
   │                    WORKFLOW LAYER                               │
   ├─────────────────────────────────────────────────────────────────┤
   │  ModelController-Based (FedAvg, Cyclic, Scaffold, etc.)         │
   │  └── timeout (0 = no timeout) - Per-task timeout                │
   │                                                                 │
   │  ScatterAndGather / ScatterAndGatherScaffold                    │
   │  ├── train_timeout (0 = no timeout)                             │
   │  └── wait_time_after_min_received (10s)                         │
   │                                                                 │
   │  CyclicController                                               │
   │  └── task_assignment_timeout (10s)                              │
   │                                                                 │
   │  CrossSiteModelEval / CrossSiteEval                             │
   │  ├── submit_model_timeout (600s)                                │
   │  ├── validation_timeout (6000s)                                 │
   │  └── wait_for_clients_timeout (300s)                            │
   │                                                                 │
   │  GlobalModelEval                                                │
   │  ├── validation_timeout (6000s)                                 │
   │  └── wait_for_clients_timeout (300s)                            │
   │                                                                 │
   │  BroadcastAndProcess / InitializeGlobalWeights                  │
   │  ├── timeout / task_timeout (0 = no timeout)                    │
   │  └── wait_time_after_min_received (0-10s)                       │
   │                                                                 │
   │  StatisticsController / HierarchicalStatisticsController        │
   │  └── result_wait_timeout (10s) - Per-statistic timeout          │
   │                                                                 │
   │  SplitNNController                                              │
   │  └── task_timeout (10s)                                         │
   │                                                                 │
   │  TIE Controller (XGBoost, Flower, etc.)                         │
   │  ├── configure_task_timeout (10s)                               │
   │  ├── start_task_timeout (10s)                                   │
   │  ├── job_status_check_interval (2s)                             │
   │  ├── max_client_op_interval (90s)                               │
   │  └── progress_timeout (3600s)                                   │
   │                                                                 │
   │  Flower-Specific                                                │
   │  ├── superlink_ready_timeout (10s)                              │
   │  ├── per_msg_timeout (10s)                                      │
   │  ├── tx_timeout (100s)                                          │
   │  └── client_shutdown_timeout (5s)                               │
   │                                                                 │
   │  CCWF Server-Side                                               │
   │  ├── configure_task_timeout (300s)                              │
   │  ├── start_task_timeout (10s)                                   │
   │  └── progress_timeout (3600s) - Overall workflow                │
   │                                                                 │
   │  CCWF Client-Side (Swarm Learning)                              │
   │  ├── learn_task_ack_timeout (10s)                               │
   │  ├── learn_task_abort_timeout (5s)                              │
   │  └── final_result_ack_timeout (10s)                             │
   └─────────────────────────────────────────────────────────────────┘

   ┌─────────────────────────────────────────────────────────────────┐
   │                    EXECUTOR LAYER                               │
   ├─────────────────────────────────────────────────────────────────┤
   │  LauncherExecutor / ClientAPILauncherExecutor                   │
   │  ├── launch_timeout                                             │
   │  ├── external_pre_init_timeout (60-300s)                        │
   │  ├── task_wait_timeout                                          │
   │  ├── last_result_transfer_timeout (300s)                        │
   │  └── heartbeat_timeout (60-300s)                                │
   │                                                                 │
   │  TaskExchanger (Pipe Handler)                                   │
   │  ├── heartbeat_interval < heartbeat_timeout (REQUIRED)          │
   │  ├── read_interval (0.5s)                                       │
   │  ├── resend_interval (2s)                                       │
   │  ├── peer_read_timeout (60s)                                    │
   │  └── result_poll_interval (0.5s)                                │
   │                                                                 │
   │  IPCExchanger (Agent-based)                                     │
   │  ├── send_task_timeout (5s)                                     │
   │  ├── resend_task_interval (2s)                                  │
   │  ├── agent_connection_timeout (60s)                             │
   │  ├── agent_heartbeat_timeout (None)                             │
   │  └── agent_ack_timeout (5s)                                     │
   │                                                                 │
   │  InProcessClientAPIExecutor                                     │
   │  ├── result_pull_interval (0.5s)                                │
   │  └── log_pull_interval (None)                                   │
   └─────────────────────────────────────────────────────────────────┘

   ┌─────────────────────────────────────────────────────────────────┐
   │                    STREAMING LAYER                              │
   ├─────────────────────────────────────────────────────────────────┤
   │  Reliable Message                                               │
   │  └── per_msg_timeout ≤ tx_timeout (for retries to work)         │
   │                                                                 │
   │  File/Container Streaming                                       │
   │  ├── chunk_timeout (5s per chunk)                               │
   │  └── entry_timeout (60s per entry)                              │
   │                                                                 │
   │  Tensor Streaming (CRITICAL RELATIONSHIP)                       │
   │  ├── tensor_send_timeout (30s)                                  │
   │  ├── wait_send_task_data_all_clients_timeout (300s)             │
   │  └── get_task_timeout >= wait_send_task_data_all_clients_timeout│
   │      (REQUIRED to prevent task fetch timeout during streaming)  │
   └─────────────────────────────────────────────────────────────────┘


Impact Analysis
---------------

**Too Short Timeouts:**

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Timeout Category
     - Impact of Too Short Value
   * - heart_beat_timeout
     - Clients incorrectly marked dead, frequent reconnections
   * - task.timeout / train_timeout
     - Training interrupted before completion, lost work
   * - external_pre_init_timeout
     - Large model loading fails, external processes killed
   * - streaming_read_timeout
     - Large file transfers fail mid-stream
   * - per_msg_timeout
     - Reliable messages fail on slow networks
   * - get_task_timeout
     - Clients fail to receive tasks, job stalls
   * - admin_timeout
     - Admin commands fail, poor CLI experience
   * - task_assignment_timeout (Cyclic)
     - Client fails to fetch task in time, job aborts
   * - submit_model_timeout (CrossSiteEval)
     - Model submission fails, evaluation incomplete
   * - validation_timeout (CrossSiteEval)
     - Validation tasks fail prematurely
   * - result_wait_timeout (Statistics)
     - Statistics collection aborted before all clients respond
   * - agent_connection_timeout (IPC)
     - External agent incorrectly marked disconnected
   * - send_task_timeout (IPC)
     - Task delivery to agent fails, triggers resends
   * - superlink_ready_timeout (Flower)
     - Flower integration fails to initialize
   * - configure_task_timeout (TIE)
     - Third-party framework configuration fails
   * - max_client_op_interval (TIE)
     - Healthy clients marked as stuck

**Too Long Timeouts:**

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Timeout Category
     - Impact of Too Long Value
   * - heart_beat_timeout
     - Dead clients not detected, resources wasted
   * - task_assignment_timeout
     - Slow failover to backup clients
   * - progress_timeout
     - Hung workflows not detected for hours
   * - retry_timeout
     - Long delays before retry attempts
   * - shutdown_timeout
     - Slow job termination, resource cleanup delayed
   * - wait_for_clients_timeout (CrossSiteEval)
     - Long wait for clients that won't join
   * - agent_heartbeat_timeout (IPC)
     - Hung agents not detected, job stalls
   * - resend_task_interval (IPC/TaskExchanger)
     - Slow recovery from transient failures
   * - result_poll_interval (Executor)
     - Delayed result detection, slower job completion
   * - job_status_check_interval (TIE)
     - Delayed detection of job completion or failure
   * - tx_timeout (ReliableMessage)
     - Long waits for failed transactions


Recommended Settings by Use Case
================================

Development Environment
-----------------------

Fast iteration with quick feedback:

.. code-block:: python

   # Server (fed_server.json)
   heart_beat_timeout = 60        # Quick dead client detection
   admin_timeout = 5.0            # Fast admin commands

   # Client parameters
   heartbeat_timeout = 30.0
   task_wait_timeout = 60.0
   external_pre_init_timeout = 60.0

   # Flare API
   login_timeout = 5.0
   poll_interval = 1.0


Production - Standard Training
------------------------------

Balanced settings for typical federated learning:

.. code-block:: python

   # Server (fed_server.json)
   heart_beat_timeout = 600       # 10 min before client considered dead
   admin_timeout = 10.0
   task_request_interval = 2.0

   # comm_config.json
   heartbeat_interval = 10
   subnet_heartbeat_interval = 5
   streaming_read_timeout = 300

   # Executor
   external_pre_init_timeout = 300.0
   heartbeat_timeout = 300.0
   last_result_transfer_timeout = 300.0


Production - Large Models (100M+ parameters)
--------------------------------------------

Extended timeouts for large model training:

.. code-block:: python

   # Server
   heart_beat_timeout = 1200      # 20 min for large model operations

   # Executor/Launcher
   external_pre_init_timeout = 600.0   # 10 min for model loading
   task_wait_timeout = 3600.0          # 1 hour for training

   # Streaming
   streaming_per_request_timeout = 900  # 15 min per chunk
   tensor_send_timeout = 120.0

   # CCWF
   progress_timeout = 14400       # 4 hours
   learn_task_timeout = 7200      # 2 hours


LLM/Foundation Model Training
-----------------------------

For billion-parameter models (examples/advanced/llm_hf):

.. code-block:: python

   # Recipe configuration
   recipe = FedAvgRecipe(
       name="llm_training",
       model=None,  # Use dict config for large models
       shutdown_timeout=120.0,
   )

   # Client parameters - CRITICAL for LLM
   recipe.add_client_config({
       "get_task_timeout": 600,            # 10 min to receive task
       "submit_task_result_timeout": 600,  # 10 min to submit results
       "external_pre_init_timeout": 900,   # 15 min for model init
   })


Unreliable/High-Latency Networks
--------------------------------

Conservative settings for challenging network conditions:

.. code-block:: python

   # More frequent heartbeats with longer tolerance
   heartbeat_interval = 15.0      # Less frequent to reduce traffic
   heartbeat_timeout = 180.0      # 3 min tolerance

   # Extended communication timeouts
   communication_timeout = 600.0
   peer_read_timeout = 180.0
   maint_msg_timeout = 60.0

   # Reliable message settings
   per_msg_timeout = 60.0
   tx_timeout = 600.0             # Long transaction timeout for retries

   # Streaming with larger windows
   streaming_read_timeout = 600
   ack_wait = 30


Edge/Hierarchical FL
--------------------

Settings for edge device deployments:

.. code-block:: python

   # Edge device timeouts
   update_timeout = 30
   job_timeout = 300.0
   device_wait_timeout = 120.0

   # Hierarchical FL
   assess_interval = 1.0
   update_interval = 2.0


XGBoost Secure Training
-----------------------

Settings for histogram-based XGBoost:

.. code-block:: python

   # Controller
   configure_task_timeout = 300
   start_task_timeout = 30
   progress_timeout = 7200

   # Reliable messaging for large histograms
   per_msg_timeout = 120.0
   tx_timeout = 600.0
   xgb_server_ready_timeout = 30


Cross-Site Model Evaluation
---------------------------

Settings for model evaluation across sites:

.. code-block:: python

   from nvflare.app_common.workflows.cross_site_model_eval import CrossSiteModelEval

   controller = CrossSiteModelEval(
       submit_model_timeout=900,        # 15 min for large model submission
       validation_timeout=7200,         # 2 hours for thorough validation
       wait_for_clients_timeout=600,    # 10 min for clients to connect
   )


Federated Statistics
--------------------

Settings for statistics computation:

.. code-block:: python

   from nvflare.app_common.workflows.statistics_controller import StatisticsController

   controller = StatisticsController(
       result_wait_timeout=60,          # 1 min per statistic
       min_clients=2,
   )


Split Learning
--------------

Settings for split neural network training:

.. code-block:: python

   from nvflare.app_common.workflows.splitnn_workflow import SplitNNController

   controller = SplitNNController(
       task_timeout=30,                 # 30 sec for task assignment
       num_rounds=10,
   )


Flower Integration
------------------

Settings for Flower framework integration:

.. code-block:: python

   from nvflare.app_opt.flower.flower_job import FlowerJob

   job = FlowerJob(
       superlink_ready_timeout=30.0,    # 30 sec for Flower server
       configure_task_timeout=60,
       start_task_timeout=30,
       progress_timeout=7200,           # 2 hours for training
       per_msg_timeout=30.0,
       tx_timeout=300.0,
       client_shutdown_timeout=10.0,
   )


Configuration File Locations
============================

This section describes where timeout configuration files are located and which timeouts 
each file controls. Configuration is divided into **system-level** (startup kit) and 
**job-level** (application) settings.

System-Level Configuration (Startup Kit)
----------------------------------------

System-level timeouts are configured in the startup kit and apply to all jobs.
These files are located in the ``local/`` directory of each participant.

**Startup Kit Structure:**

.. code-block:: text

   startup_kit/
   ├── server/
   │   └── local/
   │       ├── fed_server.json          # Server heartbeat, admin timeouts
   │       ├── comm_config.json         # F3/CellNet communication layer
   │       └── resources.json           # Resource configuration
   │
   ├── site-1/ (client)
   │   └── local/
   │       ├── fed_client.json          # Client heartbeat, retry timeouts
   │       ├── comm_config.json         # F3/CellNet communication layer
   │       └── resources.json           # Resource configuration
   │
   └── admin/
       └── local/
           └── admin.json               # Admin session timeouts

**Deployed System Paths:**

After deployment, these files are located at:

.. list-table::
   :header-rows: 1
   :widths: 20 40 40

   * - Component
     - Startup Kit Path
     - Deployed Path
   * - Server
     - ``startup_kit/server/local/``
     - ``/opt/nvflare/workspace/server/local/`` or ``~/nvflare/workspace/server/local/``
   * - Client (Site)
     - ``startup_kit/site-\*/local/``
     - ``/opt/nvflare/workspace/site-\*/local/`` or ``~/nvflare/workspace/site-\*/local/``
   * - Admin
     - ``startup_kit/admin/local/``
     - ``/opt/nvflare/workspace/admin/local/`` or ``~/nvflare/workspace/admin/local/``

**System-Level Configuration Files:**

.. list-table::
   :header-rows: 1
   :widths: 22 22 56

   * - File
     - Location
     - Timeouts Controlled
   * - fed_server.json
     - server/local/
     - ``heart_beat_timeout``, ``admin_timeout``, ``task_request_interval``, ``heartbeat_timeout``
   * - fed_client.json
     - site-\*/local/
     - ``heart_beat_interval``, ``retry_timeout``, ``communication_timeout``
   * - comm_config.json
     - server/local/, site-\*/local/
     - ``heartbeat_interval``, ``subnet_heartbeat_interval``, ``streaming_read_timeout``, ``streaming_ack_interval``, ``max_message_size``
   * - resources.json
     - server/local/, site-\*/local/
     - Resource allocation and limits
   * - admin.json
     - admin/local/
     - ``idle_timeout``, ``login_timeout``, ``command_timeout``

**Note**: Changes to system-level files require restarting the affected FLARE components.


Job-Level Configuration
-----------------------

Job-level timeouts are configured per job and override defaults for that specific job.
These files are located in the job's ``app/config/`` directory.

**Job Configuration Files:**

.. list-table::
   :header-rows: 1
   :widths: 25 25 50

   * - File
     - Location
     - Timeouts Controlled
   * - application.conf
     - app/config/
     - Task timeouts, streaming timeouts, runner sync timeouts
   * - config_fed_client.json
     - app/config/
     - Executor timeouts, Client API task exchange, pipe handler settings
   * - config_fed_server.json
     - app/config/
     - Controller timeouts, workflow component configurations

**Ways to Configure Job-Level Timeouts:**

1. **Recipe API** - Using ``recipe.add_client_config()`` to pass client parameters:

   .. code-block:: python

      # Apply to all clients
      recipe.add_client_config({
          "get_task_timeout": 300,
          "submit_task_result_timeout": 300,
      })

      # Apply to specific clients
      recipe.add_client_config({
          "get_task_timeout": 600,
      }, clients=["site-1", "site-2"])

2. **Job config files** - In ``app/config/`` directory:

   - ``config_fed_client.json`` - Client-side executor and task exchange settings
   - ``config_fed_server.json`` - Server-side controller and workflow settings

Configuration Examples
======================

fed_server.json (Server Configuration)
--------------------------------------

.. code-block:: json

   {
     "heart_beat_timeout": 600,
     "admin_timeout": 10.0,
     "servers": [
       {
         "heart_beat_timeout": 600
       }
     ]
   }


comm_config.json (F3/CellNet Layer)
-----------------------------------

.. code-block:: json

   {
     "heartbeat_interval": 10,
     "subnet_heartbeat_interval": 5,
     "streaming_read_timeout": 300,
     "streaming_ack_interval": 4194304,
     "streaming_chunk_size": 1048576,
     "max_message_size": 1048576
   }


Client API Configuration (config_fed_client.json)
-------------------------------------------------

.. code-block:: json

   {
     "TASK_EXCHANGE": {
       "heartbeat_timeout": 60.0,
       "heartbeat_interval": 5.0,
       "resend_interval": 2.0,
       "pipe": {
         "ARG": {
           "root_url": "tcp://localhost:8002"
         }
       }
     }
   }


application.conf Settings
-------------------------

.. code-block::

   # Task communication timeouts
   get_task_timeout = 60.0
   submit_task_result_timeout = 120.0
   task_check_timeout = 5.0

   # Cell/messaging timeouts
   cell_wait_timeout = 5.0

   # Streaming timeouts
   streaming_per_request_timeout = 600.0
   np_download_chunk_size = 4194304
   tensor_download_chunk_size = 4194304

   # Runner sync timeouts
   runner_sync_timeout = 10.0
   max_runner_sync_timeout = 60.0

   # Shutdown
   end_run_readiness_timeout = 10.0

   # Server startup/dead-job safety flags
   strict_start_job_reply_check = false
   sync_client_jobs_require_previous_report = true


.. _server_startup_dead_job_safety_flags:

Server Startup and Dead-Job Safety Flags
----------------------------------------

These ``application.conf`` flags are server-side safety controls used during job startup
and client heartbeat synchronization:

.. list-table::
   :header-rows: 1
   :widths: 36 12 52

   * - Parameter
     - Default
     - Purpose
   * - strict_start_job_reply_check
     - false
     - Enables strict START_JOB reply validation (detects missing/timeout replies and non-OK return codes).
   * - sync_client_jobs_require_previous_report
     - true
     - Requires a prior positive heartbeat report before treating "missing job on client" as a dead-job signal.

Recommended usage:

- ``strict_start_job_reply_check`` defaults to ``false`` for backward compatibility.
  In non-strict mode, timed-out clients are silently excluded from the active set and the
  job continues — but ``min_sites`` / ``required_sites`` constraints are **not enforced**
  for those timeouts, so startup problems can go undetected.
  In strict mode, timeouts are detected and surfaced: ``required_sites`` and ``min_sites``
  are then checked, and the job only continues (with a warning) if constraints are still
  satisfied. Enable strict mode when you want timeouts to be visible and constraints to be
  enforced at startup.
- Keep ``sync_client_jobs_require_previous_report=true`` (default) to prevent false
  dead-job reports during startup races and transient heartbeat delays.
- Set ``sync_client_jobs_require_previous_report=false`` only to restore legacy behavior
  where the first missing-job heartbeat immediately triggers dead-job detection.


Admin Client Session (Python API)
---------------------------------

.. code-block:: python

   from nvflare.fuel.flare_api.flare_api import new_secure_session

   # Create session with connection timeout
   sess = new_secure_session(
       username="admin@nvidia.com",
       startup_kit_location="/path/to/startup",
       timeout=30.0,
   )

   # Set session-specific command timeout
   sess.set_timeout(60.0)  # 60 seconds for commands

   # Monitor job with timeout
   rc = sess.monitor_job(
       job_id,
       timeout=3600,       # 1 hour max
       poll_interval=5.0,  # Check every 5 seconds
   )

   # Reset to server default
   sess.unset_timeout()


Recipe with Extended Timeouts
-----------------------------

.. code-block:: python

   from nvflare.app_opt.pt.recipes import FedAvgRecipe

   recipe = FedAvgRecipe(
       name="large_model_training",
       model={"class_path": "model.LargeModel", "args": {}},
       min_clients=8,
       num_rounds=100,
       shutdown_timeout=120.0,
       train_script="client.py",
   )

   # Client timeout parameters
   recipe.add_client_config({
       "get_task_timeout": 300,
       "submit_task_result_timeout": 300,
   })


CCWF/Swarm Learning Configuration
---------------------------------

.. code-block:: python

   from nvflare.app_opt.pt.recipes.swarm import SwarmLearningRecipe

   recipe = SwarmLearningRecipe(
       min_clients=3,
       num_rounds=10,
       model=model,
       train_script="train.py",
       cross_site_eval_timeout=600.0,
       round_timeout=3600,   # P2P model-transfer ACK budget; increase for large models
   )


Flower Integration
------------------

.. code-block:: python

   from nvflare.app_opt.flower.recipe import FlowerRecipe

   recipe = FlowerRecipe(
       server_app=ServerApp(...),
       client_app=ClientApp(...),
       superlink_ready_timeout=30.0,
       configure_task_timeout=300,
       start_task_timeout=30,
       progress_timeout=7200,
       per_msg_timeout=30.0,
       tx_timeout=300.0,
       client_shutdown_timeout=10.0,
   )


Edge Device Configuration
-------------------------

.. code-block:: python

   from nvflare.edge.tools.edge_fed_buff_recipe import EdgeFedBuffRecipe

   recipe = EdgeFedBuffRecipe(
       model=MyModel(),
       update_timeout=30,
       job_timeout=600.0,
       device_wait_timeout=120.0,
   )


TaskExchanger Configuration
---------------------------

.. code-block:: python

   from nvflare.app_common.executors.task_exchanger import TaskExchanger

   executor = TaskExchanger(
       read_interval=0.5,
       heartbeat_interval=5.0,
       heartbeat_timeout=120.0,
       resend_interval=5.0,
       peer_read_timeout=120.0,
       result_poll_interval=1.0,
   )


LauncherExecutor Configuration
------------------------------

.. code-block:: python

   from nvflare.app_common.executors.launcher_executor import LauncherExecutor

   executor = LauncherExecutor(
       launch_timeout=60.0,
       task_wait_timeout=3600.0,
       last_result_transfer_timeout=600.0,
       external_pre_init_timeout=300.0,
       peer_read_timeout=120.0,
       monitor_interval=0.5,
       read_interval=0.5,
       heartbeat_interval=10.0,
       heartbeat_timeout=120.0,
   )


ModelController-Based Workflow
------------------------------

.. code-block:: python

   from nvflare.app_common.workflows.fedavg import FedAvg

   controller = FedAvg(
       num_clients=8,
       num_rounds=100,
   )

   # Task with timeout
   controller.send_model_and_wait(
       targets=None,
       data=model,
       timeout=3600,  # 1 hour per round
   )


ScatterAndGather Configuration
------------------------------

.. code-block:: python

   from nvflare.app_common.workflows.scatter_and_gather import ScatterAndGather

   controller = ScatterAndGather(
       min_clients=4,
       num_rounds=50,
       train_timeout=7200,              # 2 hours per round
       wait_time_after_min_received=30, # Wait 30s for stragglers
       task_check_interval=1.0,
   )


CyclicController Configuration
------------------------------

.. code-block:: python

   from nvflare.app_common.workflows.cyclic_ctl import CyclicController

   controller = CyclicController(
       num_rounds=10,
       task_assignment_timeout=30,  # 30 sec to request task
   )


TIE Controller Configuration
----------------------------

.. code-block:: python

   from nvflare.app_common.tie.controller import TieController

   controller = TieController(
       configure_task_timeout=60,
       start_task_timeout=30,
       job_status_check_interval=5.0,
       max_client_op_interval=120.0,
       progress_timeout=7200.0,
   )


Notes and Best Practices
========================

**General Rules:**

- Timeout values are in **seconds** unless otherwise specified
- ``None`` or ``0`` often means no timeout limit (wait indefinitely)
- Chunk size values of ``0`` disable streaming and use native serialization

**Critical Constraints:**

- ``heartbeat_interval`` must be **less than** ``heartbeat_timeout``
- ``task_assignment_timeout`` must be **less than or equal to** ``task.timeout``
- ``task_result_timeout`` must be **less than or equal to** ``task.timeout``
- ``per_msg_timeout`` should be **less than or equal to** ``tx_timeout`` for retries to work
- ``agent_heartbeat_interval`` must be **less than** ``agent_connection_timeout``
- **IMPORTANT**: When using tensor streaming, ``get_task_timeout`` must be **greater than or equal to** 
  ``wait_send_task_data_all_clients_timeout`` to prevent task fetch timeouts while waiting for all 
  clients to receive tensors

**Tensor Streaming Timeout Warning:**

When tensor streaming is enabled, if ``get_task_timeout`` is not explicitly set, it defaults to the 
communicator's timeout. If the streaming timeout (``wait_send_task_data_all_clients_timeout``) exceeds 
the communicator timeout, clients may timeout while waiting for other clients to receive weights. This 
can cause the tensor streaming process to restart and clients may receive empty tensors, causing the 
job to fail.

**Recommended relationship for tensor streaming:**

.. code-block:: text

   get_task_timeout >= wait_send_task_data_all_clients_timeout >= tensor_send_timeout * num_clients

**Hierarchy:**

- Session-specific timeouts override server defaults
- Client config overrides can be set via ``recipe.add_client_config()``
- ``comm_config.json`` settings apply to all F3/CellNet communication

**Best Practices by Component:**

*Controllers:*

- Start with ``timeout=0`` (no timeout) during development
- Set appropriate ``train_timeout`` based on expected round duration
- For cross-site eval, ``validation_timeout`` should exceed longest validation time
- Use ``wait_for_clients_timeout`` to limit waiting for slow clients

*Executors:*

- ``external_pre_init_timeout`` should cover model loading + library imports
- ``heartbeat_timeout`` should be 2-3x ``heartbeat_interval``
- Set ``last_result_transfer_timeout`` based on result size
- For IPC: ``agent_connection_timeout`` > ``agent_heartbeat_interval`` * 3

*Workflows:*

- ``progress_timeout`` catches hung jobs; set to 2-3x expected round time
- ``job_status_check_interval`` trades responsiveness vs overhead
- For statistics: ``result_wait_timeout`` per statistic, not total

*Network/Streaming:*

- Increase ``per_msg_timeout`` and ``tx_timeout`` for high-latency networks
- ``streaming_read_timeout`` should handle slowest expected transfer
- Use longer ``ack_wait`` for unreliable connections

**Debugging Tips:**

- Enable debug logging to see timeout-related messages
- Check ``num_timeout_reqs`` counter in CoreCell for timeout statistics
- Monitor heartbeat status to detect connectivity issues early
- Look for "timeout" in logs to identify which timeouts are triggering
- For IPC issues, check ``agent_connection_timeout`` and agent logs
- For third-party integration (TIE), monitor ``max_client_op_interval`` triggers

**Common Timeout Patterns:**

1. **Layered Timeouts**: Higher-level timeouts should exceed lower-level ones
   
   - ``progress_timeout`` > ``train_timeout`` > ``task_wait_timeout``
   - ``validation_timeout`` > per-batch validation time * num_batches

2. **Heartbeat Relationships**: Always maintain proper ratios
   
   - ``heartbeat_timeout`` = 3-6x ``heartbeat_interval``
   - ``agent_heartbeat_timeout`` = 3-6x ``agent_heartbeat_interval``

3. **Retry Allowance**: Leave room for retries
   
   - ``tx_timeout`` > ``per_msg_timeout`` * expected_retries
   - ``task.timeout`` > ``task_assignment_timeout`` + actual_work_time