.. _timeout_troubleshooting:

#############################
Timeout Troubleshooting Guide
#############################

This guide covers the most common timeout-related job failures and how to resolve them.
For a comprehensive reference of all timeouts, see :ref:`timeouts_programming_guide`.

.. contents:: Table of Contents
   :local:
   :depth: 2

Common Job Failure Scenarios
============================

Task Fetch Timeout
------------------

**Symptom**: Client fails to receive tasks from server; logs show "timeout" during task fetch.

**Common Causes**:

- Large model weights take too long to transfer
- Network latency exceeds default timeout
- Tensor streaming timeout exceeds task fetch timeout

**Solution**: Set ``get_task_timeout`` in client config:

.. code-block:: python

   recipe.add_client_config({
       "get_task_timeout": 300,  # 5 minutes
   })


External Process Pre-Init Timeout (Client API Only)
----------------------------------------------------

**Applies to**: Client API with subprocess launcher (``ScriptRunner``, ``ClientAPILauncherExecutor``)

**Symptom**: Job fails before training starts with "external_pre_init_timeout" error.

This timeout controls how long NVFlare waits for your external training script to call ``flare.init()``.
When using Client API, NVFlare launches your script as a subprocess and waits for it to connect back.

**Common Causes**:

- Large models (LLMs) take time to load before ``flare.init()`` is called
- Heavy library imports (PyTorch, TensorFlow, transformers)
- Slow disk I/O reading model weights

**Solution**: Increase ``external_pre_init_timeout`` in the executor configuration:

.. code-block:: python

   from nvflare.app_common.executors.client_api_launcher_executor import ClientAPILauncherExecutor

   executor = ClientAPILauncherExecutor(
       external_pre_init_timeout=600,  # 10 minutes for LLMs
       ...
   )


Heartbeat Timeout
-----------------

**Symptom**: Client marked as dead; logs show "heartbeat timeout" or "client not responding".

**Common Causes**:

- Long-running training blocks heartbeat thread
- Network issues causing missed heartbeats
- Client overwhelmed with compute

**Solution**: Adjust heartbeat settings:

.. code-block:: python

   # In executor configuration
   heartbeat_timeout = 300.0   # 5 minutes
   heartbeat_interval = 10.0   # Send every 10 seconds

**Rule**: ``heartbeat_interval`` must be less than ``heartbeat_timeout``.


Training Task Timeout
---------------------

**Symptom**: Training interrupted before completion; logs show task timeout.

**Common Causes**:

- Training round takes longer than expected
- Data loading is slow
- Hardware is slower than anticipated

**Solution**: Set appropriate task timeout in controller:

.. code-block:: python

   # ScatterAndGather controller
   controller = ScatterAndGather(
       train_timeout=7200,  # 2 hours per round
       wait_time_after_min_received=60,
   )

   # Or via ModelController
   controller = FedAvg(
       num_rounds=100,
       timeout=7200,  # 2 hours per round
   )


Result Submission Timeout
-------------------------

**Symptom**: Training completes but result submission fails.

**Common Causes**:

- Large model results take time to transfer
- Network congestion

**Solution**: Set ``submit_task_result_timeout``:

.. code-block:: python

   recipe.add_client_config({
       "submit_task_result_timeout": 300,  # 5 minutes
   })


Subprocess Large-Model Result Submission Timeout
-------------------------------------------------

**Applies to**: Subprocess-mode clients (``launch_external_process=True``) with large models

**Symptom**: Training completes in the subprocess but the job hangs or fails immediately
after, with no result acknowledgment received.

**Cause**: ``submit_result_timeout`` (default 60 s) is the time the training subprocess
waits for the client training process to acknowledge its result.  For large models (5 GB+),
the transfer alone exceeds this limit.

**Solution**:

.. code-block:: python

   recipe.add_client_config({
       "submit_result_timeout": 1800,      # 30 min for LLM-scale results
       "tensor_min_download_timeout": 600, # PyTorch: increase if inter-chunk gaps exceed 300s default
       # "np_min_download_timeout": 600,   # NumPy/sklearn: same, use instead of tensor variant
   })

.. note::
   ``submit_result_timeout`` is the subprocess-side wait for acknowledgment.
   It is distinct from ``submit_task_result_timeout``, which is the server-side wait
   for the client to deliver a result.  For large models, set ``submit_task_result_timeout``
   (server-side) to be at least as large as ``submit_result_timeout`` (subprocess-side)
   so the server is still listening when the subprocess finishes sending.

Swarm Learning P2P Transfer Timeout
------------------------------------

**Applies to**: ``SwarmLearningRecipe`` with large models

**Symptom**: Swarm Learning job fails with P2P ACK timeout during model scatter between peers.

**Cause**: ``round_timeout`` (which sets the P2P model-transfer ACK budget between peers)
defaults to 3600 s.  For very large models (7B+) on congested networks, peer-to-peer
tensor streaming can approach this limit.

**Solution**: Set ``round_timeout`` directly on the recipe:

.. code-block:: python

   recipe = SwarmLearningRecipe(
       name="swarm",
       model=MyModel(),
       min_clients=3,
       num_rounds=5,
       train_script="client.py",
       round_timeout=7200,  # 2 hours for 70B+ models
   )

Cross-Site Evaluation Timeout
-----------------------------

**Symptom**: Model evaluation fails or times out during cross-site validation.

**Solution**: Adjust evaluation timeouts:

.. code-block:: python

   from nvflare.app_common.np.recipes import NumpyCrossSiteEvalRecipe

   recipe = NumpyCrossSiteEvalRecipe(
       submit_model_timeout=900,      # 15 min for model submission
       validation_timeout=7200,       # 2 hours for validation
   )


Quick Reference Table
=====================

Most Commonly Adjusted Timeouts
-------------------------------

.. list-table::
   :header-rows: 1
   :widths: 30 15 55

   * - Timeout
     - Default
     - When to Increase
   * - get_task_timeout
     - None
     - Large models, slow networks, tensor streaming
   * - submit_task_result_timeout
     - None
     - Large result payloads
   * - submit_result_timeout (subprocess mode only)
     - 60 s
     - Large model result transfers from subprocess; set 1800 s for LLMs
   * - tensor_min_download_timeout / np_min_download_timeout (subprocess mode only)
     - 300 s
     - 70B+ models on congested networks; increase to 600 s (tensor = PyTorch, np = NumPy/sklearn)
   * - max_resends (subprocess mode only)
     - 3
     - Persistent network failures; increase to 5–10
   * - round_timeout (Swarm Learning only)
     - 3600 s
     - 7B+ model P2P transfers between Swarm peers
   * - external_pre_init_timeout (Client API subprocess only)
     - 60-300s
     - LLMs, heavy imports before ``flare.init()``
   * - heartbeat_timeout
     - 60-300s
     - Long training iterations, slow networks
   * - train_timeout
     - 0
     - Long training rounds
   * - validation_timeout
     - 6000s
     - Large validation datasets
   * - progress_timeout
     - 3600s
     - Complex multi-round workflows


Configuration Methods
=====================

Via Recipe API
--------------

.. code-block:: python

   # Client-side timeouts (applies to all clients)
   recipe.add_client_config({
       "get_task_timeout": 300,
       "submit_task_result_timeout": 300,
   })

   # Or for specific clients
   recipe.add_client_config({
       "get_task_timeout": 600,
   }, clients=["site-1", "site-2"])


Via Configuration Files
-----------------------

**application.conf** (job-level):

.. code-block::

   get_task_timeout = 300.0
   submit_task_result_timeout = 300.0

   # Server startup/dead-job safety flags
   strict_start_job_reply_check = false
   sync_client_jobs_require_previous_report = true

Server-side safety flags guidance (see :ref:`server_startup_dead_job_safety_flags` for full details):

- ``strict_start_job_reply_check`` (default ``false``): keep default for backward-compatible startup behavior;
  set to ``true`` to enforce stricter START_JOB reply checks.
- ``sync_client_jobs_require_previous_report`` (default ``true``): keep enabled to avoid false dead-job reports
  caused by transient startup or sync races.

**comm_config.json** (system-level, in startup kit):

.. code-block:: json

   {
     "heartbeat_interval": 10,
     "streaming_read_timeout": 600
   }


Streaming Stall Guardrail (``comm_config.json``)
------------------------------------------------

For large payload/model transfers, configure F3 stream stall detection in
``comm_config.json`` (server and client startup kits).

**Runtime defaults** (if not set explicitly):

- ``streaming_send_timeout``: ``30.0`` seconds
- ``streaming_ack_progress_timeout``: ``60.0`` seconds
- ``streaming_ack_progress_check_interval``: ``5.0`` seconds
- ``sfm_send_stall_timeout``: ``45.0`` seconds
- ``sfm_close_stalled_connection``: ``false`` (warn-only)
- ``sfm_send_stall_consecutive_checks``: ``3``

**Recommended deployment guideline**:

1. Start with **warn-only** to observe behavior safely.
2. If repeated stall warnings are observed during large-model streaming, enable auto-close.
3. Keep the guard enabled with consecutive checks to reduce false alarms.

Warn-only baseline:

.. code-block:: json

   {
     "sfm_close_stalled_connection": false,
     "sfm_send_stall_timeout": 75,
     "sfm_send_stall_consecutive_checks": 3
   }

Auto-recovery mode (when needed):

.. code-block:: json

   {
     "sfm_close_stalled_connection": true,
     "sfm_send_stall_timeout": 75,
     "sfm_send_stall_consecutive_checks": 3
   }

**Timing relationship (important)**:

- ``sfm_send_stall_timeout`` is compared against the total continuous blocked-send duration.
- ``sfm_send_stall_consecutive_checks`` counts consecutive heartbeat monitor ticks (every 5 seconds),
  not multiples of ``sfm_send_stall_timeout``.

Approximate auto-close window (when ``sfm_close_stalled_connection=true``):

.. code-block:: text

   close_lower_bound ~= sfm_send_stall_timeout
   close_upper_bound ~= sfm_send_stall_timeout + (HEARTBEAT_TICK * sfm_send_stall_consecutive_checks)

With ``sfm_send_stall_timeout=75`` and ``sfm_send_stall_consecutive_checks=3``, close typically occurs
around ``75``-``90`` seconds of continuous stall (not 225 seconds).

**Outer-timeout guideline**:

Set higher-layer timeouts (for example ``communication_timeout`` or task/request timeouts that include
message transfer time) greater than ``close_upper_bound`` plus a safety margin.

Example: ``communication_timeout=300`` is safely larger than the ~``90`` second stall auto-close window.

**How to interpret logs**:

- Expected warning on real stalls:
  ``Detected stalled send on ... (N/3)``
- In healthy/normal streaming, no stall warning should be emitted.
- Intermittent stalls should not close the connection unless the threshold is reached in consecutive checks.


Recommended Settings by Scenario
================================

Standard Training
-----------------

.. code-block:: python

   recipe.add_client_config({
       "get_task_timeout": 120,
   })


Large Model Training (100M+ parameters)
---------------------------------------

.. code-block:: python

   recipe.add_client_config({
       "get_task_timeout": 600,
       "submit_task_result_timeout": 600,
       "submit_result_timeout": 600,        # subprocess mode only
       "tensor_min_download_timeout": 300,  # subprocess mode only; use np_min_download_timeout for NumPy
   })


LLM/Foundation Model Training
-----------------------------

.. code-block:: python

   recipe.add_client_config({
       "get_task_timeout": 1200,
       "submit_task_result_timeout": 1800,  # server-side; must be >= submit_result_timeout
       "submit_result_timeout": 1800,       # subprocess mode only
       "tensor_min_download_timeout": 600,  # PyTorch; use np_min_download_timeout for NumPy
       "max_resends": 5,                    # subprocess mode only
   })


High-Latency Networks
---------------------

.. code-block:: python

   # Longer communication timeouts
   recipe.add_client_config({
       "get_task_timeout": 600,
       "submit_task_result_timeout": 600,
   })

System-level (``comm_config.json`` in startup kit):

.. code-block:: json

   {
     "heartbeat_interval": 15,
     "streaming_read_timeout": 600
   }


Large-Scale Hierarchical / HPC Deployments (Slurm, Lustre)
------------------------------------------------------------

When running 100+ FL clients in a hierarchical topology on HPC systems with shared
filesystems (Lustre, GPFS), two settings significantly improve startup reliability:

**1. Set a minimum-client tolerance in** ``config_fed_server.json``

Allow a small number of clients to be late or unavailable at startup without aborting
the job. For a 144-client job, tolerating up to ~4% stragglers is safe:

.. code-block:: json

   {
     "workflows": [{
       "id": "controller",
       "path": "nvflare.app_common.workflows.fedavg.FedAvg",
       "args": {
         "num_clients": 144,
         "min_clients": 138
       }
     }]
   }

**2. Extend the runner sync timeout in** ``config_fed_client.json``

The default 50-second timeout is too tight when many clients contend for Lustre I/O
at job launch. Raise it to give each client time to initialize:

.. code-block:: json

   {
     "runner_sync_timeout": 120,
     "max_runner_sync_timeout": 7200,
     "max_runner_sync_tries": 120
   }

These two changes address the most common startup race conditions in large hierarchical
deployments and are compatible with the startup stability fixes in FLARE 2.7.2.


Debugging Timeout Issues
========================

1. **Check logs** for "timeout" messages to identify which timeout triggered
2. **Enable debug logging** to see detailed timing information
3. **Monitor heartbeat status** in admin console
4. **Start with longer timeouts** during development, then optimize

For timeout hierarchies, relationships, and all available timeout parameters, 
see the comprehensive :ref:`timeouts_programming_guide`.