Timeout Troubleshooting Guide

This guide covers the most common timeout-related job failures and how to resolve them. For a comprehensive reference of all timeouts, see Timeouts in NVIDIA FLARE (Reference).

Common Job Failure Scenarios

Task Fetch Timeout

Symptom: Client fails to receive tasks from server; logs show “timeout” during task fetch.

Common Causes:

  • Large model weights take too long to transfer

  • Network latency exceeds default timeout

  • Tensor streaming timeout exceeds task fetch timeout

Solution: Set get_task_timeout in client config:

recipe.add_client_config({
    "get_task_timeout": 300,  # 5 minutes
})

External Process Pre-Init Timeout (Client API Only)

Applies to: Client API with subprocess launcher (ScriptRunner, ClientAPILauncherExecutor)

Symptom: Job fails before training starts with “external_pre_init_timeout” error.

This timeout controls how long NVFlare waits for your external training script to call flare.init(). When using Client API, NVFlare launches your script as a subprocess and waits for it to connect back.

Common Causes:

  • Large models (LLMs) take time to load before flare.init() is called

  • Heavy library imports (PyTorch, TensorFlow, transformers)

  • Slow disk I/O reading model weights

Solution: Increase external_pre_init_timeout in the executor configuration:

from nvflare.app_common.executors.client_api_launcher_executor import ClientAPILauncherExecutor

executor = ClientAPILauncherExecutor(
    external_pre_init_timeout=600,  # 10 minutes for LLMs
    ...
)

Heartbeat Timeout

Symptom: Client marked as dead; logs show “heartbeat timeout” or “client not responding”.

Common Causes:

  • Long-running training blocks heartbeat thread

  • Network issues causing missed heartbeats

  • Client overwhelmed with compute

Solution: Adjust heartbeat settings:

# In executor configuration
heartbeat_timeout = 300.0   # 5 minutes
heartbeat_interval = 10.0   # Send every 10 seconds

Rule: heartbeat_interval must be less than heartbeat_timeout.

Training Task Timeout

Symptom: Training interrupted before completion; logs show task timeout.

Common Causes:

  • Training round takes longer than expected

  • Data loading is slow

  • Hardware is slower than anticipated

Solution: Set appropriate task timeout in controller:

# ScatterAndGather controller
controller = ScatterAndGather(
    train_timeout=7200,  # 2 hours per round
    wait_time_after_min_received=60,
)

# Or via ModelController
controller = FedAvg(
    num_rounds=100,
    timeout=7200,  # 2 hours per round
)

Result Submission Timeout

Symptom: Training completes but result submission fails.

Common Causes:

  • Large model results take time to transfer

  • Network congestion

Solution: Set submit_task_result_timeout:

recipe.add_client_config({
    "submit_task_result_timeout": 300,  # 5 minutes
})

Subprocess Large-Model Result Submission Timeout

Applies to: Subprocess-mode clients (launch_external_process=True) with large models

Symptom: Training completes in the subprocess but the job hangs or fails immediately after, with no result acknowledgment received. With very large payloads and many clients, logs may also show repeated no ref found messages from DownloadService after delayed retries.

Cause: submit_result_timeout is the time the training subprocess waits for the client job process to acknowledge its result. PEER_READ_TIMEOUT is the client config key for the parent client job’s corresponding wait for the subprocess to read a task. For large models (5 GB+) and many clients, either side can exceed short defaults if streaming request timeouts are configured higher than the pipe timeout. The subprocess also must remain alive long enough for the server to finish pulling tensors from its DownloadService after result ACK.

Solution:

recipe.add_client_config({
    "submit_result_timeout": 1800,      # 30 min for LLM-scale results
    "download_complete_timeout": 1800,  # keep subprocess alive for server tensor download
    "PEER_READ_TIMEOUT": 600,           # parent CJ read budget; match configured streaming timeout
    "tensor_min_download_timeout": 600, # PyTorch: increase if inter-chunk gaps exceed 300s default
    # "np_min_download_timeout": 600,   # NumPy/sklearn: same, use instead of tensor variant
    "max_resends": 3,                   # finite value; 0 disables retries, None is rejected
})

Note

submit_result_timeout is the subprocess-side wait for acknowledgment. It is distinct from submit_task_result_timeout, which is the server-side wait for the client to deliver a result. For large models, set submit_task_result_timeout (server-side) to be at least as large as submit_result_timeout (subprocess-side) so the server is still listening when the subprocess finishes sending.

Note

In FLARE 2.8.0, ClientAPILauncherExecutor rejects download_complete_timeout=None and max_resends=None at job initialization. Use a positive download_complete_timeout and a finite non-negative max_resends value. Recipe-based external-process jobs serialize the default max_resends=3 in executor args; use recipe.add_client_config({"max_resends": N}) only to override that default.

Swarm Learning P2P Transfer Timeout

Applies to: SwarmLearningRecipe with large models

Symptom: Swarm Learning job fails with P2P ACK timeout during model scatter between peers.

Cause: round_timeout (which sets the P2P model-transfer ACK budget between peers) defaults to 3600 s. For very large models (7B+) on congested networks, peer-to-peer tensor streaming can approach this limit.

Solution: Set round_timeout directly on the recipe:

recipe = SwarmLearningRecipe(
    name="swarm",
    model=MyModel(),
    min_clients=3,
    num_rounds=5,
    train_script="client.py",
    round_timeout=7200,  # 2 hours for 70B+ models
)

Cross-Site Evaluation Timeout

Symptom: Model evaluation fails or times out during cross-site validation.

Solution: Adjust evaluation timeouts:

from nvflare.app_common.np.recipes import NumpyCrossSiteEvalRecipe

recipe = NumpyCrossSiteEvalRecipe(
    submit_model_timeout=900,      # 15 min for model submission
    validation_timeout=7200,       # 2 hours for validation
)

Quick Reference Table

Most Commonly Adjusted Timeouts

Timeout

Default

When to Increase

get_task_timeout

None

Large models, slow networks, tensor streaming

submit_task_result_timeout

None

Large result payloads

submit_result_timeout (subprocess mode only)

300 s through Client API job config; 60 s in raw FlareAgent

Large model result transfers from subprocess; set 1800 s for LLMs

tensor_min_download_timeout / np_min_download_timeout (subprocess mode only)

300 s

70B+ models on congested networks; increase to 600 s (tensor = PyTorch, np = NumPy/sklearn)

PEER_READ_TIMEOUT (Client API subprocess only)

300 s

Large task payloads when streaming per-request timeout is explicitly increased

download_complete_timeout (subprocess mode only)

1800 s

Keep subprocess alive while the server downloads large tensor results

max_resends (subprocess mode only)

3

Persistent network failures; keep finite; use 0 to disable retries

round_timeout (Swarm Learning only)

3600 s

7B+ model P2P transfers between Swarm peers

external_pre_init_timeout (Client API subprocess only)

60-300s

LLMs, heavy imports before flare.init()

heartbeat_timeout

60-300s

Long training iterations, slow networks

train_timeout

0

Long training rounds

validation_timeout

6000s

Large validation datasets

progress_timeout

3600s

Complex multi-round workflows

Configuration Methods

Via Recipe API

# Client-side timeouts (applies to all clients)
recipe.add_client_config({
    "get_task_timeout": 300,
    "submit_task_result_timeout": 300,
})

# Or for specific clients
recipe.add_client_config({
    "get_task_timeout": 600,
}, clients=["site-1", "site-2"])

Via Configuration Files

application.conf (job-level):

get_task_timeout = 300.0
submit_task_result_timeout = 300.0

# Server startup/dead-job safety flags
strict_start_job_reply_check = false
sync_client_jobs_require_previous_report = true

Server-side safety flags guidance (see Server Startup and Dead-Job Safety Flags for full details):

  • strict_start_job_reply_check (default false): in non-strict mode, start-job timeouts are silently excluded from the active set with no min_sites/required_sites enforcement; set to true to make timeouts visible and have min_sites/required_sites constraints enforced at startup.

  • sync_client_jobs_require_previous_report (default true): keep enabled to avoid false dead-job reports caused by transient startup or sync races.

comm_config.json (system-level, in startup kit):

{
  "heartbeat_interval": 10,
  "streaming_read_timeout": 600
}

Debugging Timeout Issues

  1. Check logs for “timeout” messages to identify which timeout triggered

  2. Enable debug logging to see detailed timing information

  3. Monitor heartbeat status in admin console

  4. Start with longer timeouts during development, then optimize

For timeout hierarchies, relationships, and all available timeout parameters, see the comprehensive Timeouts in NVIDIA FLARE (Reference).