.. _live_log_streaming:

####################
Live Log Streaming
####################

FLARE can stream a job's log files from each client to the server **as the job
runs**, so an operator can ``tail -f`` the server-side copy in real time.
Streamed logs are written under the server workspace and, when a job manager
is available, automatically attached to the job's persisted artifacts.

This page describes how the feature works and how to enable it. To opt out at
a particular site, see :ref:`allow_log_streaming` in
:ref:`site_config`.

Overview
========

The feature is a simple producer / consumer pair built on top of FLARE's
existing object-streaming machinery:

- The **producer** runs inside the client's job subprocess. It tails one or
  more log files (``log.txt``, ``error_log.txt``, custom files) and pushes
  new bytes to the server as they are written.
- The **consumer** runs in the server process. It opens a destination file
  per stream and writes incoming chunks directly, so the file is readable
  with ``tail -f`` while the job is still running. When the stream closes
  (normal end, abort, or idle timeout) the file is handed to the job
  manager for storage.

A third widget, the **system streamer**, lives in the client's
``resources.json`` and saves users from declaring a streamer in every job
config — it auto-injects a ``JobLogStreamer`` into each job before launch.

Components
==========

JobLogStreamer (client, in-job)
-------------------------------

:class:`~nvflare.app_common.logging.job_log_streamer.JobLogStreamer` runs
inside the job subprocess (``CLIENT_JOB``). It belongs in the **job-level**
client configuration (``config_fed_client.json``):

.. code-block:: json

    {
      "components": [
        {
          "id": "log_streamer",
          "path": "nvflare.app_common.logging.job_log_streamer.JobLogStreamer",
          "args": {}
        }
      ]
    }

To stream more than one log file, declare one component per file:

.. code-block:: json

    {
      "components": [
        {
          "id": "log_streamer",
          "path": "nvflare.app_common.logging.job_log_streamer.JobLogStreamer",
          "args": {"log_file_name": "log.txt"}
        },
        {
          "id": "error_log_streamer",
          "path": "nvflare.app_common.logging.job_log_streamer.JobLogStreamer",
          "args": {"log_file_name": "error_log.txt"}
        }
      ]
    }

**Constructor arguments**

``log_file_name`` (str, default ``"log.txt"``)
    Base name of the log file to stream. Must be a relative file name; absolute
    paths and ``..`` traversal are rejected. The actual file is located by
    inspecting the active Python file handler and reusing its directory, so
    streaming works the same way under the simulator and in production
    without any workspace path arithmetic.

``liveness_interval`` (float, default ``10.0``)
    Seconds between heartbeat messages when the log file has produced no new
    bytes. Must be **strictly less than** the receiver's ``idle_timeout`` so
    heartbeats keep the stream alive during quiet periods.

``poll_interval`` (float, default ``0.5``)
    Seconds between polls when the log file has no new data.

**Lifecycle**

The streamer fires on three events:

- ``START_RUN`` — opens the stream and starts a daemon tailing thread.
- ``ABOUT_TO_END_RUN`` — signals the streaming thread to drain and stop, but
  does not block, so post-event log lines still land in the file and are
  picked up by the drain.
- ``END_RUN`` — joins the streaming thread (with a 60-second timeout) so the
  server has received the EOF before the client's ``client_run`` returns.

JobLogReceiver (server)
-----------------------

:class:`~nvflare.app_common.logging.job_log_receiver.JobLogReceiver` opens a
destination file per incoming stream and writes chunks as they arrive. It can
be placed either in **site-level resources** so every job is covered, or
in **job-level configuration** to scope the receiver to a single job.

Site-level (``resources.json`` on the server, recommended)::

    {
      "components": [
        {
          "id": "log_receiver",
          "path": "nvflare.app_common.logging.job_log_receiver.JobLogReceiver",
          "args": {}
        }
      ]
    }

Job-level (``config_fed_server.json``) — declare the same component there if
you want the receiver to register only for that job. The widget keys off the
``SYSTEM_START`` event in system mode and ``START_RUN`` in job mode, so the
underlying stream handler is registered exactly once.

In the Job API, you can attach a job-level receiver with:

.. code-block:: python

    job.to_server(JobLogReceiver())

**Constructor arguments**

``dest_dir`` (str, default ``None``)
    Directory where incoming log files are written. Defaults to the system
    temporary directory.

``idle_timeout`` (float, default ``30.0``)
    Seconds without any message (data or heartbeat) before the receiver
    declares the sender dead and closes the stream. Set to ``0`` to disable.

**File layout**

Each chunk is appended to::

    {dest_dir}/{job_id}/{client_name}/{log_file_name}

so an operator can find logs while the job is still running. When the stream
ends successfully, the file is handed to the job manager (if registered) for
permanent storage; otherwise — for example under the simulator — it is
moved into the job's workspace run directory alongside the other artifacts.

If the stream ends with a non-OK return code (e.g. idle timeout), the
**partial** file is retained at the path above and a warning is logged.

SiteLogStreamer (client, site widget)
-------------------------------------

:class:`~nvflare.app_common.logging.site_log_streamer.SiteLogStreamer` is
a convenience widget that lives in the **client's** ``resources.json`` and
removes the need to declare a streamer in every job:

.. code-block:: json

    {
      "components": [
        {
          "id": "site_log_streamer",
          "path": "nvflare.app_common.logging.site_log_streamer.SiteLogStreamer",
          "args": {}
        }
      ]
    }

On ``BEFORE_JOB_LAUNCH`` (after the deployed job config has been written to
disk but before the subprocess starts), it reads the deployed
``config_fed_client.json``, and if no ``JobLogStreamer`` is already declared
it appends one with the configured arguments. The job subprocess then loads
the modified config and runs ``JobLogStreamer`` as if the user had declared
it explicitly.

When configured for ``error_log.txt``, ``SiteLogStreamer`` also uploads a
post-mortem snapshot from the client parent process on ``JOB_COMPLETED``.
This guarantees error-log delivery for failures that happen so early in the
job that ``JobLogStreamer`` never reaches ``START_RUN``.

The constructor takes the same ``log_file_name``, ``liveness_interval`` and
``poll_interval`` arguments as :class:`JobLogStreamer`; any non-default values
are forwarded to the injected component.

Site control
============

Live log streaming is **enabled by default**. A site can opt out by setting
``"allow_log_streaming": false`` in its ``resources.json``; see
:ref:`allow_log_streaming` for the full description of how each component
behaves when streaming is disabled, including the server-side check that
logs an error if a chunk arrives from a site that has disabled it.

Wire protocol
=============

Streaming uses FLARE's :class:`LogStreamer` over the
``log_streaming`` channel with topic ``live_log``. The stream context carries
the trusted client name and job ID derived from the peer FL context, so
filenames on disk reflect the actual sender — they cannot be spoofed by the
streaming client through the stream payload. Each chunk is a sequence of
log bytes; heartbeats are sent every ``liveness_interval`` seconds when no
bytes have been written.

Behavior under abort
====================

The streaming thread runs in a fresh FL context whose abort signal is never
triggered, so an aborted job's still-buffered log bytes can drain to the
server before the run actually shuts down. Graceful shutdown is signaled
exclusively via the streamer's stop event, set in ``ABOUT_TO_END_RUN`` and
joined in ``END_RUN``. If the join exceeds 60 seconds, a warning is logged
and the run continues to shut down — the server will see the stream close
with an idle-timeout return code rather than EOF, and will retain whatever
partial log has been written so far.