Diagnostic Commands

NVIDIA FLARE provides diagnostic commands for monitoring and debugging communication statistics in the CellNet layer. These commands are particularly useful for troubleshooting network issues, analyzing message patterns, and understanding system performance characteristics.

Note

These diagnostic commands are only available when the system is configured with diagnose mode enabled in the NetManager component.

Overview

The diagnostic commands allow administrators to:

Discover active cells in the CellNet system
View statistics about message sizes and timing
Monitor communication patterns between cells
Inspect available statistics pools
Analyze histogram data with different statistical modes

These commands query the CellNet layer’s statistics tracking system, which maintains various statistics pools for monitoring different aspects of system communication.

Statistics Pools

NVIDIA FLARE’s statistics system uses “pools” to organize different types of metrics:

Histogram Pools: Track distributions of values (e.g., message sizes, timing) with configurable bins
Counter Pools: Track simple counters for specific events

Each pool has a name, type, and description. The system automatically creates pools for tracking message statistics, and applications can create custom pools for tracking domain-specific metrics.

Configuring Statistics Pool Saving

By default, statistics pools are maintained in memory during job execution. However, you can configure NVFLARE to save pool statistics to disk for later analysis and record-keeping.

Configuration in meta.json

To enable statistics pool saving for a job, add the following configuration to your job’s meta.json file:

{
  "stats_pool_config": {
    "save_pools": [
      "request_processing",
      "request_response",
      "*"
    ]
  }
}

Configuration Options:

save_pools: A list of pool names to save. Supports:
- Specific pool names: e.g., "request_processing", "msg_sizes"
- Wildcard: Use "*" to save all pools
- Mixed: Combine specific names and wildcards

Examples:

Save only specific pools:

{
  "stats_pool_config": {
    "save_pools": ["request_processing", "request_response"]
  }
}

Save all pools:

{
  "stats_pool_config": {
    "save_pools": ["*"]
  }
}

Output Files

When statistics pool saving is enabled, NVFLARE generates two files for each job at the end of job execution:

stats_pool_summary.json

Location: Job workspace directory

Content: Contains histogram summaries and aggregated statistics for each saved pool.

Format: JSON file with the following structure:

{
  "pool_name": {
    "name": "pool_name",
    "type": "hist",
    "description": "Pool description",
    "marks": [0, 10, 100, 1000],
    "bins": [
      {"range": "0-10", "count": 150, "total": 750.5, "min": 0.1, "max": 9.9},
      {"range": "10-100", "count": 80, "total": 4200.0, "min": 10.2, "max": 99.8},
      {"range": "100-1000", "count": 20, "total": 12000.0, "min": 105.0, "max": 950.0}
    ]
  },
  "another_pool": {
    ...
  }
}

Use Cases:

Post-job analysis of communication patterns
Historical comparison across multiple job runs
Generating reports for system performance
Identifying trends over time

stats_pool_records.csv

Location: Job workspace directory

Content: Contains raw, timestamped recordings for each data point collected in the saved pools.

Format: CSV file with columns that vary by pool type:

For histogram pools:

timestamp,pool_name,value,additional_metadata
2024-01-15T10:30:45.123Z,request_processing,0.025,
2024-01-15T10:30:45.456Z,request_processing,0.031,
2024-01-15T10:30:46.789Z,request_response,0.120,

For counter pools:

timestamp,pool_name,counter_name,value
2024-01-15T10:30:45.123Z,event_counts,task_received,1
2024-01-15T10:30:46.456Z,event_counts,task_completed,1

Use Cases:

Detailed timeline analysis
Custom data processing and visualization
Integration with external analytics tools
Machine learning on system behavior patterns
Debugging specific events or anomalies

Workflow Example

Step 1: Configure Your Job

Create or edit your job’s meta.json:

{
  "name": "my_federated_job",
  "resource_spec": {},
  "min_clients": 2,
  "stats_pool_config": {
    "save_pools": ["*"]
  }
}

Step 2: Submit the Job

> submit_job my_job_folder

Step 3: Run the Job

The job executes normally, with statistics being collected in the background.

Step 4: Retrieve Statistics After Job Completion

> download_job job_abc-123-def

Step 5: Analyze the Output

Navigate to the downloaded job workspace and examine:

cd downloaded_job/workspace/

# View summary statistics
cat stats_pool_summary.json

# Analyze raw records
cat stats_pool_records.csv

Step 6: Use Statistics for Analysis

import json
import pandas as pd

# Load summary data
with open('stats_pool_summary.json', 'r') as f:
    summary = json.load(f)

# Load raw records
records = pd.read_csv('stats_pool_records.csv')

# Analyze timing patterns
timing_data = records[records['pool_name'] == 'request_processing']
print(f"Average: {timing_data['value'].mean()}")
print(f"95th percentile: {timing_data['value'].quantile(0.95)}")

Using the Stats Viewer Tool

NVFLARE provides a convenient command-line tool called stats_viewer for interactively exploring statistics files. This tool allows you to view and analyze the stats_pool_summary.json files without writing custom scripts.

Starting the Stats Viewer:

python -m nvflare.fuel.f3.qat.stats_viewer -f stats_pool_summary.json

This launches an interactive shell where you can explore the statistics data.

Available Commands:

The stats viewer provides the following commands:

list_pools: Display all available statistics pools with their types and descriptions
show_pool <pool_name> [mode]: Display detailed statistics for a specific pool
- pool_name: Name of the pool to display
- mode (optional): Histogram display mode - one of: count, total, min, max, avg
help or ?: List available commands
bye: Exit the stats viewer

Example Session:

$ python -m nvflare.fuel.f3.qat.stats_viewer -f stats_pool_summary.json
Type help or ? to list commands.

> list_pools
Name                  Type    Description
-------------------- ------- ------------------------------
request_processing   hist    Request processing time
request_response     hist    Request-response round trip
msg_sizes            hist    Message size distribution

> show_pool request_processing avg
Range         Count    Average
------------ ------- -----------
0-10ms           150     5.2ms
10-100ms          80    45.3ms
100-1000ms        20   425.8ms

> show_pool msg_sizes count
Range         Count
------------ -------
0-1KB           200
1KB-10KB        150
10KB-100KB       50

> bye

Server-Side vs Client-Side Statistics:

The stats_viewer tool can analyze statistics from both server and client sides:

Server-side statistics: Available in the server’s job workspace after job completion. Can be retrieved using download_job command.
Client-side statistics: Currently stored locally on each client site in their respective job workspaces.

Note

Currently, client-side statistics files are not automatically sent to the server after job completion. To analyze client statistics, you need to access the stats_pool_summary.json file directly on each client site’s job workspace.

Common Pool Names

The following pools are commonly available in NVFLARE jobs:

Communication Pools

request_processing: Time spent processing requests
request_response: End-to-end request-response times
msg_sizes: Distribution of message sizes
msg_travel_time: Message transmission times

Job-Specific Pools

Different jobs may create custom pools based on their workflows. Use the list_pools command during job execution to discover available pools:

> cells
server.job_abc-123
site1.job_abc-123

> list_pools server.job_abc-123

Integration with Monitoring

Statistics pool data complements external monitoring systems:

Statistics Pools: Detailed, job-specific metrics saved with job artifacts
External Monitoring (Prometheus/Grafana): Real-time system-wide monitoring

Use both approaches together:

External monitoring for real-time alerting and dashboards
Statistics pool saving for detailed post-job analysis and historical records

See Monitoring for information on setting up external monitoring.

Available Commands

cells

Description: Lists all active cells in the CellNet system with their FQCNs (Fully Qualified Cell Names). This command is essential for discovering available targets to use with other diagnostic commands.

Usage:

cells

Parameters:

None. This command takes no parameters.

Output:

Displays a list of all active cells in the system, showing each cell’s FQCN on a separate line, followed by a summary line showing the total number of valid cells.

Example:

> cells

Example Output:

server
site1
site2
site3
server.abc-123-def
site1.abc-123-def
site2.abc-123-def
Total Cells: 7

Understanding the Output:

The cells listed include:

Parent Cells: Base cells for each site (e.g., server, site1, site2)
- The server’s parent cell is always named server
- Client parent cells use their site names
Job Cells: Cells created for active jobs (e.g., server.abc-123-def, site1.abc-123-def)
- Format: <site_name>.<job_id>
- Created when a job is deployed
- Removed when the job completes
Relay Cells: In hierarchical deployments, relay nodes (e.g., relay1, relay1.site1)
- Intermediate nodes in the communication hierarchy
- Can have their own job cells when jobs are running

Use Cases:

Discover Available Targets: Find valid FQCNs to use with list_pools, show_pool, msg_stats, and other diagnostic commands
Verify System Topology: Confirm all expected sites are connected and active
Monitor Job Cells: See which jobs are currently running by identifying job cell FQCNs
Troubleshoot Connectivity: Identify missing or disconnected cells
Understand Hierarchy: In hierarchical deployments, visualize the cell structure

Examples with Follow-up Commands:

After running cells to discover targets, you can use the FQCNs with other commands:

# First, discover all cells
> cells
server
site1
site2
server.job123
site1.job123
Total Cells: 5

# Then query specific cells
> msg_stats server
> msg_stats site1
> msg_stats server.job123
> list_pools site1.job123

Interpreting Different Cell Types:

Server Parent Cell (server):
- Always present when the FL system is running
- Handles administrative operations
- Parent for all job cells on the server
Client Parent Cells (site1, site2, etc.):
- One per connected FL client site
- Active as long as the client is connected
- Persist across multiple jobs
Job Server Cell (server.<job_id>):
- Created when a job is deployed on the server
- Contains job-specific server workflows
- Removed when job completes
Job Client Cells (<site_name>.<job_id>):
- One per client participating in a job
- Execute the client-side job logic
- Communication with corresponding server job cell
Hierarchical Cells (relay1, relay1.site1):
- Relay nodes in hierarchical deployments
- Can be nested (e.g., relay1.relay2.site1)
- Help manage large-scale deployments

Tips:

Run cells before other diagnostic commands to identify valid targets
Compare cell lists over time to track system changes
If expected cells are missing, check connectivity and site status
Job cells appear when jobs start and disappear when they complete

list_pools

Description: Lists all statistics pools available on a target cell.

Usage:

list_pools target

Parameters:

target - The FQCN (Fully Qualified Cell Name) of the target cell to query (e.g., “server”, “client1”, “server.job_id”)

Output:

Displays a table with three columns:

pool - The name of the statistics pool
type - The type of pool (“hist” for histogram, “counter” for counter)
description - A description of what the pool tracks

Example:

> list_pools server

Example Output:

+------------------+----------+--------------------------------+
| pool             | type     | description                    |
+------------------+----------+--------------------------------+
| msg_travel_time  | hist     | Message travel time in seconds |
| msg_sizes        | hist     | Message size distribution      |
| request_counts   | counter  | Request counts by channel      |
+------------------+----------+--------------------------------+

Use Cases:

Discover available statistics pools on a cell
Verify that expected statistics tracking is configured
Identify pools for detailed inspection with show_pool

show_pool

Description: Shows detailed statistics for a specific pool on a target cell.

Usage:

show_pool target pool_name [mode]

Parameters:

target - The FQCN of the target cell to query
pool_name - The name of the statistics pool to display
mode - (Optional) The display mode for histogram pools. Valid values:
- count - Show the count of values in each bin (default)
- percent - Show the percentage of values in each bin
- avg - Show the average value in each bin
- min - Show the minimum value in each bin
- max - Show the maximum value in each bin

Output:

For histogram pools, displays a table showing the distribution of values across bins. The exact columns depend on the pool type and configuration.

For counter pools, displays a table with counter names and their current values.

Examples:

# Show message size distribution with counts
> show_pool server msg_sizes count

# Show message timing with averages
> show_pool server msg_travel_time avg

# Show message size percentages
> show_pool site1 msg_sizes percent

Example Output (Count Mode):

+---------------+-------+
| Range         | Count |
+---------------+-------+
| 0-1KB         | 150   |
| 1KB-10KB      | 450   |
| 10KB-100KB    | 80    |
| 100KB-1MB     | 20    |
| >1MB          | 5     |
+---------------+-------+

Example Output (Average Mode):

+---------------+-----------+
| Range         | Avg (sec) |
+---------------+-----------+
| 0-10ms        | 5.2e-03   |
| 10ms-100ms    | 4.5e-02   |
| 100ms-1s      | 3.2e-01   |
| >1s           | 2.1e+00   |
+---------------+-----------+

Use Cases:

Analyze message size distributions to identify outliers
Monitor timing characteristics of requests
Compare statistics across different cells
Identify performance bottlenecks or unusual patterns

msg_stats

Description: Shows message request statistics for a target cell. This is a convenience command that displays the pre-configured message statistics pool.

Usage:

msg_stats target [mode]

Parameters:

target - The FQCN of the target cell to query
mode - (Optional) The display mode. Valid values:
- count - Show the count of messages (default)
- percent - Show the percentage of messages
- avg - Show the average message size or timing
- min - Show the minimum values
- max - Show the maximum values

Output:

Displays statistics about request messages, typically showing distributions of message sizes and/or timing information. The exact format depends on how the message statistics pool is configured in the system.

Examples:

# Show message counts
> msg_stats server

# Show average message characteristics
> msg_stats server avg

# Show maximum values
> msg_stats client1 max

Example Output:

Message Statistics for server:
+---------------+-------+----------+
| Size Range    | Count | Avg Time |
+---------------+-------+----------+
| 0-1KB         | 245   | 12ms     |
| 1KB-10KB      | 180   | 25ms     |
| 10KB-100KB    | 45    | 150ms    |
| >100KB        | 10    | 500ms    |
+---------------+-------+----------+

Use Cases:

Quick overview of message traffic patterns
Monitor communication health
Identify unusual message patterns
Baseline system performance characteristics

Common Workflows

Discovering Available Targets

Before using diagnostic commands, discover available cells:

List all active cells:
```
> cells
```
Identify target cells of interest:
- Parent cells for overall system monitoring (server, site1, etc.)
- Job cells for job-specific monitoring (server.job_id, site1.job_id)
- Relay cells in hierarchical deployments
Verify cell connectivity:

Check that expected cells appear in the list. Missing cells may indicate connectivity issues.

Investigating Communication Issues

When investigating communication problems between cells:

Discover active cells:
```
> cells
```

List available pools:

> list_pools server
> list_pools client1

Check message statistics:

> msg_stats server count
> msg_stats client1 count

Examine specific pools:

> show_pool server msg_travel_time avg
> show_pool client1 msg_sizes percent

Performance Analysis

To analyze system performance characteristics:

Check message timing distribution:

> show_pool server msg_travel_time count
> show_pool server msg_travel_time avg

Analyze message size patterns:

> show_pool server msg_sizes count
> show_pool server msg_sizes max

Compare across cells:

> msg_stats server avg
> msg_stats client1 avg
> msg_stats client2 avg

Monitoring Job Execution

During job execution, monitor communication patterns:

Identify job cells:

> cells
# Look for cells with format: <site_name>.<job_id>

Check job cell statistics:

> list_pools server.job_abc123
> msg_stats server.job_abc123 count

Compare parent and job cells:

> msg_stats server avg
> msg_stats server.job_abc123 avg

Statistical Modes Explained

The different statistical modes provide different views of the data:

count

Shows the number of data points in each bin. This is useful for understanding the distribution and identifying where most values fall.

Use case: “How many messages are in the 1KB-10KB range?”

percent

Shows what percentage of all data points fall in each bin. This normalizes the distribution and makes it easier to compare across different time periods or cells.

Use case: “What percentage of messages are larger than 100KB?”

avg

Shows the average value of data points within each bin. This helps understand the typical characteristics within each range.

Use case: “For messages in the 10ms-100ms latency range, what’s the typical latency?”

min

Shows the minimum value encountered in each bin. Useful for understanding best-case scenarios.

Use case: “What’s the fastest response time we’ve seen in the 1KB-10KB message range?”

max

Shows the maximum value encountered in each bin. Useful for identifying worst-case scenarios or outliers.

Use case: “What’s the longest latency we’ve seen for small messages?”

Target Cell Addressing

The target parameter in these commands uses FQCN (Fully Qualified Cell Name) addressing:

Server Cell

> msg_stats server

Client Cell

> msg_stats site1
> msg_stats client_alpha

Job Cells

When a job is running, each site has a dedicated job cell with FQCN in the format <site_name>.<job_id>:

> msg_stats server.abc-123-def
> msg_stats site1.abc-123-def

Hierarchical Cells

In hierarchical deployments with relays:

> msg_stats relay1
> msg_stats relay1.site1

See Hierarchical Communication and Clients for more information on communication hierarchies.

Tips and Best Practices

Regular Monitoring: Establish baseline statistics during normal operation to help identify anomalies.
Compare Cells: Compare statistics across different cells to identify inconsistencies or issues specific to certain sites.
Use Different Modes: Switch between statistical modes to get different insights into the same data.
Track Over Time: Run commands periodically and save output to track trends over time.
Job-Specific Analysis: Monitor job cells separately from parent cells to understand job-specific communication patterns.
Correlate with Logs: Use diagnostic commands in conjunction with log analysis for comprehensive troubleshooting.

Troubleshooting

Command Not Found

If diagnostic commands are not available:

Verify that the NetManager component is configured with diagnose=True
Check that you have appropriate permissions to run these commands
Ensure you’re using a version of NVIDIA FLARE that includes these commands

Cells Command Shows Fewer Cells Than Expected

If the cells command doesn’t show all expected cells:

Check connectivity: Verify that all sites are connected to the server
Check site status: Use check_status to see if clients are properly connected
Wait for initialization: Sites may take a few moments to appear after starting
Check logs: Review server and client logs for connection errors
Verify network: Ensure there are no network issues or firewall blocks

Cells Command Shows Old Job Cells

If job cells remain listed after a job completes:

There may be a delay in cleanup - wait a few moments and run cells again
Check if the job is actually still running with list_jobs
Review logs for any errors during job shutdown

Invalid Mode Error

If you receive an “invalid mode” error:

Ensure you’re using one of the valid modes: count, percent, avg, min, max
Check for typos in the mode parameter
Note that mode is case-sensitive (use lowercase)

Target Not Found

If the target cell cannot be reached:

Verify the FQCN is correct
Check that the target cell is running and connected
Use the cells command to list available cells

Pool Does Not Exist

If you receive a “pool does not exist” error:

Use list_pools to see available pools on that cell
Verify the pool name is spelled correctly
Note that pool names are case-sensitive

Diagnostic Commands

Overview

Statistics Pools

Configuring Statistics Pool Saving

Configuration in meta.json

Output Files

stats_pool_summary.json

stats_pool_records.csv

Workflow Example

Using the Stats Viewer Tool

Common Pool Names

Communication Pools

Job-Specific Pools

Integration with Monitoring

Available Commands

cells

list_pools

show_pool

msg_stats

Common Workflows

Discovering Available Targets

Investigating Communication Issues

Performance Analysis

Monitoring Job Execution

Statistical Modes Explained

count

percent

avg

min

max

Target Cell Addressing

Server Cell

Client Cell

Job Cells

Hierarchical Cells

Tips and Best Practices

Troubleshooting

Command Not Found

Cells Command Shows Fewer Cells Than Expected

Cells Command Shows Old Job Cells

Invalid Mode Error

Target Not Found

Pool Does Not Exist

See Also