Diagnostic Commands
NVIDIA FLARE provides diagnostic commands for monitoring and debugging communication statistics in the CellNet layer. These commands are particularly useful for troubleshooting network issues, analyzing message patterns, and understanding system performance characteristics.
Note
These diagnostic commands are only available when the system is configured with diagnose mode enabled in the NetManager component.
Overview
The diagnostic commands allow administrators to:
Discover active cells in the CellNet system
View statistics about message sizes and timing
Monitor communication patterns between cells
Inspect available statistics pools
Analyze histogram data with different statistical modes
These commands query the CellNet layer’s statistics tracking system, which maintains various statistics pools for monitoring different aspects of system communication.
Statistics Pools
NVIDIA FLARE’s statistics system uses “pools” to organize different types of metrics:
Histogram Pools: Track distributions of values (e.g., message sizes, timing) with configurable bins
Counter Pools: Track simple counters for specific events
Each pool has a name, type, and description. The system automatically creates pools for tracking message statistics, and applications can create custom pools for tracking domain-specific metrics.
Configuring Statistics Pool Saving
By default, statistics pools are maintained in memory during job execution. However, you can configure NVFLARE to save pool statistics to disk for later analysis and record-keeping.
Configuration in meta.json
To enable statistics pool saving for a job, add the following configuration to your job’s meta.json file:
{
"stats_pool_config": {
"save_pools": [
"request_processing",
"request_response",
"*"
]
}
}
Configuration Options:
save_pools: A list of pool names to save. Supports:Specific pool names: e.g.,
"request_processing","msg_sizes"Wildcard: Use
"*"to save all poolsMixed: Combine specific names and wildcards
Examples:
Save only specific pools:
{
"stats_pool_config": {
"save_pools": ["request_processing", "request_response"]
}
}
Save all pools:
{
"stats_pool_config": {
"save_pools": ["*"]
}
}
Output Files
When statistics pool saving is enabled, NVFLARE generates two files for each job at the end of job execution:
stats_pool_summary.json
Location: Job workspace directory
Content: Contains histogram summaries and aggregated statistics for each saved pool.
Format: JSON file with the following structure:
{
"pool_name": {
"name": "pool_name",
"type": "hist",
"description": "Pool description",
"marks": [0, 10, 100, 1000],
"bins": [
{"range": "0-10", "count": 150, "total": 750.5, "min": 0.1, "max": 9.9},
{"range": "10-100", "count": 80, "total": 4200.0, "min": 10.2, "max": 99.8},
{"range": "100-1000", "count": 20, "total": 12000.0, "min": 105.0, "max": 950.0}
]
},
"another_pool": {
...
}
}
Use Cases:
Post-job analysis of communication patterns
Historical comparison across multiple job runs
Generating reports for system performance
Identifying trends over time
stats_pool_records.csv
Location: Job workspace directory
Content: Contains raw, timestamped recordings for each data point collected in the saved pools.
Format: CSV file with columns that vary by pool type:
For histogram pools:
timestamp,pool_name,value,additional_metadata
2024-01-15T10:30:45.123Z,request_processing,0.025,
2024-01-15T10:30:45.456Z,request_processing,0.031,
2024-01-15T10:30:46.789Z,request_response,0.120,
For counter pools:
timestamp,pool_name,counter_name,value
2024-01-15T10:30:45.123Z,event_counts,task_received,1
2024-01-15T10:30:46.456Z,event_counts,task_completed,1
Use Cases:
Detailed timeline analysis
Custom data processing and visualization
Integration with external analytics tools
Machine learning on system behavior patterns
Debugging specific events or anomalies
Workflow Example
Step 1: Configure Your Job
Create or edit your job’s meta.json:
{
"name": "my_federated_job",
"resource_spec": {},
"min_clients": 2,
"stats_pool_config": {
"save_pools": ["*"]
}
}
Step 2: Submit the Job
> submit_job my_job_folder
Step 3: Run the Job
The job executes normally, with statistics being collected in the background.
Step 4: Retrieve Statistics After Job Completion
> download_job job_abc-123-def
Step 5: Analyze the Output
Navigate to the downloaded job workspace and examine:
cd downloaded_job/workspace/
# View summary statistics
cat stats_pool_summary.json
# Analyze raw records
cat stats_pool_records.csv
Step 6: Use Statistics for Analysis
import json
import pandas as pd
# Load summary data
with open('stats_pool_summary.json', 'r') as f:
summary = json.load(f)
# Load raw records
records = pd.read_csv('stats_pool_records.csv')
# Analyze timing patterns
timing_data = records[records['pool_name'] == 'request_processing']
print(f"Average: {timing_data['value'].mean()}")
print(f"95th percentile: {timing_data['value'].quantile(0.95)}")
Using the Stats Viewer Tool
NVFLARE provides a convenient command-line tool called stats_viewer for interactively exploring statistics files. This tool allows you to view and analyze the stats_pool_summary.json files without writing custom scripts.
Starting the Stats Viewer:
python -m nvflare.fuel.f3.qat.stats_viewer -f stats_pool_summary.json
This launches an interactive shell where you can explore the statistics data.
Available Commands:
The stats viewer provides the following commands:
list_pools: Display all available statistics pools with their types and descriptionsshow_pool <pool_name> [mode]: Display detailed statistics for a specific poolpool_name: Name of the pool to displaymode(optional): Histogram display mode - one of:count,total,min,max,avg
helpor?: List available commandsbye: Exit the stats viewer
Example Session:
$ python -m nvflare.fuel.f3.qat.stats_viewer -f stats_pool_summary.json
Type help or ? to list commands.
> list_pools
Name Type Description
-------------------- ------- ------------------------------
request_processing hist Request processing time
request_response hist Request-response round trip
msg_sizes hist Message size distribution
> show_pool request_processing avg
Range Count Average
------------ ------- -----------
0-10ms 150 5.2ms
10-100ms 80 45.3ms
100-1000ms 20 425.8ms
> show_pool msg_sizes count
Range Count
------------ -------
0-1KB 200
1KB-10KB 150
10KB-100KB 50
> bye
Server-Side vs Client-Side Statistics:
The stats_viewer tool can analyze statistics from both server and client sides:
Server-side statistics: Available in the server’s job workspace after job completion. Can be retrieved using
download_jobcommand.Client-side statistics: Currently stored locally on each client site in their respective job workspaces.
Note
Currently, client-side statistics files are not automatically sent to the server after job completion. To analyze client statistics, you need to access the stats_pool_summary.json file directly on each client site’s job workspace.
Common Pool Names
The following pools are commonly available in NVFLARE jobs:
Communication Pools
request_processing: Time spent processing requestsrequest_response: End-to-end request-response timesmsg_sizes: Distribution of message sizesmsg_travel_time: Message transmission times
Job-Specific Pools
Different jobs may create custom pools based on their workflows. Use the list_pools command during job execution to discover available pools:
> cells
server.job_abc-123
site1.job_abc-123
> list_pools server.job_abc-123
Integration with Monitoring
Statistics pool data complements external monitoring systems:
Statistics Pools: Detailed, job-specific metrics saved with job artifacts
External Monitoring (Prometheus/Grafana): Real-time system-wide monitoring
Use both approaches together:
External monitoring for real-time alerting and dashboards
Statistics pool saving for detailed post-job analysis and historical records
See Monitoring for information on setting up external monitoring.
Available Commands
cells
Description: Lists all active cells in the CellNet system with their FQCNs (Fully Qualified Cell Names). This command is essential for discovering available targets to use with other diagnostic commands.
Usage:
cells
Parameters:
None. This command takes no parameters.
Output:
Displays a list of all active cells in the system, showing each cell’s FQCN on a separate line, followed by a summary line showing the total number of valid cells.
Example:
> cells
Example Output:
server
site1
site2
site3
server.abc-123-def
site1.abc-123-def
site2.abc-123-def
Total Cells: 7
Understanding the Output:
The cells listed include:
Parent Cells: Base cells for each site (e.g.,
server,site1,site2)The server’s parent cell is always named
serverClient parent cells use their site names
Job Cells: Cells created for active jobs (e.g.,
server.abc-123-def,site1.abc-123-def)Format:
<site_name>.<job_id>Created when a job is deployed
Removed when the job completes
Relay Cells: In hierarchical deployments, relay nodes (e.g.,
relay1,relay1.site1)Intermediate nodes in the communication hierarchy
Can have their own job cells when jobs are running
Use Cases:
Discover Available Targets: Find valid FQCNs to use with
list_pools,show_pool,msg_stats, and other diagnostic commandsVerify System Topology: Confirm all expected sites are connected and active
Monitor Job Cells: See which jobs are currently running by identifying job cell FQCNs
Troubleshoot Connectivity: Identify missing or disconnected cells
Understand Hierarchy: In hierarchical deployments, visualize the cell structure
Examples with Follow-up Commands:
After running cells to discover targets, you can use the FQCNs with other commands:
# First, discover all cells
> cells
server
site1
site2
server.job123
site1.job123
Total Cells: 5
# Then query specific cells
> msg_stats server
> msg_stats site1
> msg_stats server.job123
> list_pools site1.job123
Interpreting Different Cell Types:
Server Parent Cell (
server):Always present when the FL system is running
Handles administrative operations
Parent for all job cells on the server
Client Parent Cells (
site1,site2, etc.):One per connected FL client site
Active as long as the client is connected
Persist across multiple jobs
Job Server Cell (
server.<job_id>):Created when a job is deployed on the server
Contains job-specific server workflows
Removed when job completes
Job Client Cells (
<site_name>.<job_id>):One per client participating in a job
Execute the client-side job logic
Communication with corresponding server job cell
Hierarchical Cells (
relay1,relay1.site1):Relay nodes in hierarchical deployments
Can be nested (e.g.,
relay1.relay2.site1)Help manage large-scale deployments
Tips:
Run
cellsbefore other diagnostic commands to identify valid targetsCompare cell lists over time to track system changes
If expected cells are missing, check connectivity and site status
Job cells appear when jobs start and disappear when they complete
list_pools
Description: Lists all statistics pools available on a target cell.
Usage:
list_pools target
Parameters:
target- The FQCN (Fully Qualified Cell Name) of the target cell to query (e.g., “server”, “client1”, “server.job_id”)
Output:
Displays a table with three columns:
pool - The name of the statistics pool
type - The type of pool (“hist” for histogram, “counter” for counter)
description - A description of what the pool tracks
Example:
> list_pools server
Example Output:
+------------------+----------+--------------------------------+
| pool | type | description |
+------------------+----------+--------------------------------+
| msg_travel_time | hist | Message travel time in seconds |
| msg_sizes | hist | Message size distribution |
| request_counts | counter | Request counts by channel |
+------------------+----------+--------------------------------+
Use Cases:
Discover available statistics pools on a cell
Verify that expected statistics tracking is configured
Identify pools for detailed inspection with
show_pool
show_pool
Description: Shows detailed statistics for a specific pool on a target cell.
Usage:
show_pool target pool_name [mode]
Parameters:
target- The FQCN of the target cell to querypool_name- The name of the statistics pool to displaymode- (Optional) The display mode for histogram pools. Valid values:count- Show the count of values in each bin (default)percent- Show the percentage of values in each binavg- Show the average value in each binmin- Show the minimum value in each binmax- Show the maximum value in each bin
Output:
For histogram pools, displays a table showing the distribution of values across bins. The exact columns depend on the pool type and configuration.
For counter pools, displays a table with counter names and their current values.
Examples:
# Show message size distribution with counts
> show_pool server msg_sizes count
# Show message timing with averages
> show_pool server msg_travel_time avg
# Show message size percentages
> show_pool site1 msg_sizes percent
Example Output (Count Mode):
+---------------+-------+
| Range | Count |
+---------------+-------+
| 0-1KB | 150 |
| 1KB-10KB | 450 |
| 10KB-100KB | 80 |
| 100KB-1MB | 20 |
| >1MB | 5 |
+---------------+-------+
Example Output (Average Mode):
+---------------+-----------+
| Range | Avg (sec) |
+---------------+-----------+
| 0-10ms | 5.2e-03 |
| 10ms-100ms | 4.5e-02 |
| 100ms-1s | 3.2e-01 |
| >1s | 2.1e+00 |
+---------------+-----------+
Use Cases:
Analyze message size distributions to identify outliers
Monitor timing characteristics of requests
Compare statistics across different cells
Identify performance bottlenecks or unusual patterns
msg_stats
Description: Shows message request statistics for a target cell. This is a convenience command that displays the pre-configured message statistics pool.
Usage:
msg_stats target [mode]
Parameters:
target- The FQCN of the target cell to querymode- (Optional) The display mode. Valid values:count- Show the count of messages (default)percent- Show the percentage of messagesavg- Show the average message size or timingmin- Show the minimum valuesmax- Show the maximum values
Output:
Displays statistics about request messages, typically showing distributions of message sizes and/or timing information. The exact format depends on how the message statistics pool is configured in the system.
Examples:
# Show message counts
> msg_stats server
# Show average message characteristics
> msg_stats server avg
# Show maximum values
> msg_stats client1 max
Example Output:
Message Statistics for server:
+---------------+-------+----------+
| Size Range | Count | Avg Time |
+---------------+-------+----------+
| 0-1KB | 245 | 12ms |
| 1KB-10KB | 180 | 25ms |
| 10KB-100KB | 45 | 150ms |
| >100KB | 10 | 500ms |
+---------------+-------+----------+
Use Cases:
Quick overview of message traffic patterns
Monitor communication health
Identify unusual message patterns
Baseline system performance characteristics
Common Workflows
Discovering Available Targets
Before using diagnostic commands, discover available cells:
List all active cells:
> cellsIdentify target cells of interest:
Parent cells for overall system monitoring (
server,site1, etc.)Job cells for job-specific monitoring (
server.job_id,site1.job_id)Relay cells in hierarchical deployments
Verify cell connectivity:
Check that expected cells appear in the list. Missing cells may indicate connectivity issues.
Investigating Communication Issues
When investigating communication problems between cells:
Discover active cells:
> cellsList available pools:
> list_pools server > list_pools client1
Check message statistics:
> msg_stats server count > msg_stats client1 count
Examine specific pools:
> show_pool server msg_travel_time avg > show_pool client1 msg_sizes percent
Performance Analysis
To analyze system performance characteristics:
Check message timing distribution:
> show_pool server msg_travel_time count > show_pool server msg_travel_time avg
Analyze message size patterns:
> show_pool server msg_sizes count > show_pool server msg_sizes max
Compare across cells:
> msg_stats server avg > msg_stats client1 avg > msg_stats client2 avg
Monitoring Job Execution
During job execution, monitor communication patterns:
Identify job cells:
> cells # Look for cells with format: <site_name>.<job_id>
Check job cell statistics:
> list_pools server.job_abc123 > msg_stats server.job_abc123 count
Compare parent and job cells:
> msg_stats server avg > msg_stats server.job_abc123 avg
Statistical Modes Explained
The different statistical modes provide different views of the data:
count
Shows the number of data points in each bin. This is useful for understanding the distribution and identifying where most values fall.
Use case: “How many messages are in the 1KB-10KB range?”
percent
Shows what percentage of all data points fall in each bin. This normalizes the distribution and makes it easier to compare across different time periods or cells.
Use case: “What percentage of messages are larger than 100KB?”
avg
Shows the average value of data points within each bin. This helps understand the typical characteristics within each range.
Use case: “For messages in the 10ms-100ms latency range, what’s the typical latency?”
min
Shows the minimum value encountered in each bin. Useful for understanding best-case scenarios.
Use case: “What’s the fastest response time we’ve seen in the 1KB-10KB message range?”
max
Shows the maximum value encountered in each bin. Useful for identifying worst-case scenarios or outliers.
Use case: “What’s the longest latency we’ve seen for small messages?”
Target Cell Addressing
The target parameter in these commands uses FQCN (Fully Qualified Cell Name) addressing:
Server Cell
> msg_stats server
Client Cell
> msg_stats site1
> msg_stats client_alpha
Job Cells
When a job is running, each site has a dedicated job cell with FQCN in the format <site_name>.<job_id>:
> msg_stats server.abc-123-def
> msg_stats site1.abc-123-def
Hierarchical Cells
In hierarchical deployments with relays:
> msg_stats relay1
> msg_stats relay1.site1
See Hierarchical Communication and Clients for more information on communication hierarchies.
Tips and Best Practices
Regular Monitoring: Establish baseline statistics during normal operation to help identify anomalies.
Compare Cells: Compare statistics across different cells to identify inconsistencies or issues specific to certain sites.
Use Different Modes: Switch between statistical modes to get different insights into the same data.
Track Over Time: Run commands periodically and save output to track trends over time.
Job-Specific Analysis: Monitor job cells separately from parent cells to understand job-specific communication patterns.
Correlate with Logs: Use diagnostic commands in conjunction with log analysis for comprehensive troubleshooting.
Troubleshooting
Command Not Found
If diagnostic commands are not available:
Verify that the NetManager component is configured with
diagnose=TrueCheck that you have appropriate permissions to run these commands
Ensure you’re using a version of NVIDIA FLARE that includes these commands
Cells Command Shows Fewer Cells Than Expected
If the cells command doesn’t show all expected cells:
Check connectivity: Verify that all sites are connected to the server
Check site status: Use
check_statusto see if clients are properly connectedWait for initialization: Sites may take a few moments to appear after starting
Check logs: Review server and client logs for connection errors
Verify network: Ensure there are no network issues or firewall blocks
Cells Command Shows Old Job Cells
If job cells remain listed after a job completes:
There may be a delay in cleanup - wait a few moments and run
cellsagainCheck if the job is actually still running with
list_jobsReview logs for any errors during job shutdown
Invalid Mode Error
If you receive an “invalid mode” error:
Ensure you’re using one of the valid modes:
count,percent,avg,min,maxCheck for typos in the mode parameter
Note that mode is case-sensitive (use lowercase)
Target Not Found
If the target cell cannot be reached:
Verify the FQCN is correct
Check that the target cell is running and connected
Use the
cellscommand to list available cells
Pool Does Not Exist
If you receive a “pool does not exist” error:
Use
list_poolsto see available pools on that cellVerify the pool name is spelled correctly
Note that pool names are case-sensitive
See Also
CellNet Architecture - Learn about FLARE’s communication layer
Communication Configuration - Configure communication settings
Monitoring - Set up external monitoring with Prometheus and Grafana
Hierarchical Communication and Clients - Understand hierarchical cell topologies