Memory Management
This guide describes memory management techniques for long-running federated learning jobs using Python, PyTorch, and glibc/jemalloc.
Overview
Federated learning jobs can run for hours or days. Without proper memory management, RSS (Resident Set Size) can grow continuously due to:
Long-lived references that keep large model params alive between rounds (primary cause on clients)
glibc memory arena fragmentation (freed memory not returned to the OS)
PyTorch CUDA cache retention
Cyclic references delaying Python garbage collection (supplementary; usually not the main driver)
NVFlare provides utilities and configuration options to manage memory effectively on both server and client sides. The framework automatically detects the memory allocator in use (glibc or jemalloc) and adapts its cleanup strategy accordingly.
Allocator Support
NVFlare supports two memory allocators:
- glibc (default on most Linux)
Uses
malloc_trim()to release free heap pages to the OS. RequiresMALLOC_ARENA_MAXfor optimal memory behavior.- jemalloc (recommended for PyTorch)
Uses auto-decay for memory management. Configure via
MALLOC_CONF. Nomalloc_trim()calls needed (jemalloc handles this automatically).
NVFlare automatically detects which allocator is in use at runtime.
Platform Compatibility
Not all memory management features work on all platforms. The table below summarizes compatibility:
Feature |
Linux/glibc |
Linux/musl |
macOS |
|---|---|---|---|
|
✓ |
✓ |
✓ |
|
✓ |
✗ |
✗ |
|
✓ |
✗ |
✗ |
|
✓ |
✓ |
✓ |
Notes:
Linux/glibc: Standard Linux distributions (Ubuntu, RHEL, Debian, etc.)
Linux/musl: Alpine Linux and other musl-based distributions
macOS:
malloc_trim()is silently skipped (safe no-op)
Warning
For maximum memory efficiency, use Linux with glibc. Alpine Linux (musl) and
macOS still benefit from client-side parameter reference release (and optional
gc.collect()), but cannot release fragmented heap memory back to the OS
via malloc_trim().
Environment Variables
Set these environment variables before starting NVFlare processes:
Client (Training Nodes)
export MALLOC_ARENA_MAX=2
Why: Clients typically have limited CPU memory. Setting MALLOC_ARENA_MAX=2
prevents arena explosion and reduces memory fragmentation.
Server (Aggregation Node)
export MALLOC_ARENA_MAX=4
Why: Servers are CPU memory heavy (4-7× model size) with multi-threaded networking.
MALLOC_ARENA_MAX=4 balances throughput vs memory. Use 8 for high parallelism.
Server-Side Memory Cleanup
The FedAvg controller supports automatic memory cleanup via the server_memory_gc_rounds parameter.
Server-Side Configuration
from nvflare.recipe.fedavg import FedAvgRecipe
recipe = FedAvgRecipe(
name="my_job",
min_clients=4,
num_rounds=100,
train_script="client.py",
server_memory_gc_rounds=5, # Cleanup every 5 rounds
)
Values:
0= Disabled (default for FedAvg-based recipes)1= Cleanup every round (default for FedOpt, FedAvgHE, and Cyclic recipes)5= Cleanup every 5 rounds (recommended for server)
Server Cleanup Effects
When enabled, at the end of every N rounds:
Runs Python garbage collection (
gc.collect())Returns free heap pages to OS (
malloc_trim(), Linux/glibc only)
Performance Impact
Memory cleanup has minimal overhead in typical federated learning workloads:
Operation |
Typical Duration |
Notes |
|---|---|---|
|
10-500 ms |
Depends on Python object count |
|
< 1 ms |
Very fast (page table ops) |
Overhead analysis:
Training round duration: Typically 30 seconds to 10+ minutes
Cleanup duration: 10-500 ms total
Overhead per round: Usually < 1%
With server_memory_gc_rounds=5:
Cleanup runs once every 5 rounds
Total overhead: < 0.2% of training time
Recommendation: Using server_memory_gc_rounds=5 provides good memory
management with negligible performance impact. Only disable (=0) if you’ve
measured and confirmed RSS is stable without cleanup.
Client-Side Memory Cleanup
The primary client-side memory control is clear_cache=True (the default) in
flare.send(), which immediately releases parameter references after serialization.
In CPython, this reference release is what actually frees large tensor/array memory —
no explicit GC call is needed for that.
client_memory_gc_rounds and cuda_empty_cache provide supplemental cleanup on
top of the reference release: periodic gc.collect() for cyclic objects,
malloc_trim() to return freed pages to the OS, and optional CUDA cache clearing.
Client-Side Configuration
from nvflare.recipe.fedavg import FedAvgRecipe
recipe = FedAvgRecipe(
name="my_job",
min_clients=4,
num_rounds=100,
train_script="client.py",
# Server-side cleanup
server_memory_gc_rounds=5,
# Client-side cleanup
client_memory_gc_rounds=1, # Cleanup every round
cuda_empty_cache=True, # Clear GPU cache
)
Swarm Learning Configuration
Swarm Learning uses memory_gc_rounds (not client_memory_gc_rounds) and
cuda_empty_cache on SwarmLearningRecipe:
from nvflare.app_opt.pt.recipes.swarm import SwarmLearningRecipe
recipe = SwarmLearningRecipe(
name="swarm_job",
model=MyModel(),
min_clients=3,
num_rounds=10,
train_script="train.py",
memory_gc_rounds=1, # Cleanup every round on trainer and aggregator roles
cuda_empty_cache=True,
round_timeout=3600, # P2P model-transfer ACK budget; increase for large models
)
Note
memory_gc_rounds and cuda_empty_cache are top-level Swarm recipe arguments.
Do not pass them inside train_args (they are reserved keys).
Parameters:
client_memory_gc_rounds: Run supplemental cleanup (gc.collect()+malloc_trim()) every N rounds on client (0 = disabled). The primary cleanup is reference release viaclear_cache=Trueinflare.send().cuda_empty_cache: If True, calltorch.cuda.empty_cache()on cleanupmemory_gc_rounds(Swarm): Run supplemental cleanup every N rounds (0 = disabled)
When to use client_memory_gc_rounds > 1
Use values greater than 1 only when memory is already stable and you are tuning
for lower cleanup overhead:
RSS trend is flat/bounded across rounds
No CPU/GPU OOM pressure
You want slightly better throughput/latency
Start with client_memory_gc_rounds=1, then tune to 2 and optionally 5 while monitoring RSS.
If RSS begins to climb or OOM risk increases, revert to 1.
Client Cleanup Effects
After each flare.send() on the client (with default clear_cache=True):
FLARE releases references to sent and received model params.
In CPython, this reference release is the primary mechanism that reclaims large tensors/arrays.
Supplemental cleanup is also available and configurable:
Runs Python garbage collection (
gc.collect()), mainly for cyclic references.For glibc: returns free heap pages to OS (
malloc_trim()).For jemalloc: relies on auto-decay (no manual action needed).
Optionally clears PyTorch CUDA cache.
Note
RSS may not drop immediately even after object release because allocators can retain memory for reuse. A flat RSS trend across rounds is typically the expected healthy behavior.
Note
The lifecycle handling is transparent to user training scripts. No code changes are required
in train.py for default behavior.
Client Training Process Memory Cleanup
For subprocess-mode jobs (launch_external_process=True), memory cleanup runs
across all stages of the client training process — not just the training subprocess.
After each round result is forwarded, the same GC and heap-trim cycle is applied
to every stage of the client training process, preventing RSS growth across long jobs.
The cleanup frequency and GPU cache behavior are controlled by the same
memory_gc_rounds / client_memory_gc_rounds and cuda_empty_cache parameters
already documented above.
RSS profiling across all stages can be enabled with the environment variable
NVFLARE_CLIENT_MEMORY_PROFILE=1, which emits per-stage RSS log markers after each
send and receive for easy grep-based analysis.
External Process Settings
For external process execution (launch_external_process=True), memory settings
are passed via environment variables:
NVFLARE_CLIENT_MEMORY_GC_ROUNDS: Cleanup intervalNVFLARE_CUDA_EMPTY_CACHE: GPU cache cleanup (true/false)NVFLARE_CLIENT_MEMORY_PROFILE: Set to1to enable per-round RSS logging
Recommended Settings
Role |
|
|
|
|
|---|---|---|---|---|
Server |
5 |
N/A |
4 |
N/A |
Client |
N/A |
1 |
2 |
True (for GPU) |
Using jemalloc
For PyTorch workloads, jemalloc is recommended over glibc malloc. NVFlare startup
scripts preload jemalloc only when explicitly enabled via
NVFLARE_ENABLE_JEMALLOC_PRELOAD=true and jemalloc is available.
Startup Script
The generated sub_start.sh script includes opt-in jemalloc preload:
# Enable jemalloc preload only when opted in
if [ "${NVFLARE_ENABLE_JEMALLOC_PRELOAD:-false}" = "true" ]; then
for JEMALLOC in /usr/lib/x86_64-linux-gnu/libjemalloc.so.2 \
/usr/lib64/libjemalloc.so.2 \
/usr/local/lib/libjemalloc.so; do
if [ -f "$JEMALLOC" ]; then
export LD_PRELOAD="${LD_PRELOAD:+$LD_PRELOAD:}$JEMALLOC"
export MALLOC_CONF="${MALLOC_CONF:-dirty_decay_ms:5000,muzzy_decay_ms:5000}"
break
fi
done
fi
Installing jemalloc
# Ubuntu/Debian
apt-get install libjemalloc2
# RHEL/CentOS
yum install jemalloc
API Reference
cleanup_memory
from nvflare.fuel.utils.memory_utils import cleanup_memory
cleanup_memory(cuda_empty_cache=True)
Signature: cleanup_memory(cuda_empty_cache: bool = False) -> None
Performs allocator-aware memory cleanup:
Runs
gc.collect()For glibc: Calls
malloc_trim(0)For jemalloc: Relies on auto-decay (no action needed)
Optionally calls
torch.cuda.empty_cache()
get_allocator_type
from nvflare.fuel.utils.memory_utils import get_allocator_type
allocator = get_allocator_type() # "glibc", "jemalloc", or "unknown"
Signature: get_allocator_type() -> str
Detects which memory allocator is in use at runtime. Result is cached.
try_malloc_trim
from nvflare.fuel.utils.memory_utils import try_malloc_trim
result = try_malloc_trim()
Signature: try_malloc_trim() -> Optional[int]
Low-level function to return free heap pages to OS.
Returns:
1if memory was released0if no memory to releaseNoneif not available (non-Linux or non-glibc)
Troubleshooting
High RSS on Server
Check
MALLOC_ARENA_MAXis setEnable
server_memory_gc_rounds=5Consider using jemalloc (LD_PRELOAD)
Monitor with
toporhtop
High RSS on Client
Confirm
flare.send()uses defaultclear_cache=True(or explicitly set it)Check
MALLOC_ARENA_MAX=2is setStart with
client_memory_gc_rounds=1Increase to
2or5only if RSS is already stable and you are tuning performanceEnable
cuda_empty_cache=Truefor GPUConsider using jemalloc
OOM Errors
Reduce batch size
Confirm
flare.send()uses defaultclear_cache=True— this is the primary client fixEnable supplemental cleanup every round (
client_memory_gc_rounds=1orserver_memory_gc_rounds=1)Check for memory leaks in training code
Use jemalloc with appropriate decay settings