Data Preparation & Heterogeneity

Overview

Data preparation in federated learning differs from centralized ML because:

  • Each site has its own local dataset that cannot be inspected centrally

  • Data distributions across sites are often non-IID (not identically distributed)

  • Feature schemas must be aligned without sharing raw data

  • Data quality varies across sites

This guide covers practical approaches to these challenges.

Data Heterogeneity (Non-IID Data)

In federated learning, data across sites is often non-IID (not independently and identically distributed). Sites may have different label distributions, feature distributions, or dataset sizes. This heterogeneity can slow convergence and reduce model accuracy compared to centralized training.

Mitigation strategies with examples:

  • FedProx – Adds a proximal regularization term to prevent client models from drifting too far from the global model during local training

  • SCAFFOLD – Uses control variates to correct for client drift, significantly improving convergence on highly heterogeneous data

Federated Data Exploration

Before training, use Federated Statistics to understand data distributions across sites without sharing raw data:

User Alignment for Vertical FL

In vertical federated learning, different sites hold different features for overlapping users. Before training, sites must identify their common users without revealing their full datasets.

  • Private Set Intersection (PSI) – User alignment for vertical federated learning; find common users/entities across sites without exposing private data