Data Preparation & Heterogeneity
Overview
Data preparation in federated learning differs from centralized ML because:
Each site has its own local dataset that cannot be inspected centrally
Data distributions across sites are often non-IID (not identically distributed)
Feature schemas must be aligned without sharing raw data
Data quality varies across sites
This guide covers practical approaches to these challenges.
Data Heterogeneity (Non-IID Data)
In federated learning, data across sites is often non-IID (not independently and identically distributed). Sites may have different label distributions, feature distributions, or dataset sizes. This heterogeneity can slow convergence and reduce model accuracy compared to centralized training.
Mitigation strategies with examples:
Federated Data Exploration
Before training, use Federated Statistics to understand data distributions across sites without sharing raw data:
Hello Tabular Statistics – Compute statistics (mean, std, histogram) across federated tabular data
Federated Image Statistics – Compute image histogram statistics across sites
User Alignment for Vertical FL
In vertical federated learning, different sites hold different features for overlapping users. Before training, sites must identify their common users without revealing their full datasets.
Private Set Intersection (PSI) – User alignment for vertical federated learning; find common users/entities across sites without exposing private data