Training Machine Learning Models Across Decentralized Data Without Sharing Raw Information
Authored by: Loveleen Narang
Date: April 2, 2025
Introduction: The Need for Collaborative Privacy
Machine learning (ML) thrives on data. Traditionally, this meant collecting vast amounts of user data into centralized servers for model training. However, growing concerns about data privacy, coupled with regulations like GDPR and HIPAA, make centralization increasingly problematic, especially for sensitive data generated on personal devices, in hospitals, or within financial institutions. Federated Learning (FL) emerges as a groundbreaking alternative paradigm.
FL enables multiple clients (e.g., mobile devices, hospitals) to collaboratively train a shared ML model under the coordination of a central server, critically, without exchanging their raw local data. Instead of bringing data to the model, FL brings the model to the data. Clients train the model locally on their data and only share anonymized or encrypted model updates (like parameter gradients or weights) with the server. The server then aggregates these updates to improve the global model. This approach inherently enhances privacy and data minimization, unlocking collaborative ML possibilities previously hindered by privacy constraints.
Centralized vs. Federated Learning
Fig 1: Comparison of data flow in Centralized vs. Federated Learning.
The Federated Learning Process: Federated Averaging (FedAvg)
The most common FL algorithm is Federated Averaging (FedAvg). It operates in rounds, typically involving these steps:
Initialization: The central server initializes a global model \( \theta^0 \).
Client Selection: The server selects a subset of available clients \( S_t \) (e.g., randomly) to participate in the current round \( t \). Let \( K \) be the total number of clients and \( n_k \) be the number of data points on client \( k \). Total data \( n = \sum_{k=1}^K n_k \). Formula (1): \(K\), Formula (2): \(n_k\), Formula (3): \(n\).
Distribution: The server sends the current global model \( \theta^t \) to the selected clients in \( S_t \).
Local Training: Each selected client \( k \in S_t \) updates the model based on its local data \( \mathcal{P}_k \). It typically performs multiple local epochs (\( E \)) of optimization (e.g., Stochastic Gradient Descent - SGD) starting from \( \theta^t \) to obtain a local model \( \theta_k^{t+1} \). Local SGD update: Formula (4): \( \theta \leftarrow \theta - \eta \nabla L(x_i, y_i; \theta) \). Formula (5): Local Epochs \(E\).
Communication: Each selected client \( k \) sends its updated model parameters \( \theta_k^{t+1} \) (or the update \( \Delta_k^t = \theta_k^{t+1} - \theta^t \)) back to the server. Crucially, raw data \( \mathcal{P}_k \) is never sent. Formula (6): \( \Delta_k^t \).
Aggregation: The server aggregates the updates from the selected clients, typically using a weighted average based on the amount of data each client used for training, to produce the new global model \( \theta^{t+1} \). FedAvg aggregation: Formula (7):
Iteration: Repeat steps 2-6 for a set number of communication rounds \( T \) or until convergence. Formula (9): \(T\).
Mathematically, the goal is to minimize a global objective function \( F(\theta) \), which is often the weighted average of local loss functions \( F_k(\theta) \):
Where \( L \) is the loss function (e.g., cross-entropy). Formula (12): \( L \).
Federated Averaging (FedAvg) Cycle
Fig 2: The iterative cycle of the Federated Averaging (FedAvg) algorithm.
Federated Learning Variants
Depending on how data is distributed across clients, FL can be categorized:
Federated Learning Variants
Variant
Data Partitioning
Description
Example Use Case
Horizontal FL (HFL)
Same feature space, different samples/users.
Clients have datasets with the same features but different instances (e.g., different users' phone data). This is the setting for FedAvg.
Mobile keyboard prediction across different users.
Vertical FL (VFL)
Different feature spaces, same samples/users.
Clients have datasets with different features but covering the same set of instances (e.g., a bank and an e-commerce company have different data about the same customers). Requires more complex coordination and often involves encryption.
Collaborative credit scoring between a bank and an online retailer.
Federated Transfer Learning (FTL)
Different feature spaces, different samples/users (with some overlap).
Applies when datasets differ in both samples and features. Leverages transfer learning techniques within the federated setting.
Using knowledge from a model trained on retail data in one region to help train a model for a different region with partially overlapping user bases.
Privacy-Preserving Techniques in Federated Learning
While FL prevents direct data sharing, the model updates themselves can potentially leak information about the client's local data through various attacks (e.g., membership inference, property inference, reconstruction attacks). Therefore, additional Privacy-Enhancing Technologies (PETs) are often integrated into FL.
Differential Privacy (DP)
DP provides strong, mathematically provable privacy guarantees by adding carefully calibrated noise to data or computations. It ensures that the outcome of an analysis is statistically similar whether or not any single individual's data is included in the dataset.
Definition (\((\epsilon, \delta)\)-DP): A randomized mechanism \( M \) provides \((\epsilon, \delta)\)-DP if for any two adjacent datasets \( D, D' \) (differing by one record) and any set of outcomes \( S \), Formula (13):
Where \( \epsilon \) (epsilon) is the privacy budget (lower means more privacy), and \( \delta \) (delta) is the probability of random failure (should be very small). Pure \( \epsilon \)-DP is when \( \delta = 0 \). Formula (14): \( \epsilon \). Formula (15): \( \delta \).
Sensitivity: Measures the maximum change in a function's output when one record is changed. Needed to calibrate noise. \(L_1\) sensitivity: Formula (16): \( \Delta_1 f = \max_{D, D'} ||f(D) - f(D')||_1 \). \(L_2\) sensitivity: Formula (17): \( \Delta_2 f = \max_{D, D'} ||f(D) - f(D')||_2 \).
Mechanisms:
Laplace Mechanism (for \( \epsilon \)-DP): Adds noise drawn from a Laplace distribution scaled by \( \Delta_1 f / \epsilon \). Formula (18): \( M(D) = f(D) + \text{Lap}(\Delta_1 f / \epsilon) \). Laplace PDF: Formula (19): \( \text{Lap}(x | b) = \frac{1}{2b} \exp(-|x|/b) \). Formula (20): scale \( b \).
Local DP: Clients add noise to their updates \( \theta_k^{t+1} \) or gradients \( \nabla F_k \) before sending them to the server. This protects individual client data even from the server.
Central DP: The server adds noise to the aggregated update \( \sum w_k \theta_k^{t+1} \) after receiving them. This protects against leakage from the final model but assumes a trusted server regarding individual updates.
DP-SGD in FL: Clients perform DP-SGD locally: gradients are clipped to bound sensitivity (Formula (23): \( \tilde{g}_t = g_t / \max(1, ||g_t||_2 / C) \)) and then noise is added before averaging/updating. Formula (24): \( ||g_t||_2 \). Formula (25): Clipping Threshold \(C\).
DP introduces a trade-off: stronger privacy (lower \( \epsilon \)) requires more noise, which can negatively impact model utility (accuracy).
Differential Privacy: Adding Noise for Privacy
Fig 3: Conceptual illustration of adding noise via Differential Privacy.
Homomorphic Encryption (HE)
HE allows computations (like addition or multiplication) to be performed directly on encrypted data (ciphertexts) without decrypting it first. The decrypted result matches the result of computations performed on the original plaintext.
Fully Homomorphic Encryption (FHE) supports both addition and multiplication. Partially Homomorphic Encryption (PHE) supports only one.
Application in FL: Clients encrypt their model updates \( \theta_k^{t+1} \) or \( \Delta_k^t \) using the server's public key before sending them. The server can then sum these encrypted updates (using the additive property of HE) to get an encrypted aggregated update \( \text{Enc}(\sum w_k \theta_k^{t+1}) \). This encrypted result is sent back to clients, who can decrypt it using their shared private key (or individual keys in some schemes).
Pros & Cons: Provides strong privacy against the server (it never sees plaintext updates) without adding noise (preserving accuracy). However, HE operations are computationally very expensive, significantly increasing training time and communication overhead. Often practical only for simpler aggregation schemes (like addition) or specific model types.
Secure Multi-Party Computation (SMPC or SMC)
SMPC protocols allow multiple parties to jointly compute a function over their private inputs without revealing those inputs to each other or any other party.
Concept: Instead of sending updates directly, clients use cryptographic techniques like secret sharing to mask their individual updates.
Secret Sharing (e.g., Shamir's): A secret \( s \) is split into \( N \) shares \( s_1, \dots, s_N \) such that any \( t \) (threshold) shares can reconstruct \( s \), but \( t-1 \) shares reveal no information. Based on polynomial interpolation: construct polynomial \( q(x) \) of degree \( t-1 \) with \( q(0)=s \). Formula (28): \( q(x) = s + a_1 x + \dots + a_{t-1} x^{t-1} \). Share \( i \) is \( (i, q(i)) \). Formula (29): Share \((i, s_i)\).
Application in FL (Secure Aggregation): Clients mask their updates (e.g., by adding pairwise random masks that cancel out when summed, or using secret sharing). The server receives only masked/shared updates and performs the aggregation protocol. It learns the sum \( \sum w_k \Delta_k^t \) but not the individual \( \Delta_k^t \).
Pros & Cons: Provides strong privacy against the server without adding noise like DP. Can be more efficient than HE for complex computations but often involves significant communication overhead due to multiple rounds of interaction between clients or between clients and cryptographic servers. Requires a minimum number of non-colluding clients.
Secure Aggregation Concept
Fig 4: Secure Aggregation prevents the server from seeing individual updates.
Comparison of Privacy Techniques in Federated Learning
Inference attacks on individual contributions/data points
Homomorphic Encryption (HE)
Compute on encrypted data
No accuracy loss due to noise, strong privacy against server
Very high computational overhead, often limited operation types (PHE vs FHE)
Server snooping on intermediate updates
Secure Multi-Party Computation (SMPC)
Cryptographic protocols (e.g., secret sharing)
No accuracy loss due to noise, strong privacy against server
High communication overhead, requires coordination/non-collusion assumptions
Server snooping on intermediate updates
These techniques are not mutually exclusive and can sometimes be combined (e.g., using SMPC for aggregation and applying DP locally).
Challenges in Federated Learning
Despite its promise, FL faces significant practical challenges:
Statistical Heterogeneity: Client data is often non-IID (not independent and identically distributed), varying significantly in size and distribution. This complicates model convergence and can lead to biased global models. Metrics like KL Divergence can quantify distribution differences. Formula (30): \( D_{KL}(P||Q) \).
Systems Heterogeneity: Clients (especially mobile devices) vary greatly in hardware (CPU, memory), network connectivity (WiFi, 3G/4G), and power availability. This leads to variability in training speed and potential dropouts.
Communication Bottlenecks: Communication between clients and the server is often slow and expensive compared to local computation. Reducing the number of rounds and the size of updates (e.g., via model compression, quantization) is crucial.
Security & Privacy Threats: Even with PETs, risks remain. Malicious clients could send poisoned updates to corrupt the global model. Inference attacks might still try to extract information from aggregated results or gradients, especially without strong DP guarantees.
Fairness: The global model might perform well on average but poorly for specific subgroups or clients, particularly those with under-represented data. Ensuring fairness across diverse clients is an active research area.
Productionization Complexity: Managing large-scale FL systems, ensuring robustness to dropouts, coordinating PETs, and debugging distributed processes is complex.
Applications of Federated Learning
FL is being explored and deployed in various domains where data privacy is paramount:
Mobile Devices: Smart keyboard prediction (e.g., Gboard), voice assistant personalization, on-device item ranking, without uploading sensitive user interactions.
Healthcare: Training diagnostic models (e.g., for medical imaging) across multiple hospitals without sharing sensitive patient records.
Finance: Collaborative fraud detection or credit risk modeling across different financial institutions without revealing customer data.
Industrial IoT: Predictive maintenance models trained across sensors in different factories without sharing potentially proprietary operational data.
Automotive: Training models for autonomous driving features using data from fleets of vehicles.
Conclusion
Federated Learning represents a paradigm shift in machine learning, enabling collaborative model training while respecting data privacy and locality. By keeping raw data decentralized and leveraging privacy-enhancing technologies like Differential Privacy, Homomorphic Encryption, and Secure Multi-Party Computation, FL opens doors to applications previously blocked by privacy hurdles. However, realizing its full potential requires overcoming significant challenges related to data heterogeneity, system constraints, communication efficiency, security, and fairness. As research progresses and frameworks mature, Federated Learning is poised to become an increasingly vital tool for building intelligent systems responsibly in a data-sensitive world.
Formula count includes simple definitions/parameters like: (1) K, (2) nk, (3) n, (4) Local SGD, (5) E, (6) Δk, (7) FedAvg Aggregation (weights), (8) FedAvg Aggregation (updates), (9) T, (10) Global Objective, (11) Local Objective, (12) L, (13) (ε,δ)-DP Def, (14) ε, (15) δ, (16) L1 Sensitivity, (17) L2 Sensitivity, (18) Laplace Mechanism, (19) Laplace PDF, (20) Laplace Scale b, (21) Gaussian Mechanism, (22) Gaussian Sigma σ, (23) DP-SGD Clipping, (24) L2 Norm, (25) Clipping Threshold C, (26) HE Addition, (27) HE Multiplication, (28) Shamir Polynomial, (29) Shamir Share, (30) KL Divergence, (31) Basic Learning Rate η (used in formula 4), (32) Number of clients sampled |St|. Total > 30.
About the Author, Architect & Developer
Loveleen Narang is a seasoned leader in the field of Data Science, Machine Learning, and Artificial Intelligence. With extensive experience in architecting and developing cutting-edge AI solutions, Loveleen focuses on applying advanced technologies to solve complex real-world problems, driving efficiency, enhancing compliance, and creating significant value across various sectors, particularly within government and public administration. His work emphasizes building robust, scalable, and secure systems aligned with industry best practices.