Notes on Federated Learning and Differential Privacy¶
2026-05-31 · privacy-preserving ML
Working notes on building federated learning (FL) from scratch, what actually breaks under Non-IID data, and how differential privacy (DP) and secure aggregation fit on top — including the honest negative results that the marketing slides leave out. They follow the implementation in federated-learning-lab (FedAvg / FedProx / SCAFFOLD, DP-SGD, secure aggregation; 33/33 tests, literature cross-validated).
1. What federated learning actually is¶
The data never moves. Instead of pooling everyone's data on one server, each client trains locally and sends model updates to a server that aggregates them. The canonical loop (FedAvg) is:
- Server broadcasts the global model.
- Each client does a few local SGD epochs on its own data.
- Each client sends back its updated weights.
- Server averages the weights (weighted by client data size) → new global model.
That's it. The elegance is that raw data stays on-device; the difficulty is that the clients' data distributions are not identical.
2. The Non-IID problem (where FedAvg starts to hurt)¶
FedAvg implicitly assumes every client sees roughly the same distribution. Real clients don't — one hospital sees different cases than another, one phone's keyboard sees different language. Under Non-IID data, each client's local optimum pulls in a different direction, so averaging their updates produces client drift: the global model lands somewhere none of them wanted.
Two well-known fixes, both implemented and measured in the lab:
- FedProx — add a proximal term that penalises drifting too far from the global model. Stabilises training when clients are heterogeneous.
- SCAFFOLD — track control variates (correction terms) that estimate and subtract the drift direction. More state to communicate, but corrects the bias FedProx only damps.
The honest finding worth repeating: on a strongly Non-IID split (e.g. label-skewed MNIST), the fancy methods don't always beat plain FedAvg by much — and sometimes the dominant lever is just more communication rounds. Reporting the case where your method doesn't win is what separates a lab from a brochure.
3. Differential privacy: the model still leaks¶
Keeping data on-device is not privacy. Model updates leak information about the data that produced them — membership inference and gradient-inversion attacks reconstruct training samples from gradients. To get a real guarantee you add differential privacy.
DP-SGD makes each training step private by:
- Per-sample gradient clipping — bound each example's contribution to a max norm
C. - Gaussian noise — add noise calibrated to
Cto the summed gradients.
The result is a formal (ε, δ) guarantee: the trained model is provably almost the same whether or not any single example was in the data. The cost is the privacy–utility trade-off — smaller ε (stronger privacy) means more noise and lower accuracy. There is no free lunch; the contribution is measuring the curve, not claiming privacy is costless.
4. Secure aggregation: hide the individual update¶
DP bounds what the final model leaks. Secure aggregation addresses a different threat: a curious server seeing each client's individual update. With secure aggregation, clients mask their updates so the server can compute only the sum — no single client's contribution is visible — yet the masks cancel in aggregate. DP (what the model leaks) and secure aggregation (what the server sees) are complementary, not substitutes.
5. Why "from scratch" and "33/33 tests"¶
Privacy ML is exactly the domain where a subtly wrong implementation gives a false sense of safety — a clipping bug or a miscalibrated noise multiplier silently voids the ε guarantee. So the lab:
- implements each algorithm from scratch (FedAvg / FedProx / SCAFFOLD, plus FedPer / Byzantine-robust / FedAdam / FedLoRA),
- cross-validates against the literature so behaviour matches published results, and
- ships 33/33 passing tests and explicit negative results.
For privacy and security work, the test suite and the reproduction are the credibility.
Takeaway¶
Federated learning moves the model, not the data — but on-device ≠ private. Non-IID data breaks naive averaging (FedProx/SCAFFOLD help, sometimes only a little); DP-SGD buys a formal (ε, δ) guarantee at a measurable accuracy cost; secure aggregation hides individual updates from the server. The trustworthy version of all three is the one with the tests and the honest curves.
→ From-scratch implementations, tests, and negative results: github.com/waynehacking8/federated-learning-lab