Federated learning has gained widespread use in healthcare sectors where data sharing is often challenging due to concerns over user privacy and data ownership. Without adequate techniques to utilize these "isolated" data, we typically encounter data silos. Hospitals are eager to leverage their data but lack the knowledge to do so effectively. Introduced by Google in 2016, federated learning was designed to tackle this issue by adhering to the core principle that only model gradients or weights, not raw data, are exchanged among servers.
Yet, in recent years, we have not seen this technology adopted on a large scale, despite increasing societal concern for data privacy from both regulatory agencies and the public. One possible reason is that tech companies may lack the incentive to allow users to keep their data on local devices, as this data is a lucrative resource for them. However, we are interested in investigating, from a technical standpoint, what factors are impeding the broader adoption of this technology.
Federated Learning is a distributed learning technique that aggregates locally-computed updates to learn a shared model, doing so without accessing local raw data, thus serving as a popular technique for preserving privacy. Instead of all training happens on a central server, it's distributed into multiple 'client' nodes. Distributed learning is nothing new. Exploration in distributed training, specifically in iteratively averaging locally trained models, has been extensively studied and practically applied to the cluster/data setting.
Then, how is Federated Learning different? Unlike conventional distributed training, which often overlooks privacy concerns, Federated Learning also distinguishes itself through its unique environmental considerations. Consider a data cluster center, where we assume balanced (each client node shares a similar size dataset) and IID (independent and identically distributed) data input, along with fast networks of high stability and synchronization. In the federated setting, we usually deal with from dozens to millions of local datasets, characterized by unbalanced, non-IID properties and bandwidth-constrained networks. This shift in setting results in a transition of focus: from computational costs in cluster settings to communication costs in federated environments.The Federated Averaging algorithm, introduced in this paper, serves to leverage additional local computation in order to reduce the number of communication rounds required for model training. It also describes a method of weighted averaging aggregation for local updates, as opposed to simple averaging.
How can we leverage more local computation to address the issue of high communication costs? What is weighted average aggregation in the federated learning setting, and what are its benefits compared to simple averaging? With these questions in mind, let's dive into the paper!
The paper Communication-Efficient Learning of Deep Networks from Decentralized Data is seminal in introducing federated learning. It outlines the fundamental architecture of an FL system and an optimization algorithm, FedAvg (Federated Averaging), for updating the global model on a central server based on local gradients or weights updates from client servers. Although this work primarily focuses on optimization properties within an FL setting and offers limited discussion on practical issues that might arise in real-world applications, we believe it provides an excellent foundation for understanding the technology and setting the stage for future research or practical applications.
The predominant optimization algorithm in use is Stochastic Gradient Descent (SGD), which adapts well to federated optimization (FedSGD) by processing a single batch gradient per communication round. Although this method is computationally frugal, it necessitates a substantial number of training rounds to develop robust models. For example, advanced techniques like batch normalization required 50,000 iterations for the MNIST dataset with small minibatches. Given that communication is a costly resource in federated learning, reducing the number of communication rounds is imperative. This necessity distinguishes FedAvg from FedSGD: FedAvg significantly curtails communication demand by aggregating multiple local updates before transmitting them, whereas FedSGD communicates after every batch, increasing the communication overhead. This efficiency in reducing communication rounds makes FedAvg particularly advantageous in federated settings.
The key steps of FedAvg are as follows:
$$ Loss Function: f(w) = \sum_{k=1}^{K} \frac{n_k}{n} F_k(w) \quad \text{where} \quad F_k(w) = \frac{1}{n_k} \sum_{i \in P_k} f_i(w). $$
C
: The fraction of clients participating in each
training round.
K
: The total number of clients.B
: The local minibatch size for client updates.
E
: The number of local epochs for client updates.
η
: The learning rate for gradient descent.More specifically,
w_0
.
t
:
S_t
of
m
clients randomly from the total
K
clients. C
is a fraction of the
total number of clients.
k
performs a
ClientUpdate
on the current global model
w_t
.
m_t
, the sum of the data
points n_k
from all selected clients.
w_{t+1}
using a weighted sum of the client
updates, with weights proportional to their data points.
(Run on client k
)
P_k
into batches of
size B
.
i
from 1 to E
:
b
and updates the
local model parameters w
using gradient descent.
E
epochs, the client returns the updated
parameters w
to the server.
Uniqueness of Federated Averaging Algorithm: FedAvg aims to address the training on unbalanced and non-IID data distributions across numerous clients. It is specifically designed to optimize the trade-off between local computation and communication efficiency. This optimization is critical in scenarios where preserving data privacy, managing limited bandwidth, and dealing with uneven data distribution are primary concerns.
Having understood the basics of FedAvg, including the meanings of various variables in the algorithm such as B, E, C, and K, as well as its general approach, let us now dive into the experiments conducted by the authors. As previously mentioned, FedAvg aims to address the issue of high communication costs in the traditional FedSGD setting by allowing more local computation before aggregating updates to the central server. Therefore, the authors have conducted extensive empirical studies to test the effectiveness of this approach, especially under conditions of unbalanced and non-IID data input.
Before highlighting FedAvg's remarkable efficiency in reducing communication rounds/costs, please note that the following experiments were all conducted in a setting where C=0.1 , based on previous empirical studies. This setting means that only 10% of total clients are involved in these tasks. To increase computation per client per round, we either decreased B (the local minibatch size), increased E (the number of local epochs), or both.
The number of updates per client per round, denoted as \( \boldsymbol{u = \frac{nE}{KB}} \), was increased by varying both E (number of local epochs) and B (local minibatch size).
When calculating the expected number of updates per client per round ('u'), an average is taken based on the total computational resources and the number of client nodes ('K'), where 'K' is the number of clients involved in learning. Each client may have varying amounts of data, making 'u' a measure of the average computational load per client per round. However, this doesn't imply equal data distribution among clients. In practice, the data heterogeneity issue, where data distribution varies significantly among clients, is a crucial consideration in federated learning scenarios.
Below are the summary of experiment results:
A point worth noting here is that although FedAvg is effective in balanced, unbalanced, IID, and non-IID data input settings, the number of communication rounds needed for the non-IID and unbalanced setting still far exceeds the communication cost in balanced and IID settings (except for the language model task). This remains an ongoing challenge in combating data heterogeneity in the federated learning world.
Having observed the remarkable performance of the FedAvg algorithm in reducing communication costs, it's crucial to also assess the model's performance. This is a key indicator of the effectiveness of federated learning. After all, it wouldn't be sensible from various perspectives to prioritize privacy preservation at the expense of model performance.
The figures on the right show the performance of a CNN MNIST model under both IID and Non-IID settings. Models with different hyperparameters might have different rates of convergence, but in the end, they show very promising accuracy results.
Now that we realize more local training epochs can benefit in reducing communication rounds, the question arises: can we make E (the number of local epochs) as large as possible? The experiment shows that for certain tasks, with a very large number of local epochs, FedAvg can plateau or even diverge.
Based solely on the experiments in the paper, it's difficult to assert definitively that an excessively large E will inevitably result in the model's failure to converge. As observed in the figures above, for an LSTM task, a large E did cause the model to diverge, but this was not the case in a CNN task. We believe that there are more nuances and factors that need to be considered when determining the negative effects of a large E on model training.
The social implications of this research are significant
Following are some examples of real-world implementation of federated learning applications.
Several potential issues have been identified that can hinder the effective implementation of federated learning, and some of these have now become popular open research areas.
This paper presents a clear problem and proposes a clear and straight forward solution. As is with most seminal papers, it lacks real-world validation and does not address real-world obstacles that might occur.
Overall Impression: The paper contributes significantly to federated learning, particularly in its practical application in reducing communication cost. However, its impact is somewhat limited by the lack of comprehensive theoretical analysis and an oversimplified approach to model complexity. Future research could benefit from a more rigorous theoretical foundation and comparative studies between traditional and federated learning frameworks.
The primary objective of this code implementation is to demonstrate the efficacy of the FedAvg algorithm in reducing communication costs within a federated learning framework. This process comprises three key components:
Please Explore our Federated Averaging Experiment Colab Notebook
Note that due to resource constraints, only the vanilla SGD and FedAvg algorithm experiment has been completed, which required hours of training. As the FedSGD algorithm is expected to take significantly more time due to its high communication costs in the federated learning setting—possibly more than an order of magnitude compared to FedAvg—its experiment has been delayed.
This study utilizes the YelpReview Sentiment Analysis dataset, which contains 10,000 Yelp reviews, each with review content and rating stars. The primary task is to predict the rating stars (ranging from 1 to 5) based on the review content.
The datasets for the three models mentioned are prepared as follows, with each dataset first undergoing tokenization and padding before splitting:
For both FedSGD and FedAvg models, to emulate non-IID and unbalanced data conditions, the raw dataset is sorted into classes before being divided into shards, with each shard containing varying numbers of data examples.
The Vanilla SGD model aims to establish a baseline performance for this sentiment analysis task, because the three models share a same network architecture. After training for 100 epochs, we clearly observe overfitting, and the val accuracy of around 40% is not considered good. However, this will serve as the benchmark for the FedAvg and FedSGD experiments. Specifically, we will examine how many communication rounds it takes for both FedAvg and FedSGD to reach this threshold.
The experimental results presented in the paper were not successfully replicated in our code implementation, primarily due to time and computational constraints. The FedAvg algorithm ran for 4 hours over 238 rounds for each of the 10 clients and achieved an accuracy of 25.458%. The training was intentionally interrupted due to computational limits. We are eager to continue this experiment in the future. Despite these challenges, the notebook still merits examination. Any feedback is welcomed!
The FedAvg algorithm is a cornerstone in federated learning, addressing both the efficiency of model training in decentralized environments and the privacy concerns associated with data sharing. Its design and implementation have opened up new possibilities in the application of machine learning models, especially in scenarios where data privacy and communication efficiency are paramount. However, the FedAvg algorithm is not without its limitations. It still faces challenges from data heterogeneity and system heterogeneity, which are obstacles that impede this technology from being widely adopted in real-world applications. Nevertheless, these limitations also open up new research directions in large-scale machine learning systems.
[ii] https://dl.acm.org/doi/10.1145/3414045.3415949.
[iii] https://www.mdpi.com/1424-8220/21/13/4586.
[iv] https://doi.org/10.1038/s41591-021-01506-3.
[v] https://www.mdpi.com/2076-3417/11/23/11191.
[vi] https://mlcommons.org/2023/07/announcing-medperf-open-benchmarking-platform-for-medical-ai/.
[vii] https://arxiv.org/abs/1901.06455.
[viii] https://arxiv.org/abs/2202.01141.
[ix] https://intelligentimaging.ucsf.edu/news/lessons-learned-real-world-federated-learning-experience-covid-19-modeling-ucsf
[x] https://www.nature.com/articles/s41467-022-33407-5
[xi] https://www.digfingroup.com/webank-clustar/
[xii] https://www.siemens.com/global/en/products/automation/topic-areas/artificial-intelligence-in-industry/whitepaper-federated-learning-in-the-industry.html
[xiii] https://enershare.eu/about/
Gabriel Cuchacovich
Zhongwei Zhang