Optimal Communication Structures for Big Data Aggregation
Fecha
2015-04-26Resumen
Aggregation of computed sets of results fundamentally underlies the distillation of information in many of today’s
big data applications. To this end there are many systems which have been introduced which allow users to obtain aggregate results by aggregating along communication structures such as trees, but they do not focus on optimizing performance by optimizing the underlying structure to perform the aggregation. We consider two cases of the problem – aggregation of (1) single blocks of data, and of (2) streaming input. For each case we determine which metric of “fast” completion is the most relevant and mathematically model resulting systems based on aggregation trees to optimize that metric. Our assumptions and model are laid out in depth. From our model we determine how to create a provably ideal aggregation tree (i.e., with optimal fanin) using only limited information about the aggregation function being applied. Experiments in the Amazon Elastic Compute Cloud (EC2) confirm the validity of our models in practice.