TY - GEN
T1 - Finding the right cloud configuration for analytics clusters
AU - Bilal, Muhammad
AU - Canini, Marco
AU - Rodrigues, Rodrigo
N1 - KAUST Repository Item: Exported on 2020-11-20
PY - 2020/10/13
Y1 - 2020/10/13
N2 - Finding good cloud configurations for deploying a single distributed system is already a challenging task, and it becomes substantially harder when a data analytics cluster is formed by multiple distributed systems since the search space becomes exponentially larger. In particular, recent proposals for single system deployments rely on benchmarking runs that become prohibitively expensive as we shift to joint optimization of multiple systems, as users have to wait until the end of a long optimization run to start the production run of their job. We propose Vanir, an optimization framework designed to operate in an ecosystem of multiple distributed systems forming an analytics cluster. To deal with this large search space, Vanir takes the approach of quickly finding a good enough configuration and then attempts to further optimize the configuration during production runs. This is achieved by combining a series of techniques in a novel way, namely a metrics-based optimizer for the benchmarking runs, and a Mondrian forest-based performance model and transfer learning during production runs. Our results show that Vanir can find deployments that perform comparably to the ones found by state-of-the-art single-system cloud configuration optimizers while spending 2X fewer benchmarking runs. This leads to an overall search cost that is 1.3 - 24X lower compared to the state-of-the-art. Additionally, when transfer learning can be used, Vanir can minimize the benchmarking runs even further, and use online optimization to achieve a performance comparable to the deployments found by today's single-system frameworks.
AB - Finding good cloud configurations for deploying a single distributed system is already a challenging task, and it becomes substantially harder when a data analytics cluster is formed by multiple distributed systems since the search space becomes exponentially larger. In particular, recent proposals for single system deployments rely on benchmarking runs that become prohibitively expensive as we shift to joint optimization of multiple systems, as users have to wait until the end of a long optimization run to start the production run of their job. We propose Vanir, an optimization framework designed to operate in an ecosystem of multiple distributed systems forming an analytics cluster. To deal with this large search space, Vanir takes the approach of quickly finding a good enough configuration and then attempts to further optimize the configuration during production runs. This is achieved by combining a series of techniques in a novel way, namely a metrics-based optimizer for the benchmarking runs, and a Mondrian forest-based performance model and transfer learning during production runs. Our results show that Vanir can find deployments that perform comparably to the ones found by state-of-the-art single-system cloud configuration optimizers while spending 2X fewer benchmarking runs. This leads to an overall search cost that is 1.3 - 24X lower compared to the state-of-the-art. Additionally, when transfer learning can be used, Vanir can minimize the benchmarking runs even further, and use online optimization to achieve a performance comparable to the deployments found by today's single-system frameworks.
UR - http://hdl.handle.net/10754/666044
UR - https://dl.acm.org/doi/10.1145/3419111.3421305
UR - http://www.scopus.com/inward/record.url?scp=85095439102&partnerID=8YFLogxK
U2 - 10.1145/3419111.3421305
DO - 10.1145/3419111.3421305
M3 - Conference contribution
SN - 9781450381376
SP - 208
EP - 222
BT - Proceedings of the 11th ACM Symposium on Cloud Computing
PB - ACM
ER -