Title | : | Optimizing Big Data processing in Heterogeneous Clusters |
Speaker | : | Nishank Garg (IITM) |
Details | : | Thu, 27 Jun, 2019 11:00 AM @ A M Turing Hall |
Abstract: | : | Spark has emerged as a de-facto distributed processing framework for a large number of Big Data applications. The Spark framework for processing Big Data assumes that the underlying environment is homogeneous. However, this assumption is not true with current day clusters due to addition of hardware and other resources over a period of time to the cluster. Consequently, the execution environment affects the performance of the Spark jobs executing on the cluster. The thesis work addresses mainly the environment heterogeneity arising out of memory and disk latencies and provides an optimizing framework for Spark job execution.
We propose Sparker, which is a framework to identify the idle resources and modify the size of the executors launched by Spark to optimally utilize the spare capacity available on the nodes for the processing of the Spark jobs. The executors launched by the Spark framework will be of uniform size. Since the nodes are heterogeneous in the cluster, the executors will not be in a position to exploit the resources of the nodes optimally. Sparker works by identifying the available spare capacity on each node and then resizing the executors appropriately. We have modified the Spark code for the Sparker functionality. Sparker code is tested extensively on a 15-node heterogeneous cluster. We have performed experiments on a 15-node heterogeneous cluster with each node running BOSS MOOL OS. Sparker is able to reduce execution time by 32.5% on average compared to Spark on the studied SparkBench benchmark applications. Further, we have incorporated Tula, which is an HDFS block balancing strategy for Hadoop, to address the heterogeneity of disk read/write latencies. This results in improving the performance of the Spark jobs further. The heterogeneity due to the disk latencies arises because of the degradation of the disk performance over a period of time. Eventually, the disk will fail, affecting the execution of jobs, stored data, and service availability. Therefore, it becomes essential to identify when a disk drive is likely to fail, which can be used to mitigate the affects on the execution of the Spark jobs. As part of this work, we have also designed a Long Short-Term Memory (LSTM) network with attention mechanism based on deep neural network to predict the Remaining Useful Life (RUL) of the hard disk drives. We have used publicly available datasets on disk failures to train the LSTM network. The predicted RUL values can be used by Tula to appropriately move the blocks. The developed model can be used for predicting other component failures with time series data. |