Layered performance modelling and evaluation for cloud topic detection and tracking based big data applications

Wang, Meisong

Layered performance modelling and evaluation for cloud topic detection and tracking based big data applications

Date

2016

Authors

Wang, Meisong

Abstract

“Big Data” best characterized by its three features namely “Variety”, “Volume” and “Velocity” is revolutionizing nearly every aspect of our lives ranging from enterprises to consumers, from science to government. A fourth characteristic namely “value” is delivered via the use of smart data analytics over Big Data. One such Big Data Analytics application considered in this thesis is Topic Detection and Tracking (TDT). The characteristics of Big Data brings with it unprecedented challenges such as too large for traditional devices to process and store (volume), too fast for traditional methods to scale (velocity), and heterogeneous data (variety). In recent times, cloud computing has emerged as a practical and technical solution for processing big data. However, while deploying Big data analytics applications such as TDT in cloud (called cloud-based TDT), the challenge is to cost-effectively orchestrate and provision Cloud resources to meet performance Service Level Agreements (SLAs). Although there exist limited work on performance modeling of cloud-based TDT applications none of these methods can be directly applied to guarantee the performance SLA of cloud-based TDT applications. For instance, current literature lacks a systematic, reliable and accurate methodology to measure, predict and finally guarantee performances of TDT applications. Furthermore, existing performance models fail to consider the end-to-end complexity of TDT applications and focus only on the individual processing components (e.g. map reduce). To tackle this challenge, in this thesis, we develop a layered performance model of cloud-based TDT applications that take into account big data characteristics, the data and event flow across myriad cloud software and hardware resources and diverse SLA considerations. In particular, we propose and develop models to capture in detail with great accuracy, the factors having a pivotal role in performances of cloud-based TDT applications and identify ways in which these factors affect the performance and determine the dependencies between the factors. Further, we have developed models to predict the performance of cloud-based TDT applications under uncertainty conditions imposed by Big Data characteristics. The model developed in this thesis is aimed to be generic allowing its application to other cloud-based data analytics applications. We have demonstrated the feasibility, efficiency, validity and prediction accuracy of the proposed models via experimental evaluations using a real-world Flu detection use-case on Apache Hadoop Map Reduce, HDFS and Mahout Frameworks.