Layered performance modelling and evaluation for cloud topic detection and tracking based big data applications
Abstract
“Big Data” best characterized by its three features namely
“Variety”, “Volume” and “Velocity” is revolutionizing
nearly every aspect of our lives ranging from enterprises to
consumers, from science to government. A fourth characteristic
namely “value” is delivered via the use of smart data
analytics over Big Data. One such Big Data Analytics application
considered in this thesis is Topic Detection and Tracking (TDT).
The characteristics of Big Data brings with it unprecedented
challenges such as too large for traditional devices to process
and store (volume), too fast for traditional methods to scale
(velocity), and heterogeneous data (variety). In recent times,
cloud computing has emerged as a practical and technical solution
for processing big data. However, while deploying Big data
analytics applications such as TDT in cloud (called cloud-based
TDT), the challenge is to cost-effectively orchestrate and
provision Cloud resources to meet performance Service Level
Agreements (SLAs). Although there exist limited work on
performance modeling of cloud-based TDT applications none of
these methods can be directly applied to guarantee the
performance SLA of cloud-based TDT applications. For instance,
current literature lacks a systematic, reliable and accurate
methodology to measure, predict and finally guarantee
performances of TDT applications. Furthermore, existing
performance models fail to consider the end-to-end complexity of
TDT applications and focus only on the individual processing
components (e.g. map reduce).
To tackle this challenge, in this thesis, we develop a layered
performance model of cloud-based TDT applications that take into
account big data characteristics, the data and event flow across
myriad cloud software and hardware resources and diverse SLA
considerations. In particular, we propose and develop models to
capture in detail with great accuracy, the factors having a
pivotal role in performances of cloud-based TDT applications and
identify ways in which these factors affect the performance and
determine the dependencies between the factors. Further, we have
developed models to predict the performance of cloud-based TDT
applications under uncertainty conditions imposed by Big Data
characteristics. The model developed in this thesis is aimed to
be generic allowing its application to other cloud-based data
analytics applications. We have demonstrated the feasibility,
efficiency, validity and prediction accuracy of the proposed
models via experimental evaluations using a real-world Flu
detection use-case on Apache Hadoop Map Reduce, HDFS and Mahout
Frameworks.
Description
Keywords
Citation
Collections
Source
Type
Book Title
Entity type
Access Statement
License Rights
Restricted until
Downloads
File
Description