Scholarly article on topic 'Hora: Architecture-aware online failure prediction'

Hora: Architecture-aware online failure prediction Academic research paper on "Electrical engineering, electronic engineering, information engineering"

CC BY
0
0
Share paper
Keywords
{"Online failure prediction" / Reliability / "Component-based software systems"}

Abstract of research paper on Electrical engineering, electronic engineering, information engineering, author of scientific article — Teerat Pitakrat, Dušan Okanović, André van Hoorn, Lars Grunske

Abstract Complex software systems experience failures at runtime even though a lot of effort is put into the development and operation. Reactive approaches detect these failures after they have occurred and already caused serious consequences. In order to execute proactive actions, the goal of online failure prediction is to detect these failures in advance by monitoring the quality of service or the system events. Current failure prediction approaches look at the system or individual components as a monolith without considering the architecture of the system. They disregard the fact that the failure in one component can propagate through the system and cause problems in other components. In this paper, we propose a hierarchical online failure prediction approach, called Hora, which combines component failure predictors with architectural knowledge. The failure propagation is modeled using Bayesian networks which incorporate both prediction results and component dependencies extracted from the architectural models. Our approach is evaluated using Netflix’s server-side distributed RSS reader application to predict failures caused by three representative types of faults: memory leak, system overload, and sudden node crash. We compare Hora to a monolithic approach and the results show that our approach can improve the area under the ROC curve by 9.9%.

Academic research paper on topic "Hora: Architecture-aware online failure prediction"

Accepted Manuscript

Hora: Architecture-aware Online Failure Prediction

Teerat Pitakrat, Dusan Okanovic, André van Hoorn, Lars Grunske

PII: DOI:

Reference:

To appear in:

S0164-1212(17)30039-0 10.1016/j.jss.2017.02.041 JSS 9928

The Journal of Systems & Software

Received date: Revised date: Accepted date:

15 April 2016 6 October 2016 24 February 2017

Please cite this article as: Teerat Pitakrat, DuSan OkanoviC, Andre van Hoorn, Lars Grunske, Hora: Architecture-aware Online Failure Prediction, The Journal of Systems & Software (2017), doi: 10.1016/j.jss.2017.02.041

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

010101015353232323534853535348

Highlights

An online failure prediction approach which employs architectural knowledge

An automatic extraction of the architectural dependency model

A proof-of-concept implementation of the approach

An extensive evaluation on three typical failure scenarios

Hora: Architecture-aware Online Failure Prediction

Teerat Pitakrata'*, Dusan OkanoviCa, Andre van Hoorna, Lars Grunskeb

aUniversity of Stuttgart, Institute of Software Technology, Reliable Software Systems bHumboldt University Berlin, Department of Computer Science, Software Engineering

Abstract

Complex software systems experience failures at runtime even though a lot of effort is put into the development and operation. Reactive approaches detect these failures after they have occurred and already caused serious consequences. In order to execute proactive actions, the goal of online failure prediction is to detect these failures in advance by monitoring the quality of service or the system events. Current failure prediction approaches look at the system or individual components as a monolith without considering the architecture of the system. They disregard the fact that the failure in one component can propagate through the system and cause problems in other components. In this paper, we propose a hierarchical online failure prediction approach, called Hora, which combines component failure predictors with architectural knowledge. The failure propagation is modeled using Bayesian networks which incorporate both prediction results and component dependencies extracted from the architectural models. Our approach is evaluated using Netflix's server-side distributed RSS reader application to predict failures caused by three representative types of faults: memory leak, system overload, and sudden node crash. We compare Hora to a monolithic approach and the results show that our approach can improve the area under the ROC curve by 9.9%.

Keywords:

online failure prediction, reliability, component-based software systems

1. Introduction

Quality of service (QoS) problems in software systems at runtime, such as performance degradation and service outage, can lead to frustrated customers and losses in revenue. Software service providers need to make sure that the system satisfies its QoS requirements, e.g., in terms of response times not exceeding a certain threshold. During operation, such QoS properties are continuously monitored at the system boundary to assess QoS compliance. Based on Avizienis et al [1], we use the term failure in this paper to refer to the state of the component or system violating expected QoS levels.

Online failure prediction [2] aims to foresee looming QoS problems at runtime before they manifest themselves. Accurate failure predictions are a prerequisite for preemptive maintenance actions, reducing the effect of problems or even completely preventing them from occurring [3, 4, 5, 6].

* Corresponding author

Email address: pitakrat@informatik.uni-stuttgart.de (Teerat Pitakrat)

Existing online failure prediction approaches predict failures either of the whole system or of specific parts of the system. These approaches employ monolithic models which view the whole system or the service as one entity and predict failure events based on externally observable measures, e.g., response time [7, 8], event logs [9, 10], or system metrics [11, 12]. However, without considering the architecture or dependencies between components, these approaches are able to predict only the component failures but not their consequences on other parts of the system.

When faced with complex software systems comprised of a large number of internal and external components, the existing approaches may not be able to properly analyze all the influencing measures. For instance, system failures, which are visible to users, usually originate from complex interactions of erroneous components inside the system. These internal errors, which can be regarded as failures on component level, can propagate to other parts of the system through the architectural dependencies. This causes a chain of errors up to the system boundary and results in a failure on the system level [1, 13, 14, 15, 16]. This concept of

Preprint submitted to Journal of Systems and Software

February 24, 2017

failure propagation follows the definition by Avizienis et al [1]. The notion of fault, error, and failure depends on the perspective of the viewer. For example, a failure from a component perspective can be considered as an error from the system perspective.

To overcome the challenges of predicting failures in complex systems, we hypothesize that online failure prediction can be improved by including architectural information about software systems. In this paper, we propose a hierarchical online failure prediction approach, called Hora1. The core idea is to combine architectural models with component failure prediction techniques. We use two types of architectural models: 1) Architectural Dependency Model (ADM) and 2) Failure Propagation Model (FPM). The ADM captures the dependencies between architectural entities. The FPM employs Bayesian network theory [17] to represent the propagation paths of failures with corresponding probabilities. Both ADM and FPM can be automatically generated from an existing architectural model. At runtime, the FPM is constantly updated with individual failure probabilities, which are computed by component failure predictors. These predictors apply techniques, such as time series forecasting or machine learning, on monitoring data obtained from each component to predict component failures. In the last step, the FPM is solved to combine individual and propagated failure probabilities. The concept of separating the prediction into component failure prediction and the propagation prediction allows suitable prediction techniques to be used (and reused) for different types of components.

Our evaluation investigates the prediction quality of Hora with respect to the following research question: RQ1: Does Hora improve the prediction quality compared to a monolithic approach? If yes, to what degree? RQ2: How do the parameters of Hora affect the prediction quality? RQ3: How does ADM affect the prediction quality? RQ4: What is the runtime overhead of Hora? The evaluation includes three typical types of faults, which are memory leak, system overload, and node crash, applied to Netflix's distributed RSS reader application. The results of our approach are compared to those of a monolithic approach, which does not consider architectural knowledge for the prediction but rather only the failure probability of each individual component. The results show that our approach can predict runtime failures with a higher prediction quality, with respect to standard evaluation metrics [2].

To summarize, the paper contributes a novel online failure prediction approach, called Hora, which em-

1 Hora is a Thai word, which means an oracle

ploys architectural knowledge together with a configurable and reusable set of failure prediction techniques. The approach is accompanied by a proof-of-concept implementation. The evaluation shows Hora's benefits over a monolithic approach.

This paper is an extended version of our previous work [18]. The approach is extended by an automatic extraction of the ADM. Moreover, this paper presents a comprehensive evaluation of the approach with respect to the previously mentioned research questions.

The remainder of the paper is structured as follows. Section 2 emphasizes the challenge of online failure prediction in distributed software systems. Section 3 details the Hora approach. The evaluation and the discussion of the results are presented in Section 4. Section 5 describes the related work. Finally, Section 6 draws the conclusion and outlines future work. Additionally, we provide supplementary material for this paper [19].

2. Motivating Example

Figure 1 provides a high-level view on a typical distributed enterprise application system [20]. The example system conforms to the common three-tier architectural style of enterprise applications. Each of the tiers comprises a number of instances, to which requests are distributed over load balancers. Each instance comprises a complex stack of software architecture, middleware services, operating system, virtualization, and hardware components.

In this example, it can be observed that at 4:05 PM QoS problems manifest themselves at the system boundary as a prompt increase in response times and failing requests for the provided service. Online failure prediction approaches aim to predict failures before they occur in order to allow timely actions, such as preventive maintenance, to decrease or completely prevent system downtime. However, in this case, neither of the two metrics measured at the system boundary gives an indication about the upcoming problem. The traditional approaches for online failure prediction, e.g., time series forecasting based on service response time, are not appropriate in this case because the data does not contain any symptom that precedes the failure.

In addition to the system architecture, Figure 1 includes three system-internal measures of the businesstier instance BT2, namely the utilization of CPU, system memory, and heap space of the Java Virtual Machine (JVM). It can be observed that the CPU utilization increases abruptly at 4:05 PM—the same time as the increase of the service response time. The utilization of system memory increases linearly until 3:55 PM

when it reaches a level close to 100% and remains stable. The JVM heap space utilization shows an increasing trend until reaching almost 100%. In this scenario, we can conclude that the increase of the response times is caused by the increase of the CPU utilization. The increase of the CPU utilization is in turn caused by garbage collection activity inside the JVM—a common problem in Java systems. In this scenario, the root cause of the failure could be a memory leak in the BT2, which causes a chain of errors [14] that propagates to the end users.

The online failure prediction approach, introduced in this paper, aims to predict this kind of problems by incorporating the failure probabilities of the internal components (in this case CPU, memory, and JVM) along with the failure propagation paths through the software system architecture.

3. The Hora Prediction Approach

The main idea of Hora is that if the failure of each component in the software system can be predicted and the dependencies among the components are known, the consequence, i.e., the propagation, of the failures can also be predicted.

The Hora approach is comprised of two integral parts: two types of architectural models (Section 3.1) and hierarchical online failure predictions based on these models (Section 3.2).

3.1. Architectural Models and Transformation

This section introduces two types of architectural models used in Hora, namely Architectural Dependency Model and Failure Propagation Model, as well as the ADM extraction and transformation.

3.1.1. Architectural Dependency Model

There are already a number of existing architectural modeling languages and model extraction mechanisms [21, 22], e.g., PCM [23], Descartes [24], and SLAstic [25]. However, these models are designed for different purposes, i.e., performance modeling, resource allocation, and capacity management. The information regarding component dependencies is not directly available and still needs to be extracted from these models. Thus, we introduce Architectural Dependency Model (ADM)—an intermediate model representing only the dependencies between architectural entities as needed by Hora.

The Architectural Dependency Model (ADM) relates every component to every other component along with

Table 1: Table representation of the ADM for the system in Figure 1

Component Required components and weights

BT1 {(DB, 1.0)}

BT2 {(DB, 1.0)}

BT3 {(DB, 1.0)}

PT1 {(BT1, 0.33), (BT2, 0.33), (BT3, 0.33)}

PT2 {(BT1, 0.33), (BT2, 0.33), (BT3, 0.33)}

LB {(PT1, 0.5), (PT2, 0.5)}

the corresponding degrees of dependencies. It can be viewed as a n x n matrix for a system with n components. Each element in the matrix represents how much one component requires another component to function correctly. In other words, it is the probability that a failure of one component will affect another one based on the QoS metrics. For example, an element aij in the matrix represents the degree to which component ci depends on cj, where 1 < i, j < n and i + j. The ADM acts as an intermediate step which contains the information regarding the failure dependency and allows us to focus on one type of model.

Table 1 shows the table representation of the ADM for the example introduced in Section 2. Following a topological order, the database (DB) has no dependencies to other components; the business-tier instances BT1-3 depend on DB; the presentation-tier instances PT1-2 depend on the business-tier instances BT1-3; and the load balancer depends on PT1-2. Additionally, Table 1 includes the weights associated to these dependencies—in this case, assuming that requests among the tiers are equally load-balanced to the instances of the next tier. Note that for the sake of simplicity, we consider the six nodes from the example as monolithic components. For realistic scenarios (e.g., in the evaluation in Section 4), these components can be further decomposed into software and hardware components with measures, such as service response time, method response time, or resource utilization. Moreover, the network devices, e.g., routers that connect these components, can also be included to model connector failures.

3.1.2. Automatic Extraction of Architectural Dependency Model

The ADM can be manually created by experts who have knowledge about the dependencies of the system. However, in practice, most systems contain a large number of components. Thus, the manual creation of the model becomes tedious, time-consuming, and error-prone.

In this section, we introduce an automatic extraction of the ADM from an existing architectural model. The architectural model used as the source is the SLAstic model [25] which contains information regarding the structure and behavior of the system, and can be automatically extracted from monitoring data. The monitoring data required for the automatic model extraction comprises system-level resource usage, e.g., CPU utilization and memory consumption, and detailed execution traces of the application for each user request. From this data, the hardware and software components are discovered and linked together according to their observed relationship. The result of this automatic extraction is a SLAstic model which includes software components and their relationships, deployment information, and number of invocations of each component. This model extraction is described in detail in our previous work [25]. By combining the information from the SLAstic model, we obtain sufficient knowledge about the dependencies of the components to create the ADM.

Figure 2 visualizes the number of invocations between components extracted from the SLAstic model based on the running example in Figure 1. The degree of dependency or dependency weight, denoted as w, between two software components A and B can be computed as

wab = —

where iAB is the number of invocations of component B by component A and iA is the number of invocations of component A. In Figure 2, PT1 is invoked 300 times and BT2 is invoked by PT1 100 times. Thus, the degree of dependency between PT1 and BT2 is 100 = 0.33. Since the calling probability of component A to component B is 0.33, we hypothesize that if BT2 fails, the probability of PT1 failing due to the cascading failure is also 0.33. The degree of dependencies of other components can be computed in the same manner and the resulting ADM is similar to that in Table 1.

In addition to the dependencies between software components, there are also dependencies between software and hardware components. For example, software components that are deployed on a physical node require that the hardware components of that node operate correctly. Each hardware component includes a set of measures of that component, such as utilization or temperature. In our experiment (Section 4), we consider the dependencies to the load average, CPU utilization, memory utilization, and swap utilization of all physical machines. For simplicity reasons, the hardware components are not shown in the example in Table 1.

The automatic extraction of the ADM can be configured so that some software or hardware components can be excluded. Moreover, the degrees of dependencies between components can also be adjusted. This is useful in cases when a component appears in the model due to the monitoring setup but it is known not to have any effect on the QoS and the failure propagation.

3.1.3. Failure Propagation Model

Although the ADM contains the component dependencies and the corresponding weights, it is not suitable for failure prediction because the probabilities of cascading failures are not explicitly included. To model these probabilities, we transform the ADM into another representation, called Failure Propagation Model (FPM), which is an abstraction that represents the concept of the inference of failure propagations.

The FPM employs the formalism of Bayesian networks [17] which is a probabilistic directed acyclic graph that can represent random variables and their conditional dependencies. These conditional dependencies are essential for the inference of failure propagation. The inference that includes only the health of the immediate dependent components will not be able to predict failures that propagate through many layers of the system. For instance, without conditional dependencies, the failure of DB in Figure 1 would not have any effect on the failure probability of LB. Thus, conditional dependencies take into account the failure probabilities of all components in the dependency chain for the inference.

Figure 3 depicts the Bayesian network which illustrates the relationship of node failures for the example system. Each node in the graph represents a software/hardware component and an arrow represents a causal relationship between components. The relationship implies that a failure of a parent component can cause a failure in the child component. For simplicity reasons, in this example we only consider each physical machine as a node in the graph without going into the details of each machine.

The conditional dependencies between the nodes in the graph are represented by a Conditional Probability Table (CPT). Each node in the graph has a corresponding CPT which contains conditional probabilities of possible failures occurring, given the failure probability of its parent components. For instance, the database is a node that does not depend on any other nodes. Therefore, its CPT contains only two failure probabilities that represent the probability of failure occurring, and not occurring, from inside the database itself. The

table is shown in Figure 3 as CPTDB. The failure probability is denoted by P(DBF) which is computed regularly at runtime by the corresponding component failure predictor, as will be detailed in Section 3.2.1. On the other hand, the operation of a business tier (BT1-3) requires a database (DB) with a dependency weight of 1.0, according to the ADM in Table 1. This means if the database fails, the business-tier instances will also fail. The CPT of the business-tier instance BT3 is presented in Figure 3 (CPTBT3). The first row indicates the failure probability of the business tier itself, if the database is operating properly. The second row indicates the probability of BT3 failing if the database fails.

A more complex relationship can be seen from the presentation tier which requires at least one businesstier instance. If one business-tier instance fails, the presentation tier can still operate by forwarding requests to the remaining business-tier instances. As listed in Table 1, the dependency weight of each presentation-tier instance to each business-tier instance is approximately 0.33. This implies that, for each business tier failure, the failure probability of the presentation-tier instances will increase by approximately 0.33. Hence, if all businesstier instances fail, this failure probability will sum up to 1.0 which means that the presentation-tier instances will also fail. The CPT of PT2 is presented in Figure 3 as CPTPT2.

Assuming that a component depends on n other co ponents with n > 1, the CPT of a component c0 can be expressed by a multiplication of a truth table matrix T of size 2n x n and the weight matrix Wc0 of size n x 2:

CPT, = T

t — :

n-2 Cn-1 cn

1 1 1 >

"Co —

1 - wci 1 - wa

Failure threshold

3:35 PM 3:55PM

(a) Memory failure prediction (monolithic approach)

3:35 PM 3:55PM

(b) Service response time failure prediction (monolithic approach)

^cocn 1 - WCoCn

3:35 PM 3:55PM

(c) Service response time failure prediction (Hora)

Figure 4: Example prediction results: monolithic approach vs. Hora

with d, 1 < i < n, are required components and wc0ci are the corresponding dependency weights from component c0 to component ci.

The CPTs of other nodes are also created in the same manner as those for the database, business tier, and presentation tier. The complete model with all CPTs is used as a core model to infer about the failure probability of each component and failure propagation. The failure prediction and inference of the model at runtime will be discussed in detail in Section 3.2.

Utilization (%)

Figure 5: Probability density function of memory utilization at 3:55 PM and memory failure probability

3.2. Hierarchical Online Failure Prediction

The prediction of the failures and the inference of their propagation at runtime are comprised of two main steps. The first step is the prediction of individual component failure probabilities (Section 3.2.1). The second step is the inference of failure propagation based on the individual component failures and the FPM (Section 3.2.2).

3.2.1. Component Failure Prediction

The purpose of component failure predictors is to predict failures of each individual component. Each predictor monitors the runtime measurements of one component and makes a prediction whether the current state might lead to a component failure.

At runtime when a component failure predictor duces a new prediction, the result is used to update the FPM to keep the model up-to-date. Since the prediction result of the component failure predictor indicates the probability of a failure occurring from the component itself, this probability then replaces the first row of the CPT of the corresponding node in the model. For example, if the predictor of BT3 predicts that it is going to fail with a probability of 0.8, the probabilities in the first row of CPTBT3 in Figure 3, in which DB failure is False, will be set to 0.8 and 0.2, accordingly. This process is periodically performed for all component failure predictors.

Figure 4a illustrates the concept of a component failure predictor based on the memory consumption of business tier BT2 in Figure 1. Since the memory consumption is time series data, we can, for instance, employ autoregressive integrated moving average (ARIMA) [26] as a component failure prediction technique. The goal of the prediction is to predict when the memory utilization will reach the 100% threshold, assuming that the machine will have a performance degradation when the memory is depleted, which can cause a service failure. The thin solid line in the graph indicates the monitoring data of memory utilization up

to 3:35 PM. The dash-dotted line indicates the prediction of the memory utilization in the next 20 minutes with a prediction interval in light grey.

The probability of the monitoring data crossing the failure threshold a can be computed using the probability density function f (x) of the predicted performance

measure

X > a) =

f (x)dx

depicts the probability density function of the memory utilization at 3:55 PM. Assuming that the put data is normally distributed, the prediction error normally distributed [27]. Thus, the prediction interval assembles a normal distribution. The predicted value of 93% indicates the mean of the distribution. The 95% prediction interval covers the ±1.96^ area of the distribution [27]. The probability of the memory utilization crossing the failure threshold at 100% can be computed using Equation 5 with a = 100. It is worth noting that, in the case of non-normally distributed data, the probability density function f(x) in Equation 5 can be replaced by a suitable distribution to compute the component failure probability.

In this section, we show how Hora employs ARIMA as a component failure predictor. Hora is designed to utilize any other prediction method, such as other time series forecasting [7] or machine learning techniques [28, 29], as long as they can 1) predict the failure probability and 2) provide the expected time of the failure.

3.2.2. Inference of Failure Propagation

The inference of the failure propagation is the last step of Hora, which aims to predict what can be the effects of component failures. Once the component failure probabilities are updated in the model, we use Bayesian inference [17] to obtain the failure probabilities of all components. The inference of the components' failure probabilities takes into account not only their own failure probabilities but also those of their

parents and ancestors. If a node's ancestors have high failure probabilities and the corresponding degree of dependency w is also high, its failure probability will also be high. Therefore, the inference allows us to model and predict failure propagation from inside to the outside of the system. At runtime, the inference is done at regular intervals to provide the current failure probabilities of all components. The lead time of the inference is the same as those of the component failure predictors and can be set in the configuration of Hora.

Figure 4b shows the result of an ARIMA component failure predictor which predicts service failures based only on the observed response time at the system boundary, i.e., the load balancer. The predictor does not take into account the failure probabilities of other components. It is obvious that this predictor cannot predict the first few occurrences of the service failure. This is because the response time starts increasing exactly when the first failure occurs. Figure 4c illustrates the prediction of Hora which takes into account the architectural dependencies of the components. Hora considers not only the failure probability of the response time but also the failure probability of memory utilization in Figure 4a and the failure propagation model in Figure 3. By comparing failure probabilities in Figure 4a and Figure 4c, it can be observed that the failure probability at 3:55 PM is scaled down from approximately 0.2 to 0.1. This effect is due to the weight of the dependency presented in Table 1. This inference provides a better estimation of the failure probability because the system contains three instances of the business tier and a failure in one instance does not necessarily have to cause a service failure. Should the prediction be made without the inference, the failure probability would be higher and cause a false alarm.

4. Evaluation

This section describes the evaluation of Hora and aims to answer the following research questions: RQ1: Does Hora improve prediction quality compared to a monolithic approach? If yes, to what degree?

This research question refers to our main hypothesis that we can improve the quality of online failure prediction using Hora. In order to investigate the degree of improvement, we compare Hora with a monolithic approach which is a set of component failure predictors that do not use the architectural knowledge to propagate the failure probabilities. The evaluation is conducted as a lab experiment with a distributed enterprise application under a synthetic workload with three different types of faults.

RQ2: How do the parameters of component failure predictors affect the prediction quality?

The configuration of Hora includes various parameters, such as aggregation window size, number of historical data, and lead time. When collecting response time of the service and components, the monitoring data is not time series because the functions are not executed at regular intervals. Thus, the data have to be preprocessed by computing the average response time in a time window, denoted here as aggregation window size. The historical data defines how much of the past data is used to predict future observations. The lead time is the time between when the prediction is made and when the failure is expected to occur. A longer lead time will allow the operator to be informed about the pending failures earlier than a short lead time.

This research question aims to investigate the sensitivity of Hora's prediction quality with respect to different parameter configurations. RQ3: How does ADM affect the prediction quality?

A system can be represented by multiple ADMs. For example, different configurations of the monitoring result in different ADMs when using the automatic extraction described in Section 3.1.2. This research question aims to investigate the impact of the ADM with different numbers of components and degrees of dependencies on the prediction quality.

RQ4: What is the runtime overhead of Hora?

The goal of Hora is to predict failures at runtime. Therefore, the runtime overhead of Hora should be taken into account. This research question aims to investigate how much time is required for Hora to analyze the monitoring data and to infer the FPM to make predictions.

4.1. Experimental Methodology and Settings

4.1.1. Hora Framework Implementation

According to the concept presented in Section 3.1 and Section 3.2, we have developed a Java-based proof-of-concept implementation. For the Bayesian network, we employ the Jayes library [30]. The runtime measurements of the system, including application-level performance and execution traces, as well as resource utilization measurements, are collected by the Kieker monitoring framework [31]. For time series forecasting, the statistical library R [32] is used. Hora can be configured by using a configuration file. The configuration parameters relevant to the evaluation will be presented in Section 4.1.3. The implementation of Hora is available as part of the supplementary material [19].

4.1.2. System Under Analysis

The Hora approach is evaluated using an extended version of a distributed RSS feed reader application developed by Netflix.2 This microservice-based application [33] provides a web service where users can view, add, or delete RSS feeds.

Our setup contains two instances in the presentation tier, three instances in the business tier, and one database. Additionally, the system has one frontend load balancer, one service discovery node, and two RSS feed servers. The workload driver is set up on a separate node and uses Apache JMeter [34] to generate user requests. The workload generated by each user includes view, add, and delete operations of the RSS feed with an average 3-second think time between two requests. The number of concurrent users is set to 150 throughout the experiment. On average, the workload generates approximately 90 requests per second.

The described system is deployed on Emulab [35], which is a large-scale virtualized network testbed. Each of the instances is a physical machine type pc3000 which is equipped with a 3-GHz 64-bit Xeon processor and 2 GB of physical memory, running Ubuntu 14.04.1 LTS and Java 1.7.0 update 75.

4.1.3. Experiment Configuration

This section describes the configurations of p ters and models that we use for the evaluation.

ADM and Monitored Variables. The ADM used in the evaluation is similar to that in Table 1. Other than the physical nodes in the system, we include additional software and hardware components with the following measures:

• Response times of view, add, and delete operations at the system boundary (frontend load balancer),

• Response times of methods involved in processing requests in all presentation- and business-tier instances,

• Load average, CPU utilization, memory utilization, and swap utilization of all physical machines.

Failure Definition. A service failure is defined as an event that occurs when the service deviates from the correct service [1]. For example, the deviation of the service can be regarded as an increase in the response time, a service outage, or an incorrect result. In our experiment, we classify a service to be in a healthy or failure state by observing the response time and the response

^«f«1 Tha foi

A lit lal

4.04.1 are set acc

ory utiliza physical li tem will st

arame- drive. The

the garba|

lltili'79tir\H

2 https://github.com/hora-prediction/recipes-rss

itself in 2-minute windows. We consider a service to fail if the following conditions occur within a certain time window: 1) the 95th percentile of the server-side response times of all of the requests in that window exceeds 1 second and/or 2) the ratio of successful requests over all requests falls below 99.99%.

The response time threshold of all methods is set to 1 second in the same manner as the server-side response time. We select this value because it eliminates the need of a training phase while still allowing the component failure predictors to make predictions. An alternative to this is to have the thresholds set manually. However, it is infeasible in practice when the system contains a large number of components. A second alternative is to determine the thresholds by learning the response times of all methods. However, this would introduce a learning phase and the response time could vary depending on the context in which the method would be used. In our future work, we plan to investigate the feasibility of prediction techniques that do not require the definition of a failure threshold, e.g., control charts.

ilure definition of other architectural entities are set according to the types of the entities. The mem-ilization threshold is set to 100% according to its ysical limit because after this point the operating sys-em will start swapping which uses the space on the hard drive. The heap utilization threshold is set to 90%, since the garbage collector is triggered automatically when utilization becomes too high. The load average represents the number of tasks in the CPU queue over time and provides more information than the CPU utilization [36]. For example, a 1-minute load average of 1.0 means that there is one task in the CPU queue on average in the past minute. Thus, we set the failure threshold of the load average to 1 .0.

Prediction Technique. At runtime, the monitoring data, containing execution traces and resource measurements, are aggregated into windows of size 2 minutes, which are then pre-processed according to the type of architectural entity measure. The 95th percentile is calculated for the response time and method response time while the mean is calculated for the load average, memory utilization, and heap utilization.

As stated in Section 3.2, we use ARIMA as a component failure predictor. The size of the historical data for ARIMA is set to 10 minutes. The prediction lead time is 10 minutes with a 95% confidence level. The lead time of the FPM is the same as the component failure predictors, which is 10 minutes.

Table 2: Contingency table

Prediction Actual

Failure Non-failure

Failure True positive (TP) False positive (FP)

Non-failure False negative (FN) True negative (TN)

Table 3: Selected derived evaluation metrics

Metric Formula

Precision TP TP + FP

Recall, True-positive rate (TPR) TP TP + FN

False-positive rate (FPR) FP FP + TN

Accuracy TP+TN

TP + TN + FP + FN

4.1.4. Evaluation Metrics

This section introduces the metrics used to evaluate the quality of the prediction. A complete list of metrics and detailed descriptions can be found in [2].

Table 2 presents the four basic evaluation metrics as a contingency table. A true positive (TP) is a correct prediction of a failure. A false positive (FP) is a prediction of a failure that never occurs. A false negative (FN) is a miss which means the failure occurs but it was not predicted. A true negative (TN) is a correct prediction that a failure does not occur. Moreover, we consider four derived metrics, which are precision, recall or true-positive rate (TPR), false-positive rate (FPR), and accuracy, as listed in Table 3.

A Receiver Operating Characteristic (ROC) curve [37] represents the quality of the prediction by relating TPRs to FPRs for different prediction thresholds, as shown in Figure 9 (on page 24). A perfect predictor has a curve that covers the entire area of the plot because the TPR is 1 while the FPR is 0. Thus, the closer the curve is to the (0,1) point, the better the prediction is.

Area Under ROC Curve (AUC) measures the area that is covered by a ROC curve and allows comparison between different ROC curves. A perfect predictor would have an AUC of 1. The AUC is recommended to be used as a single-number metric for evaluating learning algorithms [38, 39]. Thus, in our evaluation, the AUC is used as a representative metric for the comparison of the prediction quality of Hora and the monolithic approach. Furthermore, to evaluate the significance of the improvement, we use two-sided hypothesis testing [40] to compare the AUCs with the following

null and alternative hypotheses:

h0 : aucHora = AUCMonolithic

Hi : aucHora + AUCMonolithic

The p-value is reported as the result of the test. The method used in testing is introduced by DeLong et al. [41] to compare two or more AUCs. The ROC curves, AUCs, and the results of the tests presented in Section 4.2 are generated using the pROC package [42] available in R [32].

ated using the ].

4.1.5. Fault Injection

In our evaluation, we consider three types of faults from real world incidents [43], which are memory leak, system overload, and node crash. We inject one type of these faults into each experiment run and apply Hora and the monolithic prediction approach to predict the failures. Each experiment lasts two hours and is repeated 10 times. The reported evaluation metrics are obtained by combining and analyzing the raw prediction results of all runs. The details of each type of faults are described as follows.

Memory Leak—In the experiment, we introduce a memory leak in one of the business tiers. Every time a request is sent from the presentation tier to this specific instance, 1024 bytes of memory will be allocated and never released.

System Overload—System overloads occur when the workload increases, either gradually or abruptly, until the system is not able to handle all the incoming requests. In this scenario, instead of using a constant workload, we increase the number of users until service failures occur (detailed in Section 4.1.3).

Node Crash—Unexpected node crashes are not uncommon in real systems. They can be caused by both software and hardware, such as operating system crashes, hardware failures, or power outages. We introduce this problem by intentionally shutting down two of the business tier instances at 90 and 95 minutes into the experiment.

4.2. Results

The following sections (Section 4.2.1-Section 4.2.4) present the results for the four research questions.

4.2.1. Investigation of Prediction Quality (RQ1)

In this section, we provide the results and explanation for the experiments with different types of faults. The detailed evaluation metrics of all scenarios are summarized in Table 4.

Memory Leak—The memory leak in one of the business-tier instances causes the memory utilization to increase over time. As shown in Figure 6a, at approximately the 52th minute, the component failure predictor for memory in one of the business tier instances predicts that the memory utilization will cross the threshold at the 62th minute with a high probability. This failure probability propagates to other parts of the business tier and, consequently, to the operations of the presentation tier and the load balancer, as can be seen in Figure 6c and Figure 6d. Since the memory leak is occurring in only one of the three business tier instances, the failure probability is reduced by the inference by a factor of 3. At approximately the 75th minute, the garbage collector starts continuously freeing up heap space, which causes the load average to increase (Figure 6b). The failure probability from the load average further increases the failure probability of the service and can be seen in Figure 6c and Figure 6d.

As can be expected, the memory leak causes a sudden increase in the service response time, which cannot be predicted by the response time predictor. On the other hand, Hora considers the memory utilization of the business tier and propagates the failure probability to the service boundary. The result shows that Hora can predict the service failure 10 minutes before it occurs with a failure probability of approximately 0.3. The ROC curves of Hora and the monolithic approach are depicted in Figure 9a.

System Overload—The failures caused by overloading the system start occurring at approximately 80 minutes into the experiment. The increasing number of concurrent users causes the load average of the businesstier instances to exceed the failure threshold. As a result, some of the requests sent from the presentation tier to the business tier are rejected. After a pre-defined number of unsuccessful retries, the presentation tier responds with a page indicating that an error has occurred.

Figure 7b depict the load average of one business tier instance which gradually increases over time, while the memory utilization remains stable in Figure 7a. The failure probability from the load average propagates through the architecture to the presentation tier and load balancer, as seen in Figure 7c and Figure 7d.

The result shows that Hora can predict this type of service failure since it takes into account the dependency of the presentation tier on the business tier. On the other hand, the predictor that observes only the increase in the response time at the system boundary is not able to predict this type of service failure since the response time does not exceed the threshold. The ROC curves of both approaches are presented in Figure 9b.

Node Crash—In this scenario, we intentionally crash the second instance of the business tier at 90 minutes, and the third instance at 95 minutes into the experiment. The one remaining business tier instance has to take over the workload from those that failed. This causes the load average of the remaining instance to increase all of a sudden as can be seen in Figure 8b. As a consequence, the service response time of the presentation tier and load balancer, shown in Figure 8c and Figure 8d, also increases unexpectedly.

The result in Figure 9c shows that Hora performs as well as the monolithic approach. This is because the crash occurs unexpectedly without any preceding symptom. The component failure predictors cannot predict this and, therefore, the FPM does not have any failure information to propagate.

Overall—We evaluate the overall prediction quality of Hora by analyzing the combined raw prediction data of all three scenarios. The results in Figure 9d and Table 4 show that Hora improves the overall AUC by 9.9%, compared to the monolithic approach.

4.2.2. Investigation of Parameter Impact (RQ2)

We investigate three prediction parameters of the component failure predictors which are the aggregation window size, historical data size for ARIMA, and lead time. The reported metrics are computed by evaluating the raw prediction results of all fault types.

Aggregation Window Length—We vary the length of the aggregation window from 2 to 8 minutes. Figure 10a shows that the precision increases when the window length increases from 2 to 6 minutes. However, when the length reaches 8 minutes, the precision, TPR, and AUC drop significantly. This is because small aggregation windows preserve the small variations in the data which causes the predictor to produce a lot of false positives. On the other hand, as the aggregation window gets larger, the variations in the data get lost. The predictor cannot detect the changes and, therefore, misses the failures.

Historical Data Length—The length of historical data denotes how many data points further back in the past ARIMA considers for the prediction. Figure 10b shows that the AUC and TPR exhibit a decreasing trend with the data length, while the FPR shows a slightly increasing trend. This is because the prediction quality is reduced when the predictors consider more data from the past because the short-term variations are neglected by the moving average part of ARIMA [26].

Lead Time—When increasing the lead time, the result in Figure 10c shows a decreasing trend of precision, TPR, and AUC. The decrease is caused by the

Table 4: Comparison of all evaluation metrics for the different types of faults

Fault type Prediction approach Precision Recall, TPR FPR Accuracy AUC AUC improvement p-value

Memory leak HORA Monolithic 0.612 0.84 0.945 0.758 0.096 0.024 0.91 0.945 0.931 0.881 5.7% < 2.22 x 10-16

System overload HoRA Monolithic 0.181 0.352 0.876 0.564 0.216 0.059 0.789 0.92 0.894 0.764 17% < 2.22 x 10-16

Node crash HoRA Monolithic 0.059 0.209 0.73 0.582 0.298 0.085 0.703 0.902 0.771 0.766 < 0.1%

Overall HoRA Monolithic 0.419 0.475 0.833 0.692 0.091 0.065 0.903 0.916 0.92 0.837 i 9.9% < 2.22 x 10-16

uncertainty of the prediction. As ARIMA is used to forecast the time series data, the confidence interval becomes larger as we make prediction further into the future. Therefore, choosing an optimal lead time depends on the application and the costs of a false positive and a false negative.

4.2.3. Investigation of Architectural Dependency Model Impact (RQ3)

The automatic extraction of the ADM (Section 3.1.2) allows fine-tuning of the model, e.g., adjusting the degrees of dependencies and excluding some components This section investigates the impact of the ADM configurations on the prediction quality.

Degree of dependency—The degree of depende: between software components can be directly puted from the number of invocations described in Section 3.1.2. However, the computation of the degree of dependency from software to some hardware measures are not straightforward. For example, the load average measure contains three different values, i.e., 1-minute, 5-minute, and 15-minute averages. There are no clear guidelines what might be the effect on the operation of the physical machine if these measures exceed the threshold.

To investigate the impact of the degree of dependency, we vary the degree of dependency of the load average from 0.2 to 1.0. The results in Figure 11a show that the AUC, TPR, precision, and FPR remain almost constant. This is because the failure probabilities are always propagated to other parts of the system by the FPM regardless of how small they are. Those small propagated probabilities at the system boundary still provide the signs that the failure is imminent which are sufficient to trigger the warning. Therefore, we can conclude that varying the degree of dependency does not significantly affect the prediction quality.

Size of ADM— We evaluate four different ADMs which contain different numbers of architectural com-

model is aui

inenis. :onfig-

idency #

ponents:

• Auto-Large—The model is automatically generated and contains 98 components which include software components, as well as CPU utilization, memory utilization, swap utilization, and load average of all physical nodes.

Auto-Medium—The model is automatically gener-contains 80 components which include tware components, as well as memory utilizan, and load average. Compared to Auto-Large, e CPU utilization and swap utilization are removed.

Auto-Small—The model is automatically generated and contains 56 components which include only important software components, as well as memory utilization, and load average. Compared to Auto-Medium, the intermediate software operations are removed.

• Manual—The model is manually created by system experts and contains 48 components which include only the important software components, as well as memory utilization, and load average.

The complete details of these four models can be found in the supplementary material [19]. The evaluation results of the models are presented in Figure 11b. Although all ADMs have approximately the same AUC, TPR, and FPR, the manually created ADM achieves the highest precision. This is because the automatic extraction adds the components that do not improve the prediction and even decrease the precision. In other words, they do not help predict more failures, but rather produce more false alarms. However, the automatic extraction of the model has an advantage that it can create the ADM for a large system which is infeasible for a manual creation. Therefore, it is a trade-off between

□- ----n-^

CO o A--------------AirTtrr— û

to Ö -AX AUC TPR Precision FPR — — —

o --- ----— """" o

<N o . . X

2 3 4 5 6 7 8

Aggregation window (Min)

(a) Aggregation window length

Lead time (Min) (c) Lead time

Figure 10: Prediction quality of Hora with different parameter configurations

Auto-Large Auto-Medium Auto-Small Manual

Architectural Dependency Models

Figure 12: Hora's average analysis and prediction duration (including 95% confidence interval) for a 2-hour monitoring data

the ease in the model creation and the prediction quality. Nonetheless, the automatically generated model can still be fine-tuned by the system experts to produce even better prediction results.

4.2.4. Investigation of Runtime Overhead (RQ4)

The prediction results are obtained by collecting the monitoring data from the system described in Section 4.1.2 and executing the offline analysis on a separate machine equipped with 3.10 GHz Intel Xeon E31220 running Ubuntu 12.04.5 LTS. Figure 12 illustrates the runtime overhead of Hora for the four different ADMs. It can be observed that the larger the model is, the more time it requires for the analysis and prediction. On average, the analysis for a 2-hour monitoring data is completed in less than six minutes. This demonstrates that Hora can be deployed and make timely predictions at runtime.

It needs to be emphasized that Hora's prediction process is triggered by every new data point. For the model with 98 components, this leads to 98 predictions every 2 minutes, i.e., 5880 predictions in two hours. In this work, we focus on investigating the prediction quality of Hora rather than its prediction efficiency. Particularly, Hora has not been optimized for performance. A possible future optimization could be to configure it to make predictions at regular time intervals, e.g., every 1 minute.

4.3. Discussion

Our prediction approach exploits the knowledge of the component dependencies and a set of predictors, which can predict individual component failures, to infer the failure propagation. Our results show that in the memory leak and system overload scenarios, HoRA can predict the failures with high TPR. This demonstrates that the problems that develop internally can be detected early and the failure probability can be propagated to other parts of the system.

Although the results in Figure 9 and Table 4 show that HoRA achieves higher TPR and higher AUC, the number of false positives is also high. This results in a low precision and high false-positive rate. In other words, the monolithic approach performs better in the low false-positive-rate range between 0 and 0.1. On the other hand, if a higher false-positive rate is acceptable, HoRA will be able to correctly predict more failures than the monolithic approach.

4.4. Threats to Validity

We inject faults that trigger application failures in our experiments, which is a common practice in accessing dependability [44, 45], e.g., fault tolerance or failure prediction. It is possible that the failures that occur at runtime may be caused by other hidden problems rather than those that we inject. In our evaluation, the failures that occur in the memory leak scenario can be caused by a system overload if the workload is too high. As a result, an attempt to predict failures caused by a memory leak will also predict failures of the system overload problem. Therefore, we carefully choose the workload that is low enough so that it does not cause system overload while we inject other types of faults.

In order to evaluate our approach, we need datasets that include architectural dependencies, detailed runtime measurements for the architectural entities, and information regarding the types and time of the failures. Publicly available datasets exist,3 but they are not appropriate for our evaluation because they lack the architectural information and runtime measurements. To systematically evaluate our approach, a controlled environment is needed, which includes a usage profile and the types of failures. We conduct a lab study with fault injection which presents two main threats to external validity. First, we consider only one system. Therefore, we select an open-source application that is representative for the state-of-the-art enterprise systems, in terms of architectural style (microservice-based [33])

3https://www.usenix.org/cfdr

and technology (NetflixOSS ecosystem). Second, our experiment did not cover all possible types of faults. Since covering all possible fault types is practically in-feasible, we select three representative fault types from real world incidents based on Pertet et al. [43]. The possibility to reduce these threats for future studies would be that the community makes suitable data available and develop a benchmark for online failure prediction techniques. For this paper, we provide both the system setup and the data so that researchers can replicate our experiment and build on our research.

In the evaluation, we compare the prediction results of the HoRA approach with those of the monolithic approach, which employs ARIMA as predictors and does not consider the architecture of the system. There are existing works for predicting component failures. However, they are specifically applicable for different types of monitoring data or different types of components. For instance, Fronza et al. [46], Liang et al. [47] and Salfner et al. [10] employ machine learning techniques to analyze event logs and classify the system into healthy and failure states. These techniques are not directly applicable as the monitoring data obtained in our experiment are time series data. Therefore, we utilize ARIMA which is a suitable and commonly used prediction technique for this type of data. The relation to other prediction techniques is discussed in detail in the following related work section.

5. Related Work

In this paper, we propose a novel approach for hierarchical online failure prediction. This work is related to QoS and failure prediction, and can be categorized based on two dimensions; 1) online vs. offline, and 2) monolithic vs. hierarchical.

1. Online prediction approaches aim at providing information regarding the near future state of the running system based on runtime observations [2]. In contrast, offline prediction approaches are not used to trigger runtime actions, but to focus on providing QoS measures to reason on system design and evolution decisions [48, 49].

2. Monolithic prediction approaches consider the system as a black box, of which the architectural information is not known. A prediction model can be created using different techniques, such as time series forecasting or machine learning. On the other hand, hierarchical prediction approaches consider the architecture of software systems including the components and their inter-dependencies. Each

component has its own specification that can be combined with the others' to form a model that represents the whole system. The relevant measures of the system can then be obtained by solving the combined model.

The remainder of this section describes related work in three categories, based on the two dimensions, discussed previously. Due to the lack of relevance to our approach, we do not discuss works on monolithic offline prediction.

Monolithic Online Prediction. Similar to our approach, Cavallo et al. [8] and Amin et al. [7, 50] use ARIMA and GARCH models, respectively, to forecast response times and time between failures of web services. In contrast, other approaches use statistical analysis with adaptive thresholds [51], complex event processing [11], rule-based classification [52], linear regression and decision trees [53], support vector machine [47, 46], nearest neighbor methods [47], Bayesian networks and submodels [54, 55, 56], and hidden semi-Markov models [10].

Bovenzi et al. [51] apply statistical analysis with adaptive threshold to predict threshold violation of performance metrics in complex software systems. Michlmayr et al. [57] present a framework to predict SLA violations in web services, which combines the advantages of both client- and server-side QoS monitoring.

For black-box failure prediction, Baldoni et al. [11] employ complex event processing and hidden Markov models to predict failures of a distributed system based on the sniffed network traffic. On the other hand, Williams et al. [12] monitor performance metrics including response time and predict failures using anomaly detection in combination with dispersion frame technique.

Regarding failures in large-scale systems, Sahoo et al. [52] predict critical events in computer cluster by analyzing event logs using rule-based classification algorithms. Zheng et al. [58] predict failures in Blue Gene/P supercomputers by analyzing the log messages and using genetic algorithms to create rules for the log patterns that precede the failures. Liang et al. [47] predict failures based on the event log from Blue Gene/L using different machine learning techniques, i.e., RIPPER, support vector machine, and nearest neighbor method. Guan et al. [54] employ Bayesian submodels to detect anomalies in the performance metrics of cloud computing systems. Watanabe et al. [55] collect log files of a cloud data center and analyze the similarity between

the records. The Bayesian method is then used to compute the probability of a failure following certain error sequences. In our previous work [9], we apply machine learning techniques to analyze the correlation of event log of HPC systems and use the model to predict failure events. Fu et al. [56] employ Bayesian networks to predict failure events in HPC systems. The failure information is forwarded to a master node in order to adjust the job allocation.

For other types of systems, Salfner and Malek [10] propose an online failure prediction approach using hidden semi-Markov models. These models utilize both time and type of error messages to predict failures in a telecommunication system. Alonso et al. [53] predict web server crashes by monitoring the memory usage and applying linear regression and decision trees to predict the time to memory exhaustion. Fronza et al. [46] apply random indexing to log messages and use support vector machine to predict the patterns that can lead to software failures.

The Hora approach differs from these monolithic approaches at the system level in that we explicitly take architectural information into account. Although we apply similar techniques to predict individual component failures, the architectural knowledge is employed to predict how a failure can propagate and affect other components.

Hierarchical Offline Prediction. The approaches in this category employ architecture-based system models annotated with specific quality evaluation models or scenarios [59, 60, 61], e.g., performance [62, 63], reliability [64], and safety [65] attributes. The model can be solved by using analytical solution or simulations to obtain the relevant properties of the whole system.

Cheung [66] proposes a seminal software reliability model that takes into account reliability of individual components along with the probability of calling other components. A Markov model is employed to combine the reliability of components and represent the reliability of the whole system. Cortellessa and Grassi [13] present an approach for reliability analysis of component-based software systems. Based on the system architecture, they consider the error propagation probability between components in addition to the reliability of individual components. Becker et al. [23] introduce the Palladio Component Model (PCM) which enables performance prediction of component-based software systems. Brosch et al. [67] extend the PCM by annotating the components with corresponding reliability attributes. The model is transformed into a discrete-time Markov chain and solved to obtain the reliability

of the system.

The approaches in this category have shown that the capability of the prediction can be significantly improved by combining traditional approaches, which are monolithic, with the architecture-based models. Our approach applies the same concept and incorporates the architecture of the system, specifically the dependencies between components, into online failure prediction. In combination with monolithic prediction approaches, our approach is able to predict both failures of individual components and the probabilities that those failures will propagate to other parts of the system.

Hierarchical Online Prediction. As one recent example for performance, Brosig et al. [24] employ an architecture-based performance model to predict system performance at runtime for capacity planning and online resource provisioning. The performance characteristics are captured in an architectural performance model which is then solved by transforming it to an analytical model or by simulation, similar to Becker et al. [23]. As opposed to this, Hora focuses on predicting failure occurrences using an extensible set of tailored prediction techniques.

Chalermarrewong et al. [68] predict system unavailability in data centers using a set of component predictors and fault tree analysis. The component predictors employ autoregressive moving average (ARMA) to predict failures of hardware components. These component failures are leaf nodes in the fault tree which is evaluated to conclude whether the current set of component failures will lead to system unavailability. Even though this work does not consider software, it shares the same basic idea as Hora by having a dedicated failure predictor for each component. However, the fault tree does not incorporate the conditional probability which represents complex software architectural relationships. On the other hand, Hora employs Bayesian networks which can represent conditional dependencies and infer the probabilities of failures and their propagation.

6. Conclusion

Failures in software systems usually develop inside the system and propagate to the boundary. Existing online failure prediction approaches do not explicitly consider the software system architecture and failure propagation paths. They view the system as monolithic and make predictions based only on the available measurements, such as response time.

In this paper, we introduce our hierarchical online failure prediction approach, Hora, which employs a

combination of a failure propagation model and component failure prediction techniques. The failure propagation model uses Bayesian networks and is extracted from an architectural dependency model. The component failure predictors are updated by continuous measurements of the running system. Our evaluation shows that Hora provides a significantly higher prediction quality than the monolithic approach. In addition to the improved prediction quality, Hora allows a higher degree of modularity as different failure prediction techniques can be applied and reused among similar types of system components. Furthermore, the automatic extraction allows the architectural dependency model to be generated from an existing architectural model. However, Hora can only be used if detailed monitoring data is continuously collected at runtime. Nonetheless, detailed application performance monitoring has become a common practice in large scale systems, providing the data needed by Hora.

In our future work, we plan to investigate and improve Hora's suitability for rapidly changing systems. For instance, emerging software engineering paradigms, such as DevOps [69], aim for faster release cycles and rely on dynamic cloud environments. The structures and behaviors of these systems change frequently which requires continuous relearning of the failure models in order to retain the prediction quality [70]. We expect that Hora's modular failure prediction approach is suitable under these conditions, because each predictor can be retrained, independently from other predictors. However, architectural changes can alter the failure propagation path. Thus, the failure propagation model needs to be kept up-to-date. Moreover, we will also optimize the predictor to minimize the number of false positives and we plan to extend our evaluation setup to a benchmark for online failure prediction approaches.

Acknowledgment

This work has been partly funded by the German Federal Ministry of Education and Research (grant no. 01IS15004).

References

[1] A. Avizienis, J.-C. Laprie, B. Randell, C. Landwehr, Basic concepts and taxonomy of dependable and secure computing, IEEE Trans. on Dependable and Secure Computing 1 (1) (2004) 1133.

[2] F. Salfner, M. Lenk, M. Malek, A survey of online failure prediction methods, ACM Computing Surveys 42 (3) (2010) 10:110:42.

[3] Y. Brun, J. Y. Bang, G. Edwards, N. Medvidovic, Self-adapting reliability in distributed software systems, IEEE Trans. on Software Engineering 41 (8) (2015) 764-780.

[4] R. Calinescu, L. Grunske, M. Z. Kwiatkowska, R. Mirandola, G. Tamburrelli, Dynamic QoS management and optimization in service-based systems, IEEE Trans. on Software Eng. 37 (3) (2011)387-409.

[5] G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, A. Fox, Microreboot—a technique for cheap recovery, in: Proc. Symposium on Operating Systems Design & Implementation (OSDI '04), 2004, pp. 31-44.

[6] Y. Li, Z. Lan, Exploit failure prediction for adaptive fault-tolerance in cluster computing, in: Proc. 6th Int. Symposium on Computing and the Grid (CCGRID '06), 2006, pp. 8 pp.-538.

[7] A. Amin, A. Colman, L. Grunske, An approach to forecasting QoS attributes of web services based on ARIMA and GARCH models, in: Proc. 19th Int. Conf. on Web Services (ICWS '12), 2012, pp. 74-81.

[8] B. Cavallo, M. D. Penta, G. Canfora, An empirical comparison of methods to support QoS-aware service selection, in: Proc. 2nd Int. Workshop on Principles of Engineering Service-Oriented Systems (PESOS '10), ACM, 2010, pp. 64-70.

[9] T. Pitakrat, J. Grunert, O. Kabierschke, F. Keller, A. van Hoorn, A framework for system event classification and prediction by means of machine learning, in: Proc. 8th Int. Conf. on Performance Evaluation Methodologies and Tools (VALUETOOLS '14), 2014.

[10] F. Salfner, M. Malek, Using hidden semi-Markov models for effective online failure prediction, in: Proc. 26th Int. Symposium on Reliable Distributed Systems (SRDS '07), 2007, pp. 161174.

[11] R. Baldoni, G. Lodi, L. Montanari, G. Mariotta, M. Rizzuto, Online black-box failure prediction for mission critical distributed systems, in: Proc. 31st Int. Conf. on Computer Safety, Reliability and Security (SAFECOMP '12), 2012, pp. 185-197.

[12] A. W. Williams, S. M. Pertet, P. Narasimhan, Tiresias: Blackbox failure prediction in distributed systems, in: Proc. of Parallel and Distributed Processing Symposium (IPDPS '07), 2007, pp. 1-8.

[13] V. Cortellessa, V. Grassi, A modeling approach to analyze the impact of error propagation on reliability of component-based systems, in: Proc. 10th Int. Conf. on Comp.-Based Soft. Eng. (CBSE '07), 2007, pp. 140-156.

[14] M. Nygard, Release It!: Design and Deploy Production-Ready Software, Pragmatic Bookshelf, 2007.

[15] M. Hiller, A. Jhumka, N. Suri, An approach for analysing the propagation of data errors in software, in: Proc. 2001 Int. Conf. on Dependable Systems and Networks (DSN '01), 2001, pp. 161-172.

[16] A. Johansson, N. Suri, Error propagation profiling of operating systems, in: Proc. 2015 Int. Conf. on Dependable Systems and Networks (DSN '05), 2005, pp. 86-95.

[17] C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.

[18] T. Pitakrat, D. Okanovic, A. van Hoorn, L. Grunske, An architecture-aware approach to hierarchical online failure prediction, in: Proc. 12th Int. ACM SIGSOFT Conference on the Quality of Software Architectures (QoSA '16), IEEE, 2016, pp. 60-69.

[19] T. Pitakrat, D. Okanovic, A. van Hoorn, L. Grunske, Hora: Architecture-aware Online Failure Prediction (Apr. 2016). URL http://www.iste.uni-stuttgart.de/rss/ people/pitakrat/jss-si-issre

[20] M. Fowler, Patterns of Enterprise Application Architecture, Ad-dison Wesley, 2002.

[21] D. Lorenzoli, L. Mariani, M. Pezze, Automatic generation of software behavioral models, in: Proc. 30th Int. Conf. on Software Engineering (ICSE '08), ACM, 2008, pp. 501-510.

[22] B. R. Schmerl, J. Aldrich, D. Garlan, R. Kazman, H. Yan, Discovering architectures from running systems, IEEE Trans. on Software Engineering 32 (7) (2006) 454-466.

[23] S. Becker, H. Koziolek, R. Reussner, The Palladio component model for model-driven performance prediction, Journal of Systems and Software 82 (1) (2009) 3-22.

[24] F. Brosig, N. Huber, S. Kounev, Architecture-level software performance abstractions for online performance prediction, Science of Computer Programming 90, Part B (2014) 71-92.

[25] A. van Hoorn, Model-driven online capacity management for component-based software systems, Ph.D. thesis, faculty of Engineering, Kiel University (2014).

[26] R. H. Shumway, D. S. Stoffer, Time series analysis and its applications, Springer Science, 2013.

[27] D. C. Montgomery, G. C. Runger, N. F. Hubele, Engineering statistics, John Wiley & Sons, 2009.

[28] T. Pitakrat, A. van Hoorn, L. Grunske, A comparison of machine learning algorithms for proactive hard disk drive failure detection, in: Proc. 4th Int. Conf. on Architecting Critical Systems (ISARCS '13), ACM, 2013, pp. 1-10.

[29] T. Pitakrat, A. van Hoorn, L. Grunske, Increasing dependability of component-based software systems by online failure prediction, in: Proc. Euro. Dependable Computing Conf. (EDCC '14),

2014, pp. 66-69.

[30] M. Kutschke, Jayes: Bayesian network library. URL https://github.com/kutschkem/Jayes

[31] A. van Hoorn, J. Waller, W. Hasselbring, Kieker: A framework for application performance monitoring and dynamic software analysis, in: Proc. Int. Conf. on Performance Eng. (ICPE '12), 2012, pp. 247-248.

[32] R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria (2015).

URL http://www.R-project.org/

[33] S. Newman, Building Microservices, O'Reilly Media, Inc.,

[34] E. Halili, Apache JMeter, Packt Publishing, 2008.

[35] M. Hibler, R. Ricci, L. Stoller, J. Duerig, S. Guruprasad, T. Stack, K. Webb, J. Lepreau, Large-scale virtualization in the Emulab network testbed., in: USENIX Annual Technical Conference, 2008, pp. 113-128.

[36] R. Walker, Examining load average, Linux Journal 2006 (152) (2006) 5-16.

[37] T. Fawcett, An introduction to ROC analysis, Pattern Recognition Letters 27 (8) (2006) 861-874.

[38] A. P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognition 30 (7) (1997) 1145-1159.

[39] J. Huang, C. Ling, Using AUC and accuracy in evaluating learning algorithms, IEEE Trans. on Knowledge and Data Engineering 17 (3) (2005) 299-310.

[40] C. Wohlin, P. Runeson, M. Host, M. C. Ohlsson, B. Regnell, A. Wesslen, Experimentation in software engineering, Springer Science & Business Media, 2012.

[41] E. R. DeLong, D. M. DeLong, D. L. Clarke-Pearson, Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach, Biometrics 44 (3) (1988) 837-845.

[42] X. Robin, N. Turck, A. Hainard, N. Tiberti, F. Lisacek, J.-C. Sanchez, M. Miiller, pROC: An open-source package for R and S+ to analyze and compare ROC curves, BMC bioinformatics 12(1) (2011) 77.

[43] S. Pertet, P. Narasimhan, Causes of failure in web applications, Parallel Data Laboratory (2005) 48.

[44] I. Irrera, M. Vieira, A practical approach for generating failure data for assessing and comparing failure prediction algorithms, in: Dependable Computing (PRDC), 2014 IEEE 20th Pacific Rim International Symposium on, 2014, pp. 86-95.

[45] R. Natella, D. Cotroneo, H. S. Madeira, Assessing dependability with software fault injection: A survey, ACM Comput. Surv. 48 (3) (2016) 44:1-44:55.

[46] I. Fronza, A. Sillitti, G. Succi, M. Terho, J. Vlasenko, Failure prediction based on log files using random indexing and support vector machines, Journal of Systems and Software 86 (1) (2013) 2-11.

[47] Y. Liang, Y. Zhang, H. Xiong, R. K. Sahoo, Failure prediction in IBM BlueGene/L Event Logs, in: Proc. Int. Conf. on Data Mining (ICDM '07), 2007, pp. 583-588.

[48] V. Cortellessa, A. Di Marco, P. Inverardi, Model-based software performance analysis, Springer, 2011.

[49] J. D. Musa, Software reliability engineering, McGraw-Hill, 1998.

[50] A. Amin, L. Grunske, A. Colman, An automated Approach to Forecasting QoS Attributes based on linear and non-linear Time Series Modeling, in: Proc. 27th IEEE/ACM Int. Conf. on Automated Software Engineering (ASE '12), 2012, pp. 130-139.

[51] A. Bovenzi, F. Brancati, S. Russo, A. Bondavalli, A statistical anomaly-based algorithm for on-line fault detection in complex software critical systems, in: Proc. 30th Int. Conf. on Computer Safety, Reliability, and Security (SAFECOMP '11), 2011, pp. 128-142.

[52] R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, A. Sivasubramaniam, Critical event prediction for proactive management in large-scale computer clusters, in: Proc. 9th Int. Conf. on Knowledge Discovery and Data Mining (KDD '03), 2003, pp. 426-435.

[53] J. Alonso, J. Torres, R. Gavalda, Predicting web server crashes: A case study in comparing prediction algorithms, in: Proc. 5th Int. Conf. on Autonomic and Autonomous Systems (ICAS '09), 2009, pp. 264-269.

[54] Q. Guan, Z. Zhang, S. Fu, Ensemble of Bayesian predictors and decision trees for proactive failure management in cloud computing systems, Journal of Communication s 7(1) (2012) 52-61.

[55] Y. Watanabe, H. Otsuka, M. Sonoda, S. Kikuchi, Y. Matsumoto, Online failure prediction in cloud datacenters by real-time message pattern learning, in: Proc. 4th Int. Conf. on Cloud Computing Technology and Science (CloudCom '12), 2012, pp. 504511.

[56] S. Fu, C.-Z. Xu, Exploring event correlation for failure prediction in coalitions of clusters, in: Proc. 2007 Conf. on Supercomputing (SC '07), 2007, pp. 41:1-41:12.

[57] A. Michlmayr, F. Rosenberg, P. Leitner, S. Dustdar, Comprehensive QoS monitoring of web services and event-based SLA violation detection, in: Proc. 4th Int. Workshop on Middleware for Service Oriented Computing (MWSOC '09), 2009, pp. 1-6.

[58] Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, P. Beckman, A practical failure prediction with location and lead time for Blue Gene/P, in: Proc. 2010 Int. Conf. on Dependable Systems and Networks Workshops (DSN-W '10), 2010, pp. 15-22.

[59] M. A. Babar, I. Gorton, Comparison of scenario-based software architecture evaluation methods, in: Proc. 11th Asia-Pacific Software Engineering Conf. (APSEC '04), 2004, pp. 600-607.

[60] M. A. Babar, L. Zhu, D. R. Jeffery, A framework for classifying and comparing software architecture evaluation methods, in: Australian Soft. Eng. Conf. (ASWEC 2004), IEEE Computer Society, 2004, pp. 309-319.

[61] L. Grunske, Early quality prediction of component-based sys-

tems - A generic framework, Journal of Systems and Software 80 (5) (2007) 678-686.

[62] S. Balsamo, A. D. Marco, P. Inverardi, M. Simeoni, Modelbased performance prediction in software development: A survey, IEEE Trans. on Software Engineering 30 (5) (2004) 295310.

[63] H. Koziolek, Performance evaluation of component-based software systems: A survey, Performance Evaluation 67 (8) (2010) 634-658.

[64] K. Goseva-Popstojanova, K. S. Trivedi, Architecture-based approach to reliability assessment of software systems, Perf. Eval. 45 (2-3) (2001) 179-204.

[65] L. Grunske, J. Han, A comparative study into architecture-based safety evaluation methodologies using AADL's error annex and failure propagation models, in: Proc. 11th IEEE High Assurance Systems Engineering Symposium (HASE) '08, 2008, pp. 283292.

[66] R. C. Cheung, A user-oriented software reliability model, IEEE Trans. on Software Engineering 6 (2) (1980) 118-125.

[67] F. Brosch, H. Koziolek, B. Buhnova, R. Reussner, Architecture-based reliability prediction with the Palladio Component Model, IEEE Trans. on Software Engineering 38 (6) (2012) 1319-1339.

[68] T. Chalermarrewong, T. Achalakul, S. See, Failure prediction of data centers using time series and fault tree analysis, in: Proc. 18th Int. Conf. on Parallel and Distributed Systems (ICPADS '12), 2012, pp. 794-799.

[69] L. Bass, I. Weber, L. Zhu, DevOps: A Software Architect's Per-iddison-Wesley Prof., 2015.

Irrera, J. Duraes, M. Vieira, On the need for training failure prediction algorithms in evolving software systems, in: Pro-edings of the 2014 IEEE 15th International Symposium on High-Assurance Systems Engineering, HASE '14, IEEE Computer Society, Washington, DC, USA, 2014, pp. 216-223.

Clients

Service

Presentation Tier (PT) Business Tier (BT)

Load Balancer (LB)

<• <

Database Tier (DT) DB

Distributed enterprise application system

— Service response time (sec) x Failed requests

: : : t 1

XXXXXXXX

CPU utilization (%) Memory utilization (%) Heap utilization (%)

4:05 PM

4:05 PM

3:55 PM

4:05 PM

Measurements (system boundary) Measurements (system-internal)

DB fails

True False J

P(DBf) 1 - P(DBf) 1

BT3 fails

P(BT3F) 1.00

1 - P(BT3f) 0.00

BT2 BT3 PT2 fails

fails fails True False

False False P(PT2f) 1-P(PT2f)

False True 0.33 0.66

True False 0.33 0.66

True True 0.66 0.33

False False 0.33 0.66

False True 0.66 0.33

True False 0.66 0.33

True True 1.00 0.00

ailure propagation model for the system in Figure 1 with selected CPTs

Failure threshold

0 20 40 60 80 100 120

Time (Min)

(d) Response time of load balancer

ponse time of presentation tier

Figure 7: Timeline plots of selected components for system overload scenario

Failure threshold

Memory utilization

Failure probability: Hora Failure probability: Monolithic

0 20 40 60 80 100 120

Time (Min)

(a) Memory utilization of business tier

0 20 40 60 80 100 120

Time (Min)

(d) Response time of load balancer

ponse time of presentation tier

Figure 8: Timeline plots of selected components for node crash scenario

0.4 0.6

False positive rate

(d) Overall

Figure 9: Comparison of ROC curves for the different types of faults

-B- AUC TPR -o Precision •x- FPR

o o----

Degree of dependí (a) Degree of dependency to loai

□ Auto-Large

□ Auto-Medium

□ Auto-Small

Prediction metric

(b) Size of ADM

11: Prediction quality of Hora with different ADM configurations

Precision

Teerat Pitakrat is a PhD student at the University of Stuttgart, Germany. He received BSc in Telecommunication Engineering from the King Mongkut's In-stitude of Technology Ladkrabang in 2006, and MSc in Communication Engineering from University of Ulm in 2010. His research interests include online failure prediction and application performance monitoring and prediction.

Dusan Okanovic is a postdoc research assistant at Reliable Software Systems at University of Stuttgart (Institute of Software Technology), Germany. He received his PhD degree from University of Novi Sad, Serbia (2012), as well as MSc and BSc degrees (2006 and 2002). From 2002. he held teaching assistant position on Faculty of Technical Sciences at University of Novi Sad, and from 2013. he became assistant professor. He was also a guest lecturer on other universities in Serbia, teaching courses in various software engineering subjects. From 2015. he moved to University of Stuttgart. His main research areas are software and performance engineering. He is interested in performance monitoring, management, analysis, as well as enterprise and web application design and development.

Andre van Hoorn is the interim professor for Reliable Software Systems (RSS) at the University of Stuttgart (Institute of Software Technology), Germany. He received his PhD degree from Kiel University, many (2014) and his Master's degree (Dipl.-Infori from the University of Oldenburg, Germany (2007) Andre held a PhD scholarship with the Graduate School on Trustworthy Software Systems (TrustSoft) at the University of Oldenburg (2008-2011), was member of the Software Engineering Group at Kiel University (2008-2012), and has been postdoctoral researcher (Akademischer Rat) with the University of Stuttgart since 2013. His research interests are in the area of ar-chitecturebased software performance engineering and software reengineering. Particularly, he is interested in application performance monitoring, modeling, and management^as well as dynamic/hybrid analysis of legacy systems for architecture recovery and evolution.

Lars Grunske is currently a Professor at the Humboldt University Berlin, Germany. He received his PhD degree in computer science from the University of Potsdam (Hasso- Plattner-Institute for Software Systems Engineering) in 2004. He was a Professor at the University of Stuttgart, Junior Professor at the University of Kaiserslautern, Boeing Postdoctoral Research Fellow at the University of Queensland from 2004-2007 and a lecturer at the Swinburne University of Technology, Australia from 2008-2011. He has active research interests in the areas modelling and verification of systems

and software. His main focus is on automated analysis, mainly probabilistic and timed model checking and model-based dependability evaluation of complex software intensive systems.