CrossMark

Available online at www.sciencedirect.com

ScienceDirect

Procedía Computer Science 95 (2016) 450 - 456

Complex Adaptive Systems, Publication 6 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology

2016 - Los Angeles, CA

Sparse Causal Temporal Modeling to Inform Power System Defense

Michail Misyrlis*, Rajgopal Kannan, Charalampos Chelmis, Viktor K. Prasanna

Ming-Hsieh Dept. of Electrical Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, CA 90007, USA

Abstract

The accurate estimation of system state variables at buses in the power-grid is crucial for determining the operational state of the power system. Spoofing attacks on meters at buses can bypass bad data detectors in Supervisory Control and Data Acquisition (SCADA) systems and undetectably manipulate state estimates. Existing methods for protection of the state estimate of critical buses against spoofing attacks assume the existence of a given set of set of critical buses/meters to be protected. Given budget constraints, determining the critical set of meters/buses to be protected against spoofing attacks is a crucial problem. In this paper, we address the issue of how best to determine the set of meters to be protected. We suggest the use of two sparse temporal modeling methods from the machine learning literature to evaluate the influence of each meter measurement on the power grid network. Based on the influence distribution as indexed by these methods we can populate a set of meter measurements which serves a dual purpose. First, this set can serve as the initial collection of nodes that is a required input for methods developed to defend against false injection attacks on power system state estimation. Second, the high influential power of meters in the set incentivizes their protection since they are pivotal for making real-time prediction in the absence of complete real-time data from other meters. Thus, we introduce influence as one of the primary criteria to be considered in the process of selecting nodes to protect. We also suggest a novel way to measure influence based on sparse structure learning. We provide results on a publicly available simulated dataset and discuss how to use the notion of the N50 statistic to calibrate the number of meter measurements that should be protected under a specified budget.

© 2016PublishedbyElsevierB.V. Thisisanopenaccess article under the CC BY-NC-ND license (http://creativecommons.Org/licenses/by-nc-nd/4.0/).

Peer-review under responsibility of scientific committee of Missouri University of Science and Technology Keywords: smart grid; cyber security; causal temporal modeling; time series prediction

* Corresponding author E-mail address: misyrlis@usc.edu

1877-0509 © 2016 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.Org/licenses/by-nc-nd/4.0/).

Peer-review under responsibility of scientific committee of Missouri University of Science and Technology doi:10.1016/j.procs.2016.09.316

1. Introduction

The multiple entry points in the communication system that will be introduced to power systems as they transition into future smart grids provide malicious attackers with opportunities to infiltrate and exploit the power system. In particular, the need for accurate estimation of complex voltage phase angles at buses in the power-grid provides adversaries opportunities for man-in-the-middle false-data-injection attacks. It has been shown that such spoofing attacks can bypass bad data detection algorithms in the current SCADA (Supervisory Control and Data

1 2 3 13 14

Acquisition) system and undetectably manipulate state estimates ' ' ' ' . This would enable the malicious attacker to inject forged values into the meter measurements, thus allowing the attacker to control electricity price or to cause cascading blackouts. To counter these attacks, multiple studies14'15 have suggested various ways of protecting meter measurements by physically or digitally monitoring them. Under limited budgets, only a subset of those meter measurements can be secured. Existing methods for protection of the state estimate of critical buses against data injection attacks assume the existence of a given set of set of critical buses/meters to be protected. Recent

3 16 17 18

studies ' ' ' have suggested multiple ways to determine the subset of buses based on the meter measurements' socio-economic importance or the degree of observability into the power system that they provide. One recent study suggests graphical methods3 to defend any subset of state variables by securing a minimum number of meter measurements starting with an initial set of nodes and gradually expanding the set of protected state variables until the entire set of state estimates is protected.

Prioritizing meter measurement protection has been based on factoring in the socio-economic impact of meter

3161718 8

measurements and the observability into the power system that they provide ' ' ' . Recent studies have surfaced another issue pertaining real time prediction, the partial data problem, in which only partial data from sensors is available in real-time. In this real world scenario, some sensors' data is not readily available due to technological limitations (latency, bandwith) or consumer's preferences. Real-time prediction is essential for decision making including curtailment of power consumption8. In order to mitigate the damaging effects of partial data on prediction, the proposed method leverages time series data from multiple other sensors by employing a precomputed Dependency Matrix which captures information about the influence of each time series feature on other features. Doing so allows the use of the most influential meters' complete data to predict time series data for any other sensor with comparable accuracy especially with an increasing temporal horizon.

Under constrained budget, the goal is to minimize the subset of state nodes that need to be protected so that the system is observable3. Furthermore, in the partial data scenario, basing predictions on fewer time series features incurs less computational cost and decreased transmission loads8. Both Lasso-Granger7 and Graphical Lasso10 are methods based on -norm regularization9, which inherently tends to only assign a non-zero coefficient to a single variable out of a cluster of correlated variables, as opposed to other penalty functions. For example,-norm regularization (ridge) returns a model in which none of the coefficients is shrunk to zero, whereas the elastic net19, which is a combination of and -norm regularization, produces a model in which the weight is more evenly distributed among each cluster of correlated features while shrinking many of them to zero. In our setting, where a minimal set of nodes is desirable, the parsimony inducing properties of the -norm penalty term are naturally more fitting.

Our contribution in this paper is threefold. First, we propose a methodology based on sparse causal temporal modeling techniques aimed to provide already existing defense mechanisms with their required initial input, a starting set of nodes to protect. Second, we introduce the influential power of each node as an attribute to be considered when deciding which nodes to protect, in a setting where each node is associated with a certain value calculated using various attributes. Third, to the best of our knowledge this is the first application of Graphical Lasso attempting to model the influence of each time series feature on other time series features.

The rest of this paper is organized as follows. In Section 2 we provide mathematical background for LassoGranger and Graphical Lasso. In Section 3 we present the simulated dataset that we used in our experiments as well as the experimental setting. In Section 4 we show our results and finally in Section 5 we conclude and discuss potential directions for furthering this line of research.

2. Methods

In this section we introduce the two Lasso8 based methods that we will be using to identify the set of nodes to be protected. Lasso refers to -¿\-norm regularization, a method used widely in regression settings, where shrinking a large fraction of coefficients to zero leads to higher predictive accuracy and generalizability.

We denote by X E RmXn a design matrix with m samples (time dimension) and nfeatures and we denote by y E Rm the vector of targets. A basis of statistical inference is the application of regularized risk, in which a loss function is evaluated over a sample of data and is linearly combined with a regularizer that penalizes some norm of the prediction function as in (1):

/3 = argmin f(fi,X,y) + ^||£||„ (1)

where the first term is the loss function which is typically the squared error loss, f X,y) =|| y — X¡5 ||2, and the second term is a penalty term.

2.1 Lasso -Granger

Lasso-Granger6 is a temporal causal modeling method which combines Granger causality and l norm regularization.

A time series xl "Granger causes"6,8 another time series feature xJ', if and only if predicting xJ' using both past values of xJ and xl leads to statistically significantly higher accuracy than predicting based solely on past values of xJ.

The notion of "Granger Causality" has been used with time series data to model dependencies between multiple time series7. It can be used to test each pair of time series features in order to determine which features to incorporate into the prediction of xJ, but that entails 0(n2) number of tests6, which renders this exhaustive method computationally expensive. By using Lasso regression on the full set of available features in order to predict x] one can narrow down the features on which xJ is conditionally dependent and exclude the rest - the ones with zero coefficients, which is essentially what Lasso-Granger does.

More specifically, consider a set of sensors S = {s1(..., sn) which collect real time data, where we need to make predictions for each sensor. A dependency matrix7 M is a n X n matrix where M[i, j] reflects the dependency of sensor Sj 's time series data xl on sensor Sj 's time series data x>. This matrix can be used as a look-up table in order to select the most useful features to use for predicting time series data x', by choosing the features that correspond to the non-zero elements on the t-th row.

Moreover, the dependency matrix can be used to rank sensors by their degree of influence on other features. The influence Ik of time series xk is defined as the summation of all values in the fc-th column of the dependency matrix M,

Ik = X [ j , k ] . (2)

The summation of time series data xls corresponding column values in matrix M represents its total influential power over the other features, and one can thus obtain a ranking of the time series features based on that.

2.2 Graphical Lasso

Graphical Lasso9 is a method which computes sparse undirected Gaussian graphical models through the use of Lasso8 (-¿\-norm) regularization. A Gaussian graphical model is a graph in which each node represents continuous and jointly Gaussian random variables. Assuming a multivariate normal distribution for n variables with covariance matrix X E RnXn, the Gaussian graphical model encodes the distribution with the inverse covariance matrix ft = E_1. Zero entries in the inverse covariance matrix reflect conditional independence. That is, ft[t, j] = 0 if and only if xl and xi are conditionally independent. Due to the fact that finding the sparsest inverse covariance matrix is an NP-hard problem10, -¿\-norm regularization methods have been developed. The problem is to maximize

log det D.-tr(m) n||j,

where tr denotes the trace of a matrix, \\ Q. \\j denotes the £1 -norm regularization, and X is the regularization parameter which calibrates the degree to which the regularization penalty is enforced. The Graphical Lasso9 method solves the maximization of (2) by fitting a modified lasso regression to each variable in a cyclical fashion.

Using the learnt structure encoded in matrix Q which is the output of Graphical Lasso as the dependency matrix M =\\ Q \\j defined in Sec. 3.1., we can again define the influence Ik of time series xk as in eq. (2). In this case, elements of Q. can be negative. Thus the dependency matrix is the absolute value of Q .

3. Dataset and Experimental Set-up

For our experiments we employ a publicly available dataset3,4 developed at Pacific Northwest National Laboratory (PNNL)4. The dataset comprises 16 distribution taxonomy feeder models which were simulated5 using GridLAB-D version 2.2. For this study, each of the load points in the 16 taxonomy feeders was populated with hourly averaged load data from a utility in the feeder's geographical region. Each taxonomy feeder file contains a load value per hour over the course of one year (8760 values) for each residential and commercial load for that feeder.

For each method we learn one model per week, which amounts to 168 hourly values, in order to construct network topologies that reveal information about how the nodes interact with each other from a dependence (causality) perspective over time. We use the AIC criterion11 for model selection in both cases. We thus obtain a learned structure per each week of the year.

In order to populate the initial set of nodes to be protected, we calculate the N50 statistic of the summation of the weights of edges adjacent to each node. This represents the minimum number of nodes such that their edge weights add up to 50% of the summation of all the edge weights in the graph. In the case of Graphical Lasso, we use the absolute value of the weights since they can be negative but still reflect the degree of influence.

Equipped with two versions of Dependency Matrices, one from each method, we then use regression tree models,8,20 to make predictions and to observe the predictive accuracy advantage provided by the use of the major influencers especially with a longer temporal horizon. The first predictive model is an Autoregressive Tree, which uses a time series own recent values to make future predictions. The other two models are also regression trees but only use time series data from the top major influencers yielded by each method (Lasso Granger, Graphical Lasso) as predictors using the N50 statistic concept as described before. For the three aforementioned experiments, we train each model using data from the first half of each week of the year with a time lag of 24 hours. We then use the learned model to test on the second half of each week. We vary the temporal horizon (look-ahead) from 1 hour to 24 hours. We use the MAPE8,20 (Mean Absolute Percentage Error) to assess the predictive accuracy of our models.

4. Results

We show the mean N50 statistic for each of the 16 feeders in Fig. 1 for both methods. The number of nodes to be protected using the N50 statistic varies more with Lasso-Granger than with Graphical Lasso.

We provide the N50 statistic over the course of one year for each method for one specific feeder (R1-1247-3) in Fig. 2. Using the Lasso-Granger learned weights to calculate the N50 statistic results in a larger amount of nodes to be protected. In Fig. 3 we plot the distribution of influence across nodes as indexed by the summation of their edge weights for the two methods for a randomly chosen feeder R1-1247-3 for a randomly chosen week. Lasso-Granger seems to be distributing weights more evenly than Graphical Lasso thus leading to a higher N50 statistic.

In Fig. 4 we plot the average MAPE for predicting each meter's time series data for feeder R1-1247-3 across a varying temporal horizon spanning 24h look-ahead values. There are three Regression Tree models, where ART denotes the Autoregressive Tree model using each time series own past data only, Lasso-Granger denotes the prediction results of the Regression Tree using only major influencers indexed by the Lasso-Granger Dependency Matrix and Graphical Lasso denotes the results using the Regression tree using only major influencers indexed by the Graphical Lasso Dependency Matrix. The results obtained by the latter two models are better in terms of

predictive accuracy despite not including each meter's own past data as predictors. The difference between ART and the other two models becomes more pronounced as the temporal horizon reaches 17h look ahead prediction. Furthermore, Graphical Lasso influencers' data seem to lead to lower MAPE than those of Lasso Granger.

Mean(std) N50 per Feeder

í¡'!

////>V///XW///

Mean N50 as a percentage of total number of Loads per Feeder

/I» /V» sfl .

<f- <f- <¡5- <j> <Sí <Si <f> ^

№ Lasso-Granger ♦ Graphical Lasso

♦ Lasso-Granger ♦Graphical Laso

Fig. 1. (a) Left: Mean and standard deviation of the N50 statistic for each feeder for the two methods. The value of the N50 calculated for the Graphical Lasso learned structure deviates less from the mean. (b) Right: The distribution of the N50 as a percentage of the total number of loads per feeder. In both (a), (b), the feeders are sorted by total number of loads in ascending order. (Best viewed in color).

N50 statistic of weight distribution over 52 weeks for Feeder R1-1247-3

• Lasso-Granger

♦ ♦ ♦ ♦ ♦♦ ♦ ♦♦ ♦ ♦♦ ♦♦ ♦ ♦♦♦ ♦ ♦ ♦ 4 Graphical Lasso

0 10 20 30 40 50

Fig. 2. The N50 statistic of the weight distribution for 52 weeks for Feeder R1-1247-3. Lasso-Granger returns a N50 that amounts to larger set of nodes than that of Graphical Lasso in this case. (Best viewed in color).

Influence Distribution for Feeder R1-1247-3 for one week

I Lasso-Granger I Graphical Lasso

commercial and residential loads

Fig. 3. Distribution of weights as a percentage of the total summation of weights for feeder R1-1247-3 for the 7 week, chosen randomly. Both sparsity inducing methods shrink the last 12 influence weights to zero. Lasso-Granger seems to distribute weights more evenly than Graphical Lasso (rl_2_1, cl_16B). (Best viewed in color).

Mean MAPE Prediction Error across increasing temporal horizon

uu 0.14

< s 0.138

c ra <u 0.136 ■

S 0.134 ■

0.132 0.13 • ♦ • ♦

• Lasso Granger

♦ Graphical Lasso

0 5 10 15 20 25 30

Temporal horizon (hours)

Fig. 4. Average MAPE for prediction of meter measurements time-series data for each meter in Feeder R1-1247-3. ART denotes the Autoregressive Tree using each meter's own past values while the other two models use only past values from the major influencers as indexed by the dependency matrices learned from each method. Even from the first hour, the latter two models lead to lower MAPE and the difference becomes more pronounced as the temporal horizon increases to 17 hours look-ahead. (Best viewed in color).

5. Discussion and Future Work

We demonstrated the applicability of Lasso-Granger and Graphical Lasso in a cyber security context for power systems and showed that they can be used in a dynamic manner. Indeed, both methods are among the fastest ones in each one's family of methods6,9 thus facilitating efficient real-time computations for large datasets.

In our experiments we used the N50 statistic to choose the top major influencers but one can calibrate the cutoff percentage of weights according to a certain budget. For example, instead of using 50% as the cutoff percentage, one can set it to a lesser value thus reducing the amount of protected meters, if the budget is tighter, while at the

0201020202000202010101000100

same time incurring an increase in the probability that a meter will be attacked thus sacrificing the degree to which the power system is secured.

Our results indicate that utilizing these Lasso based sparse machine learning methods merits further investigation in real world datasets. Furthermore, the two methods should be thoroughly compared in terms of optimality of the set of nodes and in terms of predictive accuracy based on the influence-based node ordering that each of them yields.

Acknowledgements

The work described in this paper was performed with support from grants NSF XScala: ACI-1339756 and NSF SDN: CCF-1320211. The authors are grateful to NSF for their support in performing this work.

References

1. Krumpholz G, Clements K, Davis P. Power system observability: a practical algorithm using network topology. IEEE Transactions on Power

Apparatus and Systems. 1980;4(PAS-99):1534-42.

2. Kosut O, Jia L, Thomas RJ, Tong L. Malicious data attacks on smart grid state estimation: Attack strategies and countermeasures. In GridCommunications (SmartGridComm), 2010 First IEEE International Conference on 2010 Oct 4 (pp. 220-225). IEEE.

3. Bi S, Zhang YJ. Graphical methods for defense against false-data injection attacks on power system state estimation. Smart Grid, IEEE

Transactions on. 2014 May;5(3):1216-27.

4. Hoke A, Butler R, Hambrick J, Kroposki B. Steady-state analysis of maximum photovoltaic penetration levels on typical distribution feeders.

Sustainable Energy, IEEE Transactions on. 2013 Apr;4(2):350-7.

5. Thompson S. Modern grid initiative: Distribution taxonomy final report. Pacific Northwest National Laboratory; 2008 Nov 15.

6. Schneider KP, Chassin D, Chen Y, Fuller JC. Distribution power flow for smart grid technologies. In Power Systems Conference and Exposition, 2009. PSCE'09. IEEE/PES 2009 Mar 15 (pp. 1-7). IEEE.

7. Arnold A, Liu Y, Abe N. Temporal causal modelling with graphical granger methods. In Proceedings of the 13th ACM SIGKDD international

conference on Knowledge discovery and data mining 2007 Aug 12 (pp. 66-75). ACM.

8. Aman S, Chelmis C, Prasanna VK. Influence-Driven Model for Time Series Prediction from Partial Observations. InAAAI 2015 Jan 25 (pp.

601-607).

9. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological). 1996 Jan

1:267-88.

10. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008 Jul 1;9(3):432-41.

11. Banerjee O, Ghaoui LE, d'Aspremont A, Natsoulis G. Convex optimization techniques for fitting sparse Gaussian graphical models. InProceedings of the 23rd international conference on Machine learning 2006 Jun 25 (pp. 89-96). ACM.

12. Akaike H. A new look at the statistical model identification. Automatic Control, IEEE Transactions on. 1974 Dec;19(6):716-23.

13. Deka D, Baldick R, Vishwanath S. Data attack on strategic buses in the power grid: Design and protection. InPES General Meeting Conference & Exposition, 2014 IEEE 2014 Jul 27 (pp. 1-5). IEEE.

14. Bobba RB, Rogers KM, Wang Q, Khurana H, Nahrstedt K, Overbye TJ. Detecting false data injection attacks on dc state estimation. InPreprints of the First Workshop on Secure Control Systems, CPSWEEK 2010 Apr (Vol. 2010).

15. Vukovic O, Sou KC, Dan G, Sandberg H. Network-aware mitigation of data integrity attacks on power system state estimation. Selected Areas in Communications, IEEE Journal on. 2012 Jul;30(6):1108-18.

16. Chakrabarti S, Kyriakides E. Optimal placement of phasor measurement units for power system observability. Power Systems, IEEE Transactions on. 2008 Aug;23(3):1433-40.

17. NERC, "An approach to action for the electricity sector". Available online: http://www.iwar.org.uk/cip/resources/nerc/cip-nerc.pdf, Jun 2001

18.NERC, "Security guidelines for the electricity sector". Available online: http://www.optellios.com/pdf/secguide_ps-s_1.0_BOTapprvd15oct2004.pdf, Oct 2004

19. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2005 Apr 1;67(2):301-20.

20. Meek C, Chickering DM, Heckerman D. Autoregressive Tree Models for Time-Series Analysis. InSDM 2002 Jan 10 (pp. 229-244).

21. Aman S, Simmhan Y, Prasanna VK. Holistic measures for evaluating prediction models in smart grids. Knowledge and Data Engineering. IEEE Transactions on. 2015 Feb 1;27(2):475-88.