Scholarly article on topic 'Empirical Comparisons of Attack and Protection Algorithms for Online Social Networks'

Empirical Comparisons of Attack and Protection Algorithms for Online Social Networks Academic research paper on "Computer and information sciences"

CC BY-NC-ND
0
0
Share paper
Academic journal
Procedia Computer Science
OECD Field of science
Keywords
{keyword / privacy / "online social network" / "attack-protect-attack comparison scheme" / "semi-supervised learning"}

Abstract of research paper on Computer and information sciences, author of scientific article — Mingzhen Mo, Irwin King, Kwong-Sak Leung

Abstract Online social networks, like Facebook, are popular social networking websites, on which hundreds of millions of users make friends and interact with people. There is a large amount of personal information in these networking websites and their security is rather concerned by both users and researchers, because valuable private information will bring great profit to some people or groups. In the real world, profits motivate people and groups to obtain the personal private data lawlessly and many attacks are launched on the social networks. Facing various attacks, distinct protective strategies are proposed by researches to reduce the negative effect of attacks. However, the practical performance of protections is unknown when they are battling with the real attacks. Moreover, we also understand little about how strong attacks would be when they are facing protections. Therefore, this paper proposes an Attack-Protect-Attack (APA) comparison scheme to explore the performance and bias of various attack algorithms and protective strategies for online social networks. By this way, the comparison results are valuable and meaningful for further protection of private information. We apply several attacking and protective approaches on a real-world dataset from Facebook, then evaluate them by the accuracy of attack algorithms. Following the comparison scheme, the experiments demonstrate that the performance of protective strategies is not satisfactory in the complex and practical case.

Academic research paper on topic "Empirical Comparisons of Attack and Protection Algorithms for Online Social Networks"

Available online at www.sciencedirect.com

•i ScienceDirect Procedía

Computer Science

Procedía Computer Science 5 (2011) 705-712

The 8th International Conference on Mobile Web Information Systems (MobiWIS)

Empirical Comparisons of Attack and Protection Algorithms for Online Social Networks

Mingzhen Mofl, Irwin Kinga'b and Kwong-Sak Leungb

a Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong

b AT&T Labs Research, USA

Abstract

Online social networks, like Facebook, are popular social networking websites, on which hundreds of millions of users make friends and interact with people. There is a large amount of personal information in these networking websites and their security is rather concerned by both users and researchers, because valuable private information will bring great profit to some people or groups. In the real world, profits motivate people and groups to obtain the personal private data lawlessly and many attacks are launched on the social networks. Facing various attacks, distinct protective strategies are proposed by researches to reduce the negative effect of attacks. However, the practical performance of protections is unknown when they are battling with the real attacks. Moreover, we also understand little about how strong attacks would be when they are facing protections. Therefore, this paper proposes an Attack-Protect-Attack (APA) comparison scheme to explore the performance and bias of various attack algorithms and protective strategies for online social networks. By this way, the comparison results are valuable and meaningful for further protection of private information. We apply several attacking and protective approaches on a real-world dataset from Facebook, then evaluate them by the accuracy of attack algorithms. Following the comparison scheme, the experiments demonstrate that the performance of protective strategies is not satisfactory in the complex and practical case.

Keywords:

keyword, privacy, online social network, attack-protect-attack comparison scheme, semi-supervised learning

1. Introduction

Currently, the online social networks, like Facebook, Twitter, are so popular that they becomes one of the significant ways for hundreds of millions people to make friends and interact with them on the Internet. According to the statistics, Facebook is utilized by more than 500 million users and more than 700 billion minutes are spent on it per month [1]. Another example is Twitter. It recently reports that it had reached approximately 50 million tweets every day, that is an average of 600 tweets per second [2]. Online social networks are attracting more and more people to join them because they offer an effective way to interact with friends and share information by text and photo.

Due to the existing of a large amount online users' personal information, privacy issue turns to be a sensitive and vital topic. Most of the time, online social networks acquiescently allow people to publish all their profiles. In the meanwhile they also allow people to enable privacy restriction on their personal information. Take Facebook as an example, users can set some or all of their profiles, such as age or university, hidden from strangers. However, adversaries can still exploit some sensitive information or even identify users with the help of social links. For instance, the adversaries can find out someone's friendship by directly checking his/her friends list which is available

ELSEVIER

1877-0509 © 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of Prof. Elhadi Shakshuki and Prof. Muhammad Younas. doi:10.1016/j.procs.2011.07.092

on Facebook, or query the list of following and followed of users in Twitter. [3,4, 5] demonstrate that this information can leak quite a large quantity of sensitive information.

For certain commercial or political purposes, some persons and groups are eager to obtain valuable personal information, which motivates attack on the online social networks. In the current business world, users' personal information is valuable to some companies. Once the privacy information is leaked, it is probably lawlessly utilized to make profit. For example, after obtaining the privacy information, a company can advertise for itself or other commercial groups by users' personal contact ways (e.g., cell phone, email) according to users' backgrounds and preferences. Thus, attacks bring great fortune and are not easy to be detected, which motivates people or groups to thieve the valuable privacy information of hundreds of millions of online users.

On the other side, because of the informational security, protection of privacy information attracts more and more attentions gradually. Online users always try their best to require the operation teams of online social networks to protect their privacy even sensitive information. Moreover, the operation teams gradually realize that the results of attacks will disturb users' normal lives and bring some trials to them. Generally, online social networks setup a function and allow people to enable privacy restriction on their profiles. For instance, users can set some or all of their profiles, such as age or university, hidden from strangers or common friends. However, the general setting is not stringent enough. Therefore, more and more protective strategies, e.g., access control [6, 7] and anonymization techniques [8], are proposed and protection for online social networks turns to be a hot topic in research.

In applications, one of the basic and important concerns is what performance the protective strategies will have when they are facing attacks. It would be not so convinced that one protective strategy is claimed to be effective because it reduces the probability of re-identification. The real world is complex and it is difficult for us to evaluate the practical effect of protection in real case only by reviewing the probability. It must be more satisfactory that the protective strategies still perform well when they are facing attacks. In other words, if the accuracy of attacks are greatly depressed after the protective operations, we can convincingly claim that the protection is effective. Thus, for evaluating the security level of online social networks, it is in great need to group the algorithms of attack and protection in the comparison sequence.

In this paper, we explore a range of attack approaches and protective strategies, in order to better understand the performance and biases of diversified attacks and protections on the online social networks. We crawl the data from Facebook and preprocessing them manually. Then we launch five attacks including machine learning methods and graph theory method, and five protective algorithms including clustering-based and modification-based approaches. In order to evaluate the practical effect of protections, we design the Attack-Protect-Attack (APA) scheme for comparisons. The attack approaches will be applied on the original data and the protected one, then the results are compared to evaluate the performances of them. On the other side, the protective approaches need to rise to the challenges from attacks. Though this APA scheme, we may intuitionally find out the practical effect of various protective approaches for online social networks when they facing real attacks.

Our contributions include:

• In order to better protect privacy information on social networks, we first explore the performance and biases of diversified attacks and protections in the attack-protect-attack scheme. On one side, in the APA

comparison scheme, we learn the protective performance by comparing the accuracy of attacks before and after the protections. On the other side, we try to understand the attack bias by comparing the different effects on the protected data with various protective strategies. Thought these comparisons, we can better understand both attacks and protections. Additionally, it can lead researchers to propose stronger protective strategy and protect the private or sensitive information on online social networks.

• We observe that the current anonymization strategies for privacy protection cannot achieve satisfied result in the complex and practical case. The performance of protections is not satisfactory in the real-world scenario, in which more profile and relationship information is accessible.

This paper is organized as follow. We briefly describe various attack algorithms and protective approaches in the Section 2. Then, we give details on datasets and the attack-protect-attack scheme in the experiments in Section 3. Finally, we show some related works in Section 4 and conclude the paper in Section 5.

2. Algorithm

Similar to [9], we define an online social network as an undirected graph G(V, E). In G(V, E), every vertex (user) has feature vector pi and every edge (relationship) has weighted value wij, which may consider friendship, group membership and network relationship, where 0 < pi, wj < 1. In the whole graph, there are l labeled vertices and the labels of u vertices need to be predicted. For attacks, the objective is to let the prediction result of u vertices' labels agree with the true labels. On the other side, protections tend to reduce the accuracies of the prediction of attacks.

In the APA comparison scheme, attacks and protection will be executed alternately. In this scheme, we first apply attack methods on the online social networks and mark down the accuracy of them. Then, we utilize protective approaches to hide or modify the vital information on the original networks, and generalize a new safe network to be published. Finally, we adopt all attacks again on the new networks to find out how big the impact would be and how effective in practice the protections would be. We evaluate the protective approaches indirectly by the impact of attacks instead of some criteria only considering protection, e.g., percentage of hidden information and loss of information. On the same time, the attacks are evaluated by the criterion of accuracy. The comparison scheme is shown in Algorithm 1.

Algorithm 1 Attack-Protect-Attack Comparison Scheme Input: Social graph G(V, E) and labels information. 1: for every attack approach Ai do 2: for every protective approach Pj do 3: Attack on G and obtain the prediction accuracy R0(Ai). 4: Apply a protection approach Pj on G and obtain a protected social graph Gp. 5: Apply the same attack approach Ai on Gp and obtain the new prediction accuracy R1(Ai, Pj). 6: end for 7: end for

8: Compare R0 & R1 and evaluate the performances of attacks and protections.

We compare various attack approaches and protective strategies to better understand the performance and affection of them. The relationship of these approaches is shown in Fig. 1. We mainly focus on twofold:

• Effect of Attacks. We apply five attack approaches in online social networks and evaluate their performance. They are Supervised Learning (SL) approach [5], Local and Global Consistency (LGC) Semi-Supervised Learning (SSL) approach [9], Co-Training SSL approach [9], Community-based Graph (CG) SSL approach [10] and Graph Theory of Maximum Flow & Minimum Cut (MFMC) approach [11].

• Effect of Protections. Based on the first point, we evaluate the attacks performance before and after the protections respectively, in order to further analyze the protective strategies. We compare the effects of five protective strategies. They are Vertex Clustering (VC) method [12], Edge Clustering (EC) method [13], Vertex and Edge Clustering (VEC) method [14], Randomize Graph Modification (RGM) method [15] and Greedy Graph Modification (GGM) method [16].

(a) Attack Approaches (b) Protective Approaches

Figure 1: The Relationship of Various Attack and Protective Approaches

3. Experiments

In the experiments, we apply both attack and protection approaches on the attack-protect-attack scheme. In this scheme, attacks try to classify an unknown feature on the Facebook dataset and are evaluated by accuracy. The results

Table 1: Statistics of Facebook Dataset

Dataset Vertices Edges Groups Networks Classes

Facebook 10,410 45,842 61 78 3

show that protective approaches do reduce the ponderance of attack and protect the information in most cases. But the effect of protection becomes less in the more complex case.

Dataset. In order to evaluate the effectiveness of various algorithms of attack and protection, we crawl one real-world data from the online social network Facebook. The Facebook dataset is a complex network with a large amount of personal attribute information and community information, which help a lot in attacks. The details of the dataset are shown in the Subsection 3.1.

Objective. In this dataset, attacks always try to expose some unknown attribute of users and protections are to hide/modify some personal or relational information. In the Facebook dataset, attack algorithms try to predict the university the user attends as their objectives. The higher accuracy the attacks achieve, the stronger attacking methods would be. In contrast, the worse prediction of attacks suggests the better protection.

Method There are five attack and five protective algorithms applied on the APA comparison scheme. We first apply five attack algorithms on two datasets with protections. Then five protective approaches are operated respectively on the original data. At last, five attacks are launched again on the result of every protective approach.

3.1. Data Description & Preprocessing

The real-world dataset is crawled from the Internet and manually preprocessed. It contains more than ten thousand users (vertices) in the networks. The Facebook dataset is a complex and practical one with sufficient link and profile information, which may increase the difficulty of protection and affect the performance of protection.

The Facebook dataset has sufficient profile information of users and all kinds of relational information, thus it is similar to the situation of the real world. Three university names are used as class (label) names. Table 2 gives the number of users in each class.

Table 2: Statistics of Data Distribution on Facebook Dataset

University CUHK HKUST (Others)

Size of Class 68 1,583 8,759

Feature Selection. In Facebook dataset, there are 26 features for each user, however, not all of them are needed. In fact, some features such as nickname provide little information for classification. Besides, most people fill only a few of these features, for instance, very few people provide information for work phone and current location. Thus, according to the statistic result for 26 features, we select top three features for which most people provide information. After excluding nickname, we finally choose gender, birthday and home town as basic profile information of each user (vertex) for classifying.

For relational information, it also needs to select the helpful data. The original group number of Facebook data is 371. Among these groups, most of them are made up with only a small number of people. Thus, a number of small groups are removed and finally 61 groups left. Networks are processed similarly. Apart from that, some networks whose names explicitly reveal universities' names, such as "CUHK" and "HKUST" are removed manually.

Data Translation. Since some data from the real world are not directly computable, we need to translation them into the proper forms. For example, home town is just a string and it is a bad way to calculate the similarity of two users' home town through comparing two strings. Therefore, we translate home town to its longitude and latitude values through Google maps API.

Although top three features information that most users fill are selected, the number of missing value is still very large and noise information, like birthday with value "(1/1/ 0001}", exists widespreadly in datasets. For age, missing data are filled with average value of existing data and noise data are treated as missing ones. For gender, 0.5 is used to represent missing value (1 represents male and 0 represents female). For hometown, missing data are filled respectively with average value of longitude and latitude of his friends. Thus, a user's basic information could be expressed by using a vector which contains its age, gender, hometown's longitude and latitude.

The value of every attribute in users' profile is scaled to [0,1] and the cosine similarity between any two profile vectors is calculated. If both of them fail to provide at least 50% information, we set the cosine similarity with mean value.

Another kind of similarity is obtained from relational information, i.e., friendship. Two users' friendship similarity is computed through 1 divided by the shortest hop(s) between them. For example, if two users are friends (linked directly), the hop between them is 1 and the similarity is also 1; if two users are not directly linked but they both link to another user, the shortest hops between them is 2 and thus we set their friendship similarity as

3.2. Experiment Process

Labeled Data Selection. Labeled data are selected randomly with two constrains below:

• Each class must have labeled data;

• The numbers of labeled data in all classes are similar.

The second point suggests an assumption that we do not know the distribution of all classes when labeling data.

Evaluation Criterion. We mainly utilize the accuracy to measure the results of predictions (attacks). The classic accuracy is obtained by

j Incorrect \ Incorrect!

Accuracy = —-|-—-- = —;-, (1)

\Vcorrect\ + \Vincorrect\ l + u

where Vcorrect is a set containing all the vertices whose predictions are correct and Vincorrect contains all incorrect-prediction vertices.

3.3. Attack Before Protection

We firstly want to check the situation without protections. In this scenario, we launch five attack algorithms on the original datasets after data preprocessing. Table 3 and Figure 2(a) demonstrate the learning accuracy of all attacks.

Table 3: Attack Accuracy of Five Algorithms on Facebook Dataset

Labeled Labeled SL LGC Co-Training CG MFMC

Data # Data %

25 0.24% 36.74% 60.70% 55.26% 63.09% 56.79%

250 2.40% 47.65% 66.86% 62.16% 74.94% 61.27%

1000 9.61% 51.56% 68.05% 64.71% 75.43% 63.04%

2500 24.02% 51.92% 69.21% 68.84% 79.81% 66.98%

(a) Attack Accuracy of Five Algorithms

(b) Accuracy of Supervised Learning Method After Various Protections

(c) Accuracy of LGC Learning Method After Various Protections

(d) Accuracy of Co-Trainging Learning Method After Various Protections

(e) Accuracy of CG Learning Method After Various Protections

(f) Accuracy of MFMC Method After Various Protections

Figure 2: The Experimental Results on Faceboook Dataset

3.4. Attack After Protection

Based on the original dataset, five protective strategies are applied to hide/modify some useful profile or relationship information. On each protective result, five attack algorithms are launched again and kept being evaluated with the same criterion. Table 4 illustrates the accuracy of all attacks after protections.

3.5. Comparisons of Attack and Protection Algorithms 3.5.1. Attack

We try to understand the attack bias by comparing the different effects on the protected data with various protective strategies.

Figure 2(b) shows the learning accuracy of supervised learning before and after protections. Besides Vertex Clustering method, other protective strategies obviously reduce the learning accuracy to low level and protect the information in the network. More specifically, Vertex & Edge Clustering, Randomize Graph Modification and Greedy

Table 4: Attack Accuracy of Five Algorithms after Protections on Facebook Dataset

Protected Labeled Labeled SL LGC Co-Training CG MFMC

by Data # Data %

25 0.24% 28.51% 53.62% 50.28% 60.53% 49.76%

Vertex 250 2.40% 41.32% 61.75% 52.71% 71.67% 52.31%

Clustering 1000 9.61% 45.14% 64.31% 57.83% 71.34% 58.81%

2500 24.02% 47.93% 65.59% 63.32% 77.25% 62.12%

25 0.24% 28.22% 50.32% 50.00% 58.82% 45.73%

Edge 250 2.40% 29.32% 58.09% 54.24% 70.13% 46.69%

Clustering 1000 9.61% 29.90% 61.22% 56.41% 71.63% 48.68%

2500 24.02% 31.19% 64.47% 62.98% 76.52% 55.11%

Vertex & Edge Clustering 25 0.24% 26.24% 50.84% 43.16% 55.44% 37.31%

250 2.40% 25.71% 57.52% 50.57% 67.70% 45.97%

1000 9.61% 27.82% 59.37% 52.19% 69.20% 48.41%

2500 24.02% 29.27% 60.18% 58.39% 76.48% 53.29%

Randomize Graph Modification 25 0.24% 26.25% 51.93% 48.74% 59.76% 48.88%

250 2.40% 27.55% 60.69% 51.64% 66.38% 50.01%

1000 9.61% 27.99% 63.27% 53.04% 71.98% 53.66%

2500 24.02% 28.38% 65.40% 60.56% 75.07% 60.48%

Greedy Graph Modification 25 0.24% 22.12% 45.41% 43.33% 55.12% 40.32%

250 2.40% 24.86% 53.26% 48.18% 65.74% 44.94%

1000 9.61% 25.23% 59.07% 52.69% 69.07% 49.72%

2500 24.02% 27.59% 62.38% 57.82% 73.93% 52.67%

Graph Modification strategies can effectively reduce accuracy to 30% or lower and the protections are satisfied. On the other side, these illustrate supervised learning as an attack method is weak and it can be resisted by some protective strategies.

Figure 2(c) shows the learning accuracy of local and global consistency learning before and after protections. Although the accuracy of learning fall in different level due to various protective strategies, the learning results keep being improved when the number of labeled data is increased. Moreover, the difference between new and original learning accuracies become smaller and smaller when percentage of labeled data increases. This suggests that LGC learning method can effectively utilize known and unknown information and this kind of attack is stronger than the supervised learning one. The last point is LGC leaning is more sensitive to the Greedy Graph Modification Strategy than the others, because the GGM strategy reduces the learning accuracy of LGC most in most test cases.

Figure 2(d) shows the learning accuracy of co-training learning before and after protections. Generally, all five protective strategies can prevent leaking of hidden information and performance of the Co-Training learning is not strong enough. When the number of labeled data is increasing, the accuracy is kept lower than the original accuracy 5% to 10%. Figure 2(d) illustrates that Co-Training learning method is not sensitive to a specific protective strategy, because Co-Training method is a wrapper containing different classifiers inside and different classifier would be sensitive to some specific protective strategies.

Figure 2(e) shows the learning accuracy of community-based graph learning before and after protections. According to the figure, we find that Community-based Graph learning is a strong attacking model, because there is no protective strategy can strongly reduce its learning accuracy. Comparatively, Greedy Graph Modification is the most effective method to protect the information under the attack from CG learning model.

Figure 2(f) shows the learning accuracy of maximum flow & minimum cut method before and after protections. MFMC attack model is sensitive the protective strategies, especially two Graph Modification strategies. MFMC is a good attack model and its prediction accuracy can reach approximately 70%. However, when it facing various protections, it seems powerless and the accuracy even lower than 40%.

3.5.2. Protection

In the APA comparison scheme, we learn the protective performance by comparing the accuracy of attacks before and after the protections. According to the Table 4, we can easily obtain the reduction (percentage) of attack accuracy of all prediction models. The reduction is computed in this way

Reduction = (1 - Accuracypro'ec'ion ) x 100%, (2)

Accur acyoriginal

where Accuracyoriginal is the learning accuracy before protection, while Accuracyprotection is that after protection. The result is shown in the Table 5.

From the above table, we obtain some observations.

Table 5: Reduction of Attack Accuracy of Five Algorithms after Protections on Facebook Dataset

Protected by Labeled Data # Labeled Data % SL LGC Co-Training CG MFMC

Vertex Clustering 25 250 1000 2500 0.24% 2.40% 9.61% 24.02% 22.40% 13.28% 12.45% 7.68% 11.66% 7.64% 5.50% 5.23% 9.01% 15.20% 10.63% 8.02% 4.06% 4.36% 5.42% 3.21% 12.38% 14.62% 6.71% 7.26%

Edge Clustering 25 250 1000 2500 0.24% 2.40% 9.61% 24.02% 23.19% 38.47% 42.01% 39.93% 17.10% 13 . 12% 10 . 04% 6.85% 9.52% 12.74% 12.83% 8.51% 6.77% 6.42% 5.04% 4.12% 19.48% 23.80% 22.78% 17 . 72%

Vertex & Edge Clustering 25 250 1000 2500 0.24% 2.40% 9.61% 24.02% 28.58% 46.04% 46.04% 43.62% 16. 24% 13.97% 12.76% 13.05% 21.90% 18.65% 19.35% 15.18% 12.13% 9.66% 8.26% 4.17% 34.30% 24.97% 23.21% 20.44%

Randomize Graph Modification 25 250 1000 2500 0.24% 2.40% 9.61% 24.02% 28.55% 42.18% 45.71% 45.34% 14.45% 9.23% 7.02% 5.50% 11.80% 16. 92% 18.03% 12.03% 5.28% 11 . 42% 4.57% 5.94% 13.97% 18.38% 14.88% 9.70%

Greedy Graph Modification 25 250 1000 2500 0.24% 2.40% 9.61% 24.02% 39.79% 47.83% 51.07% 46.86% 25.19% 20.34% 13. 20% 9.87% 21.59% 22.49% 18.58% 16.01% 12.63% 12.26% 8.43% 7.37% 29.00% 26.65% 21.13% 21.36%

• Vertex Clustering is not a strong protective strategy and the most effective protections are against supervised learning attack and co-training attack.

• Edge Clustering is stronger and most reduction of attack accuracy is over 10%, even up to 40%. Furthermore, when it is facing SL and MFMC attacks, the protections are effective, because SL and MFMC attacks are rather depend on edges information and few transformation of this will obviously impact accuracy of predictions.

• The Vertex & Edge Clustering is stronger than VC, EC and RGM protective approaches. It is more effective when it faces SL and MFMC attacks, because it is an advanced version of Edge Clustering.

• Randomize Graph Modification strategy is sensitive to the supervised learning attack, but not others.

• The Greedy Graph Modification strategy may be the strongest protection strategy among these five. The accuracy in most test cases is reduced at least 10% and haft of test cases reach/exceed 20% reduction. Moreover, a reduction of accuracy even reaches 50%, which is satisfied.

4. Related Work

Protection. When researchers find out it is possible to re-identify personal information by combining different public social data sources, privacy becomes an important research issue. The surge of research on privacy preservation in data publishing has achieved a lot on tabular data, which contains only individual attributes (sensitive or non-sensitive). The non-sensitive attributes, also called "quasi-identifiers", are used to re-identify an individual. Meanwhile, sensitive attributes should not be recovered even when a group of possible rows is recognized. Motivated by the individual re-identification with US voting data with medical records, Sweeney proposed fc-anonymity privacy preserving data publishing technique on microdata [17]. The k-anonymity requires that every tuple in the microdata table published could only be indistinguishably identified with fc candidates [17][18]. However, all fc candidates might share the same sensitive attributes. As a result, an attacker could obtain the desired contents without re-identifying individuals. An extension of fc anonymity is to provide /-diversity, which demands these fc candidates must have I different sensitive attributes [19]. Different kinds of other privacy protection concepts and their implementation algorithms have also been developed, such as i-closeness [20], rn-invariance [21], fc-similarity [22] and 5-presence[23], etc.

Attack. Since the on-line social networks began to thrive, there has been a growing interest in the security of users' privacy under the current privacy protection. Among the previous work, the attacks using machine learning with public profile and relation information attract a lot of attention and have great significance in the security of on-line social networks. The attacks employing machine learning methods include supervised learning attacks, unsupervised learning attacks and semi-supervised learning. As for supervised learning methods, [3] presents that supervised classification can rely not only on the object attributes but also on the attributes of the users it is linked to. Because linked-based classification breaks the data comprises of i.i.d. instances and it can make the classes of linked objects correlated to each other. Besides, He et al. [4] also predict private attributes using Bayesian network with friendship

links. A more comprehensive review about collective classification can be found in Sen's work [24]. Recently Elena and Lise [5] propose a novel attack using group-based supervised classification with group membership information apart from friend links. 5. Conclusion

We apply five attack algorithms and five protective approaches on the attack-protect-attack scheme to understand the performance and biases of these attacks and protections on the online social networks. They are evaluated by the criterions of accuracy of attacks on a real-world data, Facebook. The experiments illustrate that the performance of protective strategies are not satisfactory in the complex and practical cases. In the future, we will apply the attack-protect-attack scheme to more datasets and evaluate attack and protective strategy from different aspects. This would help us to better understand the practical performance in difference scenarios.

Acknowledgments

The work described in this paper was partially support by a grant from the Research Grants Council of the Hong Kong Special Administration Region, China (Project No.: CUHK 413210).

References

[1] http://www.facebook.com/press/info.php?statistics.

[2] http://blog.twitter.com/2010/02/measuring-tweets.html.

[3] L. Getoor, B. Taskar, Introduction to statistical relational learning, The MIT Press, 2007.

[4] J. He, W. Chu, Z. Liu, Inferring privacy information from social networks, Lecture Notes in Computer Science 3975 (2006) 154.

[5] E. Zheleva, L. Getoor, To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles, in: Proc. of WWW, ACM, NY, USA, 2009, pp. 531-540.

[6] L. Fang, K. LeFevre, Privacy wizards for social networking sites, in: Proc. of WWW, ACM, 2010, pp. 351-360.

[7] P. Fong, M. Anwar, Z. Zhao, A privacy preservation model for Facebook-style social network systems, Computer Security-ESORICS 2009 (2010) 303-320.

[8] B. Zhou, J. Pei, W. Luk, A brief survey on anonymization techniques for privacy preserving publishing of social network data, ACM SIGKDD Explorations Newsletter 10 (2) (2008) 12-22.

[9] M. Mo, D. Wang, B. Li, D. Hong, I. King, Exploit of Online Social Networks with Semi-Supervised Learning, in: Proc. of IJCNN, IEEE, 2010, pp. 1893-1900.

[10] M. Mo, I. King, Exploit of online social networks with community-based graph semi-supervised learning, in: Proc. of ICONIP, 2010, pp. 669-678.

[11] G. Flake, S. Lawrence, C. Giles, Efficient identification of web communities, in: Proc. of ACM SIGKDD, ACM, 2000, pp. 150-160.

[12] M. Hay, G. Miklau, D. Jensen, D. Towsley, P. Weis, Resisting structural re-identification in anonymized social networks, Proc. VLDB Endow. 1 (1) (2008) 102-114.

[13] E. Zheleva, L. Getoor, Preserving the privacy of sensitive relationships in graph data, in: First ACM SIGKDD Workshop on Privacy, Security, and Trust in KDD (PinKDD 2007), Springer-Verlag, 2007.

[14] A. Campan, T. Truta, A clustering approach for data and structural anonymity in social networks, in: Proceedings of the 2nd ACM SIGKDD International Workshop on Privacy, Security, and Trust in KDD (PinKDD 2008), in Conjunction with KDD08, Citeseer, 2008.

[15] M. Hay, G. Miklau, D. Jensen, P. Weis, S. Srivastava, Anonymizing social networks, University of Massachusetts Technical Report (2007) 07-19.

[16] B. Zhou, J. Pei, Preserving privacy in social networks against neighborhood attacks, in: Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on, 2008, pp. 506-515.

[17] L. Sweeney, K-anonymity: a model for protecting privacy, Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10 (5) (2002) 557-570.

[18] P. Samarati, Protecting respondents identities in microdata release, IEEE Transactions on Knowledge and Data Engineering 13 (2001) 10101027.

[19] A. Machanavajjhala, J. Gehrke, D. Kifer, M. Venkitasubramaniam, L-diversity: privacy beyond k-anonymity, in: 22nd IEEE International Conference on Data Engineering, 2006.

[20] N. Li, T. Li, t-closeness: Privacy beyond k-anonymity and l-diversity, in: In Proceedings of IEEE International Conference on Data Engineering, 2007.

[21] X. Xiao, Y. Tao, M-invariance: towards privacy preserving re-publication of dynamic datasets, in: Proc. of SIGMOD, ACM, New York, NY, USA, 2007, pp. 689-700.

[22] Raymond, Ada, K. Wang, J. Pei, Minimality attack in privacy preserving data publishing, in: Proc. of VLDB, VLDB Endowment, 2007, pp. 543-554.

[23] M. E. Nergiz, M. Atzori, C. Clifton, Hiding the presence of individuals from shared databases, in: Proc. of SIGMOD, ACM, New York, NY, USA, 2007, pp. 665-676.

[24] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher, T. Eliassi-Rad, Collective classification in network data, AI Magazine 29 (3) (2008) 93-106.