CrossMark

Available online at www.sciencedirect.com

ScienceDirect

Procedía Computer Science 31 (2014) 747 - 753

2nd International Conference on Information Technology and Quantitative Management,

ITQM 2014

On the Frequency Distribution of Retweets

Yao Lua, Peng Zhanga, Yanan Caoa *, Yue Hua, and Li Guoa

aInstitute of Information Engineering, Chinese Academy of Science, 91, Minzhuang Road, Haidian District, Beijing 100193, China

Abstract

Social media platforms allow rapid information diffusion, and serve as a source of information to many of the users. Particularly, in Twitter information provided by tweets diffuses over the users through retweets. So it is of great significance to study the characteristics of retweets for marketing and outbreak detection. In this paper, we present a hypothesis that the frequency distribution of retweets follows a power law distribution asymptotically by analyzing the retweets data. Then, we propose a method to model the mechanism of retweet under considering preferential attachment and transmissibility of tweets. We quantify the parameter and get a power law distribution, the simulation results and the results of data analysis are proved to be consistent, so the model can explain the behavior of retweets.

© 2014 Published by Elsevier B.V.Thisis anopen access article under the CC BY-NC-ND license (http://creativecommons.Org/licenses/by-nc-nd/3.0/).

Selection and peer-review under responsibility of the Organizing Committee of ITQM 2014. Keywords: Twitter; Retweet; Power Law Distribution; Preferential Attachment; Transmissibility

1. Introduction

Twitter is an online social platform which makes it very easy for information sharing, dissemination and access. It allows people to communicate and share content with each other, playing a fundamental role for the spread on information, ideas, and influence. In recent years, the studies on the information diffusion in Twitter have attracted more and more attentions. Some studies indicated that, attentions have displaced information themselves and became scarce resources1. Especially, Twitter attached everyones attention by retweeting. For this reason, we analyses the frequency distribution of retweeting and propose a model to simulate the users retweeting behavior. Thus, the impact of the various parameters on information diffusion can be studied.

To solve this problem, we obtain tweet data from Twitter and analyses the number of retweeting; we find that the frequency distribution of retweets follows power law distribution. And then, we introduce preferential attachment and transmissibility of tweets. In the end, we propose a method to model retweeting. We quantify the parameter and get a power law distribution, the simulation results and the results of data analysis are proved to be consistent.

* Yanan Cao. E-mail address: caoyanan@iie.ac.cn

1877-0509 © 2014 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/).

Selection and peer-review under responsibility of the Organizing Committee of ITQM 2014. doi: 10.1016/j.procs.2014.05.323

This paper is organized as follows: Section 2 introduces the relevant work about retweeting. Section 3 introduces our datasets and the results of statistical analysis. In Section 4 retweeting model is proposed. Section 5 shows experimental results on parameter analysis and the validity of model. We conclude in Section 6.

2. Relevant work

Twitter is playing an important role in many filed such as marketing and emergencies detection. Bollen et al found that events in the social, political, cultural and economic sphere have a significant, immediate and highly specific effect on the public mood state by analysing microblogging posts2.Sakaki investigated the real-time interaction of events such as earthquakes in Twitter and proposed an algorithm to monitor tweets and detect a target event3.

Currently, many studies mainly focus on the information diffusion and retweet in twitter. Kwak et al studied the topological characteristics of Twitter and its power as a new medium of information sharing4. Boyd examined the practice of retweeting as a way by which participants can be "in a conversation." and highlighted how authorship, attribution, and communicative fidelity are negotiated in diverse ways5. Suh et al found that, amongst content features, URLs and hashtags have strong relationships with retweetability. Amongst contextual features, the number of followers and followers as well as the age of the account seem to affect retweetability, while, the number of past tweets does not predict retweetability of a users tweet6.

3. Data collection and analysis

In this section we analyse the properties of the information diffusion by studying the frequency distribution of retweeting using a dataset we have collected. This analysis indicates the frequency distribution obeys a power law distribution, and it is thus the starting point for the model we propose in Sect. 4.

3.1. Data collection

For our analysis we collected 50000 tweets from April 30 to May 30, 2013, where a total of 42180 participants were covered. Moreover we collected the tweets whose content contains some keywords, such as education, earthquake and concert, each of which includes 10000 tweets. For each tweet, we retain its ID and the number of retweets.

3.2. Data analysis

Frequency distribution can help us understand user retweeting behaviour. To analyse the frequency distribution, we obtained the number of retweeting from the 50000 tweets and draw a frequency distribution chart in Fig.1 (a). Fig.1 (a) displays the distribution of the number of retweets as the line. The y-axis represents complementary cumulative distribution function (CCDF).About half of tweets were forwarded less than 10 times, 20% of tweets were forwarded more than 100 times and only about 10% of tweets were forwarded over 1000 times. This phenomenon indicates that user retweeting behaviour has a preferential attachment feature that users tend to forward tweets which was retweeted many times.

Fig.1 (b) shows the frequency distribution of retweets in log-log coordinate. The y-axis represents the frequency of retweets; x-axis represents the number of retweets. It fits to a power law distribution with the exponents of 0.62.

Besides, we analyse the frequency distribution of tweets containing special keywords such as education (Fig.1(c)) and earthquake (Fig.1 (d)). We find that they also obey a power law distribution with an exponent from 0.6 to 0.7.

Now we know that the frequency distribution of retweets obey power law distribution, and only a few tweets were retweeted a lot of times. But why some tweets can attract user to repost them? Hence, we introduce preferential attachment and transmissibility of tweets to solve this problem.

4. Build retweeting model

4.1. Modeling rules

When a Twitter user accesses his tweet feed, there are usually some factors that impact his behaviour on selecting which message to retweet. In this paper, we assume that two main factors impact on the detailed retweeting behaviour of the users: (1) preferential attachment which means user is more likely to retweet a tweet which has been retweeted many times; (2) a parameter describes a property of tweet which is responsible for retweeting probability, and we denote the parameter as the transmissibility of tweets.

Based on the above assumption, we present retweeting rules as follows:

• The number of users and the number of tweets remain unchanged, and for each user they will forward some tweets, for each tweet it can be retweeted by any user.

• Users tend to retweet the tweets which were retweeted many times and the transmissibility also have an impact on the possibility a tweet be retweeted. Here we suppose that the transmissibility follows Gaussian distribution and the probability of being forwarded is proportional to the product of preferential attachment and transmissi-bility.

4.2. Simulation algorithms

Here we introduce our simulation algorithms based on the above rules:

The initial number of tweets is m and the initial number of users is n .For each user, he will traverse all the tweets and decided which one will be forwarded and each user will forward t tweets totally. If a user receive multiple copies of a same message, he will probably forward the first received one and ignore the others. For each tweet the probability of being forwarded is proportional to the product of preferential attachment and transmissibility. If a tweet Ti was retweeted for ri times, then the preferential attachment value is:

Hi = p-

zm=1 *

The item £m=1 r is the standing of the sum of the number of retweets of all the tweets, p is a constant. The transmissibility value is:

Ki = a ■ pi

a is a constant and pi is a random variable obeying Gaussian distribution that describes the transmissibility . So the probability of a message to be selected for the forward:

n = Ki ■ Hi

Then we establish a directed network which contains two kinds of nodes: user nodes and tweet nodes. The edge from user nodes to tweet nodes stands for user Ui have retweeted tweet Tj.There are no edges between user nodes themselves or tweet nodes themselves.

Fig.2 (a): The directed network of users and tweets Fig.2 (b): The matrix generated from the directed network

5. Parametric analysis and model validation

5.1. Simulation

We established a network with two types of nodes, user nodes and tweet nodes. The results of numerical simulation suggests that the tweet nodes degreed distribution (retweeting number frequency distribution) obeys the power law distribution (exponent is 0.628). In this simulation, the model parameter values are as follows: m=30000 ; n=450000 ; t=3

frequency

ft of retweet

Fig.3:Degree distribution of the directed networks

5.2. Parametric analysis

We run 10 independent simulations for each test in order to find the impact parameters have on the power exponent.

In each test, we keep the other parameters unchanged and study the change of exponent when one parameter changes.

Analysis results of parameters are as follows:

• The number of tweets: m is the number of tweets. Keep the other values and we notice the values of m and the change on the power exponent. In Fig.4, the x-axis represents m and y-axis represents the power exponent. It is shown by the results that the number of tweets m has positive correlation with the power exponent. So other things equal, the increase of tweets will make smaller number of tweets receive a larger amount of retweeting while most tweets will receive a tiny number of retweeting.

Fig.4: Relation between # of tweets and exponent Fig.5: Relation between # of users and exponent

• The number of users: n is the number of users. Keep the other values and we notice the values of n and the change on the power exponent. In Fig.5, the x-axis represents n and y-axis represents the power exponent. As it is shown in Fig.5, there is a negative correlation between the power exponent and the number of users when x is in between 100000 and 400000, but when x is greater than 400000, there is no significant correlation between the number of users and power exponent.

• The average number of retweet: t is the average number of retweet for each user. As it is shown in Fig.6, with the growth of t, the power exponent changes around 0.6 irregularly and this change is fluctuation. There is no significant correlation between t and power exponent.

From the above analysis, it can be seen that the number of tweets has positive correlation with the power exponent while the number of users and the average number of tweets has no significant correlation with the power exponents in

Fig.6: Relation between # of tweets and exponent Fig.7: Frequency distribution of simulation dates

our model. This phenomenon indicates that we should make the number of tweets equal in order to generate a similar frequency distribution of retweet.

5.3. Model validation

In our simulation we implement a model described in the previous section in order to simulate retweeting behavior of the users. We use the datasets described in Sect.3 to infer the parameter value. Then we will assign the calculated values to the parameter to examine the validation of our model.

The number of tweets is 50000 and they have been retweeted a total of 2578426 times, among these tweets there are 42180 users have participated, so the average number of retweet is 61.We set m=50000, n=42180, t=61.

The simulation results shows that the frequency distribution follows power law distribution with an exponent of

0.620.so the simulation exponent is close to the real exponent (0.624).And the range of exponent is the same, these indicates the validity of the model.

6. Conclusions

In this paper, we analyse the properties of the information diffusion in Twitter, in particular the impact of users retweeting behaviour. Using a Twitter dataset, we study the frequency distribution of retweeting and we conclude that this distribution is described by a power-law function with the exponent form 0.6 to 0.7.

Based on these observations we propose an information propagation model which generates cascades whose properties match empirical observations. Preferential attachment and transmissibility of tweets have a joint influence on the probability whether a twitter will be forwarded. We introduce the two factors to our model and establish a directed network with user nodes and tweet nodes, where the degree distribution corresponds to the frequency distribution of retweets.

Through simulations, we show that our model is able to reproduce information cascades statistically similar those presented in the literature measured in the dataset. These results demonstrate that our model can thus be used to study how the preferential attachment and transmissibility of tweets influence the forwarding mechanism.

Acknowledgements

This work was supported by the NSFC (No. 61370025), 863 projects (No. 2011AA010703) and the Strategic Leading Science and Technology Projects of Chinese Academy of Sciences (No.XDA06030200)

References

1. R. Lahan. The Economics of Attention. University of Chicago Press, 2006.

2. J. Bollen, H. Mao, and A. Pepe. Determining the public mood state by analysis of microblogging posts. In Procceedings Of the Alife XII Conf. MIT Press, 2010

3. T. Sakaki, M. Okazaki, Y. Matsuo. Earthquake Shakes Twitter Users : Real-time Event Detection by Social Sensors. In WWW10, 2010

4. H. Kwak, C. Lee,H. Park, S. Moon. What is Twitter, a Social Network or a News Media? In WWW10, 2010

5. D. Boyd, S. Golder and G. Lotan. Tweet, tweet, retweet: Conversational aspects of retweeting on Twitter . In 43rd Hawaii International Conf. on System Sciences, 2010

6. B. Suh, L. Hong, P. Pirolli and H. Chi. Want to be Retweeted? Large Scale Analytics on Factors Impacting Retweet in Twitter Network.In IEEE Second International Conference on Social Computing (SocialCom), pages 177-184.IEEE. 2010

7. T. R. Zaman, R. Herbrich, J. V. Gael and D. Stern. Predicting information spreading in twitter. In Workshop on Computational Social Science and the Wisdom of Crowds, NIP2010, 2010

8. S. Petrovic, M. Osborne and V. Lavrenko. RT to Win! Predicting Message Propagation in Twitter. In AAAI2011, 2011

9. Reed, W. J. The Pareto, Zipf and other power laws, Economics Letters, 2001, 74 (1):15-19.

10. Reza Bakhshandeh, Mehdi Samadi, Zohreh Azimifar, Jonathan Schaeffer. Degrees of Separation in Social Networks, Fourth Annual Symposium on Combinatorial Search,2011.

11. Chip Heath,Dan Heath,Made to Stick: Why Some Ideas .Survive and Others Die,Random House;1st edition, 2007.

12. A. Hernando, D. Villuendas, C. Vesperinas, M. Abad, A. Plastino,Unravelling the size distribution of social groups with information theory on complex networks,http://arxiv. org/abs/0905.3704v3

13. Arnaboldi, V., Conti, M., Passarella, A., Pezzoni, F.: Ego networks in twitter: an experimental analysis. In: The Fifth IEEE International Workshop on Network Science for Communication Networks (NetSciCom 2013)

14. Newman, M.E.: The structure and function of complex networks. SIAM review45(2) (2003) 167-256

15. C. Cooper and A. Frieze. A general model of web graphs.Random Structures&Algorithms,22(3):311-335, 2003.

16. M. Deijfen. Random networks with preferential growth and vertex death. Journal of Applied Probability, 47(4):1150-1163, 2010.

17. Susarla, A., Oh, J.H., Tan, Y.: Social networks and the diffusion of user-generated content: Evidence from youtube. Information Systems Research23(1) (2012) 23-41