Scholarly article on topic 'Discovering Most Significant News Using Network Science Approach'

Discovering Most Significant News Using Network Science Approach Academic research paper on "Computer and information sciences"

CC BY-NC-ND
0
0
Share paper
Academic journal
Procedia Computer Science
OECD Field of science
Keywords
{}

Abstract of research paper on Computer and information sciences, author of scientific article — Ilya Blokh, Vassil Alexandrov

Abstract The role of social network mass media has increased greatly in the recent years. In this paper we investigate news publication in Twitter based on Network Science approach. We analyzed news data posted using the most popular media sources to discover the most significant news over a given period of time. Significance is a qualitative property that reflects the degree of news impact on society and public opinion. We have attempted to define the threshold of significance and discover a number of news which had some significance for society in period of time from July 2014 to January 2015.

Academic research paper on topic "Discovering Most Significant News Using Network Science Approach"

Discovering most significant news

2ICREA-Barcelona Supercomputing Centre, Spain. lyablokh@gmail.com,—vassil.alexandrov@bsc.es-

Abstract

The role of social network mass media has increased greatly in the recent years. In this paper we investigate news publication in Twitter based on Network Science approach. We analyzed news data posted using the most popular media sources to discover the most significant news over a given period of time. Significance is a qualitative property that reflects the degree of news impact on society and public opinion. We have attempted to define the threshold of significance and discover a number of news which had some significance for society in period of time from July 2014 to January 2015.

The role of Internet mass media increased greatly in the recent years Randle (2001). Internet media get more and more audience coverage and thereby its role in forming public opinion increases. Public organizations, political parties, companies as well as public figures, politicians and entrepreneurs are interested in any mentioning of their name or of some specific facts concerning them in media including Internet media. They often trace a quantity of web-sites citing them, and different sources where some significant news could appear. Significant news finding could be also useful for common Internet use since the amount of daily media publications is large and some news which are significant for someone could be missed. Thus significant (in some context) news identification is essential and actual problem.

Nowadays a lot of research exist in the area of search and analysis of news published by mass media on the Internet. In general, there are several types of news that become a subject of research:

1. World news. It is a wide category that usually includes all information about some event or incident and don't relate to some specific branch or domain.

2. Branch news. Financial news, politics news etc.. In other words it's news related to some specific domain.

3. Emergency news. News about some extreme and extraordinary incidents, these usually happen suddenly.

4. Future news. News prediction is a special research direction that became popular last years.

Also methods of news personalization and development of personalized search engines is a separate

Approaches for analysing various types of news are widely different. They range, for example, from traditional search engines based on indexing to semantic methods based on idea of Semantic Web (Guha et al 2003, Giannopoulos et al. 2010, Halpin and Lavrenko 2009), keywords and key phrases processing (Murata 2006) , NLP methods (Yilmazel 2004). Ontology based approaches have a remarkable prevalence too (Wimalasuriya and Dou 2006, Alani et al 2009, Domingue and Motta 2000). In addition, some methods based on Data Mining technologies (Kleinberg 2006) and machine learning (Han et al 2003) used to resolve a problem of news search, interpretation and analysis.

This paper is devoted to the use of Network Science approach for news analysis.

Selection and peer-review under responsibility of the Scientific Programme Committee of ICCS 2015 1811 © The Authors. Published by Elsevier B.V.

1 Introduction

research topic.

doi: 10.1016/j .procs.2015.05.400

2 TwDiiicteejjnaet'wviiadkRclaapieiosgsyfisaiieWdresseisHcelhpiSrfsopeiiya Blokh and Vassil Alexandrov

Twitter is the social network which use simultaneously several ways of information spreading: following mechanism, retweets and mentions. General properties of this user influence ways were investigated in Cha et al (2010). In Kwak et al (2010) authors examined topological characteristics of the entire Twitter network. In particular, scale-free property of Twitter network revealed. In that case nodes of the network represent users and links between nodes represent following relation.

Figure 1: Structure of Twitter network

Network built on the mentioned rules is scale-free network.

The other approach is to point at retweet data. In this case nodes still represent the entire twitter account. These networks will have weighed links and weight represents the quantity of retweets made by one user during some time period. The investigated networks exhibit a scale-free property.

In our research we decided to take Twitter as a typical representative of the social networking phenomenon as it is one of most subscribed social networks and base our approach on smaller entities - such as tweets. Each network we build is limited by tweets from one user. Tweets act as nodes of network. Each tweet could be retweeted many times by some external twitter users. We define each retweet being one outgoing tie from a node. The total link number for each node is its degree. Thus for each tweet feed we have some degree distribution. Next we examine this distribution for scale-free property (Barabasi 2014).

Figure 2:. Retweet scheme at one twiiter feed scale Additionally the scope of our research data consists of tweets which contains news information.

3 Determining the news significance

We aim to form an integral measure of the degree of the news impact on the society and public opinion. This measure could be named as news significance. Thus our goal is to reveal significant news among the whole news data. We interpret the significance as a qualitative property of some piece of news and in the same way try to define some qualitative threshold to measure it. News significance research could help to reveal a manifold of interconnections between events and examine information spreading. The news significance could vary in different regions, between different social groups giving a possibility to describe and compare these regions or social groups. The changing of significance in some time period is also an interesting subject of analysis. In general significance analysis could help to characterize how different events impact on society or on separate social groups, to make research about events interconnection and to analyse news information spreading features.

Further, we define a quantitative metric of news significance as its retweet count for some time period. It is natural way to determine the significance because the retweet count directly shows how many people were interested by the news. In the subsequent research we expect that this metrics will become more complex by aggregating information not only about retweet count but also include the information about number of user replies on the given tweet, number of users who put the tweet in favourites and some additional data. Also news significance could vary in different periods of time and in different regions. Hence significance will be presented as multi-dimensional value. However, before we include more components to tune the significance in more precise way, we consider it currently to be a simple value of the retweet count.

Our next step after analysing the network for scale-free property is to define how could we differentiate significant news from other ones. For that the significance threshold should be specified.

4 Crawkidg data ifeväetwews using Network Science approach Ilya Blokh and Vassil Alexandrov

Thus we analysed twitter feed of some of noticeable news agencies which have from 5 up to 20 millions of followers each: BBC, CNN, New York Times, Mashable and TechCrunch. Collected data covers period from July 2014 up to January 2015 and includes information about 16500 news tweets.

Degree distributions were composed and plotted. At each chart the vertical axis shows the number of | retweets and the number of news tweet is on horizontal axis.

BBC Breaking

Figure 3: Degree distribution for BBC Breaking News account

Figure 4: Degree distribution for New York Times account

CNN Breaking

I ......

Figure 5: Degree distribution for CNN Breaking News account

Figure 6: Degree distribution for Mashable account

Figure 7: Degree distribution for Techcrunch account

As we see at charts degree distributions in all cases follow power law. Further we estimate the degree exponent (Barabasi 2014).

5 Experàngentalifiesulttn ews using Network Science approach Ilya Blokh and Vassil Alexandrov

We follow the algorithm described by Clauset et al (2009) to estimate the degree exponent for each news source. Estimated results are given in Table 1:

News Source (from Twitter) Y

BBC Breaking 2.31

NYTimes 2.05

CNN Breaking 2.11

Mashable 2.23

TechCrunch 2.45

Table 1: Estimated degree exponents

Estimated degree exponent values (2 < y < 3) show that distributions for all news sources have scale-free property (Barabasi 2014). It means that each network contains few nodes with a big amount of links - hubs. At the same time the majority of nodes have less links by several orders. So we could define the threshold for news significance being able to be forming hubs (being able to be concentrated and propagated through/by the hubs). Let's suppose that the news is significant thus meeting the condition that the corresponding node in the network is a hub. On the other part, the news is not significant if corresponding node has a low degree and thus is not forming a hub (there is not much interest in this news). Of course this is a very crude measure and is a very rough initial estimate and as explained above needs to be refined further taking into account a variety of factors identified above.

We use the fact that hubs grow polynomialy with the growth of the network size. In that way we could take a fixed portion of the most connected nodes from each network and assume that these nodes are hubs. According to (Barabasi 2014, Vallabhajosyula et al 2009) hub nodes portion could vary between 1% and 2.5% as usual in scale-free networks. In our case we consider that only 1% of highly connected nodes belong to the hub set.

As a result we could discover several significant news from each source:

Retweet count News Source (from Twitter) News text

20134 BBC Breaking Full statement from family of #MichaelBrown after #Ferguson ruling

15075 BBC Breaking US actor Robin Williams found dead, aged 63, in apparent suicide, California police say

13995 BBC Breaking Scotland has rejected independence, #indyref results confirm

10640 BBC Breaking Air Asia flight QZ 8501 travelling from Indonesia to Singapore has gone missing - reports

15104 New York Times 1989 was the year Taylor Swift was born. 2014 was The Year of Taylor Swift

5439 New York Times It should come as no surprise that music sales in 2014 1815

Discovering mos t significant news using Network S ciwnre dptiaafed ByalByibh SwdfVassil Alexandrov

3400 New York Times How exercise changes our DNA

16137 CNN Breaking Comedic actor Robin Williams, 63, died at his Northern California home Monday, law enforcement officials say.

15166 CNN Breaking Cleveland police's fatal shooting of 12-year-old Tamir Rice ruled a homicide.

13943 CNN Breaking Apple CEO Tim Cook announces he's gay.

11379 CNN Breaking Grand jury has decided not to indict Ferguson police Officer Darren Wilson. #FergusonDecision

47549 Mashable The people have spoken! Freedom has prevailed! Sony didn't give up! The Interview will be shown at theaters willing to play ...

8674 Mashable Street artist Banksy's powerful message of perseverance after Paris attack

1345 TechCrunch Start-up Opportunities Abound In The Age Of Infinite, Resilient, Immutable Infrastructure

810 TechCrunch My Kim Kardashian: Hollywood game has been nominated for a @TechCrunch Crunchies Award!

Table 2: Significant news. Time period: 15.07.2014 - 10.01.2015

6 Conclusion

We investigated variety of news (data) from Twitter and structured it into a set of networks where each network corresponds to one source of news. We chose five noticeable news agencies as sources for our research: BBC, CNN, New York Times, Mashable and TechCrunch. The data collection covers period from July 2014 to January 2015 and includes information about 16500 news tweets. We defined the network following the principle when tweets act as nodes of the network and the links symbolize retweet actions from external accounts. Thus each node has a degree value which reflects its retweet count. The network degree distribution analysis showed that networks are scale-free.

Further the significance of news as its possible impact on the degree at public opinion and social reaction was investigated. Assuming that the retweet count could be a quantitative measure of significance, we investigated if the hubs in the social networks can be used as a measure to determine if a set of news is significant.

We found that this is a very crude measure. Therefore in our future research we plan to extend the volume of analysed data from Twitter and include additional data from other social networks. Moreover one of the main directions of future research is to extend our approach and base our analysis of different points of views concerning each news. Globally, we name it information war analysis. Also we are working on introducing more complex metrics by aggregating information not only about retweet count but also aiming to include the information about number^ol gser replies on the given tweet, number of users who put the tweet in favourites and some additional

data. Inaddition we are , considering to apply an .optimization approach to. tackle .the. problem, of search and analysis

Discovering most significant news ^smg Network Science approach Hya Blokn and Vassil Alexandrov of significant news as well as compare the efficiency of the two approaches.

References

Alani H., Kim S., Millard D.E., Weal M.J., Hall W., Lewis P.H., Shadbolt N.R. (2009). Automatic ontology-based knowledge extraction from web documents, IEEE Intelligent Systems 18 (1) pp.14-21.

Barabasi A.L. (2014). Network Science.

Cha M., H. Haddadi, F. Benevenuto and K. P. Gummadi (2010) Measuring user influence in Twitter:The million follower fallacy, Proceedings of international AAAI Conference on Weblogs and Social.

Clauset A., Shalizi C.R., Newman M.E.J. (2009) Power-law distributions in empirical data. SIAM Review S1: 661703. ArXiv e-prints.

Domingue J., Motta E. (2000) PlanetOnto: from news publishing to integrated knowledge management support. IEEE Intelligent Systems 15 (3) pp.26-32.

Ghoshal G., Chi L., Barabasi A.-L.(2013) Uncovering the role of elementary processes in network evolution. Scientifc Reports 3, pp.1-8.

Giannopoulos G., Bikakis N., Dalamagas T., Sellis T. (2010) A Tool for semantic annotation and search. Proceedings of 7th Extended Semantic Web Conference, ESWC 2010, Heraklion, Crete, Greece, 2010.

Guha R., McCool R., Miller E. (2003) Semantic search. Proceedings of the 12th International Conference on World Wide Web, Budapest, Hungary, pp. 700-709.

Halpin H., Lavrenko V. (2009) Relevance feedback between hypertext and semantic search. Proceedings of the Workshop on Semantic.

Han H., Lee Giles C., Manavoglu E., Zha H., Zhang Z., Fox E.A. (2003) Automatic document metadata extraction using support vector machines. Proceedings of 3rd ACM/IEEE-CS Joint Conference on Digital Libraries,, Houston, Texas.

Jia T., Posfai M. (2014) Connecting core percolation and controllability of complex networks. Scientific Reports 4:5379, pp.1-8.

Kleinberg J. (2006) Temporal Dynamics of On-Line Information Streams. Garofalakis, Springer.

Kwak, H., C. Lee, H. Park, S. Moon. (2010) What is Twitter, a social network or a news media?, Proceedings of the 19th international conference on World wide web, 591-600.

Murata T. (2006) Towards the Detection of Breaking News from Online Web Search Keywords. Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2006 Workshops)(WI-IATW'06).

Randle, Q. (2001) A historical overview of the effects of new mass media: Introductions in magazine publishing during the twentieth century. Volume 6, Number 9 - 3 September 2001.

Vallabhajosyula R.R., Chakravarti D., Lutfeali S., Ray A., Raval A. (2009) Identifying Hubs in Protein Interaction Networks. PLoS ONE; 4(4): e5344.

Wang D., Wen Z., Tong H., Lin C.Y., Song C., Barabasi A.L. (2011) Information Spreding in Context.

Wimalasuriya D.C., Dou D. (2006) Ontology-based information extraction: an introduction and a survey of current approaches. Journal of Information Science 36, pp.306-323.

Yilmazel O., Finneran C.M., Liddy E.D.(2004) An NLP system to automatically assign metadata. Proceedings of the Fourth ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL'04)