• Aucun résultat trouvé

The pattern of information diffusion in microblog

N/A
N/A
Protected

Academic year: 2021

Partager "The pattern of information diffusion in microblog"

Copied!
3
0
0

Texte intégral

(1)

HAL Id: hal-00737836

https://hal.archives-ouvertes.fr/hal-00737836

Submitted on 2 Oct 2012

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

The pattern of information diffusion in microblog

Dong Wang, Zhenyu Li, Kave Salamatian, Gaogang Xie

To cite this version:

Dong Wang, Zhenyu Li, Kave Salamatian, Gaogang Xie. The pattern of information diffusion in microblog. Proceedings of The ACM CoNEXT Student Workshop, 2011, Tokyo, Japan. pp.3:1–3:2,

�10.1145/2079327.2079330�. �hal-00737836�

(2)

The Pattern of Information Diffusion in Microblog

Dong Wang †¶ , Zhenyu Li , Kave Salamatian § , Gaogang Xie

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

Graduate School of Chinese Academy of Sciences, Beijing, China

§ Universite de Savoie, France

{ wangdong01, zyli, xie } @ict.ac.cn, kave.salamatian@univ-savoie.fr

ABSTRACT

Microblog (e.g. Twitter) is fast emerging social medium for information diffusion. This paper presents an analysis of the pattern of information diffusion in microblog. We have collected the data of 1,798,901 tweets from Sina microblog service- the Chinese equivalent of Twitter. Our analysis un- veils that the distribution of tweets’ popularity follow the stretched exponential (SE) model. Based on this observa- tion we analyse a multiplicative cascade model for describ- ing tweets popularity.

1. INTRODUCTION

Microblog services, such as Twitter and Sina microblog, have greatly changed the way of information dissemi- nation. The speed and convenience of microblogs make them competitive services with classical media. With the increase of importance of microblog as a social medi- um for information sharing, understanding mechanism- s describing how information diffuses over microblogs, and explaining how some tweets become popular, are meaningful in order to predict the evolution of this new social medium in the future.

In this paper, we make an analysis of tweet’s popular- ity. We investigate a cascade model [1] of information propagation in microblog services. In this model that is particularly suited to describing web 2.0 services, we assume that information diffusion proceeds in an even- tually random number of successive stages. Our aim is to validate the usage of such a model for describing information diffusion in microblog services.

We collected the data of 1,798,901 tweets of Sina mi- croblog from Dec. 6th, 2010 to Jan. 1st, 2011, contain- ing all the publicly available details such as the re-tweet number and the information of the users’ participating in the re-tweeting process. Our analysis unveils that the distribution of tweets’ popularity follows the stretched

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

ACM CoNEXT Student Workshop, December 6, 2011, Tokyo, Japan.

Copyright 2011 ACM 978-1-4503-1042-0/11/0012 ... $ 10.00.

exponential (SE) model, instead of the expected power- law model. We show that the social cascade processes of tweets naturally converge to the SE model and the pa- rameters can be used to estimated crucial properties of cascade. Moreover, the re-tweet number decreases ex- ponentially with the growth of re-tweeting hop giving preliminary evidence for a simple multiplicative model.

2. THE MULTIPLICATIVE CASCADE MOD- EL FOR TWEET POPULARITY

A microblog user might follow another user, i.e., he will receive all messages (called tweets) sent by the fol- lowed person. Followers might re-tweet some of the messages they receive to their own followers. The dis- tance between re-tweeters and the tweet’s publisher is called hops. The re-tweeting mechanism enables users to spread information to users that could not normally access it. Through re-tweeting hot messages can be re- ceived by tens of thousands of users. We measure pop- ularity of a tweet by the number of people that have re-tweeted it.

The re-tweeting pattern in microblog can be described as a random multiplicative cascade process. Formally, a random multiplicative cascade process X (.) can be de- scribed at each point k as a multiplication of n random variables m 1 , . . . , m n , i.e., X (k) = m 1 × m 2 × . . . × m n . To relate this model to the propagation of information, we define that the i-th stage begins at the time for gen- eration of the first re-tweet at hop i and ends until the first re-tweet at hop i + 1 appears,that is, the number of cascade stages is equivalent to the maximum re-tweeting hop. The number of new users that will re-tweet a tweet at i-th (1 < i n) stage is a coefficient α i of the overall number of users that have re-tweeted the tweet up to the (i 1)-th stage. Then the overall number of users re-tweet a particular tweet t after n stages, denoted as N n (t), is given by N n (t) = (1 + α 1 )(1 +α 2 ) . . . (1 +α n ), where the expansion coefficient α i is in fact related to two main factors: the proportion of followers that will re-tweet at i-th stage, and the outdegree (i.e. the num- ber of followers) of the re-tweeters.

Similarly to Central limit theorem that applies to sum

of random variables, we can derive an asymptotic lim-

it theorem for multiplicative processes where all mul-

tiplied random variables are i.i.d [2]. Such processes

(3)

100 101 102 103 104 105 106 1

2076 22712 98172 282721 647531 1279962 2283020

Rank (log scale)

# of retweets (yc scale)

popularity

c = 0.26, a= 1.693, b = 24.957 R2 = 0.997476

100 102 104 106100

101 102 103 104 105 106 107

# of retweets (log scale)

data in log−yc scale SE model fit

data in log−log scale

Figure 1: Stretched ex- ponential distr. Fitting

0 10 20 30 40 50 60

10−6 10−4 10−2 100 102

hops

average retweet number(logy scale)

0~500 501~1000 1001~1500 1501~2000 2001~2500 2501~3000 3001~3500 3501~4000 4001~4500 4501~5000

>5000

Figure 2: Dis- tribution of avg. re-tweet

number in

each hop

converge to a stretched exponential (SE) distribution defined as:

P(X x) = e (

xx0

)

c

(1) where the stretched factor is related to the number of multiplied random variables m, i.e. the number of cas- cade stages, through a simple relation c = m 1 , and x 0

is a constant parameter that is related to ranking scale.

Because of its particular shape a stretched exponential distribution can be easily mistaken with a power law [3].

However processes following a stretched exponential dis- tribution will have a particular rank ordering statistic that will be different from the one of a power law . Let i be the rank of an observation from a stretched expo- nentially distributed process and y i its observed value.

It can be shown theoretically that this relation is valid y c i = a log i + b (2) where a = x c 0 and b = y 1 c , meaning that the modified ranking diagram, showing y i c the observed values with exponent c vs. the log of its rank, follows in a straight line with slope a = x c 0 . This analysis suggest that if an empirical distribution follows a streched exponential it can be meaningful to search for a multiplicative cascade that could explain the emergence of this global distri- bution. In order to check this, we fitted SE models to the observed tweet popularity rank-ordering distribu- tion using the matlab toolbox provided by authors of [4]. We show in Figure 1 the popularity distribution for all collected tweets in both log-log scale and log- y c scale. The parameters of the SE model along with the R 2 statistic of the fitting are marked in the figure.

The figure shows that the SE model fits the distribution very well, except the first several points that are due to the “King effect” [3] (this resulting from the fact that popular topics reduce the attractiveness of other topic because of their high popularity). We have fitted SE for different days and list in Table 1 the obtained parame- ters showing the relative consistency of the c parameter in close dates.

In particular the SE model predict that we can expect a number of maximum re-tweeting hop (a number of cascade stage) around m = 1 c . We also show in Table 1, the number of maximum re-tweeting hop h e derived

Table 1: Parameters of different days of tweets

Time c a R

2

h

e

h

r

06/12/10 0.38 8.617 0.9980 2.63 2.69 07/12/10 0.36 6.523 0.9987 2.78 2.73 08/12/10 0.37 7.979 0.9987 2.70 2.72

using SE model: h e = 1 c and the empirically observed value over the dataset h a . As can be observed these values are very close. These results give more rationals for a multiplicative cascade model of tweet popularity.

To model the multiplicative cascade, we have ana- lyzed how the number of re-tweets relates to the num- ber of re-tweeting hops from the tweet generator. For this purpose we have stratified the data into 11 strates according to the popularity of the tweets inside it, i.e.

for 1 i 10, the i-th strate contains the tweets with re-tweet numbers between 500i to 500(i + 1); the last set contains all remaining tweets. We show in Figure 2 in a semi-log scale the evolution of the average num- ber of re-tweets as a function of its hop distance from tweet source. Interestingly all curves seems to be al- most parallel and to be finely fitted to a straight lines.

This indicates that the average re-tweet number de- creases exponentially with the hop distance. This is compatible with a cascade model with a constant value of E { α i } = ˆ α for all stages. Interestingly these figures mean that the tweet’s popularity mainly depends on the re-tweet number at the first hop (or the two first hop- s), i.e., the number of followers of the originator that forward the tweet. However this conclusion needs more analysis and larger space to be developed.

3. CONCLUSIONS AND FUTURE WORKS

In this paper, we gave evidence that diffusion in a microblog can be explained with a very simple mul- tiplicative cascade model.We showed that the overall distribution of tweet popularity is compatible with this model.We even give some indication that cascade coeffi- cient E { α } might be constant. However, we need more precise information about the diffusion process over the microblog to validate this multiplicative cascade model both microscopically (at the scale of a single tweet) and macroscopically (at the scale of global statistics). This is the aim of our forthcoming research.

4. REFERENCES

[1] M. Cha, A. Mislove, B. Adams and K. P.

Gummadi. Characterizing Social Cascades in Flickr. In Proceedings of the WOSN’08,2008.

[2] U. Frisch and D. Sornette. Extreme deviations and applications. Journal of Physics I France, 1997.

[3] J. Laherrere, D. Sornette. Stretched exponential distributions in Nature and Economy: Fat tails with characteristic scales. In European Physical Journal B 2, 2008.

[4] A. Clauset, C. R. Shalizi, M. E.J. Newman.

Power-law distributions in empirical data. In SIAM

Review 51(4), 2009.

Références

Documents relatifs

In this Entity Parsing module, the Named Entities (NE) are identified and tagged in the given entity using the Stanford NER engine. The remaining words after stop words

The SEPLN 2013 Tweet- Norm shared task focuses on the evalua- tion of approaches useful for normalizing out-of-vocabulary words occurring in Span- ish tweets, similar to the

Display Dimensions (HxWxD, inches) Screen Size (Inches, diagonal) Display Pedestal.. Intarnal

We have found field equations governing the evolution of bound states of two Dirac fields restricted to a world line. We

In this literature, we investigate the distribution of the first digit in the sequence of the perfect square numbers and the distribution of the arbitrary positive integer power of

Wong and Stephen Yau, I gave a lower estimate of the gap of the first 2 eigenvalues of the Schr¨odinger operator in case the potential is convex.. In this note we note that the

But for finite depth foliations (which have two sided branching), there are many examples where it is impossible to make the pseudo-Anosov flow transverse to the foliation [Mo5] and

To improve the temporal accuracy, the semi-implicit spectral deferred correction (SDC) method and a high order semi-implicit Runge–Kutta method combining with the first order