The pattern of information diffusion in microblog

(1)

HAL Id: hal-00737836

https://hal.archives-ouvertes.fr/hal-00737836

Submitted on 2 Oct 2012

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

The pattern of information diffusion in microblog

Dong Wang, Zhenyu Li, Kave Salamatian, Gaogang Xie

To cite this version:

Dong Wang, Zhenyu Li, Kave Salamatian, Gaogang Xie. The pattern of information diffusion in microblog. Proceedings of The ACM CoNEXT Student Workshop, 2011, Tokyo, Japan. pp.3:1–3:2,

�10.1145/2079327.2079330�. �hal-00737836�

(2)

The Pattern of Information Diffusion in Microblog

Dong Wang ^†¶ , Zhenyu Li ^† , Kave Salamatian ^§ , Gaogang Xie ^†

† Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

¶ Graduate School of Chinese Academy of Sciences, Beijing, China

§ Universite de Savoie, France

{ wangdong01, zyli, xie } @ict.ac.cn, kave.salamatian@univ-savoie.fr

ABSTRACT

Microblog (e.g. Twitter) is fast emerging social medium for information diffusion. This paper presents an analysis of the pattern of information diffusion in microblog. We have collected the data of 1,798,901 tweets from Sina microblog service- the Chinese equivalent of Twitter. Our analysis un- veils that the distribution of tweets’ popularity follow the stretched exponential (SE) model. Based on this observa- tion we analyse a multiplicative cascade model for describ- ing tweets popularity.

1. INTRODUCTION

Microblog services, such as Twitter and Sina microblog, have greatly changed the way of information dissemi- nation. The speed and convenience of microblogs make them competitive services with classical media. With the increase of importance of microblog as a social medi- um for information sharing, understanding mechanism- s describing how information diﬀuses over microblogs, and explaining how some tweets become popular, are meaningful in order to predict the evolution of this new social medium in the future.

In this paper, we make an analysis of tweet’s popular- ity. We investigate a cascade model [1] of information propagation in microblog services. In this model that is particularly suited to describing web 2.0 services, we assume that information diﬀusion proceeds in an even- tually random number of successive stages. Our aim is to validate the usage of such a model for describing information diﬀusion in microblog services.

We collected the data of 1,798,901 tweets of Sina mi- croblog from Dec. 6th, 2010 to Jan. 1st, 2011, contain- ing all the publicly available details such as the re-tweet number and the information of the users’ participating in the re-tweeting process. Our analysis unveils that the distribution of tweets’ popularity follows the stretched

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

ACM CoNEXT Student Workshop, December 6, 2011, Tokyo, Japan.

Copyright 2011 ACM 978-1-4503-1042-0/11/0012 ... $ 10.00.

exponential (SE) model, instead of the expected power- law model. We show that the social cascade processes of tweets naturally converge to the SE model and the pa- rameters can be used to estimated crucial properties of cascade. Moreover, the re-tweet number decreases ex- ponentially with the growth of re-tweeting hop giving preliminary evidence for a simple multiplicative model.

2. THE MULTIPLICATIVE CASCADE MOD- EL FOR TWEET POPULARITY

A microblog user might follow another user, i.e., he will receive all messages (called tweets) sent by the fol- lowed person. Followers might re-tweet some of the messages they receive to their own followers. The dis- tance between re-tweeters and the tweet’s publisher is called hops. The re-tweeting mechanism enables users to spread information to users that could not normally access it. Through re-tweeting hot messages can be re- ceived by tens of thousands of users. We measure pop- ularity of a tweet by the number of people that have re-tweeted it.

The re-tweeting pattern in microblog can be described as a random multiplicative cascade process. Formally, a random multiplicative cascade process X (.) can be de- scribed at each point k as a multiplication of n random variables m 1 , . . . , m n , i.e., X (k) = m 1 × m 2 × . . . × m n . To relate this model to the propagation of information, we define that the i-th stage begins at the time for gen- eration of the first re-tweet at hop i and ends until the first re-tweet at hop i + 1 appears,that is, the number of cascade stages is equivalent to the maximum re-tweeting hop. The number of new users that will re-tweet a tweet at i-th (1 < i ≤ n) stage is a coefficient α i of the overall number of users that have re-tweeted the tweet up to the (i − 1)-th stage. Then the overall number of users re-tweet a particular tweet t after n stages, denoted as N ⁿ (t), is given by N ⁿ (t) = (1 + α ₁ )(1 +α ₂ ) . . . (1 +α _n ), where the expansion coefficient α i is in fact related to two main factors: the proportion of followers that will re-tweet at i-th stage, and the outdegree (i.e. the num- ber of followers) of the re-tweeters.

Similarly to Central limit theorem that applies to sum

of random variables, we can derive an asymptotic lim-

it theorem for multiplicative processes where all mul-

tiplied random variables are i.i.d [2]. Such processes

(3)

10⁰ 10¹ 10² 10³ 10⁴ 10⁵ 10⁶ 1

2076 22712 98172 282721 647531 1279962 2283020

Rank (log scale)

# of retweets (yc scale)

popularity

c = 0.26, a= 1.693, b = 24.957 R² = 0.997476

10⁰ 10² 10⁴ 10⁶10⁰

10¹ 10² 10³ 10⁴ 10⁵ 10⁶ 10⁷

# of retweets (log scale)

data in log−y^c scale SE model fit

data in log−log scale

Figure 1: Stretched ex- ponential distr. Fitting

0 10 20 30 40 50 60

10⁻⁶ 10⁻⁴ 10⁻² 10⁰ 10²

hops

average retweet number(logy scale)

0~500 501~1000 1001~1500 1501~2000 2001~2500 2501~3000 3001~3500 3501~4000 4001~4500 4501~5000

>5000

Figure 2: Dis- tribution of avg. re-tweet

number in

each hop

converge to a stretched exponential (SE) distribution deﬁned as:

P(X ≥ x) = e ⁻ ⁽

^x^x⁰

⁾

^c

(1) where the stretched factor is related to the number of multiplied random variables m, i.e. the number of cas- cade stages, through a simple relation c = _m ¹ , and x 0

is a constant parameter that is related to ranking scale.

Because of its particular shape a stretched exponential distribution can be easily mistaken with a power law [3].

However processes following a stretched exponential dis- tribution will have a particular rank ordering statistic that will be diﬀerent from the one of a power law . Let i be the rank of an observation from a stretched expo- nentially distributed process and y i its observed value.

It can be shown theoretically that this relation is valid y ^c _i = − a log i + b (2) where a = x ^c ₀ and b = y ₁ ^c , meaning that the modified ranking diagram, showing y _i ^c the observed values with exponent c vs. the log of its rank, follows in a straight line with slope a = x ^c ₀ . This analysis suggest that if an empirical distribution follows a streched exponential it can be meaningful to search for a multiplicative cascade that could explain the emergence of this global distri- bution. In order to check this, we fitted SE models to the observed tweet popularity rank-ordering distribu- tion using the matlab toolbox provided by authors of [4]. We show in Figure 1 the popularity distribution for all collected tweets in both log-log scale and log- y ^c scale. The parameters of the SE model along with the R ² statistic of the fitting are marked in the figure.

The figure shows that the SE model fits the distribution very well, except the first several points that are due to the “King effect” [3] (this resulting from the fact that popular topics reduce the attractiveness of other topic because of their high popularity). We have fitted SE for different days and list in Table 1 the obtained parame- ters showing the relative consistency of the c parameter in close dates.

In particular the SE model predict that we can expect a number of maximum re-tweeting hop (a number of cascade stage) around m = ¹ _c . We also show in Table 1, the number of maximum re-tweeting hop h _e derived

Table 1: Parameters of diﬀerent days of tweets

Time c a R

²

h

e

h

r

06/12/10 0.38 8.617 0.9980 2.63 2.69 07/12/10 0.36 6.523 0.9987 2.78 2.73 08/12/10 0.37 7.979 0.9987 2.70 2.72

using SE model: h e = ¹ _c and the empirically observed value over the dataset h _a . As can be observed these values are very close. These results give more rationals for a multiplicative cascade model of tweet popularity.

To model the multiplicative cascade, we have ana- lyzed how the number of re-tweets relates to the num- ber of re-tweeting hops from the tweet generator. For this purpose we have stratiﬁed the data into 11 strates according to the popularity of the tweets inside it, i.e.

for 1 ≤ i ≤ 10, the i-th strate contains the tweets with re-tweet numbers between 500i to 500(i + 1); the last set contains all remaining tweets. We show in Figure 2 in a semi-log scale the evolution of the average num- ber of re-tweets as a function of its hop distance from tweet source. Interestingly all curves seems to be al- most parallel and to be ﬁnely ﬁtted to a straight lines.

This indicates that the average re-tweet number de- creases exponentially with the hop distance. This is compatible with a cascade model with a constant value of E { α i } = ˆ α for all stages. Interestingly these figures mean that the tweet’s popularity mainly depends on the re-tweet number at the first hop (or the two first hop- s), i.e., the number of followers of the originator that forward the tweet. However this conclusion needs more analysis and larger space to be developed.

3. CONCLUSIONS AND FUTURE WORKS

In this paper, we gave evidence that diffusion in a microblog can be explained with a very simple mul- tiplicative cascade model.We showed that the overall distribution of tweet popularity is compatible with this model.We even give some indication that cascade coeffi- cient E { α } might be constant. However, we need more precise information about the diffusion process over the microblog to validate this multiplicative cascade model both microscopically (at the scale of a single tweet) and macroscopically (at the scale of global statistics). This is the aim of our forthcoming research.

4. REFERENCES

[1] M. Cha, A. Mislove, B. Adams and K. P.

Gummadi. Characterizing Social Cascades in Flickr. In Proceedings of the WOSN’08,2008.

[2] U. Frisch and D. Sornette. Extreme deviations and applications. Journal of Physics I France, 1997.

[3] J. Laherrere, D. Sornette. Stretched exponential distributions in Nature and Economy: Fat tails with characteristic scales. In European Physical Journal B 2, 2008.

The pattern of information diffusion in microblog

HAL Id: hal-00737836

https://hal.archives-ouvertes.fr/hal-00737836

Submitted on 2 Oct 2012

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

The pattern of information diffusion in microblog

Dong Wang, Zhenyu Li, Kave Salamatian, Gaogang Xie

To cite this version:

Dong Wang, Zhenyu Li, Kave Salamatian, Gaogang Xie. The pattern of information diffusion in microblog. Proceedings of The ACM CoNEXT Student Workshop, 2011, Tokyo, Japan. pp.3:1–3:2,

�10.1145/2079327.2079330�. �hal-00737836�

The Pattern of Information Diffusion in Microblog

Dong Wang †¶ , Zhenyu Li † , Kave Salamatian § , Gaogang Xie †

† Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

¶ Graduate School of Chinese Academy of Sciences, Beijing, China

§ Universite de Savoie, France

{ wangdong01, zyli, xie } @ict.ac.cn, kave.salamatian@univ-savoie.fr

ABSTRACT

1. INTRODUCTION

ACM CoNEXT Student Workshop, December 6, 2011, Tokyo, Japan.

Copyright 2011 ACM 978-1-4503-1042-0/11/0012 ... $ 10.00.

2. THE MULTIPLICATIVE CASCADE MOD- EL FOR TWEET POPULARITY

Similarly to Central limit theorem that applies to sum

of random variables, we can derive an asymptotic lim-

it theorem for multiplicative processes where all mul-

tiplied random variables are i.i.d [2]. Such processes

Figure 1: Stretched ex- ponential distr. Fitting

Figure 2: Dis- tribution of avg. re-tweet

number in

each hop

converge to a stretched exponential (SE) distribution deﬁned as:

P(X ≥ x) = e − (

)

(1) where the stretched factor is related to the number of multiplied random variables m, i.e. the number of cas- cade stages, through a simple relation c = m 1 , and x 0

is a constant parameter that is related to ranking scale.

Because of its particular shape a stretched exponential distribution can be easily mistaken with a power law [3].

However processes following a stretched exponential dis- tribution will have a particular rank ordering statistic that will be diﬀerent from the one of a power law . Let i be the rank of an observation from a stretched expo- nentially distributed process and y i its observed value.

In particular the SE model predict that we can expect a number of maximum re-tweeting hop (a number of cascade stage) around m = 1 c . We also show in Table 1, the number of maximum re-tweeting hop h e derived

Table 1: Parameters of diﬀerent days of tweets

Time c a R

h

h

06/12/10 0.38 8.617 0.9980 2.63 2.69 07/12/10 0.36 6.523 0.9987 2.78 2.73 08/12/10 0.37 7.979 0.9987 2.70 2.72

using SE model: h e = 1 c and the empirically observed value over the dataset h a . As can be observed these values are very close. These results give more rationals for a multiplicative cascade model of tweet popularity.

To model the multiplicative cascade, we have ana- lyzed how the number of re-tweets relates to the num- ber of re-tweeting hops from the tweet generator. For this purpose we have stratiﬁed the data into 11 strates according to the popularity of the tweets inside it, i.e.

3. CONCLUSIONS AND FUTURE WORKS

4. REFERENCES

[1] M. Cha, A. Mislove, B. Adams and K. P.

Gummadi. Characterizing Social Cascades in Flickr. In Proceedings of the WOSN’08,2008.

[2] U. Frisch and D. Sornette. Extreme deviations and applications. Journal of Physics I France, 1997.

[3] J. Laherrere, D. Sornette. Stretched exponential distributions in Nature and Economy: Fat tails with characteristic scales. In European Physical Journal B 2, 2008.

[4] A. Clauset, C. R. Shalizi, M. E.J. Newman.

Power-law distributions in empirical data. In SIAM

Review 51(4), 2009.

Dong Wang ^†¶ , Zhenyu Li ^† , Kave Salamatian ^§ , Gaogang Xie ^†

P(X ≥ x) = e ⁻ ⁽

⁾

(1) where the stretched factor is related to the number of multiplied random variables m, i.e. the number of cas- cade stages, through a simple relation c = _m ¹ , and x 0

In particular the SE model predict that we can expect a number of maximum re-tweeting hop (a number of cascade stage) around m = ¹ _c . We also show in Table 1, the number of maximum re-tweeting hop h _e derived

using SE model: h e = ¹ _c and the empirically observed value over the dataset h _a . As can be observed these values are very close. These results give more rationals for a multiplicative cascade model of tweet popularity.