• Aucun résultat trouvé

Inferring The Optimal Aggregation Density Online

Complex Network Analysis for Opportunistic Device-to-Device

4.4 Inferring The Optimal Aggregation Density Online

Assuming a node has collected enough information about past contacts3, it could use density-based aggregation to maintain a contact graph around the optimal density point, and apply its preferred forwarding metrics on this graph. However, the above analysis gives rise to an interesting question:How can each node decide an optimal aggregation density in real time, without any prior knowledge (e.g., mobility context), in order to optimize its performance?

We conjecture that this optimal density lies at the point where the underlying (so-cial) structure governing mobility best correlates with the social structure that can be observed on the aggregated contact graph. Consider for example a simple CAVE model: each node belong to exactly one community, and nodes have links to most other nodes in their community (with equal or different weights); we refer to these links/neihgbors asregular; in terms of mobility each community is assigned to a sep-arate physical location (“cave”), and nodes either move towards a node in their com-munity (with higher probability), or perform a random “roaming” trip in the network (with lower probability). In this mobility model, as nodes move around, they mostly come in contact withregularneighbors, but every now and then they might encounter a node from another community (e.g. during a roaming trip). A link to such a node could be included in the contact graph (e.g. if the contact was recent enough, or the ag-gregation density is high enough). We will call such a node arandomneighbor (on the

3given that there are no delay constraints on this collection process, it can be performed with relatively low overhead as nodes meet and exchange info about their past contacts.

Figure 4.4: Aggregated contact graph for a scenario with 3 social communities driving mobility (denoted with different colors), and three different aggregation densities.

contact graph). Hence, regular links imply future contacts while random links might be incidental, not necessarily implying future contacts.

For CNA-based forwarding to work, we argue that a strong “relationship” in the contact process (e.g., a community/regular link the above model) should also be an observable relationship (edge) in the aggregated graph. At the same time, the number of additional graph edges (e.g., random links) should be as few as possible, as these will most probably be the result of incidental past meetings. There is solid evidence that real life mobility is predominantly small-world, with salient features such as com-munities [157] and a high clustering coefficient [158]. Our detailed study of real traces in [40, 39] confirms this.

The goal of this section is thus to devise an algorithm that observes the aggre-gated social graph online (i.e., as new contacts arrive), and tries to assess the density at which this graph has a structure that best reflects the above intuition about connec-tivity/contact patterns. This is a difficult unsupervised learning problem. We first give hints on how to tackle this problem on a simple example, like the above. Then, we pro-pose two clustering-related methods for identifying distinguishable similarity patterns:

one based on spectral graph theory [156] and the other based on established methods of evaluating the quality of graph partitioning [159]. Both can be used at the core of our algorithm.

4.4.1 Optimal community structure of the social graph

One way to distinguish regular neighbors from random neighbors is by their similarity values. Each node will see a set of nodes to which it is highly similar (i.e., many shared regular neighbors in the graph) and another set of nodes to which it is less similar (i.e., random neighbors). Fig. 4.4 shows such an example, for “low”, “good”, and “high”

aggregation density. As can be seen there, if density is too low, very few of the regular (intra-community) links appear, and the similarity of most node pairs is 0. On the other hand, if density is too high, the graph is too dense giving the (wrong) impression that random neihgbors, belonging in different communities, have high similarity as well (and thus leading to the wrong conclusion that these nodes belong to the same community).

(a) # links vs. # contacts (b) similarity vs. # contacts

Figure 4.5: Number of links and similarity values as a function of time.

Based on the above observations, our problem could thus be cast as maximizing the similarity to regular neighbors, while minimizing the similarity to random ones. In order to motivate this approach, we use the CAVE model without rewiring, to analyze the expected number of regular neighbors and the expected number of random neigh-bors for a node in the aggregated graph. Due to the simplicity of this model, these quantities can also be derived in closed form. From these, we get the expected values of similarity that would be observed online, as a function of time.

We omit the detailed analysis here, which can be found in [38], and we just plot the expected number of regular, random and total links in Fig. 4.5(a), as a function of contacts seen by a specific node (this is an implicit measure of time as well). This figure shows that the point when all (or most) regular links are in the graph is reached quickly (after a few 10s of contacts). In this scenario, this is the right point to stop the aggregation and fix the density. After that point, only random links are added to the graph. These have little predictive value and erroneously increase nodes’ similarity.4

Ideally, assume that a node could measure the average similarity with its regular neighbors (e.g. nodes in the same community) at different aggregation densitiesd, denoted assimreg(d), and its average similarity with random neighbors, denoted as simrnd(d). Then, in order too maximize the predictive capacity of the aggregated graph, it could try to maximizesimreg(d)−simrnd(d). As nodes do not have any a priori knowledge about community membership and size, or the total number of nodes, it is more sensible to use normalized similarities and divide similarities by the expected node degreenreg(d)+nrnd(d), wherenreg(d)andnrnd(d)are the number of observed regular and random neighbors, respectively. Consequently, in order for our algorithm to automatically adjust to the optimal aggregation density, it seems reasonable to solve the following maximization problem.

4We stress here that, in some cases, nodes with little similarity can still share a strong social link with predictive value, as e.g. in the case of two friends who belong each in different communities. This initial dis-cussion serves only as a motivation, based on a simple scenario. The algorithms we propose later, while still partly based on this idea, can be applied to complex setups where such clear distinctions between “random”

and “regular” neighbors start getting blurred.

Figure 4.6: Histogram of encountered similarity values at three different aggregation levels, with 2 clusters shown for each.

maximize

d

(simreg(d)−simrnd(d) nreg(d) +nrnd(d)

) .

Figure 4.5(b) depicts the normalized similarities and their difference. It is clear from this plot, that the difference is maximized around the point where enough contacts per node have occurred to fill in most regular links, but only few random links have been instantiated. We will show later that this maximum correlates well with the aggregation density at which the performance of SimBet and Bubble Rap is optimal.

Nevertheless, all the above discussion hinges on the assumption thata node knows which neighbors are regular and which random, in advance. In practice however, there are no such pre-assigned labels on nodes. When a node encounters another, it only knows and logs down its similarity value to this node, and whether this similarity value should be counted for a random or regular node is not possible without knowing such labels. Consequently, in order to apply the above maximization, nodes need to first be able to distinguish between these two classes of links.

To solve this classification problem a node can create a histogram of similarity val-ues observed, out of the contacts observed over time. One way to assign “labels” to each past contact (regular vs. random) is to perform clustering on the set of normalized similarity values. If the aggregated density is appropriate, a 2-means algorithm should produce two clusters of similarity values, one for similar regular nodes (with high simi-larity) and another for similar random nodes (low simisimi-larity). Note that at low (close to 0) and high densities (close to1) only one cluster appears since all nodes have similar similarities close to0(no similarity) and1(all nodes similar), respectively. Hence, the unsupervised problem of finding the right aggregation density then becomes equivalent to maximizing the distance between the two cluster centroids of low and high similar-ities. Fig. 4.6 shows an example that captures this intuition, for the CAVE scenario.

Two clusters are clearly separable at a density of0.1(which is close to the “optimal”

densities of Table 4.1).