• Aucun résultat trouvé

We process each packet capture with tsharkto extract the complete set of HTTP URLs in the trace. Each of these URLs is classified along three different dimensions, following the previous section (§2). Note that we filter out HTTPS traffic (which has been shown to be small in Android apps [21]) and only consider HTTP traffic; HTTPS traffic does not expose the headers that we analyze. For each URL extracted, we carry out three different checks as follows:

1. We check the URL against the set of descriptors inEasyListand, if a match is found, we classify the URL as being ad-related, since the connection most likely was made to retrieve an ad-element to display on the smartphone screen.

2. We check the URL against the set of filters contained inEasyPrivacyand, if a match is found, we mark the URL as beingtracking related.

3. Finally, we issue a query to the VirusTotal service with the URL as a parameter to obtain a reply that aggregates the findings of all of the backend engines supported by VirusTotal. In addition, we also extract fully qualified domain names from the URL and query VirusTotal for information about these. The results relevant to domain names include things such as theWebutationsafety score for a domain, and so on.

Finally, our dataset consists of 2146 processed applications (1710 with traffic activity), spanning 25 distinct application categories and which in the aggregate, connect to almost 250k unique URLs and across 1985 top level domains.

5 Application Destination Characterization

While there is a considerable body of work in the area of profiling mobile apps, the focus has been on detecting data leakage, or on developing behavioral finger-prints of the applications. There has been relatively less work on characterizing the applications in terms of the network destinations they visit, and the nature of these destinations. We focus on analyzing the network end-points in depth and on understanding similarities (or differences) in certain app categories in terms of this behavior. We start by presenting some high level statistics of network end-points, across the set of applications analyzed.

Apps URLs and domains: We see a tremendous range in application behav-ior: a large number of applications generate no traffic at all while some applications generate well in excess of 1000 HTTP requests. We find the appMusic Volume EQconnects to almost2000 distinct URLs. Interestingly, Music Volume EQ is a volume slider app, and not an app that would really require access to the network.

By all accounts, these numbers are large especially considering that our methodol-ogy does not support authenticating against user accounts (such applications will not progress beyond the login screen). Fig. 1(a) shows a distribution of the number of URLs visited by each executed application. From the figure, about 10% of the apps tested connect to more than 500distinct URLs (recall that the execution of

Application Name URLs

Table 2: Top 10 Applications, by URLs (left) and by Top Level Domains (right) each application only lasts a few minutes). This level of “chattiness” significantly impacts resource usage on the mobile device. Interestingly, we still identify apps which do not engage in network activity, although they declare (in manifest file) that they require network access.

In Table 2, we enumerate the top 10 applications seen in our dataset, ranked by the number of URLs connected to (at least 25 applications connect to more than 1000 URLs during execution). These applications are very diverse, from weather to music and budget. This confirms the need to consider broad and varied dataset rather than focusing on specific categories.

Multiple URLs can correspond to the same domain. The number of distinct domains, apps connect to, captures the different activities carried out inside the application. Across the applications in our dataset, the median number of do-mains connected to is 4, while some apps connect to more than 100. For example, Morandini Blog, which is a blog reader application, communicates with 113 distinct domains. Interestingly, it connects with 6differentad networks, along with a number of analytics and tracking websites. In figure 1(b), we look across applica-tions and plot the distribution of domains communicated with by each application.

About half of the apps connect to 4 or fewer domains, and we also see significant variability across applications. Roughly 10% of the apps connect to 20 or more domains over the execution window. Table 2 enumerates the top 10 apps ranked by the number of distinct domains connected to. Rows marked in italics denote apps that also happen to fall in the top 10 when ranked by the number of URLs communicated with (cf. Table 2).

Looking across applications, Table 3 enumerates the 20 most frequently con-tacted domains, which provides some insights about the nature of communica-tion between the applicacommunica-tion and website. Unsurprisingly, 9 of the top 10 in this set correspond to various web services run by Google. The most popular do-main in the list,doubleclick.net, is an advertising platform that tracks end-users, and also serves up advertisments. WhileGoogle.com is generally con-sidered as the search engine portal, in our traces we found two predominant pat-terns associated with this particular domain: (i)www.google.com/images/

cleardot.gif?zx=<str>, which correspond to 1x1 tracking pixels, and (ii)

0 Number of URLs per App500 1000 1500 2000

(a) Number of URLs per app.

0 Number of Domains per App20 40 60 80 100 0.2

(b) Number of domains per app.

Figure 1: URL and domain counts by application

doubleclick.net 0.415 ajax.googleapis.com 0.058

Table 3: Top 20 popular domains (with fraction of applications connecting to them) www.google.com/ads/user-lists/<id>/?script=<num>&random=

<num>, which seems to indicate some form of user tracking.

While enumerating the communicating domains can be quite instructive, it does not reveal much about the nature of the communication between app and domain. Understanding thetypeof domain (or category) of the domain can yield a better sense of this communication. Typically, web domains are set up for well de-fined functions (e.g.,doubleclick.netas an ad platform,google-analytics as a tracking and analytics service, etc.) and communication between the app and domain is generally consistent with the service offered by the domain. We use the following methodology to identify domain categories. First, we classify each URL as atracker URL,ad relatedorother. In the first two cases, the services are obvious, and we consider them independent categories. In the latter case,other, we extract the fully qualified domain name from the URL and rely on the service provided byWebsense.comto obtain a characterization (i.e., category) for the domain in question. Specifically, we first examine each URL and extract the fully-qualified domain name (FQDN) embedded in the URL. Then we gather all the Websense categories corresponding to each of the FQDNS that correspond to the same top level domain, and assign the majority class as the category for the top level domain. Going back to the domains listed in Table 3, we find that the most

Figure 2: Ad URLs and tracker distributions Top Level Domain Popularity doubleclick.net 37,153 (50.6%) gstatic.com 16.532 (22.5%)

admob.com 3603 (4.9%)

smartadserver.com 3411 (4.6%)

inmobi.com 1399 (1.9%)

Table 4: Top 5 ad related Top Level Domains

of the 20 domains correspond to the category advertisements. In the rest of this section, we examine various types of destinations based on their domain catego-rization.

Ad related sites: Recall that we use AdBlock’s EasyList subscription to iden-tify advertising related destinations. Figure 2 shows the distribution of the number of ad URLs visited per app. We observe that 33% of the apps do not communicate withanyad destinations. On the other hand, we also see apps that connect to a very large number (>1000) of ad URLs. Overall, the average number of ad URLs associated with an application is about 40. Examining the domains of ad URLs, we find that the three most prominent ad related domains are all part of Google, as listed in Table 4. Thus, while Google does not directly make any revenue from Android itself (which is openly licensed to manufacturers), it is able to extract rev-enue from the ads business around the ecosystem. We further investigate ad URLs for particular apps, in Section 6.

User tracking related sites: We now look closely at the URLs in our dataset that correspond to destinations that track end-users and devices, as encoded in AdBlock’s EasyPrivacy subscription lists. Previous studies have reported that such practices have largely negative connotations with users [13, 19]. Given this, it is rather surprising that tracking is so widespread, and more important, completely opaque to end-users. While the Do Not Track [7] policy has been proposed by consumer advocates and has gained some acceptance, the mechanism is restricted

Top Level Domain Popularity

Table 5: Top 5 tracking related Top Level Domains

Domain category Popularity

search engines and portals 33 (1.66%)

reference materials 23 (1.16%)

internet radio and tv 23 (1.16%) application and software download 20 (1.01%) blogs and personal sites 17 (0.86%)

vehicles 17 (0.86%)

social networking 16 (0.81%)

personal network storage and backup 15 (0.76%)

Table 6: Popularity of Domain Categories to web browsers, and does not extend to mobile apps in general.

In figure 2, we plot the distribution of tracking URLs associated with each application. We observe that while the vast majority (73.2%) of apps do not involve any communication with trackers, a small number of apps do indeed communicate with them. The number of tracker URLs per app can be more than 800. In Table 5, we enumerate the most popular domains associated with trackers, where popularity is defined as the number of tracker URLs, seen across all the apps, associated with a specific domain. In contrast to the results about ad-related destinations, we find the mobile tracking ecosystem to be significantly more fragmented, with many more players, even if the dominant player is associated with Google. We further investigate tracker URLs for specific apps, in Section 6.

Other web categories:We now examine the aggregate set of URLs after hav-ing removed those that correspond to the previous two categories. Table 6 enumer-ates the 20 domain categories with the highest number of domains associated with them. The most popular category, which covers about 22% of the total domains observed, is denotedInformation Technologyand this appears to cover a number

Domain category Fraction of malicious domains

Table 7: Malicious domains based on Webutation engine.

of miscellaneous web services. The next two identifiable categories correspond to dynamic contentandadvertisements, and both of these are very likely related to the online advertising ecosystem. Note that we see a large count for these even though we filter out those described inEasyListpreviously; the remaining URLs not filtered are likely due to new patterns not inEasyList or perhaps connections to ad related websites that do not involve ad placement inside the mobile application.

Apart from these, we see small domain counts across a varied set of categories. In the next section, we examine these domain categories in detail and relate them to the category of the app itself.

URLbadness: Finally, we explore an additional characteristic of the domains being connected to – “badness”. Recall that VirusTotal aggregates results from a number of engines; these relate to the “suspiciousness” of a URL. While this term is somewhat ambiguous, the qualitative results can be explained thus: the en-gines used by VirusTotal independently crawl the URLs and catalog the various ob-jects on them. URLs that host executable content that is deemed malware-like, are deemed suspicious. Note that reliably determining malicious intent is extremely challenging and quite outside the scope of our work. For our purpose, we simply quantify whether any engine marked the URL as such, and analyze this across the set of domain categories. Bysuspicion scorefor a URL, we denote the fraction of antivirus engines (VirusTotal uses 52 in all) that deem the URL suspicious (or malicious). Our result show 94.4% of the URLs have a (suspicion) score of 0. In the worst case, a URL was deemed suspicious by 3 (of 52) engines.

Suspicious domains: For classification of the suspicious domains, we use Webutation engine. Webutation is an open community about Website Reputation.

It tests websites against spyware, spam and scams. Apart from collecting user feedback, Webutation queries various trusted engines like Google Safebrowsing or Norton Antivirus to check for malicious software and other dangerous elements.

Overall, our analysis shows thata small portion of the domains have been clas-sified by Webutation as suspicious or malicious. Specifically, we observe 2.5%

suspicious, 2.9% malicious, 61% unsure (not a clear verdict) and 33.6% safe do-mains. Table 7 further shows the domain categories with the highest fraction of malicious domains. We observe that the top-3 malicious domain categories are

“sex”, “personals and dating” and “ads”. In the following section, we devise a suspicion metric for mobile apps, and we investigate the most suspicious apps.

Application Ads Rate Downloads

cart.tabs.sw 1174 -

-VidTrim - Video Trimmer 1065 4.2 10.000.000

Simulateur laser 1019 2.1 5.000.000

Music Volume EQ 999 4.2 10.000.000

signal.booster.conchi.amplificador 940 -

-com.HillieMelani.VideoEditor 720 -

-Football365 700 3.9 10.000

Decibel (Sonometre reactif) 671 4.8 1.000

Nail Art Tutorials 2014 630 3.7 100.000

Veilleuse en Couleurs 538 3.8 10.000

Table 8: Top 10 apps connecting to ad URLs

6 Detailed Apps Characterization

In this section, we focus on individual applications and obtain an understanding of their behavior along the three axes discussed previously. Users’ value (or are annoyed by) different things – some user’s value privacy (tend to avoid applications with significant tracking), other’s value security (and wish to avoid applications with suspicious or unreasonable behavior). To this end, we study the most prolific applications along these axes and gain some insight into their behavior.

6.1 Advertising Intensity

The Internet ecosystem, along with the mobile app marketplace, is largely driven by advertising revenue. The vast majority of mobile apps offer their services to the user for free, and aredirectlymonetized by selling “real estate” (smartphone screen or website) on which ads are inserted. All of the applications in our dataset are “free” and we expect that the majority of themwillconnect to ad sites. This is confirmed in Fig.2, where more than 66% of the applications contact ad URLs.

Some of the advertising APIs and engines are very aggressive in downloading ads into the mobile app screen. For exampleAirPushis one such infamous service and is so aggressive that the PlayStore lists a number of applications whose sole function is to detect this API and notify the user. Several mobile ad APIs collect detailed device information (OS version, IMEI, location, IP address, etc.), some-times unknown by users. In general, end users find that ads (esp. display ads which are not targeted based on user interests) degrade the user experience of mobile apps and services.

Table 8 lists the top 10 apps ordered by the number of ad related URLs con-nected to. All the apps were executed for just a few minutes, and even in this brief interval, we see some apps with a very large number of connections to ad sites.

With the exception of two applications – Music Volume EQandVidTrim - Video Trimmer– none of the others are frequently downloaded (popular), and rated posi-tively by users. Note that this information is missing for some of the apps that were removed from the PlayStore soon after we downloaded the APK for testing (and no further information is available).

Application Trackers Rate Downloads

Eurosport Player 810 3.2 500.000

RunKeeper 804 4.4 10.000.000

Gestion du budget 725 4.3 500.000

Logo Quiz 301 4.5 10.000.000

Expedia Hotels et Vols 266 4.0 5.000.000

Vos Droits Quotidien 264 4.3 100

France TV Replay 261 2.8 10.000

Iron Man 3 Live Wallpaper 250 4.0 5.000.000

beIN SPORTS 236 3.6 500.000

NipCast 235 4.7 100

Table 9: Top 10 apps connecting to tracker URLs (italics indicateTop Developer status)

Documents relatifs