Spam campaign detection, analysis, and formalization

(1)

Spam Campaign Detection, Analysis, and

Formalization

Thèse Mina Sheikhalishahi Doctorat en informatique Philosophiæ doctor (Ph.D.) Québec, Canada © Mina Sheikhalishahi, 2016

(2)

Spam Campaign Detection, Analysis, and

Formalization

Thèse

Mina Sheikhalishahi

Sous la direction de:

Directeur de recherche: Mohamed Mejri Codirectrice de recherche: Nadia Tawbi

(3)

Résumé

Les courriels Spams (courriels indésirables ou pourriels) imposent des coûts annuels extrême-ment lourds en termes de temps, d’espace de stockage et d’argent aux utilisateurs privés et aux entreprises. Afin de lutter efficacement contre le problème des spams, il ne suffit pas d’arrêter les messages de spam qui sont livrés à la boîte de réception de l’utilisateur. Il est obligatoire, soit d’essayer de trouver et de persécuter les spammeurs qui, généralement, se cachent derrière des réseaux complexes de dispositifs infectés, ou d’analyser le comportement des spammeurs afin de trouver des stratégies de défense appropriées.

Cependant, une telle tâche est difficile en raison des techniques de camouflage, ce qui nécessite une analyse manuelle des spams corrélés pour trouver les spammeurs.

Pour faciliter une telle analyse, qui doit être effectuée sur de grandes quantités des courriels non classés, nous proposons une méthodologie de regroupement catégorique, nommé CCTree, permettant de diviser un grand volume de spams en des campagnes, et ce, en se basant sur leur similarité structurale. Nous montrons l’efficacité et l’efficience de notre algorithme de clustering proposé par plusieurs expériences. Ensuite, une approche d’auto-apprentissage est proposée pour étiqueter les campagnes de spam en se basant sur le but des spammeur, par exemple, phishing. Les campagnes de spam marquées sont utilisées afin de former un clas-sificateur, qui peut être appliqué dans la classification des nouveaux courriels de spam. En outre, les campagnes marquées, avec un ensemble de quatre autres critères de classement, sont ordonnées selon les priorités des enquêteurs.

Finalement, une structure basée sur le semiring est proposée pour la représentation abstraite de CCTree. Le schéma abstrait de CCTree, nommé CCTree terme, est appliqué pour formali-ser la parallélisation du CCTree. Grâce à un certain nombre d’analyses mathématiques et de résultats expérimentaux, nous montrons l’efficience et l’efficacité du cadre proposé.

(4)

Abstract

Spam emails yearly impose extremely heavy costs in terms of time, storage space, and money to both private users and companies. To effectively fight the problem of spam emails, it is not enough to stop spam messages to be delivered to end user inbox or be collected in spam box. It is mandatory either to try to find and persecute the spammers, generally hiding be-hind complex networks of infected devices, which send spam emails against their user will, i.e. botnets; or analyze the spammer behavior to find appropriate strategies against it. However, such a task is difficult due to the camouflage techniques, which makes necessary a manual analysis of correlated spam emails to find the spammers.

To facilitate such an analysis, which should be performed on large amounts of unclassified raw emails, we propose a categorical clustering methodology, named CCTree, to divide large amount of spam emails into spam campaigns by structural similarity. We show the effective-ness and efficiency of our proposed clustering algorithm through several experiments.

Afterwards, a self-learning approach is proposed to label spam campaigns based on the goal of spammer, e.g. phishing. The labeled spam campaigns are used to train a classifier, which can be applied in classifying new spam emails. Furthermore, the labeled campaigns, with the set of four more ranking features, are ordered according to investigators priorities.

A semiring-based structure is proposed to abstract CCTree representation. Through several theorems we show under some conditions the proposed approach fully abstracts the tree rep-resentation. The abstract schema of CCTree, named CCTree term, is applied to formalize CCTree parallelism.

Through a number of mathematical analysis and experimental results, we show the efficiency and effectiveness of our proposed framework as an automatic tool for spam campaign detection, labeling, ranking, and formalization.

(5)

Table des matières

Résumé iii

Abstract iv

Table des matières v

Liste des tableaux vii

Liste des figures ix

Remerciements xii

1 Introduction 1

1.1 Motivation . . . 3

1.2 Main Contributions. . . 6

1.3 Thesis Outline . . . 7

2 State of the Art 9 2.1 Spam Emails Issues. . . 9

2.2 Clustering Spam emails into Campaigns . . . 12

2.3 Labeling and Ranking Spam Campaigns . . . 17

2.4 On the Formalization of Clustering and its Applications . . . 18

3 Spam Campaign Detection 22 3.1 Introduction. . . 22

3.2 Preliminary Notions . . . 25

3.3 Related Works . . . 28

3.4 Categorical Clustering Tree (CCTree) . . . 30

3.5 Time Complexity . . . 32

3.6 Conclusion . . . 33

4 Effectiveness and Efficiency of CCTree in Spam Campaign Detection 34 4.1 Introduction. . . 34

4.2 Framework . . . 36

4.3 Evaluation and Results. . . 38

4.4 Discussion and Comparisons . . . 56

4.5 Related Work . . . 57

(6)

5 Labeling and Ranking Spam Campaigns 60

5.1 Introduction. . . 60

5.2 Related Work . . . 62

5.3 Digital Waste Sorting . . . 63

5.4 Results. . . 75

5.5 Ranking Spam Campaigns . . . 82

6 Algebraic Formalization of CCTree 87 6.1 Introduction. . . 87

6.2 Related work . . . 89

6.3 Feature-Cluster Algebra . . . 90

6.4 Feature-Cluster (Family) Term Abstraction . . . 99

6.5 Relations on Feature-Cluster Algebra . . . 109

6.6 CCTrees Parallelism . . . 117

7 Conclusions and Future Work 124 7.1 Thesis Summary . . . 124

7.2 Future work . . . 126

A Appendix 130 A.1 Source Codes of Proposed Approach . . . 130

A.2 Tables of Attributes . . . 138

Bibliography 144

(7)

Liste des tableaux

4.1 Features extracted from each email. . . 37

4.2 CCTree Internal evaluation with fixed number of elements.. . . 41

4.3 Internal evaluation results of CCTree, COBWEB and CLOPE. . . 45

4.4 Silhouette values and number of clusters in function of µ for four email datasets. 50 4.5 Silhouette result, hamming distance, = 0.001, and µ changes . . . 52

4.6 Number of Clusters , = 0.001, and µ changes . . . 52

4.7 External evaluation results of CCTree, COBWEB and CLOPE. . . 55

4.8 Campaigns on the February 2015 dataset from five clustering methodologies. . 57

5.1 Features extracted from each email. . . 71

5.2 Feature vectors of a spam email for each class. . . 72

5.3 Classification results evaluated with K-fold validation on training set.. . . 77

5.4 Classification results evaluated on test set. . . 77

5.5 Training set generated from small knowledge. . . 81

5.6 DWS classification results for the labeled spam campaigns. . . 81

5.7 Set of ranking features . . . 82

5.8 Normalized score of spam campaigns label . . . 84

5.9 Three first ranked campaigns . . . 85

6.1 CCTree Rewriting System . . . 114

6.2 Composition Rewriting System . . . 119

7.1 Table of Notations . . . 129

A.1 Language of spam message and subject. . . 138

A.2 Type of Attachment . . . 138

A.3 Attachment Size . . . 139

A.4 Number of attachment . . . 139

A.5 Average size of attachments . . . 139

A.6 Type of Message . . . 139

A.7 Length of Message . . . 140

A.8 IP-based links verification . . . 140

A.9 Mismatch links . . . 140

A.10 Number of links. . . 141

A.11 Number of Domains . . . 141

A.12 Average number of dots in links . . . 141

A.13 Hex character in links . . . 141

(8)

A.15 Characters in subject. . . 142

A.16 Non ASCII characters in subject . . . 142

A.17 Recipients of spam email. . . 142

(9)

Liste des figures

1.1 Steady volume of spam. . . 2

1.2 Mcafee Report 2015. . . 3

1.3 The framework of thesis. . . 8

3.1 dataset 1 . . . 26 3.2 dataset 2 . . . 26 3.3 Spam 1 . . . 27 3.4 Spam 2 . . . 27 3.5 A Small CCTree . . . 31 4.1 CCTree(0.001,1) . . . 42 4.2 CCTree (0.01,1) . . . 42 4.3 CCTree(0.1,1) . . . 43 4.4 CCTree(0.5,1) . . . 43

4.5 Internal evaluation at the variation of the parameter. . . 44

4.6 COBWEB . . . 46 4.7 CCTree(0.001,1) . . . 47 4.8 CCTree(0.001,10) . . . 47 4.9 CCTree(0.001,100) . . . 48 4.10 CCTree(0.001,1000). . . 48 4.11 CLOPE . . . 49

4.12 Silhouette in function of the number of clusters for different values of µ. . . 50

4.13 Sihouette (Hamming). . . 50 4.14 Generated Clusters. . . 51 4.15 Sihouette (Hamming). . . 53 4.16 Generated Clusters. . . 54 5.1 Advertisement . . . 64 5.2 Portal . . . 66 5.3 Fraud . . . 66 5.4 Malware . . . 68

5.5 Crypto Ransomeware volume . . . 69

5.6 Phishing . . . 70

5.7 DWS Workflow. . . 73

5.8 Insert new instance X in a CCTree . . . 74

5.9 ROC curve / Advertisement . . . 78

5.10 ROC curve / Portal Redirection . . . 78

(10)

5.12 ROC curve / Malware . . . 79

5.13 ROC curve/ Phishing . . . 80

6.1 A Small CCTree . . . 106

(11)

To my love, my family and

To any one who looks for worldwide peace and happiness

(12)

Remerciements

Though only my name appears on the cover of this dissertation, a great many people have contributed to its production. I owe my gratitude to all those people who have made this dissertation possible.

First and foremost, I want to thank my supervisor, professor Mohamed Mejri, for accepting me in his research group, which improved my view of life. I appreciate all his contributions of time, ideas, patience, and funding to make my Ph.D. experience productive and stimulating. Thanks for allowing me to grow as a research scientist, for all his patience and support. I also would like to express my deeply thanks to my co-advisor, professor Nadia Tawbi, who has been always there to listen and give advices. Thanks to her for all her kind mental, financial supports and helpful discussions in different stages of my Ph.D. studies. I gratefully acknowledge her support for my cooperation with IIT-CNR research group that changed my life.

I really appreciate the insightful comments and constructive criticisms of my advisor and co-advisor at different stages of my research. For encouraging the use of correct grammar and consistent notations in my writings.

Besides my advisors, I would like to thank the rest of my thesis committee : Prof. Fabio Martinelli, Prof. Raphael Khoury, and Dr. Ilaria Matteucci, for their insightful comments and encouragement. Special thanks to professor Fabio Martinelli for accepting me to join to his research group in IIT-CNR, Italy, which enriched my research experience.

My time in Quebec was made enjoyable in large part due to the many friends that became part of my life. I am grateful to my dearest Shadi, who supported me continuously during three years of my staying in Quebec. With her presence in Quebec, I always felt I have a family member who takes care of me. To my kind friend Bahareh, who several times I bothered from Italy to do something in Quebec instead of me. Thanks to my other kind friends in Quebec : Elaheh, Afrooz, Sheyda, Soamyeh.

I am especially grateful to my best friend Sara, who always, in very difficult moments of my Ph.D., was available from Iran to send me messages, to support, encourage, and motivate me. She was always there to hear me, although with different timezones of Iran and Canada. I will

(13)

always appreciate all her kind continuous supports.

Many thanks to my other friends from Iran : Mahboobeh, for continuous memorizing and praying me, Mahmoud, for always following my weblog and motivating me.

I would like to deeply thank my family for all their love and encouragement. To my father who always motivated us to read, to know, to follow our dreams, who always love us as we are. To my mother, who finally accepted my travel to Canada although never convinced, for all the worries she passed during my Ph.D., for all her patience when I was following my dreams, even against her dreams.

Thanks to my dearest sister, Mojgan, who was my joint to Iran. She was always following what ever I needed to be done in Iran, who always motivated me with her typical sweet words. Many thanks to my brother, Mohammad, who always supported me in all my pursuits, who we are always proud of him. I am also grateful to Hamed, my brother-in-law, who called me many times from Iran to tell me we all love you and miss you. To my kindest aunt, Azra, who always teaches us that you can still smile when the life is passing its most difficult challenging stage.

Most of all, I would like to give my deep gratitude to my colleague, my friend, my love, Andrea, who cleared lots of the obstacles that I faced along my Ph.D. path. Who generously from the first moments of my arrival in Italy, taught me his experiences of research. Many thanks, for all his faithful support, patience, and encouragement during the difficult stages of my Ph.D. thesis. Thanks for his presence in my life, for all happiness he brought with himself, for making the feeling that I am able to make all my dreams come true.

Mina Sheikhalishahi Laval University Quebec, Canada

(14)

Chapitre 1

Introduction

The term spam became well-known from one comedy program of “Monty Python’s Flying Circus”, where the servant was proposing dishes containing an unknown ingredient called spam, which corresponds to a brand of canned meat produced by “American Hormel Foods Corp”. In the sketches of this program, all the foods in the restaurant are served with lots of spam, and the waitress repeats the word spam several times in describing how much spam is in the plates. After doing this, a group of “Vikings” in the corner start a song : “Spam, spam, spam, spam, spam, spam, spam, spam, lovely spam ! Wonderful spam !” Hence, the meaning of the term was referring to something that keeps repeating and repeating to great annoyance1_{. Due to the}

success of this program, probably since the canned meat constituted the only nutritious food available in England during the Second World War, the term “SPAM” indicated something inevitably omnipresent.

The name imported to unwanted electronic messages, believed that the first spam email has been sent on 1 May 1978 by Digital Equipment Corporation to advertise a new product, and sent to all the users of ARPAnet of the West Coast of the United States, containing a few hundred people2_.

Only many years later, after the birth (dating back to January 19943_{), the first unwanted}

commercial message in large scale distributed across USENET, titled “Global Alert for All : Jesus is Coming Soon”. It was posted to every newsgroup, indicating unwanted messages, which were sent massively to unwilling recipients.

More precise definition of spam email got introduced later in the literature. [8] define Spam email, also known as junk email or unsolicited bulk email, as an electronic message, sent in bulk, against the will of the receiver. [83] define spam email as an unwanted email, sent indiscriminately by a sender who has no current relationship with the receiver.

Nowadays, spam emails are not just undesired advertisement. The problem of unsolicited

1. http ://www.internetsociety.org/

(15)

emails causes incredible huge costs to companies and private users [113], [83], [84]. Current proposed approaches [30], [46], [123], though being quiet effective in stopping spam emails to be delivered to end users inbox [21], [89], they do not propose a methodology to organize huge amount of messages in order to be able to fight against the root of the problem, i.e. the spammer.

Any effort in this regard requires a first analysis of large amount of spam emails, mostly col-lected in honey-pots. This first analysis demands grouping huge amount of data into smaller groups, named spam campaigns, which are supposed to be originated from the same source (spammer). Then, it is required to train a classifier to label and group new spam emails. Furthermore, the large set of detected spam campaigns should be ordered based on the inves-tigators’ priorities, automatically.

Figure 1.1 – Steady volume of spam.

To this end, in present thesis, we first propose a fast and effective categorical clustering al-gorithm, named CCTree, to detect spam campaigns on the base of structural similarity of messages. Afterwards, we propose a self-learning methodology to automatically label detected spam campaigns based on the goal of spammer. The labeled detected campaigns are ranked automatically considering a set of ranking priorities. A semiring-based formal method is pro-posed to abstract CCTree representation. The abstract form is used to formalize the process of clustering spam emails in parallel computers, which may help to speed up the process of spam campaign detection.

(16)

1.1 Motivation

Being incredibly cheap to send, spam messages are vastly used by adversaries to steal money, distribute malware, advertise the goods and/or services, etc.

Cisco Report, in 2015 [36], shows that although adversaries develop more sophisticated tech-niques to breach network defense, spam emails still continue to play a major role in these attacks, and the worldwide volume of spam has remained relatively consistent (Figure 1.1). Furthermore, it has been shown [36] that 4.5 billion emails get blocked every day. Internet Threats Trend Report [114] estimates that 54 billion spam emails were sent per day in 2014. According to McAfee 2015 Report [100], unsolicited emails constitute up more than 70 percent of total amount of email messages in 2014 (Figure 1.2).

Figure 1.2 – Mcafee Report 2015.

Microsoft and Google [113] estimate spam emails cost to American firms and consumers up to 20 billion dollars per year. Ferris Research estimated the worldwide cost of spam in 2005 at $50 billion, and raised its estimate to $100 billion in 2007 and $130 billion in 20094_,[₁₁₂_{]. [}₈₃_]

report that 382 million mailing attempts resulted in 28 sales. Yahoo ! data on similar “high ticket” items, which were sold through the marginal profit more than $50, shows that they had conversation rates of about 1 in 25,000 [112].

(17)

The problem of undesired electronic messages became a serious issue, due to a lot of troubles caused by spam to Internet Community. [5] categorize spam losses in three different groups, named direct losses, indirect losses, and defense costs, and call the sum of these losses as the society losses of spam. In what follows, the sets of society losses proposed in [5] are listed : Direct losses by spam :

• “Money withdrawn from victim accounts

• Time and effort to reset account credentials (for banks and consumers) • Distress suffered by victims

• Secondary costs of overdrawn accounts : deferred purchases, inconvenience of not having access to money when needed

• Lost attention and bandwidth caused by spam messages, even if they are not reacted to.” Indirect losses by spam :

• “Loss of trust in online banking, leading to reduced revenues from electronic transaction fees, and higher costs for maintaining branch staff and cheque clearing facilities

• Missed business opportunity for banks to communicate with their customers by email • Reduced uptake by citizens of electronic services as a result of lessened trust in online transactions

• Efforts to clean-up PCs infected with malware for a spam sending botnet” Defense costs of spam :

• “Security products such as spam filters, antivirus, and browser extensions to protect users • Security services provided to individuals, such as training and awareness measures

• Security services provided to industry, such as website take-down services • Fraud detection, tracking, and recuperation efforts

• Law enforcement

• The inconvenience of missing messages falsely classified as spam”

Considering that the large amount of spam traffic among servers cause the delay for delivering legitimate emails ; Sorting out the unsolicited messages takes time ; Whilst in the process of classifying messages into spam and legitimate, there is the risk of deleting an important email by mistake, the problems resulting of spam emails makes unbearable situations for every one who uses the Internet.

To get a better insight on the direct and indirect losses of spam, here we briefly present some reports.

Microsoft and Google [113] estimate that spam emails cost to American firms and consumers up to 20 billion dollars per year, whilst [83], [84] show that a successful spam campaign can earn revenues between $400k to $1000k. [133] estimated Cutwail botnet for providing spam

(18)

services earns around $1.7 million to $4.2 million in one year. It has been calculated that a company with 1000 employees, looses just $500,000 per year as productivity cost resulting from spam messages5_.

The most popular solution to the problem of spam is Filtering [21]. The spam filtering can be defined as a methodology to divide messages into spam and legitimate [21]. Currently, the most used approach for fighting spam emails consists in identifying and blocking them on the recipient machine through filters [30], [46], [123], which generally are based on machine learning techniques or content features [22], [138], [139].

Nevertheless that the existing filtering algorithms often show the accuracy of more than 90% in experimental evaluations [21], [89], it does not stop spammers from imposing considerable cost to users and companies [113]. We believe the reason could be that the spammer, the root of the problem, feels the minimum risk to be caught or followed.

To effectively fight the problem of spam emails, it is mandatory to find and persecute the spammers, generally hiding behind complex networks of infected devices, which send spam emails against their user will, i.e. botnets. Due to botnets, identifying the spammer is a difficult task, however possible [142], [149], [45].

To simplify this analysis, first of all, huge amount of spam emails are required to be divided into spam campaigns. A spam campaign is the set of messages spread by a spammer with a specific purpose [27], like advertising a specific product, spreading ideas, or for criminal intents e.g. phishing. Grouping spam messages into spam campaigns reveals behaviors that may be difficult to be inferred when we look at a large collection of spam emails as a whole [132]. It is noteworthy to be mentioned that the problem of grouping a large amount of spam emails into smaller groups is an unsupervised learning task. The reason is that there is no labeled data for training a classifier in the beginning. The proposed approach for clustering spam messages should be based on this premise that the general appearance of messages belonging to the same spam campaign mainly remain unchanged, although spammers usually insert random text or links [27]. The rationale behind this approach is that two messages in the same format, i.e. similar language, size, same number of attachments, same amount of links, etc., are more likely to be originated from the same source (spammer), belonging thus to the same campaign. Hence, the discriminative structural features of messages required be to be selected correctly. Furthermore, the clustering algorithm should be quite fast and effective in grouping junk emails into spam campaigns.

Afterward to each campaign should be assigned a label describing the purpose of spammer. This goal-based labeling facilitates for investigators the analysis of spam campaigns, eventually directed toward a specific cybercrime. Moreover, the spam campaign labeling based on the goal of spammer can help to rank spam campaigns.

(19)

Ranking spam campaigns based on the investigator’s priorities, provides ordered set of spam campaigns that on the base of it the investigator decides which spam campaigns must be first analyzed, which is a difficult task when we look at large number of detected spam campaigns as a whole.

It is not uncommon that data mining process requires several days or weeks to be completed. Parallel computing systems bring significant benefit, say high performance, in implementation of massive database [33]. Parallel clustering is a methodology proposed to alleviate the problem of time and memory usage in clustering large amount of data [94], [18].

Because of the huge amount of received spam emails, which vastly increases every hour (8 billions per hour) [110], [101] and for the high variance that related emails may show, due to the use of obfuscation techniques [108], it would be helpful if we are able to parallelize the process of clustering in several parallel computers. Parallel clustering will speed up the process of grouping unwanted messages into spam campaigns.

In the present thesis, we address all aforementioned issues related to spam campaign detection, analysis, labeling, and speeding up the process through parallelism with the use of formal methods. In what follows, the contribution of the thesis is explained in detail.

1.2 Main Contributions

The main contribution of this thesis can be summarized as following :

— We propose a categorical clustering algorithm, named CCTree, which is designed to divide spam emails into smaller groups, named spam campaigns, based on the structural similarity. The main hypothesis is that some parts of spam emails, belonging to the same spam campaign, remains unchanged. The CCTree has a tree-like structure, where the leaves of the tree represent the desired spam campaigns ([126]).

— A set of 21 categorical features are presented which characterize the structure of spam emails. An extensible and portable framework is provided to automatically extract the set of proposed features from raw emails. These features well represent the structure of an email. Some of these features hardly change when a spammer creates his own spam campaign ([129]).

— We propose and validate through analysis of 200k spam emails, a methodology to choose the optimal CCTree configuration parameters. The proposed technique shows that once the input parameters of CCTree are chosen for a dataset, they can be used for similar datasets with comparable size ([129]).

— We show the effectiveness and efficiency of CCTree in clustering emails into campaigns through two well-known evaluation indexes, named internal evaluation, i.e. the ability of CCTree in obtaining homogeneous clusters and external evaluation, i.e. the ability to effectively classify similar elements (emails), when classes are known beforehand ([129]).

(20)

— We propose a framework, named Digital Waste Sorter (DWS), which exploits a self lear-ning goal of the spammer-based approach for spam email classification. The proposed approach aims at automatically classifying large amount of raw unclassified spam emails dividing them into campaigns and labeling each campaign with its spammer goals. To this end, we proposed five class labels to group spammer goals into five macro-groups, namely Advertisement, Portal Redirection, Advanced Fee Fraud, Malware Distribution and Phishing ([128]).

— A ranking methodology is proposed to order sets of spam campaigns on the base of investigator priorities. The proposed approach extract five ranking features from each discovered spam campaign, according to investigator priorities. Including the spammer-goal label of spam campaign, these features are used to automatically attribute a grade to each spam campaign. The set of spam campaigns are ordered based on their grades. — A semiring-based formal method, named Feature-Cluster Algebra, is proposed to abs-tract the representation of CCTree. The resulted term equivalent to a CCTree is called CCTree term. Trough several theorems we prove that the proposed algebraic struc-ture, under some conditions, fully abstracts tree representation. A rewriting system is proposed to automatically verify whether a term is a CCTree term or not ( [127] ). — The abstract schema of CCTree is applied to formalize CCTree parallelism. The

pa-rallelism approach can be applied to speed up the process of clustering in parallel computers. To formalize CCTree parallelism, a set of rewriting rules are provided to get a final CCTree from the resulted CCTrees of parallel computers. Through the set of examples and theorems, we show how the proposed approach works.

1.3 Thesis Outline

The present thesis is structured as follows. First, we provide related work synthesis in the effort of spam campaign detection, labeling, and formalization in Chapter 2.

In Chapter 3, we propose a categorical clustering algorithm, named CCTree, to cluster spam emails based on structural similarity (step 1 in Figure 1.3), the result of this step is a set of spam campaigns, which are the leaves of the CCTree (step 2 of Figure 1.3).

The effectiveness and efficiency of CCTree in spam email campaign detection is presented in Chapter 4.

We propose a self-learning approach to label spam campaigns on the base of the goal of the spammers (step 3,4 of Figure 1.3), and rank the labeled spam campaigns (step 5 of Figure 1.3) in Chapter5.

The aforementioned steps are complete to divide a large amount of spam emails into spam campaigns. On the other side, to speed up the process of clustering algorithms, one well-known applied technique is parallel clustering. In the rest of the thesis, we formalize the CCTree parallelism. Hence, it is possible that the whole set of data to be divided in parallel computers

(21)

Figure 1.3 – The framework of thesis.

(step 6,7 of Figure 1.3). In Chapter 6, we abstract the CCTree representation with the use of a well-known algebraic structure, named semiring. We prove that the proposed algebraic based technique abstracts tree representation. The formal representation of CCTree is named CCTree term. We propose a rewriting system to verify whether a term is a CCTree term or not. The CCTree term is used to formalize CCTree parallelism with the use of a rewriting system (step 8 of Figure 1.3) . The result of final CCTree is the set of spam campaigns (step 10 of Figure 1.3), which can be delivered to previous explained parts of the framework to be labeled and ranked. We conclude with future directions of the present thesis in Chapter 7.

(22)

Chapitre 2

State of the Art

In line with the growing concerns regarding spam messages, there has been an increasing number of works dedicated to the problem, which studies the issue from different aspects. In this chapter, we present a comprehensive literature review to the problem of spam emails, directly or indirectly related to our work. At the end of the chapter, we present the studies related to formal methods applied in feature models’ presentations. We refer how these formal approaches are similar (and different) to our proposed semiring-based formalization technique for abstracting feature-based categorical clustering algorithm, and finally to speed up the process of clustering through parallelism.

2.1 Spam Emails Issues

In this section we explain different problems of spam emails discussed in the literature. Botnet is one of the main topic related to spam emails, which vastly came under consideration in recent years. [76] report that more than 85% of worldwide spam is sent by botnets1_{. The}

term botnet refers to a group of campaign host computers that are controlled by a small number of commander hosts referred to as command and control (CC) servers. Compromised machines on the Internet are generally referred to as bots, and the set of bots controlled by a single entity is called a botnet [153]. In other words, botnet is a network of “zombie” computers infected by a malicious software (or “malware”) designed to enslave them to a master computer. The malware is installed in a variety of ways, such as downloading an attachment received by a spam email [25],[78], [35].

[146] perform a large scale analysis of spamming botnet characteristics and identify trends that can benefit future botnet detection and defense mechanisms. The proposed framework is based on the premise that botnet spam emails are mostly sent in an aggregate fashion, resul-ting in content prevalence similar to the worm propagation. The focus of research is on URLd

(23)

embedded in email content. With the use of three-month collected spam emails from Hotmail, the proposed framework, named AutoRE, [146] found several interesting results regarding the degree of email obfuscation, properties of botnet IP addresses, sending patterns, and their correlation with network scanning traffic.

[79] present a platform, named Botlab, which continually monitors and analyzes the behavior of spam botnets. The result of this study shows that six botnets are responsible for 79% of spam messages arriving at the University of Washington campus.

[96] first discuss about the fundamental concepts of botnets, including formation and exploi-tation, lifecycle, and two major kinds of topologies. Several related attacks, detection, tracing, and countermeasures, are introduced later.

[47] propose a spam zombie detection system, named SPOT (Sequential Probability Ratio Test), which monitors outgoing messages of a network. Through a two-month e-mail trace collected in a large US campus network, they show that SPOT is an effective and efficient technique in automatically detecting compromised machines in a network.

[52] apply PageRank approach, with an additional clustering algorithm, to efficiently detect stealthy botnets through peer-to-peer communication.

[133] provide interesting statistic about botnet : at two hours about 29.6% of bots are blacklis-ted, and 46.4% are blacklisted after three hours. By six hours, roughly 75.3% are blacklisted. The rate reaches 90% after a period of about 18 hours.

[142], [149], [45] propose several approaches to find the botmaster through step stones. [13], [122], [116], [107] provide a brief look at the existing botnet research, the evolution and future of botnets, as well as the goals and visibility of today networks intersection in order to inform the field of botnet technology and defense.

The other topic related to the problem of spam emails is about the cost of spam messages, and the revenue of spammers.

[119] believe that any marketing based on spam emails brings the advantage of costing the sender small. Hence, the sender send large number of messages to maximize the return. There are several researches focusing on what spammer get back from spam campaigns. The conversion rate of spam marketing is discussed in [83], while in [133] , [112] , and [134] the underground economy of spam is analyzed. [133] show that spam-as-a-service can be purchased for approximately $100–$500 per million emails sent. Botnets can also be rented to groups interested in sending out larger amount of designed spam emails, which are capable in sending 100 million emails per day for $10,000 per month. Considering in their own study that a cutwail operators may have paid between $1,500 and $15,000 on a recurring basis to grow and maintain their botnet, and estimating the value of the largest email address list (containing over than 1,596,093,833 unique addresses) from advertised prices, it is worth approximately $10,000– $20,000. Finally, the Cutwail gangs profit for providing spam services is estimated to around $1.7 million to $4.2 million since June 2009. They also observed that several individuals offer 10,000 malware installations for approximately $300– $800, and rates for one million email

(24)

addresses ranging from $25 to $50, with discounted prices for bulk purchases.

[84] show that a successful spam campaign can earn revenues between $400k to $1000k. The other side of cost effect of spammer has been evaluated as productivity cost2_{. To measure}

the cost of spam emails in terms of productivity, suppose that the average money an employee makes per year equals to $ 80k, while he is working 220 days per year. Let’s say that he receives 100 messages per day, which 40 of them are spam, and the average time to read a message and delete it takes 5 seconds. Then, he gets $45 per hour, and needs 3 minutes just for deleting spam emails, he lost $2.25 per day just for checking the spam messages. It means, if a company has 1000 employees, it looses just $500,000 per year as productivity cost resulted from spam messages.

The other main focus of research related to the problem of spam emails refers to spam filtering methods.

Spam filtering is based on analysis of the message contents and additional information, trying to identify spam messages from legitimate ones [143], [21]. Generally, a spam filter is an application which implements a function as following :

f (m, θ) = (

C(spam) if the message m is spam C(leg) if the message m is legitimate

where m is a message to be classified, and θ is a vector of parameters, and C(spam) and C(leg) are labels assigned to the message.

Mostly spam filtering is performed with the use of machine learning algorithms, e.g. applying Naive bayesian approaches [9],[8], and other classifiers [75], [151], [90], [22], [138], [139]. The approaches proposed in the literature for filtering spam emails constitute a variety of topics. [29] presents an overview of approaches aimed at spam filtering. Text analysis and characteri-zing spam emails with the use special words, was another applicable approach in the field of spam filtering. To this end, [48] apply lazy learning algorithms to tackle concept drift in spam filtering, while [80] use n-grams in an anti-spam approach based on words. Spammers start to obfuscate text in spam messages, or embed the text in images, to avoid being identified trough text filtering techniques. Image spam filtering methodologies [10], [20], came under consideration to block these kinds of spam messages.

Nevertheless, despite the growing research on spam filtering, often showing accuracy of above than 90% [21], the evolution of spam messages is still considerable. Actually, a filter prevents end-users from wasting their time on junk messages, but it does not stop resources misuse, since however the messages are delivered [21].

We believe the reason could be that the spammer, the root of the problem, feels that there is the minimum risk to be caught.

(25)

To effectively fight the problem of spam emails, it is mandatory to find and persecute the spammers, generally hiding behind complex networks of infected devices which send spam emails against their user will, i.e. botnets. Due to botnets, identifying the spammer is a difficult task, however possible [142], [149], [45]. To this end, first of all it is required that efficiently and effectively divide huge amount of spam emails in the direction of being helpful to caught the spammer.

2.2 Clustering Spam emails into Campaigns

Detecting a spammer, analyzing his behavior, deciding which spammers among all have the priority to be followed, constitutes an extremely challenging task, due to the huge amount of spam emails, which vastly increases every hour (8 billions per hour) [110] [101] and for the high variance that related emails may show, due to the use of obfuscation techniques [108]. To this end : ;

• First of all, a fast and effective clustering algorithm is required to divide huge amount of spam messages into smaller groups, each representing a spam campaign, originated from the source (spammer).

In the research field of spam emails, several works exist which cluster spam emails into spam campaigns.

The basic idea in [87] for identifying spam campaigns is based on the keywords or string standing for specific types of campaigns. For example, all templates containing the string linksh are defined as a type of self-propagation campaigns. Several campaign types, related to the same spammer purpose, constitute a campaign class. The purpose of a spam campaign is identified on the base of keywords in the text or subject. The set of messages containing no text, and just the feature, belong to the image campaign. Finally, 10 spam campaign classes are presented, named 1) Image spam, 2) Job ads, 3) Other ads, 4) Personal ad, containing fake dating/matchmaking advance money scams, 5) Pharma containing pointers to web sites selling Viagra, Cialis, etc, 6) Phishing, which forces victims to enter sensitive information 7 ) Political campaigning 8) Self-prop, i.e. the spam messages which tricks victims into executing Storm binaries 9) Stock scam that ricks victims into buying a particular penny stock 10) (Other) Manual selection of keywords needs too much efforts iteratively, while the spammers soon by soon change the keywords that they use. Moreover, spammer continuously fight keywords-based approaches by means of obfuscation techniques.

It has been inferred by [87] that 65 percent of instances last less than 2 hours and the longest existing ones are pharmaceutical which were available for months, and crucial self propagation working for 12 days.

(26)

num-ber of unique headers in template, but Pharma and self propagation have actually few different bodies. The authors suggest that may be it is better to focus on clustering on headers to iden-tify these three campaigns and then try to ideniden-tify other campaigns using other techniques. In [54], although the authors focus on analysis of spam URLs in Facebook, the study of URLs and clustering spam messages is similar to our goal concerning spam emails. First, all wall posts that share the same URLs are clustered together, then the description of wall posts are analyzed and if two wall posts have the same description their clusters are merged. In this study factors like bursty activity and distributed communication have also come under consideration.

The distributed property in sending spam emails refers to the number of users who send spam messages in the cluster and in this case is usually computed from IP addresses of the senders, while in facebook spam messages it refers to users‘ unique ID.

The bursty property comes from the rational that most spam campaigns are involved in an action within a short period of time.

The threshold values for distributed and bursty properties in this study has been identified as 5 and 1.5 hours, respectively. This means that if a spammer sends spam messages to less than 5 different accounts or the interval of sending messages is greater than 1.5 hours, he is considered as a person who have no important effect in the system.

Furthermore, the authors found that for attracting people attention, the spammers techniques can mostly (88.2) be classified into three types : 1) They promise free gifts, 2) They use some phrases to trigger the curiosity, like some one likes them, 3) They describe a product for sale. It has been discovered that approximately 80 percent of malicious accounts are active less than one hour and about 10 percent are active for longer than one day. According to each time zone most malicious wall posts were sent around 3 am to avoid detection, and among 187 million wall posts of 3.5 million facebook users, 200,000 malicious wall posts were attributed to 57,000 malicious accounts.

[92] believe that spam emails with identical URLs are highly clusterable and mostly sent in burst. In their method, if the same URL exists in spam emails from source A and source B, and each has a unique IP address, they will be connected with an edge to each other and the connected components are the desired clusters. It is also observed that if a spammer is associated with multiple groups, it has a higher probability of sending more spam mails in the near future. Furthermore, the authors found a very small fraction of the active spammers actually accounted for a large portion of the total spam mails. Furthermore, they inferred that the spam emails from the same group of spammers are sent in burst.

Spamscatter [4] is a method that automatically clusters the destination web sites extracted from URLs in spam emails with the use of image shingling. In image shingling, images are divided into blocks and the blocks are hashed, then two images are considered similar if 70

(27)

percent of the hashed blocks are the same. The life time of each detected spam campaign is computed through finding the first (in terms of time) and last (in terms of time) spam message in the spam campaign. The result shows that over 40% of the malicious scams persist for less than 120 hours, whereas the lifetime for the same percentage of shopping scams is 180 hours and the median for all scams is 155 hours.

[150] cluster spam messages based on the images of spam to trace the origins of spam emails. To this end, spam images are divided into two parts : foreground and background. The foreground comprises the text and/or illustrations while background is the colors and/or textures. The spam emails are visually similar if their illustrations, text, layouts, and/or background textures are similar. In this study, spam images are separated to foreground and background, where the foreground contains the text and illustration, and the background means various colors and textures. The two-stage clustering, first with the use of Optimal Character Recognition recognizes texts whose bounding boxes represent the text layout. Afterwards, the illustrations are separated from the background by detecting the background. The authors mention that the proposed approach requires to be mixed with other methods to get better result.

[130] focus on clustering spam emails based on IP addresses resolved from URLs inside the body of these emails. The rational behind it is that the authors believe in many cases it is not easy to change the IP addresses easily, since it requires to compromise a lot of computers. In this study, two emails belong to the same cluster, if their IP addresses resolved from URLs are exactly the same. Afterwards, the relationship between spam sending system and malicious Web servers connected to URLs , and also some information like the number of unique URLs, unique domain names, etc are provided.

By examining three weeks spam messages gathered on used SMTP server, the authors conclude that the proposed methodology outperforms comparing to clustering techniques based on domain names and URLs, while the claim is justified due to the fact that domain names associated with the scam changes frequently, also the period that a URL is active is too short for performing the investigation, and most of the time the URLs used in spam emails are unique.

In all aforementioned works for clustering spam emails into campaigns, the pairwise comparison of each two email is required, where the time complexity is quadratic. Furthermore, the spam campaign detection is limited to one or two features in spam emails, where if the spam messages does not contain the related feature, the methodology fails in its clustering. For example, for emails without URL or without images, the approaches of [130] , [150] fail, respectively. Other limitations of the former approaches have been identified in [132], which shows how only considering IP addresses resolved from URLs is insufficient for dividing emails in spam campaigns. More precisely, since web servers contain lots of domains with the same IP address, every spam campaign identified by such a mean (such as [130]) are instead made of a large

(28)

amount of spam emails sent by different controlling entities.

Thus, [130] propose a new technique for spam campaign detection, named O-means cluste-ring, which is based on K-means clustering algorithm. The distance of two spam messages is calculated based on 12 features extracted from emails, which are expressed by numbers and the distance is computed with the use of euclidean measurement. The set of 12 features are 1) size of email, 2) number of lines, 3) number of unique URLs, 4) average length of unique URLs, 5) average length of domain names, 6) average length of query, 7) average number of key values pair, 8) average length of path, 9)average length of keys, 10) average length of values, 11) average number of dots in domains, 12) number of global top 100 URLs.

The limitation of O-means is that it requires the number of clusters to be known from begin-ning, which is generally not a working hypothesis. On the other hand, the applied features are considered numerical, not representing well the reality, specially for considering the distance of two emails based on the the number of links numerically, i.e. the two email with one link be considered closer to the email with 10 links rather than the one with 11 links.

After clustering spam emails according to O-means method, [131] found that the 10 largest clusters had sent about 90 percent of all spam emails. Hence, the authors investigate these 10 clusters to implement heuristic analysis for selecting significant features among 12 features used in previous work. As a result they select four most important features which could effectively separate these 10 clusters from each other. These features are : “Size of emails”, “Number of lines”, “Length of URLs” and “Number of dots”. However, the authors mentioned that it is not the best method for selecting the most significant features, since it was based on analysis of the top 10 clusters. By the way, it results almost with the same accuracy of clustering of the previous method which used 12 features. The accuracy ranges from 86.63 percent to 86.33 percent, which the difference is negligible but the execution time from 28,772 sec decreases to 6,124 sec.

[144] first extract eleven features of each spam email. This set of features includes : “Message Id”, “Sender IP address”, “Sender Email”, “Subject”, “Body Length”, “Word Count”, “Attach-ment File Name”, “Attach“Attach-ment MD5”, “Attach“Attach-ment Size”, “Body URL”, “Body URL Domain”, while some attributes are broken down into two sub-attributes, for example, “body URL” into “Machine Name” and “Path”.

Afterwards, two clustering algorithms are applied to divide spam messages. At first an agglo-merative hierarchical algorithm [66] is used to cluster the whole data set based on messages’ subject comparison. This means that at the beginning, each email is a cluster by itself and then clusters sharing common subject are merged. The distance D(i, j) between two clusters iand j is equal to 0 if they share common feature of an attribute and equal to 1 if not. Thus, when the distance between two clusters is 0, the two clusters are merged. Finding that with first merge based on the subject, 67% of messages are attributed to one cluster. To solve the

(29)

problem of false positive rate for big clusters, the connected component with weighted edges algorithm is applied. A connected component [12] is an undirected graph in a set A of vertexes such that for each vertex v ∈ A, the set of vertexes for which there exists a path from v to them is exactly the set A. The weight on edges represents the strength of the connection between two vertexes. Applying this approach, edges connect two spam emails based on the eleven attributes. The desired clusters are the connected components of this graph with the weight above a specified threshold.

The main drawback of this methodology is that it cannot be applied on large datasets, since the pairwise comparison are done for pair of emails in the dataset several times.

The basic hypothesis in [27] for clustering spam emails is that some parts of spam messages are static in the view of recognizing a spam campaign. In this work, as an improvement of [92], just URLs are not considered for clustering. In this work, for identifying spam campaigns some features extracted from spam emails, named “language of email”, “message layout”, “type of message”, “URLs” and “subject”. Afterwards, the frequencies of proposed features in a large dataset are computed in order to cluster spam messages with the use of FP-Tree. Frequent Pattern Tree (FP-Tree), proposed by [67], is a signature based method in which each node after the root depicts a feature extracted from the spam message that is shared by the sub-trees beneath. Thus, each path in this tree shows sets of features that co-occur in messages, with the property of non-increasing order of frequency of occurrences.

Applying FP-Tree for spam campaign detection, in [27] and [44], has several limitations. First of all, in the side of URL similarity, since each token of a URL is considered as a feature, it fails to distinguish dynamic URLs in emails belonging to the same campaign [27]. On the other hand, considering token of URLs as feature causes that a spam email containing several URLs be directed to several campaigns.

Moreover, in the side of layout detection, FP-Tree is too much sensitive to very small changes in the layout. More precisely, FP-Tree reads each message line by line, and then the layout is provided as the string of letters, e.g. UTBUUB, where the i’th letter in the string represents the i’th line of spam message, e.g. if U occurs in the first letter of layout string, it means that in the first line of message we have URL. Considering that spammers use several techniques for random text and URL obfuscations, it is possible that two very similar emails, belonging to the same spam campaign, be considered as having two different layouts in FP-Tree, just because the random text reaches to the next line in one email whilst not in the other one. In summary, the previous works for clustering spam emails mainly could be divided into two main categories : the first group focus on pairwise comparison of each pair of emails, for example URL comparison, and the second group in which a clustering algorithm is used, for example O-means clustering. In general, the aforementioned previous works suffer from one of the following problems : 1) They consider one or two features for grouping spam

(30)

messages, which decreases the accuracy, 2) The pairwise comparison is used, with quadratic time complexity, 3) The number of clusters is required as a former knowledge, 4) The features which create a pure cluster are not focused. In our proposed methodology for clustering spam emails into campaigns, we try to address the aforementioned problems.

2.3 Labeling and Ranking Spam Campaigns

• In the next step, to address the spam message problem, an approach is required to label detected spam campaigns in order to train a classifier with the use of labeled set of messages, and then to investigate an order among detected spam campaigns according to investigator priorities.

In the literature, the spam campaigns are usually labeled based on characteristic strings (key-words) representing individual campaign types as in [44], [88] and [55]. As explained, in these works, the occurrence of some specific string in a spam message means that the spam is labe-led as a pre-identified type spam campaign. For example, all templates containing the string linkshare defined as a type of self-propagation campaigns. First of all, manual string selection requires a lot of time, while the spammers soon by soon change the set of words in the body of messages applying obfuscation techniques. Moreover, it is worth noticing that many spammer apply the same words, like “viagra”, to deceive the victims. Hence, training a classifier based on the words label is not helpful in spam campaign detection, while the spam campaign is defined according to our need, i.e. originated from the same source.

[106] label spam campaigns on the base of contact information in the body of messages. To this end, URLs, phone number, Skype ID, and Mail ID used as contact information are considered for clustering spam emails into similar groups, whilst the contact information is considered as the label of detected spam campaign. This methodology is effective only against emails reporting contacts, which are only a subset of all the spam emails found in the wild.

There are several approaches in the literature in which the spammer goal is considered. Howe-ver, these approaches are mainly focused on detecting phishing emails, not considering other spammer purposes. Phishing email [3] as a special type of spam message, has become an enormous threat for all Internet based commercial operations, which causes non negligible financial losses to organizations and individual users. Phisher attempts to redirect users to fake websites, which is designed to obtain financial data such as usernames, passwords, and credit card detail, etc of a person illegally in an electronic communication [3].

In this regard, mostly the set of features which represent a phishing email structure are pro-posed, and then a machine learning algorithm is used to classify set of emails into phishing or legitimate.

(31)

features include : 1) IP-based URLs, 2)age of linked-to domain names, 3) nonmatching URLs, 4)“Here” links to non-model domain, 5) HTML emails, 6) number of links, 7) number of domains, 8) number of dots, 9) containing javascript, 10) spam-filter output.

[17] propose a similar methodology with additional features to train a classifier in order to filter phishing emails. Advanced email features are generated by adaptively trained Dynamic Markov Chains and latent Class-Topic Models. The set of features are divided into three main groups, named basic features, dynamic markove chain features, latent topic model features. Basic features by itself contain several features, e.g. structural features, link features.

[34] propose a methodology to detect phishing emails based on both machine learning and heu-ristics. The proposed novel heuristic anti-phishing system employs Gestalt and decision theory concepts in modeling the similarity. [3] provide a survey on different techniques in filtering phishing emails, while Gansterer et al. [53] compare different machine learning algorithms in phishing detection. Furthermore, the authors propose a technique which refines the previous phishing filtering approaches. In this work, three types of messages, named ham, spam and phishing are distinguished automatically. Nevertheless, the category of emails containing spam, is not precisely characterized.

There are number of works discussing on different aspects of spam email attacks, spanning from the network of malware distribution [104] , PageRank spam analysis [1] to total revenues for a range of spam advertised campaigns [84], [83]. However, in these works also some specific aspects of one type of spam attack is analyzed, where the detection of different types of spam attacks is not discussed.

In the side of ranking spam campaigns, [44] consider Canadian law enforcement elements, e.g. Canadian IP addresses, “.ca” top-level domain names, and IP ranges of Canadian IP addresses. To the best of our knowledge, the present work is the first effort in labeling spam campaigns based on the different goals of spammer based on the structural features of messages, whereas the goal-based label of each campaign is applied to order the set of detected labeled spam campaigns.

2.4 On the Formalization of Clustering and its Applications

• As the next step, we formalize CCTree, as the effective and efficient categorical clustering algorithm. The formal schema is used to formalize CCTree parallelism with the use of rewriting system.

It is hard to find studies in the literature on the formalization of different concepts related to clustering algorithms.

(32)

natural objective function. The dendrogram properties of hierarchical clustering are enforced as linear constraints. The proposed formalization technique has the benefit of that relaxing the constraints may provide novel program variation, like overlapping clusterings.

[103] formally define the problem of clustering in Multi-Criteria Decision Aid (MCDA) system. As in most MCDAs, the preferences of a decision maker are modeled based on a set of decision alternatives. To find the optimal solution, the authors propose a heuristic approach, which is validated trough tests on a large set of artificially generated benchmarks.

[2] propose an approach to formalize the problem of data streams in clustering algorithms, based on the set theory. Data stream refers to infinite sequences of data. The formalization scheme made it possible to identify and propose basic properties for the design and comparison of data stream clustering algorithms. To this end, they extended Kleinberg’s properties [86] to represent clustering partitions evolving according to the data stream behavior. They found that it is difficult to find an algorithm to comply with expressiveness property in a data stream context.

[41] apply predicate logic language in terms of sets of if-then rules to formalize heuristic rules in clustering algorithms. In this approach, it is possible to describe traditional clustering algorithms, like k-means. However, in none of the few number of works on formalizing clustering algorithms, algebraic methodology is used in abstracting a clustering algorithm representation. In what follows we present several techniques and methodologies used to formalize feature models.

Feature models are information models in a way that a set of products, e.g. software products or DVD player products, are represented as hierarchically arrangement of features, with dif-ferent relationships among features [15]. Feature models are used in many applications as the result of being able to model complex systems, being interpretable, and the ability to handle both ordered and unordered features [105]. Benavids et al. [15] believe designing a family of software systems in terms of features, makes it easy to be understood by all stakeholders, rather than the time they are expressed in terms of objects or classes. Representing feature models as a tree of features, were first introduced by Kang et. al in [82], to be used in soft-ware product line. Some studies [31], [32], show that tree models combined with ensemble techniques, lead to an accurate performance on variety of domains. In feature model tree, dif-ferently from CCTree, the root is the desired product, the nodes are the features, and different representation of edges demonstrates the mandatory or optional presence of features.

Hofner et al. [73], [74], were the first who applied idempotent semiring as the basis for the formalization of tree models of products, and they called it feature algebra. The concept of semiring is used to answer the needs of product family abstract form of expression, refine-ments, multi-view reconciliation, and product development and classification. The elements of semiring in the proposed methodology, are sets of products, or product families.

(33)

To get better insight on how feature algebra works, we present a brief history of product family from definition to formalization. Furthermore we explain that despite our inspiration from the concept of feature algebra in formalizing tree model system, our proposed approach is different in several aspects.

FODA used feature models as the means to give the mandatory, optional and alternative concepts within a domain [81], [115]. For example, in a car, the transmission system is a mandatory feature, and an air conditioning is an optional feature, whilst the transmission system can either be manual or automatic. The part of the FODA feature model most related to formalizations works is the proposed feature diagram. It builds a tree of features and captures the mandatory, optional, and alternative relationships among features.

[82] perform an analysis of commonalities among applications in a particular domain in terms of services, operating environments, domain technologies and implementation techniques. After-wards, they construct a model named feature model to capture commonalities as an AND/OR graph. The AND nodes in this graph demonstrate mandatory features and OR nodes show alternative features chosen from different applications.

[39] proposed a feature model represented by a hierarchically arranged diagram where a parent feature is composed of a combination of some or all of its children. A vertex parent feature and its children in this diagram can have one of the following relationships :

– And relationship, which indicates that all children must be considered in the composition of the parent feature

– Alternative relationship, which indicates that only one child forms the parent feature – Or relationship, which shows that one or more children features can be involved in the composition of parent feature

– Mandatory relationship, which indicates that children features are required – Optional relationship, which shows that children features are optional.

Lopez-Herrejon, Batory, and Lengauer model features as functions and feature composition as function composition [97] [95]

To get better insight how feature algebra works, we refer to an example of product line, provided in [24]. Suppose that an electronic company have a family of three product lines : mp3 Players, DVD Players and Hard Disk Recorders. All members share the set of features given in the Commonalities. A member can contain some mandatory features and might contain some optional features that another member of the same product line do not have. For instance, a product could be a DVD Player that is able to play music CDs, whilst the other one does not have this feature. However, all the DVD players of the DVD Player product line must contain the Play DVD feature. Furthermore, it is possible to have a DVD player that is able to play several DVDs at the same time.

(34)

Different researchers have proposed different views of what a feature is or should be. A defi-nition that is common to most (if not all) of them in Feature-Oriented Software Development (FOSD) is that “a feature is a structure that extends and modifies the structure of a given program in order to satisfy a stakeholder’s requirement, to implement a design decision, and to offer a configuration option” [72].

Mostly, a set of features are composed to create a final program, which is itself considered as a feature. Under this assumption, a feature is either a complete program which can be executed or a program increment that requires further features to lead to a complete program. The structure of a basic feature is modeled as a tree, called feature structure tree (FST), which builds the feature’s structural elements, e.g., classes, fields, or methods, hierarchically. A specified name and type information is assigned to each node of an FST, which helps to prevent the composition of incompatible nodes during feature composition [72].

The concept of product families entered from hardware industry to the software development process [72]. The reason was that the software developers also prefer not to build just a single product but a family of similar products, sharing some functionalities, whilst having some well-identified variabilities. These elements, known as features, in software family can be characterized as requirements, architectural properties, components, middleware, or code. Due to the fact that the systems are characterized by their features, in [72] the authors call their proposed methodology feature algebra. Idempotent semirings is the basis of feature algebra, which allows a formal treatment of the aforementioned elements as well as the calculations with them. Sets of products are particular models of proposed feature algebra, which in its extension form covers product lines, refinement, product development and product classification. The tree-like structure which is formalized in product family problems has different structure from CCTree. In product family structure, against CCTree, the edges of the tree have no labels, only the nodes have ones. Furthermore, different representations of edges convey different concepts, whilst in CCTree we do not have different possible edge representations.

To the best of our knowledge, we are the first to apply an algebraic structure to abstract a categorical clustering algorithm representation and formalize the interesting concepts related to it, i.e. clustering parallelism. To this end, we attribute an algebraic representation of a tree structure and then trough several theorems and examples we show the proposed abstraction algebraic term fully abstract tree representation. Calling the term resulted from CCTree, as CCTree term , a rewriting system is proposed to automatically verify whether a term represents CCTree structure or not. Furthermore, a set of rewriting rules are provided to parallelize the result of parallel clustering.

(35)

Chapitre 3

Spam Campaign Detection

Spam emails constitute a fast growing and costly problems associated with the Internet today. To fight effectively against spammers, it is not enough to block spam messages. Instead, it is necessary to analyze the behavior of spammer and catch them in the case. This analysis is extremely difficult if the huge amount of spam messages is considered as a whole. Clustering spam emails into smaller groups according to their inherent similarity, facilitates discovering spam campaigns sent by a spammer, in order to analyze the spammer behavior. In this chapter, we propose a methodology to group large amount of spam emails into spam campaigns, on the base of categorical attributes of spam messages. A new informative clustering algorithm, named Categorical Clustering Tree (CCTree), is introduced to cluster and characterize spam campaigns. The complexity of the algorithm is also analyzed and its efficiency is proved ([126]).

3.1 Introduction

Nowadays, the problem of receiving spam messages leaves no one untouched. According to McAfee [100] report, out of the daily 191.4 billions of emails sent worldwide in average [110], more than 70% are spam emails. Microsoft and Google [113] estimate spam emails cost to American firms and consumers up to 20 billion dollars per year. Moreover, Cisco Report [136] shows that spam volume increased 250 percent from January 2014 to November 2014. Spam emails cause problems, from direct financial losses to misuses of traffic, storage space and computational power.

Given the relevance of the problem, several approaches have already been proposed to tackle this issue. Currently, the most used approach for fighting spam emails consists in identifying and blocking them [30], [46], [123], on the recipient machine through filters, which generally are based on machine learning techniques or content features [22], [138], [139]. Alternative approaches are based on the analysis of spam botnets [79],[91],[146], [152].

(36)

negligible cost to users and companies [113]. Thus, the analysis of spammers behavior and the identification of spam sending infrastructures is of capital importance in the effort of defining a definitive solution to the problem of spam emails.

Such an analysis, which is based on structural dissection of raw emails, constitutes an extremely challenging task, due to the following factors :

— The amount of data to be analyzed is huge and growing too fast every single hour. — Always new attack strategies are designed and the immediate understanding of such

strategies is paramount in fighting criminal attacks brought through spam emails (e.g. phishing).

To simplify this analysis, huge amount of spam emails should be divided into spam campaigns. A spam campaign is the set of messages spread by a spammer with a specific purpose [27], like advertising a specific product, spreading ideas, or for criminal intents e.g. phishing. Grouping spam messages into spam-campaigns reveals behaviors that may be difficult to be inferred when we look at a large collection of spam emails as a whole [132]. According to [27], in order to characterize the strategies and traffic generated by different spammers, it is necessary to identify groups of messages that are generated following the same procedure and that are part of the same campaign.

It is noteworthy to be mentioned that the problem of grouping a large amount of spam emails into smaller groups is an unsupervised learning task. The reason is that there is no labeled data for training a classifier in the beginning. More specifically, supervised learning requires classes to be defined in advance and the availability of a training set with elements for each class. In several classification problems, this knowledge is not available and unsupervised learning is used instead. The problem of unsupervised learning refers to trying to find hidden structure in unlabeled data [57]. The most known unsupervised learning methodology is clustering. Clustering is an unsupervised learning methodology that divides data into groups (clusters) of objects, such that object in the same group are more similar to each other than to those in other groups [77].

However, dividing spam messages into spam campaigns is not a trivial task due to the following reasons :

— Spam campaign classes are not known beforehand, which means we need an unsuper-vised machine learning technique.

— Feature extraction is difficult. Finding the elements that best characterize an email is an open problem addressed differently in various research works [50], [17], [150], [132]. For these reasons the most used approaches to classify spam emails is clustering them on the base of their similarities [4], [111], [132].

However, the accuracy of current solutions is still somehow limited and further improvements are needed. While some categorical attributes, for example the language of spam message, are primary, discriminative and outstanding characteristics to specify a spam campaign,