Proactive Discovery of Phishing Related Domain Names

(1)

samuel.marchal@uni.lu 13/09/12

Proactive Discovery of Phishing

Related Domain Names

(2)

Motivation Phishing domains modelling Experiments and Results Conclusion

Outline

1

Motivation

(3)

Outline

1

Motivation

2

Phishing domains modelling

3

Experiments and Results

(4)

(5)

(6)

Early blacklisting solution

DNS Proactive Blacklisting · ebay-securelogin.com · paypalprotect.com · hsbcbankinglogon.com · ... · fake websites · download malware · ... Spoofed e-mail Register Control Visit Checking + ALERT Domain names generator

State of the Art:

(7)

Early blacklisting solution

State of the Art:

(8)

Early blacklisting solution

State of the Art:

(9)

Outline

1

Motivation

2

Phishing domains modelling

3

Experiments and Results

(10)

Natural language model

Key idea: generate domain names similar to those registered by phishers

=⇒ relying on natural language

I focus on main domain + TLD

ex: www.login.myphishingdomain.com/index.php

I extract features from blacklisted phishing domain

names

I deduce a generation model for domain names

(11)

Natural language model

Key idea: generate domain names similar to those registered by phishers =⇒ relying on natural language

names

(12)

Natural language model

names

(13)

Natural language model

names

(14)

Features extraction

securelogin34ebaymy-securephishing-domain.co.uk securelogin34ebaymy-securephishing-domain securephishing domain login

secure ebay my secure phishing domain

TLD splitting "-" splitting number extraction word segmentation 34 securelogin34ebaymy TLD I distlen = {(8, 1)}

I distword = {(secure, 0.25), (login, 0.125), (34, 0.125), (ebay , 0.125), ...}

I distfirstword = {(secure, 1)}

(15)

Features extraction

securelogin34ebaymy-securephishing-domain.co.uk securelogin34ebaymy-securephishing-domain securephishing domain login

secure ebay my secure phishing domain

TLD splitting "-" splitting number extraction word segmentation 34 securelogin34ebaymy TLD I distlen = {(8, 1)}

(16)

Model generation

secure login 34 ebay my phishing _domain 0.5 1 1 1 1 0.5 1 I distlen = {(8, 1)}

(17)

Model generation

secure login 34 ebay my phishing _domain 0.5 1 1 1 1 0.5 1 1 I I distlen = {(8, 1)}

(18)

Model generation

secure login 34 ebay my phishing _domain 0.475 0.95 0.95 0.95 0.95 0.475 0.95 0.008 0.008 0.016 0.008 0.008 1 I I distlen = {(8, 1)}

(19)

Semantic extension

Disco:

I calculate a similarity score (semantic relatedness)

between 2 words

I give the n most related words to w

I based on dictionary (Wikipedia, BNC, PubMed, etc.)

I applied to each state of

the Markov Chain ⇒ expand the discovery

(20)

Generator global overview

I extract features from known phishing domains I generate domain names ⇒ potentially phishing I domain names automatically checked further

=⇒ Blacklist Name Statistics set up xp Markov Chains + (1) (4) (5) Name Decompositon TLD list: com, lu, fr, de, org...

Malicous domains (blacklists, honeypots, malware analysis...) Word Splitter DISCO Domain checker Potential Malicious Domain List macromediasetup.com/dl.exe

macromediasetup,com |macro|media|set|up|, |com|

Feature extraction Model

(2) (3)

Blacklist

(21)

Outline

1

Motivation

2

Phishing domains modelling

3

Experiments and Results

(22)

Offline testing

Phishing domain set from blacklists (∼ 50,000):

I Malware Domain List (01/2009 → 03/2012)

I DNS-Black-Hole (01/2009 → 03/2012)

I PhishTank (07/2007 → 03/2012)

5 tests of 1 million domain generations I learning set 30% (15,000 domains) I testing set 70% (35,000 domains) 0 100 200 300 400 500 600 0 200000 400000 600000 800000 1e+06 # of ma lic ious doma in nam es

# of generated domain names

(23)

Offline testing

Phishing domain set from blacklists (∼ 50,000):

I Malware Domain List (01/2009 → 03/2012)

I DNS-Black-Hole (01/2009 → 03/2012)

I PhishTank (07/2007 → 03/2012)

(24)

Offline testing

Predictability I learning: the 10% oldest I testing: 90% remaining 0 5 10 15 20 25 m+2m+4m+6m+8m+10m+12m+14m+16m+18m+20m+22m+24m+26m+28m+30m+32m+34 # of ma lic ious doma in nam es

Time (in month) after the generation

Strategy I learning set 30% I testing set 70% 0 50 100 150 200 250 300 350 400 0 200000 400000 600000 800000 1e+06 # of ma lic ious doma in nam es

# of generated domain names

(25)

(26)

Online testing

DNS request for 1 million domains generated ∼ 100,000 domains match an @IP:

I ∼ 80,000 wildcardings domains I ∼ 5,000 domains for sale

I ∼ 15,000 remaining domains:

I ∼ 500 actually malicious and blacklisted I ∼ 200 legitimate domains

Discriminate phishing from legitimate generated domains:

MCscore

(27)

Online testing

MCscore

=⇒ Eliminate 93 % of legitimate domains...

(28)

Online testing

MCscore

(29)

Outline

1

Motivation

2

Phishing domains modelling

3

Experiments and Results

(30)

Conclusion

Generation of domain names likely to be malicious

I features extracted from existing domain names

I Markov chain model

I semantic relatedness techniques

=⇒ Proactively build a phishing blacklist Results:

I able to generate phishing domains... I ... still with false positives

=⇒ Domain scoring based on Markov chain model Future works:

(31)

samuel.marchal@uni.lu 13/09/12