samuel.marchal@uni.lu 13/09/12
Proactive Discovery of Phishing
Related Domain Names
Motivation Phishing domains modelling Experiments and Results Conclusion
Outline
1
Motivation
Motivation Phishing domains modelling Experiments and Results Conclusion
Outline
1
Motivation
2
Phishing domains modelling
3
Experiments and Results
Motivation Phishing domains modelling Experiments and Results Conclusion
Motivation Phishing domains modelling Experiments and Results Conclusion
Motivation Phishing domains modelling Experiments and Results Conclusion
Early blacklisting solution
DNS Proactive Blacklisting · ebay-securelogin.com · paypalprotect.com · hsbcbankinglogon.com · ... · fake websites · download malware · ... Spoofed e-mail Register Control Visit Checking + ALERT Domain names generatorState of the Art:
Motivation Phishing domains modelling Experiments and Results Conclusion
Early blacklisting solution
DNS Proactive Blacklisting · ebay-securelogin.com · paypalprotect.com · hsbcbankinglogon.com · ... · fake websites · download malware · ... Spoofed e-mail Register Control Visit Checking + ALERT Domain names generatorState of the Art:
Motivation Phishing domains modelling Experiments and Results Conclusion
Early blacklisting solution
DNS Proactive Blacklisting · ebay-securelogin.com · paypalprotect.com · hsbcbankinglogon.com · ... · fake websites · download malware · ... Spoofed e-mail Register Control Visit Checking + ALERT Domain names generatorState of the Art:
Motivation Phishing domains modelling Experiments and Results Conclusion
Outline
1
Motivation
2
Phishing domains modelling
3
Experiments and Results
Motivation Phishing domains modelling Experiments and Results Conclusion
Natural language model
Key idea: generate domain names similar to those registered by phishers
=⇒ relying on natural language
I focus on main domain + TLD
ex: www.login.myphishingdomain.com/index.php
I extract features from blacklisted phishing domain
names
I deduce a generation model for domain names
Motivation Phishing domains modelling Experiments and Results Conclusion
Natural language model
Key idea: generate domain names similar to those registered by phishers =⇒ relying on natural language
I focus on main domain + TLD
ex: www.login.myphishingdomain.com/index.php
I extract features from blacklisted phishing domain
names
I deduce a generation model for domain names
Motivation Phishing domains modelling Experiments and Results Conclusion
Natural language model
Key idea: generate domain names similar to those registered by phishers =⇒ relying on natural language
I focus on main domain + TLD
ex: www.login.myphishingdomain.com/index.php
I extract features from blacklisted phishing domain
names
I deduce a generation model for domain names
Motivation Phishing domains modelling Experiments and Results Conclusion
Natural language model
Key idea: generate domain names similar to those registered by phishers =⇒ relying on natural language
I focus on main domain + TLD
ex: www.login.myphishingdomain.com/index.php
I extract features from blacklisted phishing domain
names
I deduce a generation model for domain names
Motivation Phishing domains modelling Experiments and Results Conclusion
Features extraction
securelogin34ebaymy-securephishing-domain.co.uk securelogin34ebaymy-securephishing-domain securephishing domain loginsecure ebay my secure phishing domain
TLD splitting "-" splitting number extraction word segmentation 34 securelogin34ebaymy TLD I distlen = {(8, 1)}
I distword = {(secure, 0.25), (login, 0.125), (34, 0.125), (ebay , 0.125), ...}
I distfirstword = {(secure, 1)}
Motivation Phishing domains modelling Experiments and Results Conclusion
Features extraction
securelogin34ebaymy-securephishing-domain.co.uk securelogin34ebaymy-securephishing-domain securephishing domain loginsecure ebay my secure phishing domain
TLD splitting "-" splitting number extraction word segmentation 34 securelogin34ebaymy TLD I distlen = {(8, 1)}
Motivation Phishing domains modelling Experiments and Results Conclusion
Model generation
secure login 34 ebay my phishing domain 0.5 1 1 1 1 0.5 1 I distlen = {(8, 1)}I distword = {(secure, 0.25), (login, 0.125), (34, 0.125), (ebay , 0.125), ...}
I distfirstword = {(secure, 1)}
Motivation Phishing domains modelling Experiments and Results Conclusion
Model generation
secure login 34 ebay my phishing domain 0.5 1 1 1 1 0.5 1 1 I I distlen = {(8, 1)}I distword = {(secure, 0.25), (login, 0.125), (34, 0.125), (ebay , 0.125), ...}
Motivation Phishing domains modelling Experiments and Results Conclusion
Model generation
secure login 34 ebay my phishing domain 0.475 0.95 0.95 0.95 0.95 0.475 0.95 0.008 0.008 0.016 0.008 0.008 1 I I distlen = {(8, 1)}I distword = {(secure, 0.25), (login, 0.125), (34, 0.125), (ebay , 0.125), ...}
I distfirstword = {(secure, 1)}
Motivation Phishing domains modelling Experiments and Results Conclusion
Semantic extension
Disco:
I calculate a similarity score (semantic relatedness)
between 2 words
I give the n most related words to w
I based on dictionary (Wikipedia, BNC, PubMed, etc.)
I applied to each state of
the Markov Chain ⇒ expand the discovery
Motivation Phishing domains modelling Experiments and Results Conclusion
Generator global overview
I extract features from known phishing domains I generate domain names ⇒ potentially phishing I domain names automatically checked further
=⇒ Blacklist Name Statistics set up xp Markov Chains + (1) (4) (5) Name Decompositon TLD list: com, lu, fr, de, org...
Malicous domains (blacklists, honeypots, malware analysis...) Word Splitter DISCO Domain checker Potential Malicious Domain List macromediasetup.com/dl.exe
macromediasetup,com |macro|media|set|up|, |com|
Feature extraction Model
(2) (3)
Blacklist
Motivation Phishing domains modelling Experiments and Results Conclusion
Outline
1
Motivation
2
Phishing domains modelling
3
Experiments and Results
Motivation Phishing domains modelling Experiments and Results Conclusion
Offline testing
Phishing domain set from blacklists (∼ 50,000):
I Malware Domain List (01/2009 → 03/2012)
I DNS-Black-Hole (01/2009 → 03/2012)
I PhishTank (07/2007 → 03/2012)
5 tests of 1 million domain generations I learning set 30% (15,000 domains) I testing set 70% (35,000 domains) 0 100 200 300 400 500 600 0 200000 400000 600000 800000 1e+06 # of ma lic ious doma in nam es
# of generated domain names
Motivation Phishing domains modelling Experiments and Results Conclusion
Offline testing
Phishing domain set from blacklists (∼ 50,000):
I Malware Domain List (01/2009 → 03/2012)
I DNS-Black-Hole (01/2009 → 03/2012)
I PhishTank (07/2007 → 03/2012)
Motivation Phishing domains modelling Experiments and Results Conclusion
Offline testing
Predictability I learning: the 10% oldest I testing: 90% remaining 0 5 10 15 20 25 m+2m+4m+6m+8m+10m+12m+14m+16m+18m+20m+22m+24m+26m+28m+30m+32m+34 # of ma lic ious doma in nam esTime (in month) after the generation
Strategy I learning set 30% I testing set 70% 0 50 100 150 200 250 300 350 400 0 200000 400000 600000 800000 1e+06 # of ma lic ious doma in nam es
# of generated domain names
Motivation Phishing domains modelling Experiments and Results Conclusion
Online testing
DNS request for 1 million domains generated ∼ 100,000 domains match an @IP:
I ∼ 80,000 wildcardings domains I ∼ 5,000 domains for sale
I ∼ 15,000 remaining domains:
I ∼ 500 actually malicious and blacklisted I ∼ 200 legitimate domains
Discriminate phishing from legitimate generated domains:
MCscore
Motivation Phishing domains modelling Experiments and Results Conclusion
Online testing
DNS request for 1 million domains generated ∼ 100,000 domains match an @IP:
I ∼ 80,000 wildcardings domains I ∼ 5,000 domains for sale
I ∼ 15,000 remaining domains:
I ∼ 500 actually malicious and blacklisted I ∼ 200 legitimate domains
Discriminate phishing from legitimate generated domains:
MCscore
=⇒ Eliminate 93 % of legitimate domains...
Motivation Phishing domains modelling Experiments and Results Conclusion
Online testing
DNS request for 1 million domains generated ∼ 100,000 domains match an @IP:
I ∼ 80,000 wildcardings domains I ∼ 5,000 domains for sale
I ∼ 15,000 remaining domains:
I ∼ 500 actually malicious and blacklisted I ∼ 200 legitimate domains
Discriminate phishing from legitimate generated domains:
MCscore
Motivation Phishing domains modelling Experiments and Results Conclusion
Outline
1
Motivation
2
Phishing domains modelling
3
Experiments and Results
Motivation Phishing domains modelling Experiments and Results Conclusion
Conclusion
Generation of domain names likely to be malicious
I features extracted from existing domain names
I Markov chain model
I semantic relatedness techniques
=⇒ Proactively build a phishing blacklist Results:
I able to generate phishing domains... I ... still with false positives
=⇒ Domain scoring based on Markov chain model Future works:
samuel.marchal@uni.lu 13/09/12