PhishScore: Hacking Phishers’ Minds
CNSM 2014 – Fault Tolerance and Security Track November 18, 2014
Samuel Marchal, Jérôme François, Radu State and Thomas Engel
{samuel.marchal,radu.state,thomas.engel}@uni.lu [email protected]
PhishScore at a glance
URL Word
extraction Features
computation Prediction Email
Server
PhishScore Internet
• Use of technical subterfuges and social engineering to steal any kind of valuable consumers’ data:
• Identity information
• Web-sites credentials: login, password, etc.
• Credit card information
• Etc.
• Cause billions of dollars of loss every year
What is Phishing ?
Phishing techniques and statistics
• Web based delivery
• Trojan hosts
• Content Injection (website)
• Phishing emails
• Instant messaging
• Fake websites
• etc.
Phishing website example
Phishing URLs characteristics
• Long URLs (many level domains, long path, etc.)
• Composed of many labels
• Embed targeted brand at different URL level e.g. Yahoo, Wells Fargo
• Embed specific key words
www.paypal.creasconsultores.com/www.paypal.com/Resolutioncenter.php shevkun.org/css/paypal.com/cgi-bin/cmd%3D_login-submit/css/websc.php us-mg6.mail.yahoo.com.dwarkamaigroup.com/Yahoo.html
emailoans.hostingventure.com.au/bankofamerica.com nitkowski.pl/components/wellsfargo/questions.php
URL characteristics:
Prior Work
URL lexical analysis
•Garrera et al. [WORM `07]
Logistic regression with word based features
•Ma et al. [SIGKDD `09]
Batch classification method with lexical and host based features
•Blum et al. [AISec `10]
Refined technique with binary feature for each word/level
•Le et al. [Infocom `11]
Batch and online learning with lexical features and URL features
Phishing URLs characteristics
www.paypal.creasconsultores.com/www.paypal.com/Resolutioncenter.php shevkun.org/css/paypal.com/cgi-bin/cmd%3D_login-submit/css/websc.php us-mg6.mail.yahoo.com.dwarkamaigroup.com/Yahoo.html
emailoans.hostingventure.com.au/bankofamerica.com nitkowski.pl/components/wellsfargo/questions.php
The registered domain has no relationship with the rest of the URL
• Most parts of URLs can be freely defined
• Except the registered domain: main level domain + public suffix 4ld.3ld.
http:// mld.ps /path1/path2?key1=value1&key2=value2
Proposition for Phishing URL Detection
Hypothesis:
• Components of legitimate URLs are all related
• Registered domains (mld.ps) of phishing URLs are not related to the remaining of the URL
Analyse relatedness between mld.ps and
the remaining part of a URL : Intra-URL relatedness
Intra-URL relatedness
URL label extraction:
login.paypal.com/securepayment
•RDurl = {paypal; paypal.com}
•REMurl = {login; secure; payment}
http://4ld.3ld.mld.ps/path1/path2?key1=value1&key2=value2
Basic splitting
“mld” & “mld.ps”
• Compare the two sets RDurl and REMurl
• Existing word relatedness techniques : Wordnet [Miller90], NGD [Cilibrasi07], Disco [Kolb08], etc.
Problem: all dictionary based and ”Internet” vocabulary is not necessarily contained in dictionary
• Idea : use Search Engine Query Data
•Web searches reflect the cognitive behaviour of users looking for services on Internet (what phishers try to identify and to mimic)
•Request well-known services: Google Trends & Yahoo Clues
•See which words are requested together in search engines to
How to evaluate intra-URL relatedness ?
Intra-URL relatedness evaluation
Intra-URL relatedness evaluation
sezopostos.com/paypalitlogin/us/websrc.html?cmd=_login-run
RD = {sezopostos,sezopostos.com}url
REM = {paypal,it,login,us,web,src,html,cmd}url
paypal
Term = {{amazon,paypal},{paypal,fees},{ebay,uk},{paypal,login}}
REL = {amazon,paypal,fees,ebay,uk,login}rem
URL label extraction
Search engine query data Term computation
AS = {amazon,fees,login}rem
Related words Associated words
AS rd REL rd
12 features representing intra-URL relatedness:
Features set
J
RRJ
RAJ
AAJ
ARJ
ARrdJ
ARremcard
remratio
Aremratio
Rremmld
resmld.ps
resranking
Word set relatedness (Jaccard index)
Words embedded in URL
Popularity of words in URL
Popularity of registered domain
Feature analysis
• Datasets:
• 48,009 phishing URLs (source: PhishTank)
• 48,009 legitimate
URLs (source DMOZ)
• Features extraction for all dataset
URL classification
• Machine learning approach:
• Determine the best classifier to identify phishing URLs
• 7 classifiers tested: Random Forest, C4.5, JRip, SVM, etc.
• 10-fold cross-validation on the presented feature set (96,016 URLs)
• Random Forest:
94.91% accuracy 1.44% FPrate
URL rating
• Random Forest based rating system:
• Use soft prediction score [0;1] as URL score:
• 1: phishing URL
• 0: legitimate URL
• 0: 22,863 legitimate // 40 phishing
• 1: 26 legitimate // 34,790 phishing 99.89% correctness on
60.11% of the dataset
• [0;0.1] and [0.9;1]
99.22% correctness on
Conclusion
Lexical analysis to detect phishing URLs:
• Intra-URL relatedness
• Word relatedness inferred with search engine query data
• Phishing URL detection: 95% accuracy (FP rate = 1.44%)
• URL rating system: >99% correctness for > 80% URLs
Future Work:
• Use distributed on-line processing (Big Data) to reduce delay
• Implementation as phishing email filtering and browser add-on
PhishScore
PhishScore: Hacking Phishers’ Minds
CNSM 2014 – Fault Tolerance and Security Track November 18, 2014
Samuel Marchal, Jérôme François, Radu State and Thomas Engel
{samuel.marchal,radu.state,thomas.engel}@uni.lu [email protected]
Phishing summary
• Phishing:
• seeks to steal different kind of data
• targets several industry sector
• uses various techniques