PhishScore: Hacking Phishers’ Minds

(1)

PhishScore: Hacking Phishers’ Minds

CNSM 2014 – Fault Tolerance and Security Track November 18, 2014

Samuel Marchal, Jérôme François, Radu State and Thomas Engel

{samuel.marchal,radu.state,thomas.engel}@uni.lu [email protected]

(2)

PhishScore at a glance

URL Word

extraction Features

computation Prediction Email

Server

PhishScore Internet

(3)

• Use of technical subterfuges and social engineering to steal any kind of valuable consumers’ data:

• Identity information

• Web-sites credentials: login, password, etc.

• Credit card information

• Etc.

• Cause billions of dollars of loss every year

What is Phishing ?

(4)

Phishing techniques and statistics

• Web based delivery

• Trojan hosts

• Content Injection (website)

• Phishing emails

• Instant messaging

• Fake websites

• etc.

(5)

Phishing website example

(6)

Phishing URLs characteristics

• Long URLs (many level domains, long path, etc.)

• Composed of many labels

• Embed targeted brand at different URL level e.g. Yahoo, Wells Fargo

• Embed specific key words

www.paypal.creasconsultores.com/www.paypal.com/Resolutioncenter.php shevkun.org/css/paypal.com/cgi-bin/cmd%3D_login-submit/css/websc.php us-mg6.mail.yahoo.com.dwarkamaigroup.com/Yahoo.html

emailoans.hostingventure.com.au/bankofamerica.com nitkowski.pl/components/wellsfargo/questions.php

URL characteristics:

(7)

Prior Work

URL lexical analysis

•Garrera et al. [WORM `07]

Logistic regression with word based features

•Ma et al. [SIGKDD `09]

Batch classification method with lexical and host based features

•Blum et al. [AISec `10]

Refined technique with binary feature for each word/level

•Le et al. [Infocom `11]

Batch and online learning with lexical features and URL features

(8)

Phishing URLs characteristics

www.paypal.creasconsultores.com/www.paypal.com/Resolutioncenter.php shevkun.org/css/paypal.com/cgi-bin/cmd%3D_login-submit/css/websc.php us-mg6.mail.yahoo.com.dwarkamaigroup.com/Yahoo.html

emailoans.hostingventure.com.au/bankofamerica.com nitkowski.pl/components/wellsfargo/questions.php

The registered domain has no relationship with the rest of the URL

• Most parts of URLs can be freely defined

• Except the registered domain: main level domain + public suffix 4ld.3ld.

http:// mld.ps /path1/path2?key1=value1&key2=value2

(9)

Proposition for Phishing URL Detection

Hypothesis:

• Components of legitimate URLs are all related

• Registered domains (mld.ps) of phishing URLs are not related to the remaining of the URL

Analyse relatedness between mld.ps and

the remaining part of a URL : Intra-URL relatedness

(10)

Intra-URL relatedness

URL label extraction:

login.paypal.com/securepayment

•RD_url= {paypal; paypal.com}

•REM_url= {login; secure; payment}

http://4ld.3ld.mld.ps/path1/path2?key1=value1&key2=value2

Basic splitting

“mld” & “mld.ps”

(11)

• Compare the two sets RD_url and REM_url

• Existing word relatedness techniques : Wordnet [Miller90], NGD [Cilibrasi07], Disco [Kolb08], etc.

Problem: all dictionary based and ”Internet” vocabulary is not necessarily contained in dictionary

• Idea : use Search Engine Query Data

•Web searches reflect the cognitive behaviour of users looking for services on Internet (what phishers try to identify and to mimic)

•Request well-known services: Google Trends & Yahoo Clues

•See which words are requested together in search engines to

How to evaluate intra-URL relatedness ?

Intra-URL relatedness evaluation

(12)

Intra-URL relatedness evaluation

sezopostos.com/paypalitlogin/us/websrc.html?cmd=_login-run

RD = {sezopostos,sezopostos.com}url

REM = {paypal,it,login,us,web,src,html,cmd}url

paypal

Term = {{amazon,paypal},{paypal,fees},{ebay,uk},{paypal,login}}

REL = {amazon,paypal,fees,ebay,uk,login}_rem

URL label extraction

Search engine query data Term computation

AS = {amazon,fees,login}rem

Related words Associated words

AS rd REL _rd

(13)

12 features representing intra-URL relatedness:

Features set

J

_RR

J

_RA

J

_AA

J

_AR

J

_ARrd

J

_ARrem

card

_rem

ratio

_Arem

ratio

_Rrem

mld

_res

mld.ps

_res

ranking

Word set relatedness (Jaccard index)

Words embedded in URL

Popularity of words in URL

Popularity of registered domain

(14)

Feature analysis

• Datasets:

• 48,009 phishing URLs (source: PhishTank)

• 48,009 legitimate

URLs (source DMOZ)

• Features extraction for all dataset

(15)

URL classification

• Machine learning approach:

• Determine the best classifier to identify phishing URLs

• 7 classifiers tested: Random Forest, C4.5, JRip, SVM, etc.

• 10-fold cross-validation on the presented feature set (96,016 URLs)

• Random Forest:

94.91% accuracy 1.44% FP_rate

(16)

URL rating

• Random Forest based rating system:

• Use soft prediction score [0;1] as URL score:

• 1: phishing URL

• 0: legitimate URL

• 0: 22,863 legitimate // 40 phishing

• 1: 26 legitimate // 34,790 phishing 99.89% correctness on

60.11% of the dataset

• [0;0.1] and [0.9;1]

99.22% correctness on

(17)

Conclusion

Lexical analysis to detect phishing URLs:

• Intra-URL relatedness

• Word relatedness inferred with search engine query data

• Phishing URL detection: 95% accuracy (FP rate = 1.44%)

• URL rating system: >99% correctness for > 80% URLs

Future Work:

• Use distributed on-line processing (Big Data) to reduce delay

• Implementation as phishing email filtering and browser add-on

PhishScore

(18)