• Aucun résultat trouvé

PhishScore: Hacking Phishers’ Minds

N/A
N/A
Protected

Academic year: 2021

Partager "PhishScore: Hacking Phishers’ Minds"

Copied!
19
0
0

Texte intégral

(1)

PhishScore: Hacking Phishers’ Minds

CNSM 2014 – Fault Tolerance and Security Track November 18, 2014

Samuel Marchal, Jérôme François, Radu State and Thomas Engel

{samuel.marchal,radu.state,thomas.engel}@uni.lu [email protected]

(2)

PhishScore at a glance

URL Word

extraction Features

computation Prediction Email

Server

PhishScore Internet

(3)

• Use of technical subterfuges and social engineering to steal any kind of valuable consumers’ data:

• Identity information

• Web-sites credentials: login, password, etc.

• Credit card information

• Etc.

• Cause billions of dollars of loss every year

What is Phishing ?

(4)

Phishing techniques and statistics

• Web based delivery

• Trojan hosts

• Content Injection (website)

• Phishing emails

• Instant messaging

• Fake websites

• etc.

(5)

Phishing website example

(6)

Phishing URLs characteristics

• Long URLs (many level domains, long path, etc.)

• Composed of many labels

• Embed targeted brand at different URL level e.g. Yahoo, Wells Fargo

• Embed specific key words

www.paypal.creasconsultores.com/www.paypal.com/Resolutioncenter.php shevkun.org/css/paypal.com/cgi-bin/cmd%3D_login-submit/css/websc.php us-mg6.mail.yahoo.com.dwarkamaigroup.com/Yahoo.html

emailoans.hostingventure.com.au/bankofamerica.com nitkowski.pl/components/wellsfargo/questions.php

URL characteristics:

(7)

Prior Work

URL lexical analysis

•Garrera et al. [WORM `07]

Logistic regression with word based features

•Ma et al. [SIGKDD `09]

Batch classification method with lexical and host based features

•Blum et al. [AISec `10]

Refined technique with binary feature for each word/level

•Le et al. [Infocom `11]

Batch and online learning with lexical features and URL features

(8)

Phishing URLs characteristics

www.paypal.creasconsultores.com/www.paypal.com/Resolutioncenter.php shevkun.org/css/paypal.com/cgi-bin/cmd%3D_login-submit/css/websc.php us-mg6.mail.yahoo.com.dwarkamaigroup.com/Yahoo.html

emailoans.hostingventure.com.au/bankofamerica.com nitkowski.pl/components/wellsfargo/questions.php

The registered domain has no relationship with the rest of the URL

• Most parts of URLs can be freely defined

• Except the registered domain: main level domain + public suffix 4ld.3ld.

http:// mld.ps /path1/path2?key1=value1&key2=value2

(9)

Proposition for Phishing URL Detection

Hypothesis:

• Components of legitimate URLs are all related

• Registered domains (mld.ps) of phishing URLs are not related to the remaining of the URL

Analyse relatedness between mld.ps and

the remaining part of a URL : Intra-URL relatedness

(10)

Intra-URL relatedness

URL label extraction:

login.paypal.com/securepayment

•RDurl = {paypal; paypal.com}

•REMurl = {login; secure; payment}

http://4ld.3ld.mld.ps/path1/path2?key1=value1&key2=value2

Basic splitting

“mld” & “mld.ps”

(11)

• Compare the two sets RDurl and REMurl

• Existing word relatedness techniques : Wordnet [Miller90], NGD [Cilibrasi07], Disco [Kolb08], etc.

Problem: all dictionary based and ”Internet” vocabulary is not necessarily contained in dictionary

Idea : use Search Engine Query Data

•Web searches reflect the cognitive behaviour of users looking for services on Internet (what phishers try to identify and to mimic)

•Request well-known services: Google Trends & Yahoo Clues

•See which words are requested together in search engines to

How to evaluate intra-URL relatedness ?

Intra-URL relatedness evaluation

(12)

Intra-URL relatedness evaluation

sezopostos.com/paypalitlogin/us/websrc.html?cmd=_login-run

RD = {sezopostos,sezopostos.com}url

REM = {paypal,it,login,us,web,src,html,cmd}url

paypal

Term = {{amazon,paypal},{paypal,fees},{ebay,uk},{paypal,login}}

REL = {amazon,paypal,fees,ebay,uk,login}rem

URL label extraction

Search engine query data Term computation

AS = {amazon,fees,login}rem

Related words Associated words

AS rd REL rd

(13)

12 features representing intra-URL relatedness:

Features set

J

RR

J

RA

J

AA

J

AR

J

ARrd

J

ARrem

card

rem

ratio

Arem

ratio

Rrem

mld

res

mld.ps

res

ranking

Word set relatedness (Jaccard index)

Words embedded in URL

Popularity of words in URL

Popularity of registered domain

(14)

Feature analysis

• Datasets:

• 48,009 phishing URLs (source: PhishTank)

• 48,009 legitimate

URLs (source DMOZ)

• Features extraction for all dataset

(15)

URL classification

• Machine learning approach:

• Determine the best classifier to identify phishing URLs

• 7 classifiers tested: Random Forest, C4.5, JRip, SVM, etc.

• 10-fold cross-validation on the presented feature set (96,016 URLs)

• Random Forest:

94.91% accuracy 1.44% FPrate

(16)

URL rating

• Random Forest based rating system:

• Use soft prediction score [0;1] as URL score:

• 1: phishing URL

• 0: legitimate URL

• 0: 22,863 legitimate // 40 phishing

• 1: 26 legitimate // 34,790 phishing 99.89% correctness on

60.11% of the dataset

• [0;0.1] and [0.9;1]

99.22% correctness on

(17)

Conclusion

Lexical analysis to detect phishing URLs:

• Intra-URL relatedness

• Word relatedness inferred with search engine query data

• Phishing URL detection: 95% accuracy (FP rate = 1.44%)

• URL rating system: >99% correctness for > 80% URLs

Future Work:

• Use distributed on-line processing (Big Data) to reduce delay

• Implementation as phishing email filtering and browser add-on

PhishScore

(18)

PhishScore: Hacking Phishers’ Minds

CNSM 2014 – Fault Tolerance and Security Track November 18, 2014

Samuel Marchal, Jérôme François, Radu State and Thomas Engel

{samuel.marchal,radu.state,thomas.engel}@uni.lu [email protected]

(19)

Phishing summary

• Phishing:

• seeks to steal different kind of data

• targets several industry sector

• uses various techniques

Is there a global characteric for phishing ? No , but most of phishing attacks rely on

fake websites using redirecting links

Phishing detection technique with wide scope:

Phishing URL identification

Références

Documents relatifs

Our data warehouse query should provide the following features: (i) An intuitive usable graphical user interface to create easily queries, whose result is displayed in a clear way,

Current work in the area of big data analytics focuses on proposing programming models for data mining and ma- chine learning tasks, and developing optimizations that re- sult

• Samuel Marchal, Jérôme François, Radu State, and Thomas Engel - Proactive Discovery of Phishing Related Domain Names - In Proceedings of Research in Attacks, Intrusions,

In a second stage, BMWE present in the dictionary were detected in the training corpus in order to modify the word alignment (see section 3.2 for more details).. Every word of

Given an original query q = {w q1 , ..., w qn }, the process of expanding q is three- fold: (1) retrieving relevant tweets answering q using a retrieval model, (2) se- lecting a set

In this demonstration paper we present and evaluate the improved query func- tionalities of Ontodia, which apply vector similarity with word embeddings be- tween query terms and

One interesting analysis is to iden- tify particular data slices (subset of original data) that gen- erate exceptional views apart from the average view gener- ated from the

Our intentions to analyse the search engine query trends related to the bitcoins are motivated by the possibility to predict the evolution of future searches related to the same