Cantina: A content-based approach to detecting phishing websites
Background Knowledge and Insights
- TF-IDF
- Measure importance of word in document
- TF, frequency of word in document
- IDF, measure popularity of word in corpus
- Log(N/#{documents having the term})
- tf-idf(t,d,D)=tf(t,d)×idf(t,D)
- Robust Hyperlinks
- Lexical signatures for identifying URLs
- Signature words chosen using TF-IDF
- Experiments: 5 terms enough for unique page identification
- Observations
- Minimal changes to original page
- Detectable via robust hyperlinks
- Phishing sites often include brand names
- Common on brand's webpages
- Rare on the web
Goals and Contributions
- CANTINA makes use of TF-IDF for detecting phishing sites
- Apply Robust Hyperlinks to anti-phishing
- Not rely on blacklists
Data Source
- English language sites
- Phishing URLs
- Legitimate URLs
- 3Sharp's study of anti-phishing toolbars
- Select the login pages of 35 sites that are often attacked by phishers
- Select the 35 top pages from Alexa Web Search
- Select 30 random pages from random.yahoo.com/fast/ryl
- URLs gathered from email
CANTINA
Words
- List of word frequencies
- British National Corpus
- 67,962,112 total words
- 9,022 unique words
- Analyze page
- Downloaded web page
- Document Object Model (DOM)
Feature
- Assumption
- Phishing pages have low pagerank
- Lack of links pointing to the phishing pages
- Average time that a phishing site stays online is 4.5 days
- Workflow
- Compute term TF-IDFs
- Find top 5 terms
- Submit terms as query to Google
- Check if domain is among top-N results
- Decrease False Positives
- Include domain name in lexical signature
- Zero results Means Phishing (ZMP)
Other Features
- Age of domain
- Known images
- Presence of inconsistent well-known logos
- Top-10 identified targets: eBay, PayPal, Citibank, Bank of America, Fifth Third Bank, Barclays Bank, ANZ Bank, Chase Bank, and Wells Fargo Bank
- Suspicious URL
- Contains
@
or –
in domain name
- Suspicious links
- IP address as domain
- Dots in URL
- Forms
- HTML
<input>
tag, with text such as "credit card", "password"
Limitation
- Not include JaveSvript
- Non-English web sites
- Rely on Google query
- Timeout
- Denial service by Google
- Google's PageRank, SEO
- TF-IDF
- Images instead of words
- Invisible text
References
- Cantina: A content-based approach to detecting phishing websites, Zhang et al, 2007
- Kim Giglia, CSC 682 CANTINA.ppt
- CS 259D Lecture 16