EXPOSURE: Finding Malicious Domains Using Passive DNS Analysis
Background Knowledge and Insights
- Typical Internet attacks
- Attachers face same engineering challenges with global enterprises
- Implement a reliable and flexible server infrastructure
- Advantage of DNS
- Change IP
- Hide critical servers
- Migrate servers flexibly
- Server more "fault-tolerant" than IP
Goals and Contributions
- Detect malicious domains
- Passive
- Without prior knowledge
- 15 behavioral features
Overview of EXPOSURE
- Five main components
- Data Collector
- Feature Attribution
- Malicious and Benign Domains Collector (Label data)
- Learning Module
- Classifier
- Classify unlabeled domains
Collect Training Data
- DNS traffic from the Security Information Exchange (SIE)
- Response data from authoritative name servers in North America & Europe
- 2.5 months
- More than 100 billion DNS queries (1 million queries/minute on average)
- 4.8 million distinct domain names
- Filtering
- Alexa top 1000 (20% reduction)
- Domains older than 1 year (50% more reduction)
- Malicious domains
- 3500 domains
- Types
- Botnet C&C
- drive-by-download sites
- phishing/scam pages
- Example Sources
- Benign domains
- 3000 domains
- Example Source
- F1: Time-based features
- Short life
- Daily similarity
- Repeating patterns
- Access ratio
- F2: DNS answer-based features
- # distinct IP addresses
- # distinct countries
- # domains IP shared with
- Reverse DNS query results
- F3: TTL value-based features
- Avg TTL
- # distinct TTL values
- # TTL changes
- % usage of specific TTL ranges
- F4: Domain name-based features
- % numerical characters
- % the length of the LMS (Longest Meaningful Substring)
Time-Based Features
- Insight
- Malicious domains often increase suddenly and decrease suddenly in the number of requests
- Analyse method
- Divide period into fixed length intervals
- Count queries
- Global scope
- Short life
- DGA(domain generation algorithm) generate domains
- Change behavior
- Local scope
- Main idea
- Zoom into the life time of a domain
- Study behavioral characteristics
- Daily similarity
- An increase or decrease of request count at same intervals everyday
- Regularly repeating patterns
- Instance of change point detection (CPD)
- # changes
- SD the durations of detected changes
- Access ratio
Change Point Detection (CPD)
- Operate on time series
- Goal to find points, which data values change abruptly
- Detecting abrupt changes
- Time series for each domain
- P(t), Request count at hour t, normalized by max count
- Iterate time interval t=3600s
- P−(t), Average of past 8 time intervals
- P+(t), Average of next 8 time intervals
- d(t)=∣P+(t)−P−(t)∣
- Apply Cumulative Sum (CUSUM) algorithm to d(t)
- Detect times t, when d(t) is large and is a local_max
- CUSUM(t)=Max{0,CUSUM(t−1)+d(t)−local_max}
- Report t as change point if: CUSUM(t)>cusum_max
Detecting Similar Daily Behavior
- Compute distances of all pairs of daily time series
- Normalized each time series by its Avg and SD
- Use Euclidian distance
- dij, Euclidian distance between ith & jth days
- D, Avg all dij values
DNS Answer-Based Features
- Domains map to multiple IPs, and IPs be shared across different domains
- # distinct IP addresses
- Resolved for a domain during the experiment
- # distinct countries that IP addresses located
- # domains share the IP
- Can be large for web hosting providers as well
- Reverse DNS query results
- Reduce false positives by looking for reverse DNS query results on Google top 3 search results
TTL Value-Based Features
- TTL(Time To Live): Length of time to cache a DNS response
- Recommended between 1-5 days
- Insight
- Sophisticated infrastructure of malicious networks cause frequent TTL changes
- Setting lower TTL values to the less reliable hosts
- Avg TTL
- High availability systems
- Low TTL values
- Round Robin DNS
- Example: CDNs, Fast Flux botnets
- Compromised home computers (dynamic IP) assigned much shorter TTL than compromised servers (static IP)
- # distinct TTL values
- # TTL change
- Higher in malicious domains
- Percentage usage of specific TTL ranges
- Considered ranges: [0,1),[1,10),[10,100),[100,300),[300,900),[900,inf)
- Malicious domains peak at [0,100) ranges
Domain Name-Based Features
- Insight
- Easy-to-remember names
- Important for benign services
- Main purpose of DNS
- Unimportant for attackers (e.g., DGA-generated)
- Ratio of numerical characters to name length
- Ratio of length of the longest meaningful substring (LMS) to length of domain name
- Query name on Google & check the number of hits vs a threshold
- Features applied to only second-level domains(SLD)
→ server.com
- Other possible feature: entropy of the domain name
- DGA-generated names more random than human-generated
- Training period
- Initial period of 7 days (for time-based features)
- Retraining every day
The Classifier
- J48 decision tree
- Feature selection
- Percentage of miss-classified instances
C4.5 decision tree algorithm
- Check for base cases
- For each attribute
- Compute attribute's normalized information gain
- Split over attribute with highest gain
- Recurse
- Normalized information gain = difference in entropy of class values
- Evaluation of the Detection Rate
- 569 domains reported by
- 219 domains queried in the monitored network
- 5 had less than 20 queries
- 211 detected malicious
- Detection rate: 98%
- Evaluation of the False Positives
- Filter out domains with < 20 requests in 2.5 months (300,000 domains remaining)
- 17,686 detected as malicious (5.9%)
- Unlabeled domains
- Verification
- Google searches
- Well-known spam lists
- Norton Safe Web
- McAfee Site Advisor
- False positive rate: 7.9%
Limitation (Evasion)
- Attackers have prior information about EXPOSURE
- Assign uniform TTL values across all compromised machines
- Reduces attacker's infrastructure reliability
- Reduce number of DNS lookups of malicious domain
- Not trivial to implement
- Reduces attacker's impact
- Requires high degree of coordination
- EXPOSURE: Finding Malicious Domains Using Passive DNS Analysis (2011)
- CS 259D Lecture 3