We applied Centurion to one month (February 2016) of security logs in a Fortune 500 enterprise. This dataset is extremely noisy and heterogeneous – unlike our homogeneous dataset in the paper "Early and Robust Malware Detection in Enterprise Networks", which models deployments like osquery in Facebook. In contrast, this dataset includes an average of almost 200 Million security-related log entries per day from almost 20 different local detectors (Blue Coat, Symantec, McAfee, F5, Cisco, etc), generated from a system with over 150K devices (including personal computers, mobile devices, servers, firewalls, routers, and other network elements), all recorded in a ‘market leading’ SIEM tool for global malware analysis. Devices come and go, log entries can be delayed by up to two months, and logs with confirmed signature-based malware infections are very rare (0.29%). Hence, ground truth infection data is hard to get. The goal here is to filter the 6 Billion log entries including 700,000 different server IPs in February 2016 to a few tens of incidents that can be manually analyzed.
First, the data is high-dimensional and sparse. Each log entry contains 466 categories filled out with mostly categorical values, and most rows have no content or are filled out with uninformative default values. Directly using one-hot encoding to represent categorical data produces a 330,000-dimensional vector, hence we use the top 20 security-critical dimensions and encode data manually before applying one-hot encoding to yield 364 dimensional vectors. Determining unique identities for devices and identifying usernames and domain-names associated with IP addresses in rows takes several analysis passes over the sparse dataset.
We form neighborhoods based on the accessed domains, thus a neighborhood includes devices accessing different URLs hosted in the same domain. This yields 400,000 unique neighborhoods. We apply two heuristics to rank those neighborhoods in terms of their maliciousness. First, we run SecureRank algorithm, which is an adaptation of PageRank algorithm for the problem of ranking domains in terms of their potential maliciousness. We initialize the SecureRank algorithm by labeling domains belonging to certain Blue Coat’s categories (e.g. Suspicious, Spam, Scam/Questionable/Illegal, etc) as malicious. SecureRank outputs a sorted list of domain names based on their malicious score.
Our second heuristic is to query VirusTotal scanner to obtain information about security incidents related to a particular domain. VirusTotal’s reports include the number of malicious URLs, the number of malware samples communicating to a particular domain, the number of samples embedding URL strings leading to the domain under investigation. If VirusTotal does not have information about incidents related to a particular domain, then we discard the domain from the list of suspicious domains (i.e., the candidates for neighborhood formation process).
Results. We experimented with three values of the expected global false positive rate – 3%, 2%, and 1% (Table below). For each false positive rate we computed the number of security incidents that could have been prevented during February once Centurion declares the domain (i.e. its neighborhood) as infected. The chosen values of false positive rates result in labeling 25, 15, and 5 domains/neighborhoods (respectively) as infected, and yet preempt 15, 11, and 7 (respectively) of the 30 exploits. Increasing the false positive rate from 1% percentile to 3% enables an analyst to trade off the number of domains that need to be manually analyzed with the number of exploits that were pre-empted.