The Symantec Data Analysis project is an application of Shape Global Detector to the Symantec Winedataset. Even though it shares a common theoretical foundation with Shape Global Detector, it significantly deviates from the original implementation of Shape Global Detector that successfully detects waterhole and phishing attacks. We had to revisit the main concepts of Shape Global detector because, in comparison to waterhole and phishing attacks, in the Symantec Wine dataset malware propagation occurs at a much larger time scale and it is masked by benign files.
Wine dataset contains telemetry information collected by Symantec’s intrusion prevention system and Symantec antivirus product over 5-year period – from 2008 until 2013. The dataset summarizes file downloader activities across 5 million Windows hosts around the world. File downloads are represented in the form of downloader graphs – one per end host. A graph node represents a downloaded file and a directed edge between two nodes a and b indicates that the file a has downloaded the file b from a domain D on the corresponding host machine, where D is the edge’s label. Wine dataset contains information about 20.7 million unique files, which were downloaded 67 million times from 353 thousand domains.
The current implementation of Shape Global Detector relies on two types of fuzzy local detectors: a detector performing lightweight static analysis of binary files to find malicious ones and a detector that identifies malicious domains. Each of those detectors can hardly be deployed in practice due to the high false positive rate.
We developed a custom neighborhood template to match the following intuition: if a domain is malicious, then the files transitively downloaded from such a domain are likely to be malicious. Thus, in the context of the Symantec Wine dataset, a neighborhood aggreagetes all files transitively downloaded from a suspicious domain within a 150-day long neighborhood time window. After that the Global Detector extracts shape-based features vecotrs from each neighborhood and feeds them into a classifier that detects malicious neighborhoods.
In comparison to the prior work, the Shape Global Detector reduces the false positive rate by ~10x at the cost of 13% decrease of the true positive rate and it achieves 1.5x-4x higher F-1 score than malware detectors in the prior work. Interestingly, Shape Global Detector detects 43% of malware files 345 days (on average) ahead of commercial antivirus products.