Learning Framework for Detecting New Malware | Avast
This article was written by the following Avast researchers:
Viliam LisÃ½, Avast Senior AI Scientist
Branislav BoÅ¡anskÃ½, Senior AI Scientist at Avast
Karel Horak, Avast Senior AI researcher
Matej Racinsky, Avast AI researcher
Petr Somol, AI Research Director at Avast
Every day, anti-virus systems around the world inspect billions of files for potential threats. For most of them, they can easily decide whether the files are malware or clean based on the reputation of the specific file or the common patterns identified in known malware families. However, there is still a considerable part of the files which is not easy to classify according to the known models. These files are usually uploaded to massive backends of anti-virus systems in the cloud, where they are analyzed in depth based on a wide variety of methods, such as static analysis, dynamic analysis, behavioral analysis or queries to third-party knowledge bases. Each of these scans produces a rich, diverse, and often changing, feature set that indicates whether the file is malware or not.
The amount of relevant data is important and the decision must be fast. For example, the WannaCry ransomware outbreak went from one to over 200,000 computers in just over seven hours. Every minute of delay in detecting such threats can mean thousands of newly infected computers. Therefore, the detection of new malware should be automated, usually using a machine learning (ML) model that takes into account the functionality extracted from the binary or other preprocessing tool. Standard ML approaches require ML engineers to understand the information contained in reports, determine to what extent the scanned file is malware, and implement routines that encode the most important information into fixed-size vector representations required by most machine learning algorithms. If the format or information in the reports is updated or extended, the engineer must understand the differences and adapt these routines. If a new data source is added, the engineer must go through this whole process from scratch: understanding the data, implementing feature extraction routines, and training the classifier. Note that such changes in reporting occur very often in the area of ââmalware detection, as all preprocessing tools are actively developed to discover important features of new binaries.
In our previous blog post, we have introduced a generic framework that automates these tasks, traditionally performed by machine learning engineers. With our implementation of the framework, which we call ReportGrinder, adding a new data source simply means adding a pointer to the new set of learning analytics reports. If the reports change arbitrarily but the problem of distinguishing malware from healthy files persists, no human intervention is required and the system can simply be automatically recycled using the new reports.
In this article, we’ll show how we deployed the ReportGrinder framework for rapid malware detection in new, unpublished files based on various data sources. Each new file is analyzed by multiple back-end systems to extract static functionality, provide behavioral analysis, and query third-party information. The raw output of these systems in the form of JSON reports is used as input for the machine learning model trained on hundreds of millions of files that we have classified in the past. We use an ensemble model to assess the confidence of the classification. This new model itself makes a confident decision on 85% of the most difficult files, which we receive from our customers on the backend within a minute of receiving them. Extending Avast’s backend decision systems with this new model has reduced processing time for new files by 50%. In addition, any new functionality in the analysis systems reports will be automatically integrated from the report logs into the model without additional human intervention.
A rapid classification of new malware
When an antivirus system encounters a file, its hash is usually checked against a reputation database to determine whether or not it is clean. A small fraction of the files will never have been seen before, as they contain, for example, polymorphic malware or a custom installer. These files are then scanned using client-side detection methods that look for known patterns in the file binary and maybe even run short emulations. For a small portion of these files, even this check fails and the file is sent to the cloud for scanning by the anti-virus backends. At this point, the user is already starting to experience delays and can wait for the desired new app to start for the first time. Therefore, the speed for the following steps is very important.
Relevant data sources
When the most difficult files arrive at the backend, a plethora of computationally expensive systems can be run in parallel to provide additional information about the suspect sample:
- Tools to extract static characteristics from the binary (like Decree Where PLACE, sample report)
- Separate tools can run the sample in a safe and controlled environment to provide behavioral analysis (such as Hello Where Cap, sample report)
- The file can be unzipped
- The validity, reputation and other properties of digital signatures can be obtained
- External data sources can be queried for additional data
- The similarity of the file to existing file clusters can be reported
It’s important to consider the wide variety of data sources because malware can successfully avoid one type of scan, but avoidance often makes it easier to detect by a complementary method. Each data source produces a structured report in JSON format, ideal for processing by our ReportGrinder.
Using HMIL for Malware Classification
Using multi-instance hierarchical learning (HMIL) through ReportGrinder for malware classification is quite straightforward. We collect all reports for a large dataset of hundreds of millions of files. Then the general sequence of steps that we have introduced before is done automatically.
- ReportGrinder automatically derives the schema for each data source.
- Based on the schema, all basic data types, such as strings and numbers, are encoded in a vector representation.
- A neural network following the structure of the diagram is automatically derived so that it aggregates an arbitrarily large and variable ratio into a fixed vector representation.
- The vector representation of all relevant data sources can then be concatenated and supplemented with several look-ahead layers and an appropriate output layer with a corresponding loss function. In the case of malware classification, it may simply be a softmax output layer driven by optimization cross entropy.
We deployed ReportGrinder for classification of Windows executable files based on static and dynamic scan reports in Avast CyberCapture. This feature receives tens of thousands of suspicious executables every day that are invisible to Avast users. Before even deploying ReportGrinder, these files were classified as malware, potentially unwanted programs (PUPs), or clean, based on a diverse combination of classifiers using machine learning, reputation of individual file components, written rules by hand, external intelligence, etc. to.
If neither of these systems can make a conclusive decision, the file was rescanned after some time because many classifiers continually adapt to each new file that Avast scans. Prior to deploying ReportGrinder, approximately 20% of incoming files to CyberCapture were not finalized within hours. We also refer to these files as “expired”.
The initial deployment of ReportGrinder to process the files of Avast’s 435 million users was cautious, but it still led to substantial improvements. ReportGrinder’s decision is only used after some of the well-established pre-existing classifiers do not know how to classify the file.
The breakdown of CyberCapture decision based on the different classifiers that made the final decision is shown in Figure 1. We can see that in the two weeks leading up to the system deployment, 24% of the files expired. Two weeks after the ReportGrinder classifiers were deployed, however, only 6% of the files expired, while a large portion of the files that would otherwise have expired were classified by the HMIL classifiers built into the ReportGrinder framework.
Figure 1: Distribution of CyberCapture decisions by different internal systems two weeks before and after the deployment of ReportGrinder.
Reducing expired files is very important for the user experience because instead of waiting a few hours to receive a decision, they can continue their work within the minute sufficient for the ReportGrinder classifiers. Even files that would ultimately be picked by pre-existing systems can be picked by ReportGrinder in a minute. Therefore, the deployment has led to a substantial reduction in the time spent on files in CyberCapture. Figure 2 shows the average (relative) processing time before and after deploying ReportGrinder.
Figure 2: Average CyberCapture processing time before and after deploying ReportGrinder.
Avast researchers transformed their theoretical framework for processing complex security data without feature engineering into a practical application. They built a system that uses static and dynamic analysis reports of executable files in their raw form and decides whether the corresponding files are malware or clean. The system is regularly trained on over 100 million files and has reduced the average scan time of the most complex files never seen before arriving at Avast backends to half the time required without the new system.