Judging the similarity of terminal binary files is the fundamental technology of the office security system, and has a wide range of application scenarios in the office security business. For example, process whitelist, judging whether a new file is a new version of a known software; or malicious code detection, identifying whether a suspicious sample is a new variant of a known malware family, all rely on the judgment of binary file similarity.
The similarity judgment of terminal binary files can be realized based on dynamic analysis or static analysis, and each method has its own advantages and disadvantages. Static analysis does not need to debug samples, and directly analyzes the binary file content. It has high efficiency and comprehensive coverage of file branches, but has low anti-confusion ability and cannot cope with scenarios such as packing. Dynamic analysis is based on the behavior of the sample after running it in sandbox, with high accuracy and strong anti-confusion ability, but low running efficiency and incomplete file branch coverage will also lead to false positives. At present, the combination of static and dynamic analysis is generally adopted, and the mainstream detection mode is static way.
Judging the similarity of binary files based on static analysis, the traditional method is to directly compare the similarity of two binary files, and the accuracy of the judgment result is low. Take the well-known fuzzy hashing project ssdeep as an example. It treats binary files as binary character streams for fuzzy hash calculation, and does not distinguish whether a byte is an operator (such as JMP instruction) or an immediate value (such as the target address of JMP). In fact, the execution code segment in the binary file is composed of operators and operands. As long as the sequence of operators is similar, the two files should be regarded as similar, and the immediate value should not affect the judgment result. Even if the source code is exactly the same, the binary files generated by compiler will have large differences in different processor architectures, compiler versions, and compilation options, especially in terms of immediate values. The characteristics of these binary files will have a negative impact on the traditional fuzzy hash-based file similarity judgment.
To this end, we need to study a binary file feature extraction method based on deep learning to represent binary files more efficiently and accurately. In a data-driven manner, model training is performed on a large number of binary file samples, so that the model can extract the characteristics of operator sequences and ignore the interference of irrelevant content such as immediate data. The characterization results of binary files will be widely used in downstream services such as file whitelists, malicious code detection, and virus family classification.
1. A prototype system for terminal binary file representation, the specific indicators are:
1) At least support two type of binary file (PE and macho) on two type of terminal operating systems (windows and mac);
2) It has a strong ability to extract the features of binary files, and the representation results of similar binary files should have proper similarity. For different versions of the same application software, or binary files generated by the same source code by different compilers, the similarity of the characterization results measured by the common Euclidean distance or cosine distance is not less than 90%;
3) The system has high execution efficiency, from the preprocessing of binary files to the output of the final characterization vector, the processing samples on ordinary servers are no less than 50,000 cases/set.
2. Submit 1 high-level paper on a class A conference or journal certificated by Alibaba group.
Related Research Topics
This project intends to study a binary file representation technology based on deep learning, and achieve the judgment of the similarity of binary files by representing binary files as dense vectors that can reflect their characteristics. The characterization of binary files can be performed at different levels such as the binary instructions, conversion to intermediate representation, and higher-level control flow analysis. The efficiency of preprocessing at different levels, the size of the required labeled data set, the algorithm of model training, and the adaptation the limitations of downstream tasks are different, and we need to explore a set of binary file representation schemes suitable for our own business scenarios according to the actual situation. The technical problems to be solved in this project include:
Binary file representation algorithm framework: Aiming at realizing the two downstream tasks of process whitelist and malicious code static detection, with an average daily processing of fifty thousand new samples as the index, in the binary layer/intermediate layer/disassembly code layer /Control flow layer and other options, explore a set of binary file representation algorithm framework suitable for our business scenarios.
Binary file preprocessing technology: On the selected preprocessing level, implement a set of binary file preprocessing technology, which can meet the preprocessing task of an average of fifty thousand samples per day, and the preprocessed content is suitable for subsequent deep learning-based features Extract frame.
Binary file representation algorithm: Given a large number of unlabeled samples and a small number of labeled samples (including white software families, black and white sample files), establish a set of algorithm framework combining self-supervised learning and supervised fine-tuning to realize binary Files are represented as dense vectors and perform well on two downstream tasks of process whitelist detection and malicious code detection.