本文为荷兰拉德堡德大学(作者:Paul van de Luitgaarden)的硕士论文,共39页。
为了在计算机网络中实现态势感知,必须收集和处理有关主机的信息。必须执行的进程之一是角色发现。角色发现的目标是将正确的角色(例如FTP服务器)分配给主机。将角色分配给主机也被称为标记主机。
由于本文对工业控制网络的研究,信息的采集必须被动地进行:传感器捕获经过的所有流量。从这个由网络包组成的流量中提取信息;帮助建立角色知识的信息称为特征,并放置在每个主机的特征向量中。
本文介绍了一种角色发现框架,其中尝试使用三种技术正确地标记主机:
1专家知识,一种手动技术。网络操作员能够为每个主机分配正确的角色。
2指纹。一种半自动方法,其中表示某一角色的指纹是手动构造并与每个向量匹配的。当发生匹配时,该角色将被应用于主机。
3自我训练。一种自动的半监督机器学习技术。
重点研究工业控制网络中的角色发现。由于这些网络中通常存在大量主机,手工形式的技术专家知识非常耗时。因此,本文主要研究后两种技术,特别是自我训练。利用各种工业控制网络的网络流量构造特征向量。本文所构建的指纹技术,几乎25%的载体被标记,剩余的向量尝试通过自我训练来标记。
因此进行了两个实验。第一个实验测量了每个数据集的分类器性能。利用机器学习算法构造分类器,考虑到的算法有决策树(C4.5)、K-最近邻和朴素贝叶斯。对于初始标记向量较少的数据集,K-最近邻的性能最好。
第二个实验回顾了达到一定阈值的分类决策数。阈值越高,分类器的两个向量越相似。当对未标记向量进行分类,使决策的置信度达到阈值时,该向量被分配给角色。在第一次实验中选择性能最好的分类器时,正确分类的数量在56.5%到100%之间,而错误分类主机的数量在0%到8.7%之间。
In order to achieve situational awareness in a computer network, information about the hosts that are present has to be gathered and processed. One of the processes that have to be executed is role discovery [7]. The goal of role discovery is assigning the correct role, e.g. FTP server, to a host. Assigning a role to a host is also referred to as labeling a host.
Due to the focus on industrial control networks in this thesis, information has to be gathered passively: a sensor captures all the traffic that passes by. From this traffic, that consists of network packets, information is extracted. The information that helps establishing knowledge about the role is referred to as a feature and is placed in a feature vector for each host.
This thesis introduces a role discovery framework in which the an attempt is made to correctly label a host using three techniques:
-
Expert knowledge. A manual technique in which a network operator is able to assign the correct role to each host.
-
Fingerprinting. A semi-automatic approach in which fingerprints that represent a certain role are manually constructed and matched with each vector. When a match occurs, that role is applied to the host.
-
Self-training. An automatic semi-supervised machine learning technique.
The focus lies on role discovery in industrial control networks. Due to the large amount of hosts that are typically present in these networks, the manual technique expert knowledge is very time-consuming. Therefore, this thesis mainly focuses on the latter two techniques and then especially self-training.
The feature vectors are constructed using network traffic from various industrial control networks. Using fingerprinting with the fingerprints constructed in this thesis, almost 25% of the vectors gets labeled. The remaining vectors are attempted to get labeled using self-training.
Therefore, two experiments are conducted.The first experiment measures the performance of each classifier for every dataset. A classifier is constructed using a machine learning algorithm. The algorithms taken into consideration are Decision Tree (C4.5), K-nearest neighbor and Naive Bayes. For datasets with a small number of initial labeled vectors, K-nearest neighbor performs best.
In the second experiment, the number of classification decisions that reaches a certain threshold are reviewed. The higher the threshold, the similar two vectors are for a classifier. When an unlabeled vector is classified such that the confidence in the decision reaches the threshold, then the vector get assigned to a role. The number of correct classifications lies between 56.5% and 100%, while the number of misclassified hosts lies between 0% and 8.7%, when choosing the best performing classifier from the first experiment.
-
引言
- 相关工作
- 设计
- 具体实现
- 实验与讨论
- 结论与展望
更多精彩文章请关注公众号: