CN114861780B

CN114861780B - Data label marking method, device and processor

Info

Publication number: CN114861780B
Application number: CN202210434729.4A
Authority: CN
Inventors: 陈敏; 陈震宇; 刘国华; 李少波
Original assignee: Postal Savings Bank of China Ltd
Current assignee: Postal Savings Bank of China Ltd
Priority date: 2022-04-24
Filing date: 2022-04-24
Publication date: 2024-11-08
Anticipated expiration: 2042-04-24
Also published as: CN114861780A

Abstract

The application provides a marking method, a marking device and a processor of a data tag. Acquiring first type data and second type data, marking the first type data to obtain a first tag value, and marking the second type data to obtain a second tag value; clustering the first type data and the second type data according to a preset clustering algorithm to obtain a plurality of data clusters; determining the proportion of the first type data in each data cluster; and secondarily marking the first type data and the second type data according to the occupied proportion, so that at least part of first tag values in each data cluster are updated to second tag values, and at least part of second tag values are updated to first tag values. In the scheme, the first type data and the second type data are secondarily marked according to the proportion of the first type data in each data cluster, and the obtained tag data are more accurate after the data are secondarily marked, so that the accuracy of predicting unknown data is improved.

Description

Data label marking method, device and processor

技术领域Technical Field

本申请涉及数据处理领域，具体而言，涉及一种数据标签的标记方法、装置、计算机可读存储介质和处理器。The present application relates to the field of data processing, and in particular, to a data tagging method, device, computer-readable storage medium, and processor.

背景技术Background Art

在对数据进行聚类时，要对数据进行标记，得到标签数据，标签数据就带有了标记信息，对已有标记信息的标签数据通过机器学习算法进行模型训练，可以对未知的数据进行预测，但是，在对数据进行标记的过程中，由于是对数据进行一次标记，经过一次标记得到的标签数据并不准确，进而对未知的数据进行预测时的准确性较低。When clustering data, the data must be marked to obtain labeled data. The labeled data carries the label information. The labeled data with existing label information is trained through a machine learning algorithm to predict unknown data. However, in the process of labeling the data, since the data is labeled once, the labeled data obtained after one labeling is not accurate, and the accuracy of predicting unknown data is low.

发明内容Summary of the invention

本申请的主要目的在于提供一种数据标签的标记方法、装置、计算机可读存储介质和处理器，以解决现有技术中经过一次标记的数据准确性较低的问题。The main purpose of the present application is to provide a data labeling method, device, computer-readable storage medium and processor to solve the problem of low accuracy of data after one labeling in the prior art.

根据本发明实施例的一个方面，提供了一种数据标签的标记方法，包括：获取第一类型数据和第二类型数据，且对所述第一类型数据进行标记，得到第一标签值，对所述第二类型数据进行标记，得到第二标签值；按照预定聚类算法对所述第一类型数据和所述第二类型数据进行聚类，得到多个数据簇，各所述数据簇中包括至少一个所述第一类型数据和至少一个所述第二类型数据；确定所述第一类型数据在各所述数据簇中的所占比例；根据所述所占比例，对所述第一类型数据和所述第二类型数据进行二次标记，使得各所述数据簇中的至少部分所述第一标签值更新为所述第二标签值，至少部分所述第二标签值更新为所述第一标签值。According to one aspect of an embodiment of the present invention, a method for marking data tags is provided, comprising: acquiring first type data and second type data, and marking the first type data to obtain a first tag value, and marking the second type data to obtain a second tag value; clustering the first type data and the second type data according to a predetermined clustering algorithm to obtain multiple data clusters, each of the data clusters including at least one of the first type data and at least one of the second type data; determining the proportion of the first type data in each of the data clusters; and based on the proportion, re-marking the first type data and the second type data, so that at least part of the first tag value in each of the data clusters is updated to the second tag value, and at least part of the second tag value is updated to the first tag value.

可选地，根据所述所占比例，对所述第一类型数据和所述第二类型数据进行二次标记，使得各所述数据簇中的至少部分所述第一标签值更新为所述第二标签值，至少部分所述第二标签值更新为所述第一标签值，包括：在所述所占比例大于比例阈值的情况下，将当前的所述数据簇中的所述第二类型数据的标签值更新为所述第一标签值；在所述所占比例小于或者等于所述比例阈值的情况下，将当前的所述数据簇中的所述第一类型数据的标签值更新为所述第二标签值。Optionally, based on the proportion, the first type of data and the second type of data are secondary marked so that at least part of the first label values in each data cluster are updated to the second label value, and at least part of the second label values are updated to the first label value, including: when the proportion is greater than a proportion threshold, the label value of the second type of data in the current data cluster is updated to the first label value; when the proportion is less than or equal to the proportion threshold, the label value of the first type of data in the current data cluster is updated to the second label value.

可选地，在所述所占比例大于比例阈值的情况下，将当前的所述数据簇中的所述第二类型数据的标签值更新为所述第一标签值，包括：将所述所占比例大于所述比例阈值的当前的所述数据簇确定为目标数据簇；将所述目标数据簇中的预定个数的所述第二类型数据的标签值更新为所述第一标签值。Optionally, when the proportion is greater than a proportion threshold, the label value of the second type of data in the current data cluster is updated to the first label value, including: determining the current data cluster whose proportion is greater than the proportion threshold as a target data cluster; and updating the label values of a predetermined number of the second type of data in the target data cluster to the first label value.

可选地，在所述所占比例小于或者等于所述比例阈值的情况下，将当前的所述数据簇中的所述第一类型数据的标签值更新为所述第二标签值，包括：将所述所占比例小于或者等于所述比例阈值的当前的所述数据簇确定为非目标数据簇；将所述非目标数据簇中的所有的所述第一类型数据的标签值更新为所述第二标签值。Optionally, when the proportion is less than or equal to the proportion threshold, the label value of the first type of data in the current data cluster is updated to the second label value, including: determining the current data cluster whose proportion is less than or equal to the proportion threshold as a non-target data cluster; and updating the label values of all the first type of data in the non-target data cluster to the second label value.

可选地，在所述所占比例大于比例阈值的情况下，将当前的所述数据簇中的所述第二类型数据的标签值更新为所述第一标签值，包括：在多个所述数据簇中的所述第一类型数据的所述所占比例均大于所述比例阈值的情况下，比较多个所述所占比例的大小；按照所述所占比例的从大到小的顺序依次将所述数据簇中的所述第二类型数据的标签值更新为所述第一标签值。Optionally, when the proportion is greater than a proportion threshold, the label value of the second type of data in the current data cluster is updated to the first label value, including: when the proportions of the first type of data in multiple data clusters are greater than the proportion threshold, comparing the sizes of multiple proportions; and updating the label values of the second type of data in the data clusters to the first label value in order from large to small according to the proportions.

可选地，所述方法还包括：采用所述第一类型数据、所述第一标签值、所述第二类型数据和所述第二标签值，对决策树模型训练，得到树状图；根据所述树状图确定决策树的规则；确定所述决策树的规则与标准规则的相似度；根据所述相似度确定所述决策树的规则对未知数据进行分类预测的准确性。Optionally, the method further includes: using the first type of data, the first label value, the second type of data and the second label value to train a decision tree model to obtain a tree diagram; determining the rules of the decision tree based on the tree diagram; determining the similarity between the rules of the decision tree and the standard rules; and determining the accuracy of the rules of the decision tree for classification prediction of unknown data based on the similarity.

可选地，所述树状图包括多个节点，每个所述节点包括相关信息，所述相关信息包括以下至少之一：节点名称、数据长度、数据宽度、基尼系数、数据量、数据对应的函数值、数据名称，根据所述树状图确定决策树的规则，包括：获取目标节点在所述树状图中的位置信息；根据所述位置信息和所述目标节点的所述相关信息，确定所述目标节点对应的所述决策树的规则，所述目标节点为多个所述节点中的一个。Optionally, the dendrogram includes multiple nodes, each of which includes relevant information, and the relevant information includes at least one of the following: node name, data length, data width, Gini coefficient, data volume, function value corresponding to the data, and data name. Determining the rules of the decision tree based on the dendrogram includes: obtaining position information of the target node in the dendrogram; determining the rules of the decision tree corresponding to the target node based on the position information and the relevant information of the target node, and the target node is one of the multiple nodes.

根据本发明实施例的另一方面，还提供了一种数据标签的标记装置，包括：第一处理单元，用于获取第一类型数据和第二类型数据，且对所述第一类型数据进行标记，得到第一标签值，对所述第二类型数据进行标记，得到第二标签值；聚类单元，用于按照预定聚类算法对所述第一类型数据和所述第二类型数据进行聚类，得到多个数据簇，各所述数据簇中包括至少一个所述第一类型数据和至少一个所述第二类型数据；第一确定单元，用于确定所述第一类型数据在各所述数据簇中的所占比例；第二处理单元，用于根据所述所占比例，对所述第一类型数据和所述第二类型数据进行二次标记，使得各所述数据簇中的至少部分所述第一标签值更新为所述第二标签值，至少部分所述第二标签值更新为所述第一标签值。According to another aspect of an embodiment of the present invention, a data label marking device is also provided, including: a first processing unit, used to obtain first type data and second type data, and mark the first type data to obtain a first label value, and mark the second type data to obtain a second label value; a clustering unit, used to cluster the first type data and the second type data according to a predetermined clustering algorithm to obtain multiple data clusters, each of the data clusters including at least one of the first type data and at least one of the second type data; a first determination unit, used to determine the proportion of the first type data in each of the data clusters; a second processing unit, used to re-mark the first type data and the second type data according to the proportion, so that at least part of the first label value in each of the data clusters is updated to the second label value, and at least part of the second label value is updated to the first label value.

根据本发明实施例的又一方面，还提供了一种计算机可读存储介质，所述计算机可读存储介质包括存储的程序，其中，所述程序执行任意一种所述的方法。According to another aspect of the embodiments of the present invention, a computer-readable storage medium is provided, wherein the computer-readable storage medium includes a stored program, wherein the program executes any one of the methods described.

根据本发明实施例的再一方面，还提供了一种处理器，所述处理器用于运行程序，其中，所述程序运行时执行任意一种所述的方法。According to yet another aspect of the embodiments of the present invention, a processor is provided, wherein the processor is used to run a program, wherein any one of the methods is executed when the program is run.

在本发明实施例中，首先获取第一类型数据和第二类型数据，且对上述第一类型数据进行标记，得到第一标签值，对上述第二类型数据进行标记，得到第二标签值，之后按照预定聚类算法对上述第一类型数据和上述第二类型数据进行聚类，得到多个数据簇，各上述数据簇中包括至少一个上述第一类型数据和至少一个上述第二类型数据，之后确定上述第一类型数据在各上述数据簇中的所占比例，最后根据上述所占比例，对上述第一类型数据和上述第二类型数据进行二次标记，使得各上述数据簇中的至少部分上述第一标签值更新为上述第二标签值，至少部分上述第二标签值更新为上述第一标签值。该方案中，根据第一类型数据在各数据簇中的所占比例，对第一类型数据和第二类型数据进行二次标记，数据进行了二次标记后，得到的标签数据较为准确，进而提高了对未知的数据进行预测时的准确性。In an embodiment of the present invention, firstly, the first type of data and the second type of data are obtained, and the first type of data is marked to obtain a first label value, and the second type of data is marked to obtain a second label value, and then the first type of data and the second type of data are clustered according to a predetermined clustering algorithm to obtain multiple data clusters, each of which includes at least one of the first type of data and at least one of the second type of data, and then the proportion of the first type of data in each of the data clusters is determined, and finally, according to the proportion, the first type of data and the second type of data are re-marked, so that at least part of the first label value in each of the data clusters is updated to the second label value, and at least part of the second label value is updated to the first label value. In this scheme, the first type of data and the second type of data are re-marked according to the proportion of the first type of data in each data cluster. After the data is re-marked, the label data obtained is more accurate, thereby improving the accuracy of predicting unknown data.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

构成本申请的一部分的说明书附图用来提供对本申请的进一步理解，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。在附图中：The drawings constituting part of the present application are used to provide a further understanding of the present application. The illustrative embodiments and descriptions of the present application are used to explain the present application and do not constitute an improper limitation on the present application. In the drawings:

图1示出了根据本申请的实施例的一种数据标签的标记方法的流程示意图；FIG1 is a schematic diagram showing a flow chart of a data tag marking method according to an embodiment of the present application;

图2示出了第一类型数据在各数据簇中的所占比例的示意图；FIG2 is a schematic diagram showing the proportion of the first type of data in each data cluster;

图3示出了各数据簇中第一类型数据在目标节点的数量的示意图；FIG3 is a schematic diagram showing the number of first type data in each data cluster at the target node;

图4示出了树状图的示意图；FIG4 shows a schematic diagram of a dendrogram;

图5示出了根据本申请的实施例的一种数据标签的标记装置的结构示意图。FIG5 shows a schematic structural diagram of a data tag marking device according to an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that, in the absence of conflict, the embodiments and features in the embodiments of the present application can be combined with each other. The present application will be described in detail below with reference to the accompanying drawings and in combination with the embodiments.

为了使本技术领域的人员更好地理解本申请方案，下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分的实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本申请保护的范围。In order to enable those skilled in the art to better understand the solution of the present application, the technical solution in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without creative work should fall within the scope of protection of the present application.

需要说明的是，本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本申请的实施例。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the specification and claims of the present application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequential order. It should be understood that the data used in this way can be interchanged where appropriate, so that the embodiments of the present application described here. In addition, the terms "including" and "having" and any of their variations are intended to cover non-exclusive inclusions, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those steps or units clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products or devices.

应该理解的是，当元件(诸如层、膜、区域、或衬底)描述为在另一元件“上”时，该元件可直接在该另一元件上，或者也可存在中间元件。而且，在说明书以及权利要求书中，当描述有元件“连接”至另一元件时，该元件可“直接连接”至该另一元件，或者通过第三元件“连接”至该另一元件。It should be understood that when an element (such as a layer, film, region, or substrate) is described as being "on" another element, the element may be directly on the other element, or there may be intermediate elements. Moreover, in the specification and claims, when an element is described as being "connected" to another element, the element may be "directly connected" to the other element, or "connected" to the other element through a third element.

为了便于描述，以下对本申请实施例涉及的部分名词或术语进行说明：For the convenience of description, some nouns or terms involved in the embodiments of the present application are explained below:

聚类：聚类是一种机器学习技术，涉及到数据点的分组，给定一组数据点，可以使用聚类将每个数据点划分为一个特定的组，理论上，同一组中的数据点应该具有相似的属性和/或相似的特征，而不同组中的数据点应该具有不同的属性和/或相似度较低的特征，聚类是一种无监督学习的方法，是许多领域中常用的统计数据分析技术，聚类算法中常见的算法为原型聚类；Clustering: Clustering is a machine learning technique that involves grouping data points. Given a set of data points, clustering can be used to divide each data point into a specific group. In theory, data points in the same group should have similar attributes and/or similar features, while data points in different groups should have different attributes and/or features with lower similarity. Clustering is an unsupervised learning method and a commonly used statistical data analysis technique in many fields. The most common algorithm in clustering algorithms is prototype clustering.

原型聚类：原型聚类也称为“基于原型的聚类”，原型聚类算法假设聚类结构能通过一组原型刻画，算法先对原型进行初始化，然后对原型进行迭代更新求解，采用不同的原型表示，不同的求解方式将产生不同的算法，比较典型的算法为KMeans；Prototype clustering: Prototype clustering is also called "prototype-based clustering". The prototype clustering algorithm assumes that the clustering structure can be described by a set of prototypes. The algorithm first initializes the prototype and then iteratively updates and solves the prototype. Different prototype representations and different solving methods will produce different algorithms. The more typical algorithm is KMeans.

KMeans算法：先随机选取K个对象作为初始的聚类中心，然后计算每个对象与各个初始的聚类中心之间的距离，把每个对象分配给距离最近的初始的聚类中心，初始的聚类中心和分配的对象组成了一个聚类，一旦全部对象都被分配了，每个聚类的聚类中心会根据聚类中现有的对象被重新计算，这个过程将不断重复直到满足终止条件，终止条件包括以下至少之一：没有(或最小数目)对象被重新分配给不同的聚类中心、没有(或最小数目)聚类中心再发生变化、误差平方和局部最小(误差平方和指所有样本和最近的簇中心的差的平方和，其中，C_i表示第i个簇，x表示簇中的一个点(簇中的一个对象)，u_i是第i个簇的中心)；KMeans algorithm: first randomly select K objects as the initial cluster centers, then calculate the distance between each object and each initial cluster center, assign each object to the initial cluster center with the closest distance, the initial cluster center and the assigned objects form a cluster, once all objects are assigned, the cluster center of each cluster will be recalculated according to the existing objects in the cluster, this process will be repeated until the termination condition is met, the termination condition includes at least one of the following: no (or the minimum number) objects are reassigned to different cluster centers, no (or the minimum number) cluster centers change again, the error sum of squares is locally minimum (the error sum of squares refers to the sum of squares of the differences between all samples and the nearest cluster center, Where _Ci represents the i-th cluster, x represents a point in the cluster (an object in the cluster), and _ui is the center of the i-th cluster);

MinibatchKMeans算法：是在KMeans算法的基础上演进而来，主要解决了KMeans大数据量下的计算时间过长的问题，MinibatchKMeans算法使用了一个种叫做Mini Batch(分批处理)的方法对数据点之间的距离进行计算，从不同类别的样本中抽取一部分样本来代表各自类型进行计算，不必使用所有的数据样本，从而减少了相应的运行时间。MinibatchKMeans algorithm: It is evolved from the KMeans algorithm. It mainly solves the problem of long calculation time of KMeans under large data volume. The MinibatchKMeans algorithm uses a method called Mini Batch (batch processing) to calculate the distance between data points. It extracts a part of samples from different categories to represent their respective types for calculation. It is not necessary to use all data samples, thereby reducing the corresponding running time.

正如背景技术中所说的，现有技术中经过一次标记的数据准确性较低，为了解决上述问题，本申请的一种典型的实施方式中，提供了一种数据标签的标记方法、装置、计算机可读存储介质和处理器。As mentioned in the background technology, the accuracy of data marked once in the prior art is low. In order to solve the above problem, in a typical embodiment of the present application, a data labeling method, device, computer-readable storage medium and processor are provided.

根据本申请的实施例，提供了一种数据标签的标记方法。According to an embodiment of the present application, a data tagging method is provided.

图1是根据本申请实施例的数据标签的标记方法的流程图。如图1所示，该方法包括以下步骤：FIG1 is a flow chart of a data tagging method according to an embodiment of the present application. As shown in FIG1 , the method includes the following steps:

步骤S101，获取第一类型数据和第二类型数据，且对上述第一类型数据进行标记，得到第一标签值，对上述第二类型数据进行标记，得到第二标签值；Step S101, obtaining first type data and second type data, and marking the first type data to obtain a first label value, and marking the second type data to obtain a second label value;

步骤S102，按照预定聚类算法对上述第一类型数据和上述第二类型数据进行聚类，得到多个数据簇，各上述数据簇中包括至少一个上述第一类型数据和至少一个上述第二类型数据；Step S102, clustering the first type of data and the second type of data according to a predetermined clustering algorithm to obtain a plurality of data clusters, each of which includes at least one of the first type of data and at least one of the second type of data;

步骤S103，确定上述第一类型数据在各上述数据簇中的所占比例；Step S103, determining the proportion of the first type of data in each of the data clusters;

步骤S104，根据上述所占比例，对上述第一类型数据和上述第二类型数据进行二次标记，使得各上述数据簇中的至少部分上述第一标签值更新为上述第二标签值，至少部分上述第二标签值更新为上述第一标签值。Step S104, re-labeling the first type data and the second type data according to the proportions, so that at least part of the first label values in each of the data clusters are updated to the second label values, and at least part of the second label values are updated to the first label values.

上述的方法中，首先获取第一类型数据和第二类型数据，且对上述第一类型数据进行标记，得到第一标签值，对上述第二类型数据进行标记，得到第二标签值，之后按照预定聚类算法对上述第一类型数据和上述第二类型数据进行聚类，得到多个数据簇，各上述数据簇中包括至少一个上述第一类型数据和至少一个上述第二类型数据，之后确定上述第一类型数据在各上述数据簇中的所占比例，最后根据上述所占比例，对上述第一类型数据和上述第二类型数据进行二次标记，使得各上述数据簇中的至少部分上述第一标签值更新为上述第二标签值，至少部分上述第二标签值更新为上述第一标签值。该方案中，根据第一类型数据在各数据簇中的所占比例，对第一类型数据和第二类型数据进行二次标记，数据进行了二次标记后，得到的标签数据较为准确，进而提高了对未知的数据进行预测时的准确性。In the above method, firstly, the first type of data and the second type of data are obtained, and the first type of data is marked to obtain the first label value, and the second type of data is marked to obtain the second label value, and then the first type of data and the second type of data are clustered according to a predetermined clustering algorithm to obtain multiple data clusters, each of which includes at least one of the first type of data and at least one of the second type of data, and then the proportion of the first type of data in each of the data clusters is determined, and finally, according to the proportion, the first type of data and the second type of data are re-marked, so that at least part of the first label value in each of the data clusters is updated to the second label value, and at least part of the second label value is updated to the first label value. In this scheme, the first type of data and the second type of data are re-marked according to the proportion of the first type of data in each data cluster. After the data is re-marked, the label data obtained is more accurate, thereby improving the accuracy of predicting unknown data.

需要说明的是，在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行，并且，虽然在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤。It should be noted that the steps shown in the flowcharts of the accompanying drawings can be executed in a computer system such as a set of computer executable instructions, and that, although a logical order is shown in the flowcharts, in some cases, the steps shown or described can be executed in an order different from that shown here.

具体地，对第一类型数据和第二类型数据进行第一次标记可以根据预先制定的规则对数据进行标记，可以将第一标签值设置为1，第二标签值设置为0，当然，并不限于上述的情况，本领域技术人员还可以根据实际情况选择合适的第一标签值和第二标签值。Specifically, the first type of data and the second type of data can be marked for the first time according to pre-established rules. The first label value can be set to 1 and the second label value can be set to 0. Of course, this is not limited to the above situation. Technical personnel in this field can also select appropriate first label value and second label value according to actual conditions.

一种实施例中，确定上述第一类型数据在各上述数据簇中的所占比例如图2所示，共有4个数据簇，第一数据簇的数据量为600，第一类型数据在第一数据簇中的所占比例为20％，第二数据簇的数据量为300，第一类型数据在第二数据簇中的所占比例为70％，第三数据簇的数据量为400，第一类型数据在第三数据簇中的所占比例为90％，第四数据簇的数据量为700，第一类型数据在第四数据簇中的所占比例为10％。In one embodiment, the proportion of the first type of data in each of the above-mentioned data clusters is determined as shown in Figure 2. There are 4 data clusters in total. The data volume of the first data cluster is 600, and the proportion of the first type of data in the first data cluster is 20%. The data volume of the second data cluster is 300, and the proportion of the first type of data in the second data cluster is 70%. The data volume of the third data cluster is 400, and the proportion of the first type of data in the third data cluster is 90%. The data volume of the fourth data cluster is 700, and the proportion of the first type of data in the fourth data cluster is 10%.

需要说明的是，预定聚类算法可以为MinibatchKMeans算法，当然，并不限于MinibatchKMeans算法，本领域技术人员还可以选择其他任何可行的聚类算法。It should be noted that the predetermined clustering algorithm may be the MinibatchKMeans algorithm, but is certainly not limited to the MinibatchKMeans algorithm, and those skilled in the art may also select any other feasible clustering algorithm.

本申请的一种实施例中，根据上述所占比例，对上述第一类型数据和上述第二类型数据进行二次标记，使得各上述数据簇中的至少部分上述第一标签值更新为上述第二标签值，至少部分上述第二标签值更新为上述第一标签值，包括：在上述所占比例大于比例阈值的情况下，将当前的上述数据簇中的上述第二类型数据的标签值更新为上述第一标签值；在上述所占比例小于或者等于上述比例阈值的情况下，将当前的上述数据簇中的上述第一类型数据的标签值更新为上述第二标签值。该实施例中，根据所占比例和比例阈值的大小关系可以更准确地对第一类型数据和第二类型数据进行二次标记，进而进一步保证了得到的标签数据较为准确。In one embodiment of the present application, the first type of data and the second type of data are secondary labeled according to the above-mentioned proportion, so that at least part of the first label value in each of the above-mentioned data clusters is updated to the second label value, and at least part of the second label value is updated to the first label value, including: when the above-mentioned proportion is greater than the proportion threshold, the label value of the second type of data in the current data cluster is updated to the first label value; when the above-mentioned proportion is less than or equal to the above-mentioned proportion threshold, the label value of the first type of data in the current data cluster is updated to the second label value. In this embodiment, the first type of data and the second type of data can be secondary labeled more accurately according to the size relationship between the proportion and the proportion threshold, thereby further ensuring that the obtained label data is more accurate.

一种具体的实施例中，比例阈值为80％，在图2中的四个数据簇中，只有第三数据簇中的第一类型数据的所占比例大于80％，将第三数据簇中的第二类型数据的标签值更新为第一标签值。In a specific embodiment, the ratio threshold is 80%. Among the four data clusters in FIG2 , only the first type of data in the third data cluster accounts for more than 80% of the ratio. The label value of the second type of data in the third data cluster is updated to the first label value.

本申请的又一种实施例中，在上述所占比例大于比例阈值的情况下，将当前的上述数据簇中的上述第二类型数据的标签值更新为上述第一标签值，包括：将上述所占比例大于上述比例阈值的当前的上述数据簇确定为目标数据簇；将上述目标数据簇中的预定个数的上述第二类型数据的标签值更新为上述第一标签值。该实施例中，可以根据业务需求，将预定个数的第二类型数据的标签值进行更新，这样可以保证得到的数据簇更符合业务需求。In another embodiment of the present application, when the above-mentioned proportion is greater than the proportion threshold, the label value of the above-mentioned second type of data in the current data cluster is updated to the above-mentioned first label value, including: determining the current data cluster whose proportion is greater than the above-mentioned proportion threshold as the target data cluster; updating the label value of the above-mentioned second type of data of a predetermined number in the above-mentioned target data cluster to the above-mentioned first label value. In this embodiment, the label value of the second type of data of a predetermined number can be updated according to business needs, so that the obtained data cluster can be guaranteed to better meet business needs.

另一种具体的实施例中，还可以将数据簇中的部分第一类型数据和部分第二类型数据提取出来，例如，提取2000个数据，第一类型数据所占比例为1/50，在需要将第一类型数据的数据量更新为原来的二倍的情况下，则需要将2000×(1/50)＝40个第二类型数据的标签值更新为第一标签值，如图2中的第三数据簇，可以满足400×(1-0.9)＝40的预定个数的更新需求。In another specific embodiment, part of the first type of data and part of the second type of data in the data cluster can also be extracted. For example, 2000 data are extracted, and the proportion of the first type of data is 1/50. When the amount of the first type of data needs to be updated to twice the original amount, the label values of 2000×(1/50)=40 second type of data need to be updated to the first label value. As shown in the third data cluster in Figure 2, it can meet the update requirement of the predetermined number of 400×(1-0.9)=40.

本申请的再一种实施例中，在上述所占比例小于或者等于上述比例阈值的情况下，将当前的上述数据簇中的上述第一类型数据的标签值更新为上述第二标签值，包括：将上述所占比例小于或者等于上述比例阈值的当前的上述数据簇确定为非目标数据簇；将上述非目标数据簇中的所有的上述第一类型数据的标签值更新为上述第二标签值。该实施例中，在所占比例小于或者等于比例阈值的情况下，可以将数据簇中的所有的第一类型数据的标签值更新为第二标签值，这样进一步保证了数据簇中所有的数据的标签值都为第二标签值，进而进一步保证了得到的标签数据较为准确。In another embodiment of the present application, when the proportion is less than or equal to the proportion threshold, the label value of the first type of data in the current data cluster is updated to the second label value, including: determining the current data cluster whose proportion is less than or equal to the proportion threshold as a non-target data cluster; updating the label value of all the first type of data in the non-target data cluster to the second label value. In this embodiment, when the proportion is less than or equal to the proportion threshold, the label value of all the first type of data in the data cluster can be updated to the second label value, which further ensures that the label value of all the data in the data cluster is the second label value, thereby further ensuring that the obtained label data is more accurate.

本申请的另一种实施例中，在上述所占比例大于比例阈值的情况下，将当前的上述数据簇中的上述第二类型数据的标签值更新为上述第一标签值，包括：在多个上述数据簇中的上述第一类型数据的上述所占比例均大于上述比例阈值的情况下，比较多个上述所占比例的大小；按照上述所占比例的从大到小的顺序依次将上述数据簇中的上述第二类型数据的标签值更新为上述第一标签值。该实施例中，是按照预定的顺序来将标签值进行更新的，数据簇中的第一类型数据的所占比例越大，数据簇的可控性越好，数据簇中的第一类型数据的个数更为准确，因此，先更新第一类型数据所占比例大的数据簇，这样可以更为准确地对数据进行二次标记。In another embodiment of the present application, when the above-mentioned proportion is greater than the proportion threshold, the label value of the above-mentioned second type of data in the current data cluster is updated to the above-mentioned first label value, including: when the above-mentioned proportion of the above-mentioned first type of data in multiple data clusters is greater than the above-mentioned proportion threshold, compare the sizes of multiple above-mentioned proportions; and update the label values of the above-mentioned second type of data in the above-mentioned data cluster to the above-mentioned first label value in order from large to small according to the above-mentioned proportions. In this embodiment, the label value is updated in a predetermined order. The larger the proportion of the first type of data in the data cluster, the better the controllability of the data cluster, and the more accurate the number of the first type of data in the data cluster. Therefore, the data cluster with a large proportion of the first type of data is updated first, so that the data can be more accurately re-labeled.

本申请的一种具体的实施例中，上述方法还包括：采用上述第一类型数据、上述第一标签值、上述第二类型数据和上述第二标签值，对决策树模型训练，得到树状图；根据上述树状图确定决策树的规则；确定上述决策树的规则与标准规则的相似度；根据上述相似度确定上述决策树的规则对未知数据进行分类预测的准确性。该实施例中，可以进一步验证经过二次标记的决策树的规则是否合理，相似度越高，决策树的规则对未知数据进行分类预测的准确性越高。In a specific embodiment of the present application, the method further includes: using the first type of data, the first label value, the second type of data and the second label value to train the decision tree model to obtain a tree diagram; determining the rules of the decision tree according to the tree diagram; determining the similarity between the rules of the decision tree and the standard rules; and determining the accuracy of the rules of the decision tree for classifying and predicting unknown data according to the similarity. In this embodiment, it is possible to further verify whether the rules of the decision tree that have been labeled twice are reasonable. The higher the similarity, the higher the accuracy of the rules of the decision tree for classifying and predicting unknown data.

本申请的又一种具体的实施例中，上述树状图包括多个节点，每个上述节点包括相关信息，上述相关信息包括以下至少之一：节点名称、数据长度、数据宽度、基尼系数、数据量、数据对应的函数值、数据名称，根据上述树状图确定决策树的规则，包括：获取目标节点在上述树状图中的位置信息；根据上述位置信息和上述目标节点的上述相关信息，确定上述目标节点对应的上述决策树的规则，上述目标节点为多个上述节点中的一个。该实施例中，根据相关信息可以更为准确地确定决策树的规则，进而进一步提高了决策树的规则对未知数据进行分类预测的准确性。In another specific embodiment of the present application, the above-mentioned tree diagram includes multiple nodes, each of the above-mentioned nodes includes relevant information, and the above-mentioned relevant information includes at least one of the following: node name, data length, data width, Gini coefficient, data volume, function value corresponding to the data, data name, and determining the rules of the decision tree according to the above-mentioned tree diagram includes: obtaining the position information of the target node in the above-mentioned tree diagram; determining the rule of the above-mentioned decision tree corresponding to the above-mentioned target node according to the above-mentioned position information and the above-mentioned relevant information of the above-mentioned target node, and the above-mentioned target node is one of the multiple above-mentioned nodes. In this embodiment, the rules of the decision tree can be determined more accurately according to the relevant information, thereby further improving the accuracy of the decision tree rules in classifying and predicting unknown data.

另一种具体的实施例中，图2中的四个数据簇，第一类型数据的标签值为1，第二类型数据的标签值为0，比例阈值为80％，只有第三数据簇中的第一类型数据的所占比例大于80％，将第三数据簇中的第二类型数据的标签值更新为1，将第一数据簇、第二数据簇和第四数据簇中的第一类型数据的标签值更新为0，对决策树模型进行训练，得到的树状图为图3中左侧的图，图3中中间的图为第一类型数据的第一标签值和第二类型数据的第二标签值的所占比例的扇形图，根据树状图得到决策树的规则，第三数据簇中可进行二次标记的数据的分别落在不同的目标节点，例如，分别落在目标节点node#2、node#5、node#7和node#10中，node#2中有34个第一类型数据，node#5中有3个第一类型数据，node#7中有2个第一类型数据，node#10中有1个第一类型数据，以iris(由三种鸢尾花，各50组数据构成的数据集)的数据集进行决策树模型训练，得到的决策树规则如图4所示，图中共包括17个目标节点，其中节点名称为node，数据长度为petal length(cm)，数据宽度为petal width(cm)，基尼系数为gini，数据量为samples，数据对应的函数值为value，数据名称为class，每个节点的相关信息如下：In another specific embodiment, in the four data clusters in FIG2, the label value of the first type of data is 1, the label value of the second type of data is 0, the ratio threshold is 80%, and only the proportion of the first type of data in the third data cluster is greater than 80%. The label value of the second type of data in the third data cluster is updated to 1, and the label values of the first type of data in the first data cluster, the second data cluster, and the fourth data cluster are updated to 0. The decision tree model is trained, and the obtained tree diagram is the diagram on the left side of FIG3. The middle diagram in FIG3 is a fan diagram of the proportion of the first label value of the first type of data and the second label value of the second type of data. The rule of the decision tree is obtained according to the tree diagram. The data that can be re-labeled in the cluster fall on different target nodes, for example, they fall on target nodes node#2, node#5, node#7 and node#10 respectively. There are 34 first-type data in node#2, 3 first-type data in node#5, 2 first-type data in node#7, and 1 first-type data in node#10. The decision tree model is trained with the iris data set (a data set consisting of three types of irises, each with 50 groups of data). The obtained decision tree rules are shown in Figure 4, which includes 17 target nodes in total, where the node name is node, the data length is petal length (cm), the data width is petal width (cm), the Gini coefficient is gini, the data volume is samples, the function value corresponding to the data is value, the data name is class, and the relevant information of each node is as follows:

node#0：petal length(cm)≤2.45，gini＝0.667，samples＝150，value＝[50,50,50]，class＝setosa；node#0: petal length(cm)≤2.45, gini=0.667, samples=150, value=[50,50,50], class=setosa;

node#1：gini＝0.0，samples＝50，value＝[50,0,0]，class＝setosa；node#1: gini=0.0, samples=50, value=[50,0,0], class=setosa;

node#2：petal width(cm)≤1.75，gini＝0.5，samples＝100，value＝[0,50,50]，class＝versicolor；node#2: petal width(cm)≤1.75, gini=0.5, samples=100, value=[0,50,50], class=versicolor;

node#3：petal length(cm)≤4.95，gini＝0.168，samples＝54，value＝[0,49,50]，class＝versicolor；node#3: petal length(cm)≤4.95, gini=0.168, samples=54, value=[0,49,50], class=versicolor;

node#4：petal width(cm)≤1.65，gini＝0.041，samples＝48，value＝[0,47,1]，class＝versicolor；node#4: petal width(cm)≤1.65, gini=0.041, samples=48, value=[0,47,1], class=versicolor;

node#5：gini＝0.0，samples＝47，value＝[0,47,0]，class＝versicolor；node#5: gini=0.0, samples=47, value=[0,47,0], class=versicolor;

node#6：gini＝0.0，samples＝1，value＝[0,0,1]，class＝virginica；node#6: gini=0.0, samples=1, value=[0,0,1], class=virginica;

node#7：petal width(cm)≤1.55，gini＝0.444，samples＝6，value＝[0,2,4]，class＝virginica；node#7: petal width(cm)≤1.55, gini=0.444, samples=6, value=[0,2,4], class=virginica;

node#8：gini＝0.0，samples＝3，value＝[0,0,3]，class＝virginica；node#8: gini=0.0, samples=3, value=[0,0,3], class=virginica;

node#9：petal length(cm)≤6.95，gini＝0.444，samples＝3，value＝[0,2,1]，class＝versicolor；node#9: petal length(cm)≤6.95, gini=0.444, samples=3, value=[0,2,1], class=versicolor;

node#10：gini＝0.0，samples＝2，value＝[0,2,0]，class＝versicolor；node#10: gini=0.0, samples=2, value=[0,2,0], class=versicolor;

node#11：gini＝0.0，samples＝1，value＝[0,0,1]，class＝virginica；node#11: gini=0.0, samples=1, value=[0,0,1], class=virginica;

node#12：petal length(cm)≤4.85，gini＝0.043，samples＝46，value＝[0,1,45]，class＝virginica；node#12: petal length(cm)≤4.85, gini=0.043, samples=46, value=[0,1,45], class=virginica;

node#13：petal length(cm)≤5.95，gini＝0.444，samples＝3，value＝[0,1,2]，class＝virginica；node#13: petal length(cm)≤5.95, gini=0.444, samples=3, value=[0,1,2], class=virginica;

node#14：gini＝0.0，samples＝43，value＝[0,1,0]，class＝versicolor；node#14: gini=0.0, samples=43, value=[0,1,0], class=versicolor;

node#15：gini＝0.0，samples＝2，value＝[0,0,2]，class＝virginica；node#15: gini=0.0, samples=2, value=[0,0,2], class=virginica;

node#16：gini＝0.0，samples＝43，value＝[0,0,43]，class＝virginica。node#16: gini=0.0, samples=43, value=[0,0,43], class=virginica.

本申请实施例还提供了一种数据标签的标记装置，需要说明的是，本申请实施例的数据标签的标记装置可以用于执行本申请实施例所提供的用于数据标签的标记方法。以下对本申请实施例提供的数据标签的标记装置进行介绍。The present application also provides a data tag marking device. It should be noted that the data tag marking device of the present application can be used to execute the data tag marking method provided in the present application. The data tag marking device provided in the present application is introduced below.

图5是根据本申请实施例的数据标签的标记装置的示意图。如图5所示，该装置包括：FIG5 is a schematic diagram of a data tag marking device according to an embodiment of the present application. As shown in FIG5 , the device includes:

第一处理单元10，用于获取第一类型数据和第二类型数据，且对上述第一类型数据进行标记，得到第一标签值，对上述第二类型数据进行标记，得到第二标签值；The first processing unit 10 is used to obtain the first type of data and the second type of data, and mark the first type of data to obtain a first label value, and mark the second type of data to obtain a second label value;

聚类单元20，用于按照预定聚类算法对上述第一类型数据和上述第二类型数据进行聚类，得到多个数据簇，各上述数据簇中包括至少一个上述第一类型数据和至少一个上述第二类型数据；A clustering unit 20, configured to cluster the first type of data and the second type of data according to a predetermined clustering algorithm to obtain a plurality of data clusters, each of which includes at least one of the first type of data and at least one of the second type of data;

第一确定单元30，用于确定上述第一类型数据在各上述数据簇中的所占比例；A first determining unit 30, used to determine the proportion of the first type of data in each of the data clusters;

第二处理单元40，用于根据上述所占比例，对上述第一类型数据和上述第二类型数据进行二次标记，使得各上述数据簇中的至少部分上述第一标签值更新为上述第二标签值，至少部分上述第二标签值更新为上述第一标签值。The second processing unit 40 is used to perform secondary marking on the above-mentioned first type data and the above-mentioned second type data according to the above-mentioned proportion, so that at least part of the above-mentioned first label values in each of the above-mentioned data clusters are updated to the above-mentioned second label values, and at least part of the above-mentioned second label values are updated to the above-mentioned first label values.

上述的装置中，第一处理单元获取第一类型数据和第二类型数据，且对上述第一类型数据进行标记，得到第一标签值，对上述第二类型数据进行标记，得到第二标签值，聚类单元按照预定聚类算法对上述第一类型数据和上述第二类型数据进行聚类，得到多个数据簇，各上述数据簇中包括至少一个上述第一类型数据和至少一个上述第二类型数据，第一确定单元确定上述第一类型数据在各上述数据簇中的所占比例，第二处理单元根据上述所占比例，对上述第一类型数据和上述第二类型数据进行二次标记，使得各上述数据簇中的至少部分上述第一标签值更新为上述第二标签值，至少部分上述第二标签值更新为上述第一标签值。该方案中，根据第一类型数据在各数据簇中的所占比例，对第一类型数据和第二类型数据进行二次标记，数据进行了二次标记后，得到的标签数据较为准确，进而提高了对未知的数据进行预测时的准确性。In the above-mentioned device, the first processing unit obtains the first type of data and the second type of data, and marks the first type of data to obtain a first label value, marks the second type of data to obtain a second label value, the clustering unit clusters the first type of data and the second type of data according to a predetermined clustering algorithm to obtain a plurality of data clusters, each of which includes at least one of the first type of data and at least one of the second type of data, the first determination unit determines the proportion of the first type of data in each of the data clusters, and the second processing unit performs secondary labeling on the first type of data and the second type of data according to the proportion, so that at least part of the first label value in each of the data clusters is updated to the second label value, and at least part of the second label value is updated to the first label value. In this scheme, the first type of data and the second type of data are secondary labeled according to the proportion of the first type of data in each data cluster. After the data is secondary labeled, the label data obtained is more accurate, thereby improving the accuracy of predicting unknown data.

本申请的一种实施例中，第二处理单元包括第一处理模块、第二处理模块，第一处理模块用于在上述所占比例大于比例阈值的情况下，将当前的上述数据簇中的上述第二类型数据的标签值更新为上述第一标签值；第二处理模块用于在上述所占比例小于或者等于上述比例阈值的情况下，将当前的上述数据簇中的上述第一类型数据的标签值更新为上述第二标签值。该实施例中，根据所占比例和比例阈值的大小关系可以更准确地对第一类型数据和第二类型数据进行二次标记，进而进一步保证了得到的标签数据较为准确。In one embodiment of the present application, the second processing unit includes a first processing module and a second processing module. The first processing module is used to update the label value of the second type of data in the current data cluster to the first label value when the proportion is greater than the proportion threshold; the second processing module is used to update the label value of the first type of data in the current data cluster to the second label value when the proportion is less than or equal to the proportion threshold. In this embodiment, the first type of data and the second type of data can be more accurately marked according to the size relationship between the proportion and the proportion threshold, thereby further ensuring that the obtained label data is more accurate.

本申请的又一种实施例中，第一处理模块包括第一处理子模块和第二处理子模块，第一处理子模块用于将上述所占比例大于上述比例阈值的当前的上述数据簇确定为目标数据簇；第二处理子模块用于将上述目标数据簇中的预定个数的上述第二类型数据的标签值更新为上述第一标签值。该实施例中，可以根据业务需求，将预定个数的第二类型数据的标签值进行更新，这样可以保证得到的数据簇更符合业务需求。In another embodiment of the present application, the first processing module includes a first processing submodule and a second processing submodule, the first processing submodule is used to determine the current data cluster whose proportion is greater than the proportion threshold as the target data cluster; the second processing submodule is used to update the label value of the predetermined number of the second type of data in the target data cluster to the first label value. In this embodiment, the label value of the predetermined number of the second type of data can be updated according to business needs, so that the obtained data cluster can be guaranteed to better meet business needs.

本申请的再一种实施例中，第二处理模块包括第三处理子模块和第四处理子模块，第三处理子模块用于将上述所占比例小于或者等于上述比例阈值的当前的上述数据簇确定为非目标数据簇；第四处理子模块用于将上述非目标数据簇中的所有的上述第一类型数据的标签值更新为上述第二标签值。该实施例中，在所占比例小于或者等于比例阈值的情况下，可以将数据簇中的所有的第一类型数据的标签值更新为第二标签值，这样进一步保证了数据簇中所有的数据的标签值都为第二标签值，进而进一步保证了得到的标签数据较为准确。In another embodiment of the present application, the second processing module includes a third processing submodule and a fourth processing submodule, the third processing submodule is used to determine the current data cluster whose proportion is less than or equal to the proportion threshold as a non-target data cluster; the fourth processing submodule is used to update the label value of all the first type data in the non-target data cluster to the second label value. In this embodiment, when the proportion is less than or equal to the proportion threshold, the label value of all the first type data in the data cluster can be updated to the second label value, which further ensures that the label value of all the data in the data cluster is the second label value, thereby further ensuring that the obtained label data is more accurate.

本申请的另一种实施例中，第一处理模块包括比较子模块和第五处理子模块，比较子模块用于在多个上述数据簇中的上述第一类型数据的上述所占比例均大于上述比例阈值的情况下，比较多个上述所占比例的大小；第五处理子模块用于按照上述所占比例的从大到小的顺序依次将上述数据簇中的上述第二类型数据的标签值更新为上述第一标签值。该实施例中，是按照预定的顺序来将标签值进行更新的，数据簇中的第一类型数据的所占比例越大，数据簇的可控性越好，数据簇中的第一类型数据的个数更为准确，因此，先更新第一类型数据所占比例大的数据簇，这样可以更为准确地对数据进行二次标记。In another embodiment of the present application, the first processing module includes a comparison submodule and a fifth processing submodule, the comparison submodule is used to compare the sizes of the above-mentioned proportions when the above-mentioned proportions of the above-mentioned first type of data in the above-mentioned multiple data clusters are all greater than the above-mentioned proportion threshold; the fifth processing submodule is used to update the label values of the above-mentioned second type of data in the above-mentioned data cluster to the above-mentioned first label value in order from large to small according to the above-mentioned proportions. In this embodiment, the label value is updated in a predetermined order, the larger the proportion of the first type of data in the data cluster, the better the controllability of the data cluster, and the more accurate the number of the first type of data in the data cluster, therefore, the data cluster with a large proportion of the first type of data is updated first, so that the data can be more accurately re-labeled.

本申请的一种具体的实施例中，上述装置还包括训练单元、第二确定单元、第三确定单元和第四确定单元，训练单元用于采用上述第一类型数据、上述第一标签值、上述第二类型数据和上述第二标签值，对决策树模型训练，得到树状图；第二确定单元用于根据上述树状图确定决策树的规则；第三确定单元用于确定上述决策树的规则与标准规则的相似度；第四确定单元用于根据上述相似度确定上述决策树的规则对未知数据进行分类预测的准确性。该实施例中，可以进一步验证经过二次标记的决策树的规则是否合理，相似度越高，决策树的规则对未知数据进行分类预测的准确性越高。In a specific embodiment of the present application, the above-mentioned device also includes a training unit, a second determination unit, a third determination unit and a fourth determination unit. The training unit is used to train the decision tree model using the above-mentioned first type of data, the above-mentioned first label value, the above-mentioned second type of data and the above-mentioned second label value to obtain a tree diagram; the second determination unit is used to determine the rules of the decision tree according to the above-mentioned tree diagram; the third determination unit is used to determine the similarity between the rules of the above-mentioned decision tree and the standard rules; the fourth determination unit is used to determine the accuracy of the rules of the above-mentioned decision tree for classification prediction of unknown data according to the above-mentioned similarity. In this embodiment, it can be further verified whether the rules of the decision tree that have been marked twice are reasonable. The higher the similarity, the higher the accuracy of the rules of the decision tree for classification prediction of unknown data.

本申请的又一种具体的实施例中，上述树状图包括多个节点，每个上述节点包括相关信息，上述相关信息包括以下至少之一：节点名称、数据长度、数据宽度、基尼系数、数据量、数据对应的函数值、数据名称，第二确定单元包括获取模块和确定模块，获取模块用于获取目标节点在上述树状图中的位置信息；确定模块用于根据上述位置信息和上述目标节点的上述相关信息，确定上述目标节点对应的上述决策树的规则，上述目标节点为多个上述节点中的一个。该实施例中，根据相关信息可以更为准确地确定决策树的规则，进而进一步提高了决策树的规则对未知数据进行分类预测的准确性。In another specific embodiment of the present application, the above-mentioned tree diagram includes multiple nodes, each of the above-mentioned nodes includes relevant information, and the above-mentioned relevant information includes at least one of the following: node name, data length, data width, Gini coefficient, data volume, function value corresponding to the data, data name, and the second determination unit includes an acquisition module and a determination module, and the acquisition module is used to obtain the position information of the target node in the above-mentioned tree diagram; the determination module is used to determine the rule of the above-mentioned decision tree corresponding to the above-mentioned target node according to the above-mentioned position information and the above-mentioned relevant information of the above-mentioned target node, and the above-mentioned target node is one of the multiple above-mentioned nodes. In this embodiment, the rules of the decision tree can be determined more accurately according to the relevant information, thereby further improving the accuracy of the classification prediction of unknown data by the rules of the decision tree.

上述数据标签的标记装置包括处理器和存储器，上述第一处理单元、聚类单元、第一确定单元和第二处理单元等均作为程序单元存储在存储器中，由处理器执行存储在存储器中的上述程序单元来实现相应的功能。The marking device for the above-mentioned data label includes a processor and a memory. The above-mentioned first processing unit, clustering unit, first determination unit and second processing unit, etc. are all stored in the memory as program units, and the processor executes the above-mentioned program units stored in the memory to realize corresponding functions.

处理器中包含内核，由内核去存储器中调取相应的程序单元。内核可以设置一个或以上，通过调整内核参数来提高对未知的数据进行预测时的准确性。The processor includes a kernel, which calls the corresponding program unit from the memory. One or more kernels can be set, and the accuracy of predicting unknown data can be improved by adjusting kernel parameters.

存储器可能包括计算机可读介质中的非永久性存储器，随机存取存储器(RAM)和/或非易失性内存等形式，如只读存储器(ROM)或闪存(flash RAM)，存储器包括至少一个存储芯片。The memory may include non-permanent memory in a computer-readable medium, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash RAM, and the memory includes at least one memory chip.

本发明实施例提供了一种计算机可读存储介质，其上存储有程序，该程序被处理器执行时实现上述数据标签的标记方法。An embodiment of the present invention provides a computer-readable storage medium on which a program is stored. When the program is executed by a processor, the above-mentioned data tagging method is implemented.

本发明实施例提供了一种处理器，上述处理器用于运行程序，其中，上述程序运行时执行上述数据标签的标记方法。An embodiment of the present invention provides a processor, which is used to run a program, wherein the data tag marking method is executed when the program is running.

本发明实施例提供了一种设备，设备包括处理器、存储器及存储在存储器上并可在处理器上运行的程序，处理器执行程序时实现至少以下步骤：An embodiment of the present invention provides a device, the device including a processor, a memory, and a program stored in the memory and executable on the processor, and when the processor executes the program, at least the following steps are implemented:

步骤S104，根据上述所占比例，对上述第一类型数据和上述第二类型数据进行二次标记，使得各上述数据簇中的至少部分上述第一标签值更新为上述第二标签值，至少部分上述第二标签值更新为上述第一标签值。Step S104, re-labeling the first type data and the second type data according to the above proportions, so that at least part of the first label values in each of the above data clusters are updated to the second label values, and at least part of the second label values are updated to the first label values.

本文中的设备可以是服务器、PC、PAD、手机等。The devices in this article can be servers, PCs, PADs, mobile phones, etc.

本申请还提供了一种计算机程序产品，当在数据处理设备上执行时，适于执行初始化有至少如下方法步骤的程序：The present application also provides a computer program product, which, when executed on a data processing device, is suitable for executing a program for initializing at least the following method steps:

在本发明的上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。In the above embodiments of the present invention, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference can be made to the relevant descriptions of other embodiments.

在本申请所提供的几个实施例中，应该理解到，所揭露的技术内容，可通过其它的方式实现。其中，以上所描述的装置实施例仅仅是示意性的，例如上述单元的划分，可以为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，单元或模块的间接耦合或通信连接，可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. Among them, the device embodiments described above are only schematic. For example, the division of the above-mentioned units can be a logical function division. There may be other division methods in actual implementation. For example, multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of units or modules, which can be electrical or other forms.

上述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the present embodiment.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.

上述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例上述方法的全部或部分步骤。而前述的存储介质包括：U盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。If the above-mentioned integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions for a computer device (which can be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the above-mentioned methods of each embodiment of the present invention. The aforementioned storage medium includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

从以上的描述中，可以看出，本申请上述的实施例实现了如下技术效果：From the above description, it can be seen that the above embodiments of the present application achieve the following technical effects:

1)、本申请的数据标签的标记方法，首先获取第一类型数据和第二类型数据，且对上述第一类型数据进行标记，得到第一标签值，对上述第二类型数据进行标记，得到第二标签值，之后按照预定聚类算法对上述第一类型数据和上述第二类型数据进行聚类，得到多个数据簇，各上述数据簇中包括至少一个上述第一类型数据和至少一个上述第二类型数据，之后确定上述第一类型数据在各上述数据簇中的所占比例，最后根据上述所占比例，对上述第一类型数据和上述第二类型数据进行二次标记，使得各上述数据簇中的至少部分上述第一标签值更新为上述第二标签值，至少部分上述第二标签值更新为上述第一标签值。该方案中，根据第一类型数据在各数据簇中的所占比例，对第一类型数据和第二类型数据进行二次标记，数据进行了二次标记后，得到的标签数据较为准确，进而提高了对未知的数据进行预测时的准确性。1) The data labeling method of the present application first obtains the first type of data and the second type of data, and labels the first type of data to obtain the first label value, labels the second type of data to obtain the second label value, and then clusters the first type of data and the second type of data according to a predetermined clustering algorithm to obtain multiple data clusters, each of which includes at least one of the first type of data and at least one of the second type of data, and then determines the proportion of the first type of data in each of the data clusters. Finally, according to the proportion, the first type of data and the second type of data are relabeled, so that at least part of the first label value in each of the data clusters is updated to the second label value, and at least part of the second label value is updated to the first label value. In this scheme, the first type of data and the second type of data are relabeled according to the proportion of the first type of data in each data cluster. After the data is relabeled, the label data obtained is more accurate, thereby improving the accuracy of predicting unknown data.

2)、本申请的数据标签的标记装置，第一处理单元获取第一类型数据和第二类型数据，且对上述第一类型数据进行标记，得到第一标签值，对上述第二类型数据进行标记，得到第二标签值，聚类单元按照预定聚类算法对上述第一类型数据和上述第二类型数据进行聚类，得到多个数据簇，各上述数据簇中包括至少一个上述第一类型数据和至少一个上述第二类型数据，第一确定单元确定上述第一类型数据在各上述数据簇中的所占比例，第二处理单元根据上述所占比例，对上述第一类型数据和上述第二类型数据进行二次标记，使得各上述数据簇中的至少部分上述第一标签值更新为上述第二标签值，至少部分上述第二标签值更新为上述第一标签值。该方案中，根据第一类型数据在各数据簇中的所占比例，对第一类型数据和第二类型数据进行二次标记，数据进行了二次标记后，得到的标签数据较为准确，进而提高了对未知的数据进行预测时的准确性。2) The data labeling device of the present application, the first processing unit obtains the first type of data and the second type of data, and labels the first type of data to obtain the first label value, labels the second type of data to obtain the second label value, the clustering unit clusters the first type of data and the second type of data according to a predetermined clustering algorithm to obtain multiple data clusters, each of which includes at least one of the first type of data and at least one of the second type of data, the first determination unit determines the proportion of the first type of data in each of the data clusters, and the second processing unit performs secondary labeling on the first type of data and the second type of data according to the proportion, so that at least part of the first label value in each of the data clusters is updated to the second label value, and at least part of the second label value is updated to the first label value. In this scheme, the first type of data and the second type of data are secondary labeled according to the proportion of the first type of data in each data cluster. After the data is secondary labeled, the obtained label data is more accurate, thereby improving the accuracy of predicting unknown data.

以上上述仅为本申请的优选实施例而已，并不用于限制本申请，对于本领域的技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。The above are only preferred embodiments of the present application and are not intended to limit the present application. For those skilled in the art, the present application may have various modifications and variations. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of marking a data tag, comprising:

Acquiring first type data and second type data, marking the first type data to obtain a first tag value, and marking the second type data to obtain a second tag value;

clustering the first type data and the second type data according to a preset clustering algorithm to obtain a plurality of data clusters, wherein each data cluster comprises at least one first type data and at least one second type data;

Determining the proportion of the first type data in each data cluster;

performing secondary marking on the first type data and the second type data according to the occupied proportion, so that at least part of the first tag values in each data cluster are updated to the second tag values, and at least part of the second tag values are updated to the first tag values;

and performing secondary marking on the first type data and the second type data according to the occupied proportion, so that at least part of the first tag values in the data clusters are updated to the second tag values, and at least part of the second tag values are updated to the first tag values, wherein the secondary marking comprises the following steps:

Updating the tag value of the second type data in the current data cluster to the first tag value under the condition that the occupied proportion is larger than a proportion threshold value;

and updating the label value of the first type data in the current data cluster to the second label value under the condition that the occupied proportion is smaller than or equal to the proportion threshold value.

2. The method of claim 1, wherein updating the tag value of the second type of data in the current data cluster to the first tag value if the occupied proportion is greater than a proportion threshold comprises:

Determining the current data cluster with the proportion larger than the proportion threshold value as a target data cluster;

And updating the label values of the second type data of the preset number in the target data cluster to the first label values.

3. The method of claim 1, wherein updating the tag value of the first type of data in the current data cluster to the second tag value if the proportion is less than or equal to the proportion threshold comprises:

determining the current data cluster with the occupied proportion smaller than or equal to the proportion threshold value as a non-target data cluster;

and updating the tag values of all the first type data in the non-target data cluster into the second tag values.

4. The method of claim 1, wherein updating the tag value of the second type of data in the current data cluster to the first tag value if the occupied proportion is greater than a proportion threshold comprises:

Comparing the magnitudes of the plurality of occupied ratios in the case where the occupied ratios of the first type of data in the plurality of data clusters are all greater than the ratio threshold;

And sequentially updating the tag values of the second type of data in the data cluster into the first tag values according to the order of the occupied proportion from large to small.

5. The method according to any one of claims 1 to 4, further comprising:

Training a decision tree model by adopting the first type data, the first tag value, the second type data and the second tag value to obtain a tree diagram;

Determining rules of a decision tree according to the tree diagram;

Determining the similarity between the rule of the decision tree and the standard rule;

And determining the accuracy of classification prediction of unknown data according to the rule of the decision tree according to the similarity.

6. The method of claim 5, wherein the tree graph comprises a plurality of nodes, each of the nodes comprising related information, the related information comprising at least one of: node name, data length, data width, coefficient of foundation, data quantity, function value corresponding to data, data name, determining rule of decision tree according to the tree diagram, comprising:

Acquiring the position information of a target node in the tree diagram;

and determining rules of the decision tree corresponding to the target node according to the position information and the related information of the target node, wherein the target node is one of a plurality of nodes.

7. A data tag marking apparatus comprising:

The first processing unit is used for acquiring first type data and second type data, marking the first type data to obtain a first tag value, and marking the second type data to obtain a second tag value;

a clustering unit, configured to cluster the first type data and the second type data according to a predetermined clustering algorithm, so as to obtain a plurality of data clusters, where each data cluster includes at least one first type data and at least one second type data;

a first determining unit, configured to determine a proportion of the first type data in each of the data clusters;

The second processing unit is used for secondarily marking the first type data and the second type data according to the occupied proportion, so that at least part of the first tag values in the data clusters are updated to the second tag values, and at least part of the second tag values are updated to the first tag values;

Wherein the second processing unit includes:

The first processing module is used for updating the label value of the second type data in the current data cluster into the first label value under the condition that the occupied proportion is larger than a proportion threshold value;

And the second processing module is used for updating the label value of the first type data in the current data cluster into the second label value under the condition that the occupied proportion is smaller than or equal to the proportion threshold value.

8. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program, wherein the program performs the method of any one of claims 1 to 6.