CN110321552A

CN110321552A - Term vector construction method, device, medium and electronic equipment

Info

Publication number: CN110321552A
Application number: CN201910462774.9A
Authority: CN
Inventors: 崔勇; 杨光; 杨雪松
Original assignee: Taikang Asset Management Co Ltd; Taikang Insurance Group Co Ltd
Current assignee: Taikang Asset Management Co Ltd; Taikang Insurance Group Co Ltd
Priority date: 2019-05-30
Filing date: 2019-05-30
Publication date: 2019-10-11

Abstract

The embodiment of the present invention discloses a word vector construction method, device, medium and electronic equipment, the word vector construction method is used to construct the word vector of new words not included in the trained Word2Vec model dictionary, the word vector The construction method includes: obtaining related words appearing in the context of the new word in the related corpus text containing the new word; constructing the word vector of the new word according to the word vector of the related word. The present invention can efficiently and economically calculate the word vector of the new word, and the obtained word vector of the new word expresses close to the semantic information of the new word. In addition, using word vectors of new words as initialization values for retraining or incremental training can make training faster and more efficient.

Description

Word vector construction method, device, medium and electronic equipment

技术领域technical field

本发明涉及语言建模技术领域，具体而言，涉及一种词向量构建方法、装置、介质及电子设备，用于构建未包含在训练好的Word2Vec模型词典中的新词的词向量。The present invention relates to the technical field of language modeling, in particular, to a word vector construction method, device, medium and electronic equipment for constructing word vectors of new words not included in the trained Word2Vec model dictionary.

背景技术Background technique

近年来，为了将自然语言应用于语义分析、情绪走势分析、信息检索等领域，人们通常会先将词语(简称词)表达成一维或多维的词向量，再利用计算设备对该词向量进行进一步的处理。Google公司在2013年开放了Word2Vec技术，用于根据给定的语料库训练词向量。训练完成后，所得到的Word2Vec模型词典可以用来映射(语料库中包含的)词到一个词向量，从而实现快速的词向量建模。In recent years, in order to apply natural language to semantic analysis, emotional trend analysis, information retrieval and other fields, people usually first express words (words for short) into one-dimensional or multi-dimensional word vectors, and then use computing equipment to further process the word vectors. processing. Google opened Word2Vec technology in 2013 to train word vectors based on a given corpus. After training, the resulting dictionary of Word2Vec models can be used to map words (contained in the corpus) to a word vector, thus enabling fast word vector modeling.

然而，随着网络新词汇、新主题的不断涌现以及大量新语料文本的出现，已经训练好的Word2Vec模型词典中可能不包含新出现的新词，目前解决这一问题的方法是利用这些新的语料文本重新训练模型，或者在已有模型的基础上增量式地训练模型。由于训练模型时需要多次遍历语料文本并且进行大量的数值计算，因此带来低效、耗时、成本高等问题。However, with the continuous emergence of new vocabulary, new topics and a large number of new corpus texts on the Internet, the trained Word2Vec model dictionary may not contain new words that have emerged. The current solution to this problem is to use these new words The corpus text retrains the model, or incrementally trains the model on the basis of the existing model. Due to the need to traverse the corpus text multiple times and perform a large number of numerical calculations when training the model, it brings problems such as inefficiency, time-consuming, and high cost.

发明内容Contents of the invention

为解决上述现有技术中存在的问题，根据本发明的一个实施例，提供一种词向量构建方法，所述方法用于构建未包含在训练好的Word2Vec模型词典中的新词的词向量，包括：在包含所述新词的相关语料文本中，获得在所述新词的上下文中出现的相关词；以及，根据所述相关词的词向量构建所述新词的词向量。In order to solve the problems in the above-mentioned prior art, according to one embodiment of the present invention, a word vector construction method is provided, which is used to construct the word vector of new words not included in the trained Word2Vec model dictionary, The method includes: obtaining related words appearing in the context of the new word in the related corpus text containing the new word; and constructing the word vector of the new word according to the word vector of the related word.

上述方法中，在包含所述新词的相关语料文本中，获得在所述新词的上下文中出现的相关词包括：对所述相关语料文本进行分词操作，得到词序列；以及，在所述词序列中，获取在所述新词之前和之后出现的预定数量的词作为所述相关词。In the above method, in the related corpus text containing the new word, obtaining the related word appearing in the context of the new word includes: performing a word segmentation operation on the related corpus text to obtain a word sequence; and, in the In the word sequence, a predetermined number of words appearing before and after the new word are acquired as the related words.

上述方法中，根据所述相关词的词向量构建所述新词的词向量包括：在训练好的Word2Vec模型词典中查找所述相关词对应的词向量，将找到对应的词向量的相关词作为实际关联词加入所述新词的上下文相关词列表；以及，获得所述上下文相关词列表中的实际关联词对应的词向量，根据所获得的词向量计算所述新词的词向量。In the above method, constructing the word vector of the new word according to the word vector of the related word includes: searching the word vector corresponding to the related word in the trained Word2Vec model dictionary, and finding the related word of the corresponding word vector as The actual related words are added to the context-related word list of the new word; and the word vectors corresponding to the actual related words in the context-related word list are obtained, and the word vectors of the new word are calculated according to the obtained word vectors.

上述方法中，根据所获得的实际关联词对应的词向量计算所述新词的词向量可以包括：将所获得的词向量的平均值作为所述新词的词向量。In the above method, calculating the word vector of the new word according to the obtained word vector corresponding to the actual associated word may include: taking an average value of the obtained word vectors as the word vector of the new word.

上述方法中，根据所获得的实际关联词对应的词向量计算所述新词的词向量可以包括：记录所述实际关联词在所述新词的上下文中出现的次数；以及，根据下式计算所述新词的词向量：In the above method, calculating the word vector of the new word according to the obtained word vector corresponding to the actual associated word may include: recording the number of occurrences of the actual associated word in the context of the new word; and, calculating the word vector according to the following formula Word vectors for new words:

其中，WV_i表示新词i的词向量，List_i表示新词i的上下文相关词列表，WV_j表示新词i的实际关联词j在训练好的Word2Vec模型词典中对应的词向量，w_j表示实际关联词j在新词i的上下文中出现的次数与List_i中所有实际关联词在新词i的上下文中出现的次数总和的比值。Among them, WV _i represents the word vector of new word i, List _i represents the context-related word list of new word i, WV _j represents the word vector corresponding to the actual associated word j of new word i in the trained Word2Vec model dictionary, w _j represents The ratio of the number of times that the actual associated word j appears in the context of the new word i to the sum of the times that all actual associated words in List _i appear in the context of the new word i.

上述方法中，根据所获得的实际关联词对应的词向量计算所述新词的词向量可以包括：记录所述实际关联词与所述新词的平均距离；以及，根据下式计算所述新词的词向量：In the above method, calculating the word vector of the new word according to the obtained word vector corresponding to the actual associated word may include: recording the average distance between the actual associated word and the new word; and, calculating the word vector of the new word according to the following formula word vector:

其中，WV_i表示新词i的词向量，List_i表示新词i的上下文相关词列表，WV_j表示新词i的实际关联词j在训练好的Word2Vec模型词典中对应的词向量，v_j表示实际关联词j和新词i的平均距离的倒数与List_i中所有实际关联词和新词i的平均距离的倒数之和的比值。Among them, WV _i represents the word vector of new word i, List _i represents the context-related word list of new word i, WV _j represents the word vector corresponding to the actual associated word j of new word i in the trained Word2Vec model dictionary, and v _j represents The ratio of the reciprocal of the average distance between the actual associated word j and the new word i to the sum of the reciprocals of the average distances between all actual associated words and the new word i in List _i .

根据本发明的一个实施例，还提供一种Word2Vec模型词典的更新方法，包括：获取待生成词向量的新词；根据上述的词向量构建方法构建所述新词的词向量；以及，将所述新词和所述新词的词向量加入所述Word2Vec模型词典。According to an embodiment of the present invention, there is also provided a method for updating a Word2Vec model dictionary, including: obtaining a new word of a word vector to be generated; constructing a word vector of the new word according to the above-mentioned word vector construction method; The word vector of described new word and described new word is added described Word2Vec model dictionary.

根据本发明的一个实施例，还提供一种词向量构建装置，所述装置用于构建未包含在训练好的Word2Vec模型中的新词的词向量，所述装置包括：According to an embodiment of the present invention, also provide a kind of word vector construction device, described device is used for constructing the word vector of the new word that is not included in the trained Word2Vec model, described device comprises:

相关词获取模块，用于在包含所述新词的相关语料文本中，获得在所述新词的上下文中出现的相关词；A related word acquisition module, used to obtain related words that appear in the context of the new word in the relevant corpus text containing the new word;

词向量构建模块，用于根据所述相关词的词向量构建所述新词的词向量。The word vector construction module is used for constructing the word vector of the new word according to the word vector of the related word.

根据本发明的一个实施例，还提供一种计算机可读介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现上述词向量构建方法或Word2Vec模型词典的更新方法。According to an embodiment of the present invention, there is also provided a computer-readable medium on which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned word vector construction method or Word2Vec model dictionary update method is implemented.

根据本发明的一个实施例，还提供一种电子设备，包括：一个或多个处理器；存储装置，用于存储一个或多个计算机程序，当所述一个或多个计算机程序被所述一个或多个处理器执行时，使得所述电子设备实现上述词向量构建方法或Word2Vec模型词典的更新方法。According to an embodiment of the present invention, there is also provided an electronic device, including: one or more processors; a storage device for storing one or more computer programs, when the one or more computer programs are executed by the one or a plurality of processors are executed, so that the electronic device implements the above word vector construction method or the update method of the Word2Vec model dictionary.

本发明实施例提供的技术方案具有以下有益效果：The technical solutions provided by the embodiments of the present invention have the following beneficial effects:

利用新词在语料文本中的上下文相关词以及训练好的Word2Vec模型词典来为新词构建词向量，该过程仅需对包含新词的语料文本进行一次遍历，实现了高效、经济地计算新词的词向量；此外，由于上下文相关词和新词在语义上有相关性，因此所得到的新词的词向量表达接近该新词的语义信息。在未来重新训练模型或者增量训练模型时，所得到的新词的词向量可以作为重新/增量训练的初始化值，从而让训练更快、更有效地进行。Use the context-related words of new words in the corpus text and the trained Word2Vec model dictionary to construct word vectors for new words. This process only needs to traverse the corpus text containing new words once, and realizes efficient and economical calculation of new words. In addition, because context-related words and new words are semantically related, the resulting word vectors of new words express close to the semantic information of the new word. When retraining the model or incrementally training the model in the future, the word vector of the new word obtained can be used as the initialization value of the re/incremental training, so that the training can be performed faster and more efficiently.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本发明。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本发明的实施例，并与说明书一起用于解释本发明的原理。显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。在附图中：The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description serve to explain the principles of the invention. Apparently, the drawings in the following description are only some embodiments of the present invention, and those skilled in the art can obtain other drawings according to these drawings without creative efforts. In the attached picture:

图1示意性示出了根据本发明一个实施例的词向量构建方法的流程图；Fig. 1 schematically shows the flowchart of the word vector construction method according to one embodiment of the present invention;

图2示意性示出了根据本发明一个实施例的在包含新词的相关语料文本中获得在该新词的上下文中出现的相关词的方法的流程图；Fig. 2 schematically shows a flow chart of a method for obtaining related words that appear in the context of the new word in the relevant corpus text containing the new word according to one embodiment of the present invention;

图3示意性示出了根据本发明一个实施例的根据相关词的词向量构建新词的词向量的方法的流程图；Fig. 3 schematically shows the flow chart of the method for constructing the word vector of new word according to the word vector of related word according to one embodiment of the present invention;

图4示意性示出了根据本发明一个实施例的Word2Vec模型词典的更新方法的流程图；Fig. 4 schematically shows the flow chart of the update method of the Word2Vec model dictionary according to one embodiment of the present invention;

图5示意性示出了根据本发明一个实施例的词向量构建装置的框图；Fig. 5 schematically shows a block diagram of a word vector construction device according to an embodiment of the present invention;

图6示意性示出了根据本发明一个实施例的Word2Vec模型词典的更新装置的框图；Fig. 6 schematically shows the block diagram of the updating device of the Word2Vec model dictionary according to one embodiment of the present invention;

图7示意性示出了适于用来实现本发明实施例的电子设备的计算机系统的结构示意图。Fig. 7 schematically shows a schematic structural diagram of a computer system suitable for implementing the electronic device of the embodiment of the present invention.

具体实施方式Detailed ways

现在将参考附图更全面地描述示例实施方式。然而，示例实施方式能够以多种形式实施，且不应被理解为限于在此阐述的范例；相反，提供这些实施方式使得本发明将更加全面和完整，并将示例实施方式的构思全面地传达给本领域的技术人员。Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete and fully convey the concept of example embodiments to those skilled in the art.

此外，所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中，提供许多具体细节从而给出对本发明的实施例的充分理解。然而，本领域技术人员将意识到，可以实践本发明的技术方案而没有特定细节中的一个或更多，或者可以采用其它的方法、组元、装置、步骤等。在其它情况下，不详细示出或描述公知方法、装置、实现或者操作以避免模糊本发明的各方面。Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided in order to give a thorough understanding of embodiments of the invention. However, those skilled in the art will appreciate that the technical solutions of the present invention may be practiced without one or more of the specific details, or other methods, components, means, steps, etc. may be employed. In other instances, well-known methods, apparatus, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the invention.

附图中所示的方框图仅仅是功能实体，不一定必须与物理上独立的实体相对应。即，可以采用软件形式来实现这些功能实体，或在一个或多个硬件模块或集成电路中实现这些功能实体，或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。The block diagrams shown in the drawings are merely functional entities and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices entity.

附图中所示的流程图仅是示例性说明，不是必须包括所有的内容和操作/步骤，也不是必须按所描述的顺序执行。例如，有的操作/步骤还可以分解，而有的操作/步骤可以合并或部分合并，因此实际执行的顺序有可能根据实际情况改变。The flow charts shown in the drawings are only exemplary illustrations, and do not necessarily include all contents and operations/steps, nor must they be performed in the order described. For example, some operations/steps can be decomposed, and some operations/steps can be combined or partly combined, so the actual order of execution may be changed according to the actual situation.

对于已经训练好的Word2Vec模型词典中不包含的新词而言，由于这些新词在语料文本中的出现频率比较低，重新训练Word2Vec模型或者增量训练Word2Vec模型都会造成效率低而成本高的问题，此外，鉴于Word2Vec算法的核心思想在于一个词的语义是由其上下文相关词所共同决定的，因此在构建新词的词向量时，可以基于该新词的上下文相关词的词向量进行构建，从而得到符合该新词的语义的词向量。For new words that are not included in the trained Word2Vec model dictionary, since these new words appear less frequently in the corpus, retraining the Word2Vec model or incrementally training the Word2Vec model will cause low efficiency and high cost. , in addition, in view of the core idea of the Word2Vec algorithm is that the semantics of a word is determined by its context-related words, so when constructing the word vector of a new word, it can be constructed based on the word vector of the context-related word of the new word, Thus, a word vector conforming to the semantics of the new word is obtained.

有鉴于此，根据本发明的一个实施例，提供一种词向量构建方法，该词向量构建方法用于构建未包含在训练好的Word2Vec模型词典中的新词的词向量，该方法可以在具有计算功能和存储功能的电子设备处执行，该电子设备可以位于客户端或者服务器处。In view of this, according to one embodiment of the present invention, a kind of word vector construction method is provided, and this word vector construction method is used to construct the word vector of the new word that is not included in the trained Word2Vec model dictionary, and this method can have Computing functions and storage functions are performed on electronic devices, and the electronic devices can be located at clients or servers.

图1示意性地示出了根据本发明一个实施例的词向量构建方法的流程图，概括而言，包括：在包含待构建词向量的新词的相关语料文本中，获得在所述新词的上下文中出现的相关词；以及，根据相关词的词向量构建所述新词的词向量。下文将参照图1详细描述该词向量构建方法的各个步骤：Fig. 1 schematically shows a flow chart of a method for constructing a word vector according to an embodiment of the present invention. In general, it includes: in the relevant corpus text containing the new word of the word vector to be constructed, obtain the related words appearing in the context; and, constructing the word vector of the new word according to the word vector of the related word. The steps of the word vector construction method will be described in detail below with reference to Fig. 1:

步骤S101.在包含新词的相关语料文本中，获得在该新词的上下文中出现的相关词。参见图2，步骤S101包括如下的子步骤：Step S101. Obtain related words that appear in the context of the new word in the relevant corpus text containing the new word. Referring to Fig. 2, step S101 includes the following sub-steps:

步骤S1011.获取新词(或称目标新词)以及包含该新词的相关语料文本。Step S1011. Acquire new words (or target new words) and relevant corpus texts containing the new words.

其中，新词指的是未包含在训练好的Word2Vec模型词典中的词；训练好的Word2Vec模型词典中包含经训练得到的多个词以及每个词所对应的词向量；相关语料文本既可以是曾经训练过的语料文本(即用于训练已训练好的Word2Vec模型词典的语料文本)，也可以是新增的语料文本，这些语料文本可以来自于不同的领域。Among them, the new word refers to the word that is not included in the trained Word2Vec model dictionary; the trained Word2Vec model dictionary contains multiple words obtained through training and the word vector corresponding to each word; the relevant corpus text can be It is the corpus text that has been trained (that is, the corpus text used to train the trained Word2Vec model dictionary), or it can be a newly added corpus text, and these corpus texts can come from different fields.

步骤S1012.将相关语料文本进行分词操作，得到词序列。Step S1012. Segment the relevant corpus texts to obtain word sequences.

步骤S1013.在词序列中获取在所述新词之前和之后出现的n(n为正整数)个词，作为新词的相关词(即新词的上下文相关词)。其中，某些相关词可能会多次出现在新词的上下文中，因此，还记录在相关语料文本中，相关词在新词的上下文中出现的次数。Step S1013. Obtain n (n is a positive integer) words appearing before and after the new word in the word sequence as related words of the new word (ie, context-related words of the new word). Among them, some related words may appear in the context of the new word multiple times, therefore, the number of times the related word appears in the context of the new word is also recorded in the relevant corpus text.

具体地，遍历所述词序列，若匹配到新词，则根据一个预设的窗口大小，获取在词序列内该新词前一个窗口和后一个窗口内的词作为该新词的相关词。优选地，预设的窗口大小为7，则在匹配到新词时，获取该新词在相关语料文本中的前七个词和后七个词作为新词的相关词。如果某个相关词在窗口内第一次出现，则将该相关词在新词的上下文中出现的次数设置为1；如果不是第一次出现，则将该相关词在新词的上下文中出现的次数加1。Specifically, the word sequence is traversed, and if a new word is matched, then according to a preset window size, words in the window before and after the new word in the word sequence are obtained as related words of the new word. Preferably, the preset window size is 7, and when a new word is matched, the first seven words and the last seven words of the new word in the relevant corpus text are obtained as related words of the new word. If a related word appears for the first time in the window, set the number of times the related word appears in the context of the new word to 1; if it is not the first time, set the related word to appear in the context of the new word The number of times plus 1.

步骤S102.根据新词的相关词的词向量构建该新词的词向量。参见图3，步骤S102包括如下子步骤：Step S102. Construct the word vector of the new word according to the word vectors of the related words of the new word. Referring to Fig. 3, step S102 includes the following sub-steps:

步骤S1021.加载训练好的Word2Vec模型词典。Step S1021. Load the trained Word2Vec model dictionary.

步骤S1022.构造新词的上下文相关词列表。Step S1022. Construct a context-dependent word list of the new word.

对于新词的所有相关词中的每个相关词，在训练好的Word2Vec模型词典中查找该相关词以及对应的词向量，如果找到对应的词向量则将该相关词作为实际关联词加入所述新词的上下文相关词列表，并在列表中记录该相关词在所述新词的上下文中出现的次数，以及记录该相关词在Word2Vec模型词典中所对应的词向量。如果没找到，则舍弃该相关词不进行记录。For each related word in all related words of new word, search this related word and corresponding word vector in the Word2Vec model dictionary trained well, if find corresponding word vector then add this related word as actual associated word into described new word The context related word list of word, and record the number of times that this related word appears in the context of described new word in list, and record the corresponding word vector of this related word in Word2Vec model dictionary. If not found, the relevant word is discarded and not recorded.

步骤S1023.根据上下文相关词列表中的实际关联词对应的词向量计算新词的词向量。Step S1023. Calculate the word vector of the new word according to the word vector corresponding to the actual associated word in the context-related word list.

具体地，获得所述上下文相关词列表中的实际关联词对应的词向量，将所获得的词向量进行加权平均，以得到新词的词向量。其中，词向量的权重为对应的实际关联词在新词的上下文中出现的次数与上下文相关词列表中所有实际关联词在新词的上下文中出现的次数总和的比值，具体参见下式：Specifically, the word vectors corresponding to the actual related words in the context-related word list are obtained, and the obtained word vectors are weighted and averaged to obtain the word vectors of the new words. Among them, the weight of the word vector is the ratio of the number of times the corresponding actual associated words appear in the context of the new word to the sum of the times of occurrences of all the actual associated words in the context of the new word in the list of context-related words, see the following formula for details:

其中，WV_i表示新词i的词向量，List_i表示新词i的上下文相关词列表，WV_j表示新词i的实际关联词j在Word2Vec模型词典中对应的词向量，w_j表示实际关联词j在新词i的上下文中出现的次数与List_i中所有实际关联词在新词i的上下文中出现的次数总和的比值。Among them, WV _i represents the word vector of the new word i, List _i represents the context-related word list of the new word i, WV _j represents the word vector corresponding to the actual associated word j of the new word i in the Word2Vec model dictionary, and w _j represents the actual associated word j The ratio of the number of occurrences in the context of new word i to the sum of the number of occurrences of all actually associated words in List _i in the context of new word i.

根据上述的实施例，仅需要对包含新词的语料文本进行一次遍历，因此与重新训练/增量训练模型相比，上述实施例提供的词向量构建方法更为高效和经济。计算时考虑了新词的上下文信息，因此所得到的新词的词向量表达接近该新词的语义信息。此外，在未来重新训练模型或者增量训练模型时，所得到的新词的词向量可以作为重新/增量训练的初始化值，从而让训练更快、更有效地进行。According to the above-mentioned embodiments, only one traversal of the corpus text containing new words is required. Therefore, compared with retraining/incremental training models, the method for constructing word vectors provided by the above-mentioned embodiments is more efficient and economical. The context information of the new word is considered during the calculation, so the obtained word vector expression of the new word is close to the semantic information of the new word. In addition, when retraining the model or incrementally training the model in the future, the word vector of the new word obtained can be used as the initialization value of the re/incremental training, so that the training can be performed faster and more efficiently.

在上述实施例中，需要记录相关词在新词的上下文中出现的次数，用于计算新词的词向量。在另一个实施例中，为了进一步提高效率、节省计算资源，也可以不记录相关词在新词上下文中出现的次数，而是以新词的上下文相关词列表中的所有实际关联词对应的词向量的平均值作为新词的词向量。In the above embodiments, it is necessary to record the number of occurrences of related words in the context of the new word, which is used to calculate the word vector of the new word. In another embodiment, in order to further improve efficiency and save computing resources, the number of occurrences of related words in the new word context may not be recorded, but the word vectors corresponding to all actual related words in the context-related word list of the new word The average value of is used as the word vector of the new word.

在另一个实施例中，可以记录相关词与新词的平均距离，用于计算新词的词向量。举例而言，在获得新词上下文中出现的相关词时，并非记录相关词出现的次数，而是记录每个相关词与新词的距离d(距离d表示新词与该相关词之间间隔了d-1个词)；如果在训练好的Word2Vec模型词典中找到相关词对应的词向量，则除了在上下文相关词列表中记录该相关词(即实际关联词)和对应的词向量，则还要记录该相关词与新词的平均距离；在计算新词的词向量时，将新词的上下文相关词列表中的实际关联词所对应的词向量进行加权平均，其中，词向量的权重为：对应的实际关联词和新词的平均距离的倒数与上下文相关词列表中所有实际关联词和新词的平均距离的倒数之和的比值，具体参见下式：In another embodiment, the average distance between the related word and the new word can be recorded for calculating the word vector of the new word. For example, when obtaining related words that appear in the context of a new word, instead of recording the number of occurrences of related words, record the distance d between each related word and the new word (distance d represents the interval between the new word and the related word d-1 words); if the word vector corresponding to the related word is found in the trained Word2Vec model dictionary, in addition to recording the related word (that is, the actual associated word) and the corresponding word vector in the context related word list, then also To record the average distance between the related word and the new word; when calculating the word vector of the new word, the word vectors corresponding to the actual associated words in the context-related word list of the new word are weighted and averaged, where the weight of the word vector is: The ratio of the reciprocal of the average distance between the corresponding actual associated words and new words to the sum of the reciprocals of the average distances between all actual associated words and new words in the list of context-related words, see the following formula for details:

其中，WV_i表示新词i的词向量，List_i表示新词i的上下文相关词列表，WV_j表示新词i的实际关联词j在Word2Vec模型词典中对应的词向量，v_j表示实际关联词j和新词i的平均距离的倒数与List_i中所有实际关联词和新词i的平均距离的倒数之和的比值。Among them, WV _i represents the word vector of the new word i, List _i represents the context-related word list of the new word i, WV _j represents the word vector corresponding to the actual associated word j of the new word i in the Word2Vec model dictionary, and v _j represents the actual associated word j The ratio of the reciprocal of the average distance to the new word i to the sum of the reciprocals of the average distances of all actually associated words in List _i to the new word i.

根据本发明的一个实施例，还提供一种Word2Vec模型词典的更新方法，该方法可以在具有计算功能和存储功能的电子设备处执行，该电子设备可以位于客户端或者服务器处。According to an embodiment of the present invention, there is also provided a method for updating the Word2Vec model dictionary, the method can be executed at an electronic device with a computing function and a storage function, and the electronic device can be located at a client or a server.

图4示意性地示出了根据本发明一个实施例的Word2Vec模型词典的更新方法的流程图，下文将参照图4详细描述该方法的各个步骤：Fig. 4 schematically shows the flow chart of the updating method of the Word2Vec model dictionary according to one embodiment of the present invention, each step of this method will be described in detail below with reference to Fig. 4:

步骤S201.获取待生成词向量的新词。Step S201. Obtain new words of word vectors to be generated.

步骤S202.构建所述新词的词向量，包括如下的子步骤：Step S202. Constructing the word vector of the new word, including the following sub-steps:

步骤S2021.在包含新词的相关语料文本中，获得在该新词的上下文中出现的相关词。Step S2021. In the relevant corpus text containing the new word, obtain the related words appearing in the context of the new word.

具体地，获取新词以及包含该新词的相关语料文本；将相关语料文本进行分词操作，得到词序列；以及，在词序列中获取在所述新词之前和之后出现的n(n为正整数)个词，作为新词的相关词。Specifically, obtain new words and related corpus texts that contain the new words; carry out word segmentation operations on the relevant corpus texts to obtain word sequences; and obtain n (n is positive) that occurs before and after the new words in the word sequences Integer) words, as the related words of new words.

步骤S2022.根据新词的相关词的词向量构建该新词的词向量。Step S2022. Construct the word vector of the new word according to the word vectors of the related words of the new word.

具体地，加载训练好的Word2Vec模型词典；构造新词的上下文相关词列表；以及，根据新词的上下文相关词列表中的实际关联词对应的词向量计算新词的词向量。Specifically, load the trained Word2Vec model dictionary; construct the context-related word list of the new word; and calculate the word vector of the new word according to the word vectors corresponding to the actual associated words in the context-related word list of the new word.

步骤S203.将新词和新词的词向量加入Word2Vec模型词典，实现对word2Vec模型词典的更新。Step S203. Add the new word and the word vector of the new word into the Word2Vec model dictionary to update the word2Vec model dictionary.

以下结合附图介绍本发明的装置实施例。The device embodiments of the present invention will be described below in conjunction with the accompanying drawings.

根据本发明的一个实施例，还提供一种词向量构建装置。图5示出了该词向量构建装置500的框图，该装置包括：相关词获取模块501和词向量构建模块502。其中，相关词获取模块501用于在包含新词的相关语料文本中，获得在该新词的上下文中出现的相关词；词向量构建模块502用于根据新词的相关词的词向量构建该新词的词向量。According to an embodiment of the present invention, a word vector construction device is also provided. FIG. 5 shows a block diagram of the word vector construction device 500 , which includes: a related word acquisition module 501 and a word vector construction module 502 . Wherein, related word acquisition module 501 is used for in the related corpus text that contains new word, obtains the related word that appears in the context of this new word; Word vectors for new words.

由于本实施例的词向量构建装置的各个功能模块与上文结合图1-3描述的词向量构建方法的实施例的步骤对应，因此对于本装置实施例中未披露的细节，请参照结合图1-3描述的词向量构建方法的实施例。Since each functional module of the word vector construction device of this embodiment corresponds to the steps of the embodiment of the word vector construction method described above in conjunction with Figures 1-3, for details not disclosed in this device embodiment, please refer to the combined figure An embodiment of the word vector construction method described in 1-3.

根据本发明的一个实施例，还提供一种Word2Vec模型词典的更新装置。图6示出了该Word2Vec模型词典的更新装置600的框图，如图6所示，该装置包括：新词获取模块601，新词向量构建模块602，以及更新模块603。其中，新词获取模块601用于获取待生成词向量的新词；词向量构建模块602用于构建新词的词向量；以及更新模块603用于将新词和新词的词向量加入Word2Vec模型词典，实现对word2Vec模型词典的更新。According to an embodiment of the present invention, a device for updating Word2Vec model dictionary is also provided. FIG. 6 shows a block diagram of an update device 600 of the Word2Vec model dictionary. As shown in FIG. 6 , the device includes: a new word acquisition module 601 , a new word vector construction module 602 , and an update module 603 . Wherein, the new word acquisition module 601 is used to obtain the new words of the word vector to be generated; the word vector construction module 602 is used to construct the word vector of the new word; and the update module 603 is used to add the word vector of the new word and the new word to the Word2Vec model Dictionary, which implements the update of the word2Vec model dictionary.

由于本实施例的Word2Vec模型词典的更新装置的各个功能模块与上文结合图4描述的Word2Vec模型词典的更新方法的实施例的步骤对应，因此对于本装置实施例中未披露的细节，请参照结合图4描述的Word2Vec模型词典的更新方法的实施例。Since each functional module of the updating device of the Word2Vec model dictionary of this embodiment corresponds to the steps of the embodiment of the updating method of the Word2Vec model dictionary described above in conjunction with FIG. 4 , for details not disclosed in this device embodiment, please refer to An embodiment of the method for updating the Word2Vec model dictionary described in conjunction with FIG. 4 .

根据本发明的一个实施例，还提供一种适于用来实现本发明实施例的电子设备的计算机系统的结构示意图。参见图7，计算机系统700包括总线705，耦合到总线705的设备之间可以快速地传输信息。处理器701与总线705耦合，用于执行由计算机程序代码所指定的一组动作或操作，处理器701可以单独地或者与其他设备组合实现为机械、电、磁、光、量子或者化学部件等。According to an embodiment of the present invention, a schematic structural diagram of a computer system suitable for implementing the electronic device of the embodiment of the present invention is also provided. Referring to FIG. 7, a computer system 700 includes a bus 705 to which devices coupled to the bus can transfer information quickly. The processor 701 is coupled with the bus 705, and is used to execute a set of actions or operations specified by the computer program code. The processor 701 can be implemented as mechanical, electrical, magnetic, optical, quantum or chemical components, etc. alone or in combination with other devices. .

计算机系统700还包括耦合到总线705的存储器703，存储器703(例如，RAM或者其他动态存储设备)存储可由计算机系统700改变的数据，包括实现上述实施例所述的词向量构建方法以及Word2Vec模型词典的更新方法的指令或计算机程序。当处理器701执行该指令或计算机程序时，使得计算机系统700能够实现上述实施例中描述的词向量构建方法和Word2Vec模型词典的更新方法，例如，可以实现如图1-4中所示的各个步骤。存储器703还可以存储处理器701执行指令或计算机程序期间产生的临时数据，以及系统操作所需的各种程序和数据。计算机系统700还包括耦合到总线705的只读存储器702以及非易失性储存设备708，例如磁盘或光盘等，用于存储当计算机系统700被关闭或掉电时也能持续的数据。The computer system 700 also includes a memory 703 coupled to the bus 705. The memory 703 (for example, RAM or other dynamic storage devices) stores data that can be changed by the computer system 700, including implementing the word vector construction method described in the above-mentioned embodiments and the Word2Vec model dictionary Instructions or computer programs for updating methods. When the processor 701 executes the instruction or the computer program, the computer system 700 can realize the word vector construction method described in the above-mentioned embodiment and the update method of the Word2Vec model dictionary, for example, can realize each shown in Figure 1-4 step. The memory 703 can also store temporary data generated during the execution of instructions or computer programs by the processor 701, as well as various programs and data required for system operation. Computer system 700 also includes read-only memory 702 coupled to bus 705 and non-volatile storage device 708, such as a magnetic or optical disk, for storing data that persists even when computer system 700 is turned off or powered down.

计算机系统700还包括诸如键盘、传感器等的输入设备706，以及诸如阴极射线管(CRT)、液晶显示器(LCD)、打印机等的输出设备707。计算机系统700还包括耦合到总线705的通信接口704，通信接口704可以提供对外部设备的单向或双向的通信耦合。例如，通信接口704可以是并行端口、串行端口、电话调制解调器或者局域网(LAN)卡。计算机系统700还包括耦合到总线705的驱动设备709以及可拆卸设备710，诸如磁盘、光盘、磁光盘、半导体存储器等等，其根据需要安装在驱动设备709上，以便于从其上读出的计算机程序根据需要被安装入储存设备708。Computer system 700 also includes input devices 706 such as keyboards, sensors, etc., and output devices 707 such as cathode ray tubes (CRTs), liquid crystal displays (LCDs), printers, and the like. Computer system 700 also includes a communication interface 704 coupled to bus 705, which can provide a one-way or two-way communication coupling to external devices. For example, communication interface 704 may be a parallel port, a serial port, a telephone modem, or a local area network (LAN) card. The computer system 700 also includes a drive device 709 coupled to the bus 705, and a detachable device 710, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., which are installed on the drive device 709 as needed, so as to facilitate reading from it. Computer programs are installed into the storage device 708 as needed.

根据本发明的另一个实施例，还提供一种计算机可读介质，该计算机可读介质可以是上述计算机系统700中所包含的，也可以是单独存在而未装配入该计算机系统700中的。该计算机可读介质承载有一个或者多个计算机程序或者指令，当所述一个或者多个计算机程序或者指令被处理器执行时，使得该计算机系统700实现上述实施例中所述的词向量构建方法和Word2Vec模型词典的更新方法。需要说明的是，计算机可读介质指的是向处理器501提供数据的任意介质，这种介质可以采取任意形式，包括但不限于，计算机可读存储介质(例如，非易失性介质、易失性介质)以及传输介质。其中，非易失性介质诸如包括光盘或磁盘，例如储存设备708；易失性介质例如包括存储器704。传输介质例如包括同轴电缆、铜线、光纤电缆以及在没有电缆和线缆的情况下通过空间的载波，例如声波和电磁波，包括无线电、光和红外波。计算机可读介质的一般形式包括：软盘、柔性盘、硬盘、磁带、任意其它磁介质、CD-ROM、CDRW、DVD、任意其它光介质、穿孔卡片、纸带、光标记表单、具有孔或其它光可识别标识的图案的任意其它物理介质、RAM、PROM、EPROM、FLASH-EPROM、任意其它存储器芯片或磁带盒、载波、或计算机可读取的任意其它介质。According to another embodiment of the present invention, a computer-readable medium is also provided, and the computer-readable medium may be included in the above-mentioned computer system 700 , or may exist independently without being assembled into the computer system 700 . The computer-readable medium carries one or more computer programs or instructions, and when the one or more computer programs or instructions are executed by the processor, the computer system 700 implements the word vector construction method described in the above-mentioned embodiments and an update method for Word2Vec model dictionaries. It should be noted that a computer-readable medium refers to any medium that provides data to the processor 501, and such a medium may take any form, including but not limited to, a computer-readable storage medium (such as a non-volatile medium, a volatile volatile media) and transmission media. Wherein, the non-volatile medium includes, for example, an optical disk or magnetic disk, such as the storage device 708 ; the volatile medium includes, for example, the memory 704 . Transmission media include, for example, coaxial cables, copper wire, fiber optic cables and carrier waves that travel through space without cables and wires, such as acoustic waves and electromagnetic waves, including radio, optical and infrared waves. Common forms of computer readable media include: floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, CDRW, DVD, any other optical medium, punched card, paper tape, optically marked form, with holes or other Any other physical medium, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or magnetic tape cartridge, carrier wave, or any other medium readable by a computer, with a pattern of an optically recognizable logo.

本领域技术人员在考虑说明书及实践这里公开的发明后，将容易想到本发明的其它实施方案。本申请旨在涵盖本发明的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本发明的一般性原理并包括本发明未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的，本发明的真正范围和精神由下面的权利要求指出。Other embodiments of the invention will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any modification, use or adaptation of the present invention, these modifications, uses or adaptations follow the general principles of the present invention and include common knowledge or conventional technical means in the technical field not disclosed in the present invention . The specification and examples are to be considered exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

应当理解的是，本发明并不局限于上面已经描述并在附图中示出的精确结构，并且可以在不脱离其范围进行各种修改和改变。本发明的范围仅由所附的权利要求来限制。It should be understood that the present invention is not limited to the precise constructions which have been described above and shown in the accompanying drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. a word vector construction method, described method is used for constructing the word vector of the new word that is not included in the trained Word2Vec model dictionary, and described method comprises:

Obtaining related words that appear in the context of the new words in the relevant corpus texts that contain the new words;

The word vector of the new word is constructed according to the word vector of the related word.

2. method according to claim 1, is characterized in that, in the related corpus text that comprises described new word, obtaining the related word that occurs in the context of described new word comprises:

Carrying out a word segmentation operation on the relevant corpus text to obtain a word sequence;

In the word sequence, a predetermined number of words appearing before and after the new word are acquired as the related words.

3. The method according to claim 1 or 2, characterized in that, constructing the word vector of the new word according to the word vector of the related word comprises:

In the word2vec model dictionary that has been trained, look up the corresponding word vector of described related word, will find the related word of corresponding word vector as actual associated word and add the context-related word list of described new word;

The word vectors corresponding to the actual associated words in the context-related word list are obtained, and the word vectors of the new words are calculated according to the obtained word vectors corresponding to the actual associated words.

4. The method according to claim 3, wherein calculating the word vector of the new word according to the word vector corresponding to the obtained actual associated word comprises:

The average value of the obtained word vectors corresponding to the actual associated words is used as the word vector of the new word.

5. The method according to claim 3, wherein calculating the word vector of the new word according to the word vector corresponding to the obtained actual associated word comprises:

recording the number of occurrences of said actual associated word in the context of said new word; and,

Calculate the word vector of the new word according to the following formula:

Among them, WV _i represents the word vector of new word i, List _i represents the context-related word list of new word i, WV _j represents the word vector corresponding to the actual associated word j of new word i in the trained Word2Vec model dictionary, w _j represents The ratio of the number of times that the actual associated word j appears in the context of the new word i to the sum of the times that all actual associated words in List _i appear in the context of the new word i.

6. The method according to claim 3, wherein calculating the word vector of the new word according to the word vector corresponding to the obtained actual associated word comprises:

recording the average distance between the actual associated word and the new word; and,

Calculate the word vector of the new word according to the following formula:

Among them, WV _i represents the word vector of new word i, List _i represents the context-related word list of new word i, WV _j represents the word vector corresponding to the actual associated word j of new word i in the trained Word2Vec model dictionary, and v _j represents The ratio of the reciprocal of the average distance between the actual associated word j and the new word i to the sum of the reciprocals of the average distances between all actual associated words and the new word i in List _i .

7. A method for updating Word2Vec model dictionary, comprising:

Obtain new words to be generated word vectors;

According to the word vector construction method described in any one of claim 1-6, construct the word vector of described new word;

Add the new word and the word vector of the new word to the Word2Vec model dictionary.

8. A word vector construction device, said device is used to construct the word vector of the new word that is not included in the trained Word2Vec model, said device comprises:

A related word acquisition module, used to obtain related words that appear in the context of the new word in the relevant corpus text containing the new word;

The word vector construction module is used for constructing the word vector of the new word according to the word vector of the related word.

9. A computer-readable medium, on which a computer program is stored, wherein the computer program implements the method according to any one of claims 1 to 7 when executed by a processor.

10. An electronic device, characterized in that it comprises:

one or more processors;

A storage device for storing one or more computer programs, when the one or more computer programs are executed by the one or more processors, the electronic device implements any one of claims 1 to 7 the method described.