CN108960317B - Cross-language text classification method based on word vector representation and classifier combined training - Google Patents

Cross-language text classification method based on word vector representation and classifier combined training Download PDF

Info

Publication number
CN108960317B
CN108960317B CN201810680474.3A CN201810680474A CN108960317B CN 108960317 B CN108960317 B CN 108960317B CN 201810680474 A CN201810680474 A CN 201810680474A CN 108960317 B CN108960317 B CN 108960317B
Authority
CN
China
Prior art keywords
word
loss
language
text
classifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810680474.3A
Other languages
Chinese (zh)
Other versions
CN108960317A (en
Inventor
曹海龙
杨沐昀
赵铁军
高国骥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology Shenzhen
Original Assignee
Harbin Institute of Technology Shenzhen
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology Shenzhen filed Critical Harbin Institute of Technology Shenzhen
Priority to CN201810680474.3A priority Critical patent/CN108960317B/en
Publication of CN108960317A publication Critical patent/CN108960317A/en
Application granted granted Critical
Publication of CN108960317B publication Critical patent/CN108960317B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

基于跨语言词向量表示和分类器联合训练的跨语言文本分类方法,本发明涉及跨语言文本分类方法。本发明的目的是为了解决现有基于同义词替换的方法分类准确率低,现有基于翻译的方法准确率较高,但是训练翻译器需要大量的语料,而且训练耗时较长,任务的复杂性与时间消耗远远超过了文本分类这一较为简单的任务,因此并不实用的问题。过程为:一:语料预处理:二:通过梯度优化方法优化总的损失函数,使总的损失函数达到最小值,对应一组词向量和一个分类器;三:取概率最大的标签作为目标端语言T上的测试文本的分类结果;与测试集的标准结果对比,得到测试准确率和召回率指标。本发明用于跨语言文本分类领域。

Figure 201810680474

A cross-language text classification method based on cross-language word vector representation and classifier joint training, the present invention relates to a cross-language text classification method. The purpose of the present invention is to solve the problem that the existing method based on synonym replacement has low classification accuracy, and the existing translation-based method has high accuracy, but training translators requires a large amount of corpus, and the training takes a long time and the complexity of the task. It is not a practical problem with time consumption far exceeding the simpler task of text classification. The process is: 1: corpus preprocessing: 2: optimize the total loss function through the gradient optimization method, so that the total loss function reaches the minimum value, corresponding to a set of word vectors and a classifier; 3: take the label with the highest probability as the target end The classification results of the test text on language T; compared with the standard results of the test set, the test accuracy and recall indicators are obtained. The present invention is used in the field of cross-language text classification.

Figure 201810680474

Description

Cross-language text classification method based on word vector representation and classifier combined training
Technical Field
The invention relates to a cross-language text classification method.
Background
Text classification is one of the most important basic technologies in the fields of natural language processing, machine learning, and information retrieval. Its task is to classify a piece of text into a particular category or to label a piece of text with one or more labels. Is also an important research field.
The background of the cross-language text classification task is: there are two languages of text, defined as source language text and target language text, respectively, where there is insufficient corpus in the target language to train a performance-qualified text classifier, requiring the help of the source language. The task aims to train a text classifier on a source language, so that the classifier can test on a target language text and can obtain good classification performance.
The main background for the cross-language text classification problem to arise is: because a large number of languages lack enough training corpora to train a text classifier with qualified performance, some languages with abundant corpus resources (such as english) are required to construct a machine learning system (such as a classifier) and train tasks.
The traditional method mainly has the following two ways to realize the cross-language text classification technology:
1. method based on synonym substitution. Under the condition of richer translation dictionary resources, the words in the target language and the words in the source language can be directly and simply replaced by the words in the source language, so that the feature spaces of the two texts at the word level are the same. The method is simple, direct and fast, but the classification accuracy is low.
2. A translation-based approach. A trained translation model can be directly introduced, and can be a statistical-based translation model or a coding-decoding translation model based on a neural network, and then the translation model is used for directly translating the target end language text into the source end language text and then classifying the source end language text. This method has high accuracy, but it is not practical because it needs a lot of corpora to train the translator, and the training takes a long time, and the complexity and time consumption of the task far exceed the simple task of text classification.
Disclosure of Invention
The invention aims to solve the problems that the existing method based on synonym replacement has low classification accuracy, the existing method based on translation has high accuracy, but a large amount of linguistic data is needed for training a translator, the training time is long, and the complexity and the time consumption of a task far exceed those of a simple task of text classification, so that the method is not practical, and provides a cross-language text classification method based on cross-language word vector representation and classifier combined training.
The cross-language text classification method based on the cross-language word vector representation and the classifier joint training is characterized in that:
the method comprises the following steps: preprocessing the corpus:
extracting a word list from the parallel linguistic data, initializing a word vector matrix in the parallel linguistic data by adopting a random number between-0.1 and 0.1, and carrying out word stem reduction on the classified linguistic data to remove low-frequency word processing;
the parallel corpora are N pairs of English and corresponding Chinese translation;
the word list is all words in the parallel corpus, and each word has an index;
the word vector matrix is a word vector matrix formed by all word vectors in the parallel corpus;
english is used as a source end language and is set as S, the language of the text to be classified is used as a target end language and is set as T;
definition CsRepresenting the source language part, C, in parallel corporaTRepresenting a target end language part in the parallel corpus;
defining that a source end language S has | S | words, a target end language T has | T | words, and S and T respectively represent words of the source end language and the target end language;
step two: optimizing a total loss function loss (the calculation mode of the loss is given by a formula (7)) by a gradient optimization method (such as methods of SGD, Adam, AdaGrad and the like), so that the total loss function loss reaches a minimum value, and corresponds to a group of word vectors and a classifier when the total loss function loss reaches the minimum value, wherein the classifier is a logistic regression classifier, and parameters of the classifier are a weight vector W and an offset b;
step three: weighting and summing the test text on the target end language T by using a group of corresponding word vectors when the total loss function loss reaches the minimum value to obtain a text vector, inputting the text vector into a corresponding classifier when the total loss function loss reaches the minimum value to test to obtain probability distribution on each label, taking the label with the maximum probability as a classification result of the test text on the target end language T, and comparing the classification result with a standard result of a test set to obtain a test accuracy index and a recall rate index.
The invention has the beneficial effects that:
1. the method adopts the cross-language word vector as the representation of the text, obtains the cross-language word vector fused with the multilingual characteristics through the cross-language task training, and applies the cross-language word vector to the classification task, so that the text classification accuracy is high.
2. The invention breaks through the limitation of single training word vector of the existing method, unifies the training word vector and the optimization classifier in the same process, and performs combined training on word vector representation and the classifier, so that the word vector obtained by training not only contains cross-language information including source end language information and target end language information, but also integrates text category information, a large amount of linguistic data is not needed for training the translator, the training time is short, the practicability is strong, and the performance of the translator on a text classification task is better than that of the existing method.
The method and the device have a promoting effect on the fields of cross-language text processing, information retrieval, rare languages and the like. The invention has the innovation point that the limitation of the original method is broken through, the optimized word vector and the optimized classifier are unified in the same process, and the word vector representation and the classifier are trained in a combined manner, so that the obtained word vector has more excellent performance under a text classification task. The accuracy rate in the news classification task of the RCV road agency reaches over 90 percent and exceeds the existing method by about 2 percent. And meanwhile, good performance is obtained in the TED multi-language text classification task, and the method is well performed on 12 source end-target end language pairs.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The first embodiment is as follows: the present embodiment is described with reference to fig. 1, and the specific process of the cross-language text classification method based on cross-language word vector representation and classifier joint training in the present embodiment is as follows:
the traditional text classification task usually represents words as a one-hot vector, represents texts as a high-dimensional text vector through a bag-of-words model, the dimension of the vector is consistent with the size of a word list, the component of the vector in each dimension represents the weight of a certain word in the texts, and the common useful word frequency represents the weight or 0 and 1 respectively represent the existence or nonexistence of the word. The bag-of-words representation causes serious sparseness and dimension problems. More computing resources are required to be consumed in larger-scale text classification. In addition, the bag of words indicates that context information and word order information of the words are ignored and the semantics cannot be expressed sufficiently.
The presence of word vectors solves this problem. Word vectors (also translated as word embedding, collectively referred to herein as word vectors) represent words as dense vectors of lower dimensions, typically obtained by training neural network language models. For example, word2vec is a more popular implementation of a single-word vector.
A cross-language word vector is a word vector that is capable of representing multi-lingual information. In the present invention, cross-language word vectors are employed as representations of words and thus constitute representations of text.
In order to establish a cross-language text classifier, a joint training method is provided for training a cross-language word vector fused with text category information, and then a text classifier is established in the vector space, wherein the text vector used by the text classifier is obtained by averaging the word vectors obtained by training.
English is used as a source language and is set as S, and the language of the text to be classified is set as a target language and is set as T. In the whole training process, the used corpus resources comprise: the source language text with class labels (the source language text for training), the parallel corpus are S and T languages without class labels, and the translation dictionary pairs of the S language and the T language, namely bilingual word alignment tables. The method does not use any labeled target end language text to participate in the training process, and is only used when the test indexes such as accuracy are calculated in the test stage.
In the whole training process, obtaining the cross-language word vector with the text category information through the joint training is the most key step. Formally, we define | S | words in the source language S and | T | words in the target language S and S and T denote words in the source and target languages, respectively, S and TIn parallel corpus, CsRepresenting a source language part, CTRepresenting a target end language portion. Word alignment information is also needed in our model, which can be automatically obtained from parallel corpora (via IBM model or other word alignment tools such as GIZA + +). Cross-language word vectors are trained by building a bilingual model. In bilingual models, each word s needs to predict the probability of its neighboring words in the corpus (equations 1 and 2) and the probability of neighboring words of the aligned word T in T (equations 3 and 4).
The method comprises the following steps: preprocessing the corpus:
extracting word lists from parallel linguistic data (each word list has a plurality of words, the parallel linguistic data comprises S and T), initializing word vector matrixes in the parallel linguistic data by adopting random numbers between-0.1 and 0.1, carrying out word stem reduction on classified linguistic data (the existing linguistic data with category labels, such as whether each section or each text is negative or positive, and whether the text is positive or negative or is 2 category labels), removing low-frequency words and the like;
the parallel corpora are N pairs of English and corresponding Chinese translation;
the word list is all words in the parallel corpus, and each word has an index (sequence number, several rows and several columns in the matrix);
the word vector matrix is a word vector matrix formed by all word vectors (each word is a word vector) in the parallel corpus;
english is used as a source language and is set as S, the language of a text to be classified (a text without a class label) is used as a target language and is set as T;
definition CsRepresenting the source language part, C, in parallel corporaTRepresenting a target end language part in the parallel corpus; the source language refers to a language, and the source language part in the parallel corpus refers to a part of the corpus belonging to the language. So that it is additionally indicated by one letter. Subscript indicates the language and C indicates that it is in the corpus.
Defining that a source end language S has | S | words, a target end language T has | T | words, and S and T respectively represent words of the source end language and the target end language;
step two: optimizing a total loss function loss by a gradient optimization method (such as SGD, Adam, AdaGrad and other methods), so that the total loss function loss reaches a minimum value, and corresponds to a group of word vectors and a classifier when the total loss function loss reaches the minimum value, wherein the classifier is a logistic regression classifier, and parameters of the classifier are a weight vector W and an offset b;
the training process of the second step is simply equivalent to:
1. initializing word vectors and classifier parameters w, b
2. Calculating loss by using word vectors, w and b in initialized word vectors
3. Updating the word vector, w, b such that loss is reduced
4. Repeating the step 3 to obtain final word vector and w, b
Therefore, although the optimization in the second step is loss, the updated word vector and w, b are finally obtained;
step three: the total loss function loss (the calculation mode of the loss function is shown by a formula (7)) is adopted, a corresponding classifier tests the test corpus (the existing test corpus is not provided with a label and has a category, and the test corpus is a target end), a test text on a target end language T is weighted and summed by a group of word vectors corresponding to the minimum loss function loss to obtain a text vector (the text vector is not provided with a label but has a plurality of categories),
and inputting the text vector into a corresponding classifier when the total loss function loss reaches the minimum value, testing to obtain the probability distribution on each label, taking the label with the maximum probability as the classification result of the test text on the target end language T, and comparing the classification result with the standard result (with the label and the category) of the test set to obtain indexes such as test accuracy, recall rate and the like.
The second embodiment is as follows: the first difference between the present embodiment and the specific embodiment is: the concrete solving process of the total loss function loss in the step two is as follows:
the overall loss function includes three terms:
one is the loss of the source language, namely the loss of the source language S, which is obtained from the source language part in the parallel corpus;
secondly, the loss of the target end language, namely the loss on the target end language T is obtained from the target end language part in the parallel linguistic data;
thirdly, classifier loss;
and constructing a total loss function loss according to the language loss of the source end, the language loss of the target end and the classifier loss.
Other steps and parameters are the same as those in the first embodiment.
The third concrete implementation mode: the present embodiment differs from the first or second embodiment in that: the source language loss, namely the loss on the source language S, is obtained from a source language part in the parallel corpus; the specific process is as follows:
at CsChinese, monolingual (using only C)s) The loss is:
Figure GDA0003192335460000051
wherein, CsRepresenting a source language part; obj (C)s|Cs) Representing monolingual loss in source language in parallel corpus; w represents one of the words in the context of the word s of the source language; p (w | s) represents the probability of predicting the window of s under the condition that the central word is s; adj(s) words that represent the context of the words s of the source language;
the probability value p in the formula is obtained by a double-layer fully-connected feedforward neural network; the process is as follows:
c is to besThe dimension 512 of the word vector is changed into | S | dimension after passing through a full connection layer, and after softmax operation, the probability operation expression of each word in the softmax operation is as follows:
Figure GDA0003192335460000052
wherein p isiRepresenting the probability of the ith word, eiRepresenting vectors generated after passing through fully-connected layersI dimension of (e)jRepresenting the jth dimension of a vector generated after passing through the full connection layer, i is more than or equal to 1 and less than or equal to S, j is more than or equal to 1 and less than or equal to S, obtaining the probability of each word in S through softmax operation, picking out the probability represented by adj (S) from S, taking the logarithm and summing to obtain the probability
Figure GDA0003192335460000061
Obtained for each core word
Figure GDA0003192335460000062
Then adding to obtain
Figure GDA0003192335460000063
Outputting;
at CsIn, bilingual loss is:
Figure GDA0003192335460000064
wherein, CTRepresenting a target-side language portion; obj (C)T|CS) Representing bilingual loss in a source end language and a target end language in the parallel corpus; adj (t) words that represent the context of the word t in the target end language;
wherein
Figure GDA00031923354600000610
Representing aligned word pairs (one source-language word corresponds to one target-language word), the word alignment information being automatically obtained from parallel corpora (by IBM model or other word alignment tools such as GIZA + +); adj (.) represents a word adjacent to a certain word, and the probability value p in the formula is obtained by a double-layer fully-connected feedforward neural network;
the probability value p in the formula is obtained by a double-layer fully-connected feedforward neural network; the process is as follows:
c is to besThe word vectors of all words in the T are used as central word and word vectors and input into a neural network, the dimension 512 of the word vector is changed into | T | dimension after passing through a full connection layer, the probability of each word in the T is obtained through softmax operation, and the word vectors are picked out from the Tw∈adj(t),
Figure GDA00031923354600000611
The probabilities represented are summed up logarithmically to obtain
Figure GDA0003192335460000065
Obtained for each core word
Figure GDA0003192335460000066
Then adding to obtain
Figure GDA0003192335460000067
Other steps and parameters are the same as those in the first or second embodiment.
The fourth concrete implementation mode: the difference between this embodiment mode and one of the first to third embodiment modes is: the target end language loss, namely the loss on the target end language T, is obtained from the target end part in the parallel linguistic data; the specific process is as follows:
at CTIn, monolingual loss is:
Figure GDA0003192335460000068
Obj(CT|CT) Representing monolingual loss in a target end language in the parallel corpus;
the probability value p in the formula is obtained by a double-layer fully-connected feedforward neural network; the process is as follows:
c is to beTThe word vectors of all words in the T are used as central word and word vectors and input into a neural network, the dimension 512 of the word vector is changed into | T | dimension after passing through a full connection layer, the probability of each word in the T is obtained through softmax operation, and the word vectors are picked out from the Tadj(t)The probabilities represented are summed up logarithmically to obtain
Figure GDA0003192335460000069
Obtained for each core word
Figure GDA0003192335460000071
Then adding to obtain
Figure GDA0003192335460000072
At CTIn, bilingual loss is:
Figure GDA0003192335460000073
Obj(CS|CT) Representing bilingual loss in a source end language and a target end language in the parallel corpus; wherein
Figure GDA0003192335460000078
Representing aligned word pairs (one target end language word corresponds to one source end language word), the word alignment information is automatically obtained from parallel corpora (by IBM model or other word alignment tools such as GIZA + +); adj (.) represents a word adjacent to a certain word, and the probability value p in the formula is obtained by a double-layer fully-connected feedforward neural network;
the probability value p in the formula is obtained by a double-layer fully-connected feedforward neural network; the process is as follows:
c is to beTThe word vectors of all words in the S are used as central word and word vectors and input into a neural network, the dimension 512 of the word vector is changed into the dimension of | S | after passing through a full connection layer, the probability of each word in the S is obtained through softmax operation, w is belonged to adj (S) from the S,
Figure GDA0003192335460000079
the probabilities represented are summed up logarithmically to obtain
Figure GDA0003192335460000074
Obtained for each core word
Figure GDA0003192335460000075
Then adding to obtain
Figure GDA0003192335460000076
Combining (1), (2), (3) and (4) to obtain an objective function on the parallel corpus:
Obj(C)=α1Obj(CS|CS)+α2Obj(CT|CS)+α3Obj(CT|CT)+α4Obj(CS|CT),(5)
wherein alpha is1,α2,α3,α4The hyper-parameters are scalar quantities.
Other steps and parameters are the same as those in one of the first to third embodiments.
The fifth concrete implementation mode: the difference between this embodiment and one of the first to fourth embodiments is: the classifier penalty is:
since the task is to train the text classifier, the ideal word vector needs to carry text category information. Therefore, text category information needs to be fused into the word vectors, the way is that linguistic data of text classification is used as supervision information in the training process, the loss of a text classifier is added into a loss function, and a bilingual model and the text classifier are subjected to combined training to obtain the word vectors which are fused with text label information and cross-language information.
A logistic regression classifier is adopted as a text classifier, and the loss of the text classifier adopts a cross entropy loss function and is recorded as L; the text classifier penalty function is:
Figure GDA0003192335460000077
wherein, CLRepresenting text classification corpora (tagged), SdRepresenting any text in the text classification corpus; x represents a text vector and is obtained by weighted summation of word vectors of each word in the text; xSdAs a text SdA representative text vector, b is an offset; w is a weight vector corresponding to each text category (2W for the second category and 4W for the fourth category), tag (S)d) As a text SdThe tag of (positive or negative),
Figure GDA0003192335460000081
as a text SdThe weight vector corresponding to the label of (1).
Other steps and parameters are the same as in one of the first to fourth embodiments.
The sixth specific implementation mode: the difference between this embodiment and one of the first to fifth embodiments is: obtaining a total loss function according to the source end language loss, the target end language loss and the classifier loss; the concrete formula is as follows:
loss=-Obj(C)-L(CL) (7)
wherein obj (C) represents an objective function on the parallel corpus; l (C)L) Representing a text classifier loss function;
after a classifier loss function is added, the word vector information obtained by training is fused with monolingual information, cross-language information and text category information, and can meet the task requirements of people.
Other steps and parameters are the same as those in one of the first to fifth embodiments.
The seventh embodiment: the difference between this embodiment and one of the first to sixth embodiments is: in the second step, the total loss function loss is optimized by a gradient optimization method (such as SGD, Adam, AdaGrad and other methods) to make the total loss function loss reach a minimum value, and the specific process is as follows:
1) calculating partial derivatives of the total loss function loss to a word vector (each word represented by the parallel corpus from the first step) matrix, and calculating partial derivatives of the total loss function loss to a weight vector W and a bias b (in formula 6);
2) subtracting a partial derivative of the loss to the current word vector matrix from the value of the current word vector matrix, subtracting a partial derivative of the loss to the current weight vector W from the current weight vector W, and subtracting a partial derivative of the loss to the current bias b from the current bias b to calculate a total loss function loss;
3) and (3) repeatedly executing the steps 1) and 2) until the partial derivative of the step 1) is zero or loss is not reduced (the partial derivative and the loss are 1), and obtaining a corresponding group of word vectors and a classifier at the moment, wherein the classifier is a logistic regression classifier, and the classifier parameters are a weight vector W and an offset b.
Other steps and parameters are the same as those in one of the first to sixth embodiments.
The following examples were used to demonstrate the beneficial effects of the present invention:
the first embodiment is as follows:
the preparation method comprises the following steps:
the method comprises the following steps: preprocessing the corpus: including extracting a vocabulary and initializing a word vector matrix. Parallel corpora of the European parliament (100 ten thousand sentences per language pair) are adopted as the parallel corpora required by training word vectors, text classification training is carried out by adopting TED corpora, and the data set is a binary classification task. And performing word stem reduction on the classified linguistic data, removing low-frequency words and the like. The scheme also needs bilingual word alignment resources, if the bilingual word alignment resources are lacked, a GIZA + + tool is needed, and a bilingual word alignment table is obtained by training bilingual parallel linguistic data.
Step two: a loss function is constructed. The loss function includes three items, one is the loss of the source language, i.e. the loss in the source language S, which is obtained from the source part of the parallel corpus. The calculation method is obtained from the target end part in the parallel corpus according to the formula (1) and the formula (2). the second formula is the target end loss, and the calculation method is shown in the formula (3) and the formula (4). The probability p in each formula is calculated by a two-layer feed neural network. And thirdly, classifier loss, which is obtained by the formula (6). The total loss function is calculated by equation (7).
Step three: and (5) training and testing. The loss function is constructed in a specific corpus, and training is performed by using a gradient-based optimization method (such as SGD, Adam, AdaGrad and other methods) and using a word vector matrix and classifier parameters on the whole word list as trainable parameters of the whole objective function until convergence. And then testing on the test corpus. And obtaining a test result. This example uses SGD (random gradient descent method) as the optimization method.
The test result shows that: the classification accuracy obtained on multiple language pairs on a TED dataset exceeds that of the existing methods, F on Ender language pairs1The value reached 0.413.
Example two:
the preparation method comprises the following steps:
the method comprises the following steps: preprocessing the corpus: including extracting a vocabulary and initializing a word vector matrix. The European parliament parallel corpus (100 ten thousand sentences per language pair) is used as the parallel corpus required by training word vectors, and RCV1 corpus is used for text classification training, and the data set is a four-classification task. And performing word stem reduction on the classified linguistic data, removing low-frequency words and the like. And a bilingual word alignment table, namely a translation dictionary, is obtained by utilizing parallel corpus training through a GIZA + + tool.
Step two: a loss function is constructed. A loss function is constructed. The loss function includes three items, one is the loss of the source language, i.e. the loss of the source language S, which is obtained from the source part of the parallel corpus. The calculation method is obtained from the target end part in the parallel corpus according to the formula (1) and the formula (2). the second formula is the target end loss, and the calculation method is shown in the formula (3) and the formula (4). The probability p in each formula is calculated by a two-layer feed neural network. And thirdly, classifier loss is obtained by a multi-classification logistic regression loss function improved by the formula (6), namely a cross entropy loss function of softmax regression. The expression of the loss function is:
Figure GDA0003192335460000091
the total loss function is obtained by the formula (7), wherein the loss part of the multi-classification classifier needs to be improved from the formula (6) to the formula (8).
Step three: and (5) training and testing. The loss function is constructed in a specific corpus, and training is performed by using a gradient-based optimization method (such as SGD, Adam, AdaGrad and other methods) and using a word vector matrix and classifier parameters on the whole word list as trainable parameters of the whole objective function until convergence. And then testing on the test corpus. And obtaining a test result. This example uses the Adam method as the optimization method.
The test result shows that: the classification accuracy rate obtained by the method on the RCV corpus exceeds the existing scheme. The correctness of the classification result obtained on the English language pair is 90.2%.
The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.

Claims (4)

1.基于跨语言词向量表示和分类器联合训练的跨语言文本分类方法,其特征在于:1. A cross-language text classification method based on cross-language word vector representation and classifier joint training, characterized in that: 步骤一:语料预处理:Step 1: Corpus preprocessing: 从平行语料中提取词表,采用-0.1~0.1之间的随机数初始化平行语料中词向量矩阵,并对分类语料进行词干还原,去除低频词处理;Extract the vocabulary from the parallel corpus, initialize the word vector matrix in the parallel corpus with random numbers between -0.1 and 0.1, and perform stem reduction on the classified corpus to remove low-frequency words; 所述平行语料为N对英文以及对应的中文翻译;The parallel corpus is N pairs of English and corresponding Chinese translations; 所述词表为平行语料中所有词,每个词有一个索引;The vocabulary is all words in the parallel corpus, and each word has an index; 所述词向量矩阵为平行语料中所有词向量组成的词向量矩阵;The word vector matrix is a word vector matrix composed of all word vectors in the parallel corpus; 英文作为源端语言,设为S,待分类文本的语种为目标端语言,设为T;English is the source language, set to S, and the language of the text to be classified is the target language, set to T; 定义Cs表示平行语料中源端语言部分,CT表示平行语料中目标端语言部分;Definition C s represents the source-side language part in the parallel corpus, and C T represents the target-side language part in the parallel corpus; 定义源端语言S上有|S|个词,目标端语言T上有|T|个词,s和t分别表示源端和目标端语言的词;It is defined that there are |S| words in the source language S, and |T| words in the target language T, and s and t represent the words of the source and target languages respectively; 步骤二:通过梯度优化方法优化总的损失函数loss,使总的损失函数loss达到最小值,达到最小值时对应一组词向量和一个分类器,分类器为逻辑斯蒂回归分类器,分类器参数为权值向量W和偏置b;Step 2: Optimize the total loss function loss through the gradient optimization method, so that the total loss function loss reaches the minimum value. When the minimum value is reached, it corresponds to a set of word vectors and a classifier. The classifier is a logistic regression classifier. The parameters are the weight vector W and the bias b; 步骤三:将目标端语言T上的测试文本用使总的损失函数loss达到最小值时对应的一组词向量加权求和得到文本向量,将文本向量输入使总的损失函数loss达到最小值时对应的分类器进行测试,得到在每个标签上的概率分布,取概率最大的标签作为目标端语言T上的测试文本的分类结果,与测试集的标准结果对比,得到测试准确率和召回率指标;Step 3: Use the test text on the target language T to use the weighted summation of a set of word vectors corresponding to when the total loss function loss reaches the minimum value to obtain the text vector, and input the text vector to make the total loss function loss reach the minimum value. The corresponding classifier is tested to obtain the probability distribution on each label, and the label with the highest probability is taken as the classification result of the test text on the target language T, and compared with the standard results of the test set, the test accuracy and recall rate are obtained. index; 所述步骤二中总的损失函数loss的具体求解过程为:The specific solution process of the total loss function loss in the second step is: 总的损失函数包括三项:The overall loss function consists of three terms: 一为源端语言损失,即源端语言S上的损失,由平行语料中的源端语言部分得到;One is the source language loss, that is, the loss on the source language S, which is obtained from the source language part in the parallel corpus; 二为目标端语言损失,即目标端语言T上的损失,由平行语料中的目标端语言部分得到;The second is the target-side language loss, that is, the loss on the target-side language T, which is obtained from the target-side language part in the parallel corpus; 三为分类器损失;The third is the classifier loss; 根据源端语言损失、目标端语言损失和分类器损失构造总的损失函数loss;Construct the total loss function loss according to the source language loss, target language loss and classifier loss; 所述根据源端语言损失、目标端语言损失和分类器损失构造总的损失函数loss;具体公式为:The total loss function loss is constructed according to the source language loss, the target language loss and the classifier loss; the specific formula is: loss=-Obj(C)-L(CL) (7)loss=-Obj(C)-L(C L ) (7) 其中Obj(C)表示平行语料上的目标函数;L(CL)表示文本分类器损失函数;where Obj(C) represents the objective function on parallel corpus; L(C L ) represents the text classifier loss function; 所述步骤二中通过梯度优化方法优化总的损失函数loss,使总的损失函数loss达到最小值,具体过程为:In the second step, the gradient optimization method is used to optimize the total loss function loss, so that the total loss function loss reaches the minimum value, and the specific process is as follows: 1)计算总的损失函数loss对词向量矩阵的偏导数,计算总的损失函数loss对权值向量W和偏置b的偏导数;1) Calculate the partial derivative of the total loss function loss to the word vector matrix, and calculate the partial derivative of the total loss function loss to the weight vector W and the bias b; 2)当前词向量矩阵的值减去loss对当前词向量矩阵的偏导数,当前权值向量W减去loss对当前权值向量W的偏导数,当前偏置b减去loss对当前偏置b的偏导数;2) The value of the current word vector matrix minus the partial derivative of loss to the current word vector matrix, the current weight vector W minus the partial derivative of loss to the current weight vector W, the current bias b minus the loss to the current bias b The partial derivative of ; 3)重复执行1)、2)直到1)的偏导数为零或者loss不再减小,得到此时对应一组词向量和一个分类器,分类器为逻辑斯蒂回归分类器,分类器参数为权值向量W和偏置b。3) Repeat 1), 2) until the partial derivative of 1) is zero or the loss is no longer reduced, and a set of word vectors and a classifier corresponding to this time are obtained. The classifier is a logistic regression classifier, and the classifier parameters is the weight vector W and the bias b. 2.根据权利要求1所述基于跨语言词向量表示和分类器联合训练的跨语言文本分类方法,其特征在于:所述源端语言损失,即源端语言S上的损失,由平行语料中的源端语言部分得到;具体过程为:2. The cross-language text classification method based on cross-language word vector representation and classifier joint training according to claim 1, characterized in that: the source-end language loss, that is, the loss on the source-end language S, is obtained from the parallel corpus. The source-side language part of is obtained; the specific process is: 在Cs中,单语损失为:In C s , the monolingual loss is:
Figure FDA0003192335450000021
Figure FDA0003192335450000021
其中,Cs表示源端语言部分;Obj(Cs|Cs)表示平行语料中源端语言中的单语损失;w表示源端语言的词s上下文的词中某一个;p(w|s)表示中心词是s的条件下,预测s的窗口的概率;adj(s)表示源端语言的词s上下文的词;Among them, C s represents the source language part; Obj(C s |C s ) represents the monolingual loss in the source language in the parallel corpus; w represents one of the words in the context of word s in the source language; p(w| s) represents the probability of predicting the window of s under the condition that the central word is s; adj(s) represents the word in the context of word s in the source language; 公式中的概率值p由一个双层的全连接前馈神经网络得到;过程为:The probability value p in the formula is obtained by a two-layer fully connected feedforward neural network; the process is: 将Cs中的所有词的词向量作为中心词词向量输入到神经网络中,词向量的维数512,经过全连接层后维数变为|S|维,经过softmax运算,得到S中每一个词的概率,从S中挑出adj(s)代表的概率取对数再加和,得到
Figure FDA0003192335450000022
对每个中心词得到的
Figure FDA0003192335450000023
再做加和,得到
Figure FDA0003192335450000024
输出;
The word vector of all words in C s is input into the neural network as the central word word vector. The dimension of the word vector is 512. After the full connection layer, the dimension becomes |S| dimension. The probability of a word, pick out the probability represented by adj(s) from S, take the logarithm and add it to get
Figure FDA0003192335450000022
for each head word
Figure FDA0003192335450000023
Do the addition and get
Figure FDA0003192335450000024
output;
在Cs中,双语损失为:In C s , the bilingual loss is:
Figure FDA0003192335450000025
Figure FDA0003192335450000025
其中,CT表示目标端语言部分;Obj(CT|CS)表示平行语料中源端语言和目标端语言中的双语损失;adj(t)表示目标端语言的词t上下文的词;Among them, C T represents the part of the target language; Obj(C T | C S ) represents the bilingual loss in the source language and the target language in the parallel corpus; adj(t) represents the word in the context of word t in the target language; 其中
Figure FDA0003192335450000026
表示对齐的词对;
in
Figure FDA0003192335450000026
word pairs representing alignment;
公式中的概率值p由一个双层的全连接前馈神经网络得到;过程为:The probability value p in the formula is obtained by a two-layer fully connected feedforward neural network; the process is: 将Cs中的所有词的词向量作为中心词词向量输入到神经网络中,词向量的维数512,经过全连接层后维数变为|T|维,经过softmax运算,得到T中每一个词的概率,从T中挑出w∈adj(t),
Figure FDA0003192335450000031
代表的概率取对数再加和,得到
Figure FDA0003192335450000032
对每个中心词得到的
Figure FDA0003192335450000033
再做加和,得到
Figure FDA0003192335450000034
The word vector of all words in C s is input into the neural network as the central word word vector. The dimension of the word vector is 512. After the full connection layer, the dimension becomes |T| dimension. The probability of a word, picking w ∈ adj(t) from T,
Figure FDA0003192335450000031
The representative probability is taken logarithmically and summed to get
Figure FDA0003192335450000032
for each head word
Figure FDA0003192335450000033
Do the addition and get
Figure FDA0003192335450000034
3.根据权利要求2所述基于跨语言词向量表示和分类器联合训练的跨语言文本分类方法,其特征在于:所述目标端语言损失,即目标端语言T上的损失,由平行语料中的目标端部分得到;具体过程为:3. The cross-language text classification method based on cross-language word vector representation and classifier joint training according to claim 2, characterized in that: the target-side language loss, that is, the loss on the target-side language T, is determined by the parallel corpus. The target part of , is obtained; the specific process is: 在CT中,单语损失为:In CT , the monolingual loss is:
Figure FDA0003192335450000035
Figure FDA0003192335450000035
Obj(CT|CT)表示平行语料中目标端语言中的单语损失;Obj(C T |C T ) represents the monolingual loss in the target language in the parallel corpus; 公式中的概率值p由一个双层的全连接前馈神经网络得到;过程为:The probability value p in the formula is obtained by a two-layer fully connected feedforward neural network; the process is: 将CT中的所有词的词向量作为中心词词向量输入到神经网络中,词向量的维数512,经过全连接层后维数变为|T|维,经过softmax运算,softmax运算中每一个词的概率运算表达式为:The word vector of all words in CT is input into the neural network as the central word word vector. The dimension of the word vector is 512. After the full connection layer, the dimension becomes |T| dimension. After the softmax operation, each time in the softmax operation The probability operation expression of a word is:
Figure FDA0003192335450000036
Figure FDA0003192335450000036
其中pi表示第i个词的概率,ei表示经过全连接层后产生的向量的第i维,ej表示经过全连接层后产生的向量的第j维,1≤i≤|T|、1≤j≤|T|,经过softmax运算得到S中每一个词的概率后,得到T中每一个词的概率,从T中挑出adj(t)代表的概率取对数再加和,得到
Figure FDA0003192335450000037
对每个中心词得到的
Figure FDA0003192335450000038
再做加和,得到
Figure FDA0003192335450000039
where pi represents the probability of the ith word, e i represents the ith dimension of the vector generated after the fully connected layer, e j represents the jth dimension of the vector generated after the fully connected layer, 1≤i≤|T| , 1≤j≤|T|, after obtaining the probability of each word in S after softmax operation, get the probability of each word in T, pick out the probability represented by adj(t) from T, take the logarithm and add it to the sum, get
Figure FDA0003192335450000037
for each head word
Figure FDA0003192335450000038
Do the addition and get
Figure FDA0003192335450000039
在CT中,双语损失为:In CT , the bilingual loss is:
Figure FDA00031923354500000310
Figure FDA00031923354500000310
Obj(CS|CT)表示平行语料中源端语言和目标端语言中的双语损失;其中
Figure FDA00031923354500000311
表示对齐的词对;
Obj(C S | C T ) represents the bilingual loss in the source language and the target language in the parallel corpus; where
Figure FDA00031923354500000311
word pairs representing alignment;
公式中的概率值p由一个双层的全连接前馈神经网络得到;过程为:The probability value p in the formula is obtained by a two-layer fully connected feedforward neural network; the process is: 将CT中的所有词的词向量作为中心词词向量输入到神经网络中,词向量的维数512,经过全连接层后维数变为|S|维,经过softmax运算,得到S中每一个词的概率,从S中挑出w∈adj(s),
Figure FDA0003192335450000041
代表的概率取对数再加和,得到
Figure FDA0003192335450000042
对每个中心词得到的
Figure FDA0003192335450000043
再做加和,得到
Figure FDA0003192335450000044
The word vector of all words in CT is input into the neural network as the central word word vector. The dimension of the word vector is 512. After the full connection layer, the dimension becomes |S| dimension. The probability of a word, picking w ∈ adj(s) from S,
Figure FDA0003192335450000041
The representative probability is taken logarithmically and summed to get
Figure FDA0003192335450000042
for each head word
Figure FDA0003192335450000043
Do the addition and get
Figure FDA0003192335450000044
将(1)、(2)、(3)、(4)组合,得到在平行语料上的目标函数:Combining (1), (2), (3), and (4), the objective function on the parallel corpus is obtained: Obj(C)=α1Obj(CS|CS)+α2Obj(CT|CS)+α3Obj(CT|CT)+α4Obj(CS|CT), (5)Obj(C)=α 1 Obj(C S |C S )+α 2 Obj(C T |C S )+α 3 Obj(C T |C T )+α 4 Obj(C S |C T ), ( 5) 其中,α1,α2,α3,α4为超参数,均为标量。Among them, α 1 , α 2 , α 3 , α 4 are hyperparameters, all of which are scalars.
4.根据权利要求3所述基于跨语言词向量表示和分类器联合训练的跨语言文本分类方法,其特征在于:所述分类器损失为:4. the cross-language text classification method based on cross-language word vector representation and classifier joint training according to claim 3, it is characterized in that: described classifier loss is: 采用逻辑斯蒂回归分类器作为文本分类器,文本分类器损失采用交叉熵损失函数,记为L;文本分类器损失函数为:The logistic regression classifier is used as the text classifier, and the loss of the text classifier adopts the cross entropy loss function, denoted as L; the loss function of the text classifier is:
Figure FDA0003192335450000045
Figure FDA0003192335450000045
其中,CL表示文本分类语料,Sd表示文本分类语料中的任一文本;X表示文本向量,由文本中每个词的词向量加权求和得到;
Figure FDA0003192335450000046
为文本Sd代表的文本向量,b为偏置;W为每类文本类别对应的权值向量,tag(Sd)为文本Sd的标签,
Figure FDA0003192335450000047
为文本Sd的标签对应的权值向量。
Among them, CL represents the text classification corpus, S d represents any text in the text classification corpus; X represents the text vector, which is obtained by the weighted summation of the word vector of each word in the text;
Figure FDA0003192335450000046
is the text vector represented by the text S d , b is the bias; W is the weight vector corresponding to each text category, tag(S d ) is the label of the text S d ,
Figure FDA0003192335450000047
is the weight vector corresponding to the label of the text S d .
CN201810680474.3A 2018-06-27 2018-06-27 Cross-language text classification method based on word vector representation and classifier combined training Active CN108960317B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810680474.3A CN108960317B (en) 2018-06-27 2018-06-27 Cross-language text classification method based on word vector representation and classifier combined training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810680474.3A CN108960317B (en) 2018-06-27 2018-06-27 Cross-language text classification method based on word vector representation and classifier combined training

Publications (2)

Publication Number Publication Date
CN108960317A CN108960317A (en) 2018-12-07
CN108960317B true CN108960317B (en) 2021-09-28

Family

ID=64487284

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810680474.3A Active CN108960317B (en) 2018-06-27 2018-06-27 Cross-language text classification method based on word vector representation and classifier combined training

Country Status (1)

Country Link
CN (1) CN108960317B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918499A (en) * 2019-01-14 2019-06-21 平安科技(深圳)有限公司 A text classification method, device, computer equipment and storage medium
CN110297903B (en) * 2019-06-11 2021-04-30 昆明理工大学 A Cross-Language Word Embedding Method Based on Asymmetric Corpus
US11126797B2 (en) * 2019-07-02 2021-09-21 Spectrum Labs, Inc. Toxic vector mapping across languages
CN110413736B (en) * 2019-07-25 2022-02-25 百度在线网络技术(北京)有限公司 Cross-language text representation method and device
CN112446462B (en) * 2019-08-30 2024-06-18 华为技术有限公司 Method and device for generating target neural network model
CN112329481B (en) * 2020-10-27 2022-07-19 厦门大学 Training method of multi-language machine translation model for relieving language-to-difference conflict
CN113392179A (en) * 2020-12-21 2021-09-14 腾讯科技(深圳)有限公司 Text labeling method and device, electronic equipment and storage medium
CN113032559B (en) * 2021-03-15 2023-04-28 新疆大学 Language model fine tuning method for low-resource adhesive language text classification
CN113312453B (en) * 2021-06-16 2022-09-23 哈尔滨工业大学 A model pre-training system for cross-language dialogue understanding
CN113343672B (en) * 2021-06-21 2022-12-16 哈尔滨工业大学 An Unsupervised Bilingual Dictionary Construction Method Based on Corpus Merging

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105446958A (en) * 2014-07-18 2016-03-30 富士通株式会社 Word aligning method and device
CN108197109A (en) * 2017-12-29 2018-06-22 北京百分点信息科技有限公司 A kind of multilingual analysis method and device based on natural language processing

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9779085B2 (en) * 2015-05-29 2017-10-03 Oracle International Corporation Multilingual embeddings for natural language processing

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105446958A (en) * 2014-07-18 2016-03-30 富士通株式会社 Word aligning method and device
CN108197109A (en) * 2017-12-29 2018-06-22 北京百分点信息科技有限公司 A kind of multilingual analysis method and device based on natural language processing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Cross-lingual Text Classification via Model Translation with Limited Dictionaries;Xu, Ruochen等;《PROCEEDINGS OF THE 2016 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT》;20161031;全文 *
基于词向量的越汉跨语言事件检索研究;唐亮等;《中文信息学报》;20180331;全文 *

Also Published As

Publication number Publication date
CN108960317A (en) 2018-12-07

Similar Documents

Publication Publication Date Title
CN108960317B (en) Cross-language text classification method based on word vector representation and classifier combined training
Farahani et al. Parsbert: Transformer-based model for persian language understanding
Zhu et al. CAN-NER: convolutional attention network for Chinese named entity recognition
CN113360673B (en) Entity alignment method, device and storage medium for multimodal knowledge graph
CN109635124B (en) A Remote Supervision Relation Extraction Method Combined with Background Knowledge
CN109670039B (en) Semi-supervised e-commerce comment emotion analysis method based on three-part graph and cluster analysis
Gouws et al. Bilbowa: Fast bilingual distributed representations without word alignments
Xue et al. Neural collective entity linking based on recurrent random walk network learning
CN110378409A (en) It is a kind of based on element association attention mechanism the Chinese get over news documents abstraction generating method
CN106980608A (en) A kind of Chinese electronic health record participle and name entity recognition method and system
CN110134954B (en) Named entity recognition method based on Attention mechanism
CN110263165A (en) A kind of user comment sentiment analysis method based on semi-supervised learning
CN112559741A (en) Nuclear power equipment defect recording text classification method, system, medium and electronic equipment
Andrabi et al. A review of machine translation for south asian low resource languages
CN110457715A (en) Chinese-Vietnamese neural machine translation out-of-collection word processing method integrated into lexicon
CN104657351A (en) Method and device for processing bilingual alignment corpora
Elsherif et al. Perspectives of arabic machine translation
Stoeckel et al. Voting for POS tagging of Latin texts: Using the flair of FLAIR to better ensemble classifiers by example of Latin
Seeha et al. ThaiLMCut: Unsupervised pretraining for Thai word segmentation
CN118228734A (en) Medical terminology normalization method based on large language model for data enhancement
CN110516230B (en) Method and Device for Extracting Chinese-Burmese Bilingual Parallel Sentence Pairs Based on Pivot Language
Sun [Retracted] Analysis of Chinese Machine Translation Training Based on Deep Learning Technology
Balouchzahi et al. LA-SACo: A study of learning approaches for sentiments analysis inCode-mixing texts
Shahade et al. Deep learning approach-based hybrid fine-tuned Smith algorithm with Adam optimiser for multilingual opinion mining
CN114611487B (en) Unsupervised Thai dependency syntax analysis method based on dynamic word embedding alignment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant