CN108960317B

CN108960317B - Cross-language text classification method based on word vector representation and classifier combined training

Info

Publication number: CN108960317B
Application number: CN201810680474.3A
Authority: CN
Inventors: 曹海龙; 杨沐昀; 赵铁军; 高国骥
Original assignee: Harbin Institute of Technology Shenzhen
Current assignee: Harbin Institute of Technology Shenzhen
Priority date: 2018-06-27
Filing date: 2018-06-27
Publication date: 2021-09-28
Anticipated expiration: 2038-06-27
Also published as: CN108960317A

Abstract

A cross-language text classification method based on cross-language word vector representation and classifier joint training, the present invention relates to a cross-language text classification method. The purpose of the present invention is to solve the problem that the existing method based on synonym replacement has low classification accuracy, and the existing translation-based method has high accuracy, but training translators requires a large amount of corpus, and the training takes a long time and the complexity of the task. It is not a practical problem with time consumption far exceeding the simpler task of text classification. The process is: 1: corpus preprocessing: 2: optimize the total loss function through the gradient optimization method, so that the total loss function reaches the minimum value, corresponding to a set of word vectors and a classifier; 3: take the label with the highest probability as the target end The classification results of the test text on language T; compared with the standard results of the test set, the test accuracy and recall indicators are obtained. The present invention is used in the field of cross-language text classification.

Description

Cross-language text classification method based on word vector representation and classifier combined training

Technical Field

The invention relates to a cross-language text classification method.

Background

Text classification is one of the most important basic technologies in the fields of natural language processing, machine learning, and information retrieval. Its task is to classify a piece of text into a particular category or to label a piece of text with one or more labels. Is also an important research field.

The background of the cross-language text classification task is: there are two languages of text, defined as source language text and target language text, respectively, where there is insufficient corpus in the target language to train a performance-qualified text classifier, requiring the help of the source language. The task aims to train a text classifier on a source language, so that the classifier can test on a target language text and can obtain good classification performance.

The main background for the cross-language text classification problem to arise is: because a large number of languages lack enough training corpora to train a text classifier with qualified performance, some languages with abundant corpus resources (such as english) are required to construct a machine learning system (such as a classifier) and train tasks.

The traditional method mainly has the following two ways to realize the cross-language text classification technology:

1. method based on synonym substitution. Under the condition of richer translation dictionary resources, the words in the target language and the words in the source language can be directly and simply replaced by the words in the source language, so that the feature spaces of the two texts at the word level are the same. The method is simple, direct and fast, but the classification accuracy is low.

2. A translation-based approach. A trained translation model can be directly introduced, and can be a statistical-based translation model or a coding-decoding translation model based on a neural network, and then the translation model is used for directly translating the target end language text into the source end language text and then classifying the source end language text. This method has high accuracy, but it is not practical because it needs a lot of corpora to train the translator, and the training takes a long time, and the complexity and time consumption of the task far exceed the simple task of text classification.

Disclosure of Invention

The invention aims to solve the problems that the existing method based on synonym replacement has low classification accuracy, the existing method based on translation has high accuracy, but a large amount of linguistic data is needed for training a translator, the training time is long, and the complexity and the time consumption of a task far exceed those of a simple task of text classification, so that the method is not practical, and provides a cross-language text classification method based on cross-language word vector representation and classifier combined training.

The cross-language text classification method based on the cross-language word vector representation and the classifier joint training is characterized in that:

the method comprises the following steps: preprocessing the corpus:

extracting a word list from the parallel linguistic data, initializing a word vector matrix in the parallel linguistic data by adopting a random number between-0.1 and 0.1, and carrying out word stem reduction on the classified linguistic data to remove low-frequency word processing;

the parallel corpora are N pairs of English and corresponding Chinese translation;

the word list is all words in the parallel corpus, and each word has an index;

the word vector matrix is a word vector matrix formed by all word vectors in the parallel corpus;

english is used as a source end language and is set as S, the language of the text to be classified is used as a target end language and is set as T;

definition C_sRepresenting the source language part, C, in parallel corpora_TRepresenting a target end language part in the parallel corpus;

defining that a source end language S has | S | words, a target end language T has | T | words, and S and T respectively represent words of the source end language and the target end language;

step two: optimizing a total loss function loss (the calculation mode of the loss is given by a formula (7)) by a gradient optimization method (such as methods of SGD, Adam, AdaGrad and the like), so that the total loss function loss reaches a minimum value, and corresponds to a group of word vectors and a classifier when the total loss function loss reaches the minimum value, wherein the classifier is a logistic regression classifier, and parameters of the classifier are a weight vector W and an offset b;

step three: weighting and summing the test text on the target end language T by using a group of corresponding word vectors when the total loss function loss reaches the minimum value to obtain a text vector, inputting the text vector into a corresponding classifier when the total loss function loss reaches the minimum value to test to obtain probability distribution on each label, taking the label with the maximum probability as a classification result of the test text on the target end language T, and comparing the classification result with a standard result of a test set to obtain a test accuracy index and a recall rate index.

The invention has the beneficial effects that:

1. the method adopts the cross-language word vector as the representation of the text, obtains the cross-language word vector fused with the multilingual characteristics through the cross-language task training, and applies the cross-language word vector to the classification task, so that the text classification accuracy is high.

2. The invention breaks through the limitation of single training word vector of the existing method, unifies the training word vector and the optimization classifier in the same process, and performs combined training on word vector representation and the classifier, so that the word vector obtained by training not only contains cross-language information including source end language information and target end language information, but also integrates text category information, a large amount of linguistic data is not needed for training the translator, the training time is short, the practicability is strong, and the performance of the translator on a text classification task is better than that of the existing method.

The method and the device have a promoting effect on the fields of cross-language text processing, information retrieval, rare languages and the like. The invention has the innovation point that the limitation of the original method is broken through, the optimized word vector and the optimized classifier are unified in the same process, and the word vector representation and the classifier are trained in a combined manner, so that the obtained word vector has more excellent performance under a text classification task. The accuracy rate in the news classification task of the RCV road agency reaches over 90 percent and exceeds the existing method by about 2 percent. And meanwhile, good performance is obtained in the TED multi-language text classification task, and the method is well performed on 12 source end-target end language pairs.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The first embodiment is as follows: the present embodiment is described with reference to fig. 1, and the specific process of the cross-language text classification method based on cross-language word vector representation and classifier joint training in the present embodiment is as follows:

the traditional text classification task usually represents words as a one-hot vector, represents texts as a high-dimensional text vector through a bag-of-words model, the dimension of the vector is consistent with the size of a word list, the component of the vector in each dimension represents the weight of a certain word in the texts, and the common useful word frequency represents the weight or 0 and 1 respectively represent the existence or nonexistence of the word. The bag-of-words representation causes serious sparseness and dimension problems. More computing resources are required to be consumed in larger-scale text classification. In addition, the bag of words indicates that context information and word order information of the words are ignored and the semantics cannot be expressed sufficiently.

The presence of word vectors solves this problem. Word vectors (also translated as word embedding, collectively referred to herein as word vectors) represent words as dense vectors of lower dimensions, typically obtained by training neural network language models. For example, word2vec is a more popular implementation of a single-word vector.

A cross-language word vector is a word vector that is capable of representing multi-lingual information. In the present invention, cross-language word vectors are employed as representations of words and thus constitute representations of text.

In order to establish a cross-language text classifier, a joint training method is provided for training a cross-language word vector fused with text category information, and then a text classifier is established in the vector space, wherein the text vector used by the text classifier is obtained by averaging the word vectors obtained by training.

English is used as a source language and is set as S, and the language of the text to be classified is set as a target language and is set as T. In the whole training process, the used corpus resources comprise: the source language text with class labels (the source language text for training), the parallel corpus are S and T languages without class labels, and the translation dictionary pairs of the S language and the T language, namely bilingual word alignment tables. The method does not use any labeled target end language text to participate in the training process, and is only used when the test indexes such as accuracy are calculated in the test stage.

In the whole training process, obtaining the cross-language word vector with the text category information through the joint training is the most key step. Formally, we define | S | words in the source language S and | T | words in the target language S and S and T denote words in the source and target languages, respectively, S and TIn parallel corpus, C_sRepresenting a source language part, C_TRepresenting a target end language portion. Word alignment information is also needed in our model, which can be automatically obtained from parallel corpora (via IBM model or other word alignment tools such as GIZA + +). Cross-language word vectors are trained by building a bilingual model. In bilingual models, each word s needs to predict the probability of its neighboring words in the corpus (equations 1 and 2) and the probability of neighboring words of the aligned word T in T (equations 3 and 4).

The method comprises the following steps: preprocessing the corpus:

extracting word lists from parallel linguistic data (each word list has a plurality of words, the parallel linguistic data comprises S and T), initializing word vector matrixes in the parallel linguistic data by adopting random numbers between-0.1 and 0.1, carrying out word stem reduction on classified linguistic data (the existing linguistic data with category labels, such as whether each section or each text is negative or positive, and whether the text is positive or negative or is 2 category labels), removing low-frequency words and the like;

the word list is all words in the parallel corpus, and each word has an index (sequence number, several rows and several columns in the matrix);

the word vector matrix is a word vector matrix formed by all word vectors (each word is a word vector) in the parallel corpus;

english is used as a source language and is set as S, the language of a text to be classified (a text without a class label) is used as a target language and is set as T;

definition C_sRepresenting the source language part, C, in parallel corpora_TRepresenting a target end language part in the parallel corpus; the source language refers to a language, and the source language part in the parallel corpus refers to a part of the corpus belonging to the language. So that it is additionally indicated by one letter. Subscript indicates the language and C indicates that it is in the corpus.

step two: optimizing a total loss function loss by a gradient optimization method (such as SGD, Adam, AdaGrad and other methods), so that the total loss function loss reaches a minimum value, and corresponds to a group of word vectors and a classifier when the total loss function loss reaches the minimum value, wherein the classifier is a logistic regression classifier, and parameters of the classifier are a weight vector W and an offset b;

the training process of the second step is simply equivalent to:

1. initializing word vectors and classifier parameters w, b

2. Calculating loss by using word vectors, w and b in initialized word vectors

3. Updating the word vector, w, b such that loss is reduced

4. Repeating the step 3 to obtain final word vector and w, b

Therefore, although the optimization in the second step is loss, the updated word vector and w, b are finally obtained;

step three: the total loss function loss (the calculation mode of the loss function is shown by a formula (7)) is adopted, a corresponding classifier tests the test corpus (the existing test corpus is not provided with a label and has a category, and the test corpus is a target end), a test text on a target end language T is weighted and summed by a group of word vectors corresponding to the minimum loss function loss to obtain a text vector (the text vector is not provided with a label but has a plurality of categories),

and inputting the text vector into a corresponding classifier when the total loss function loss reaches the minimum value, testing to obtain the probability distribution on each label, taking the label with the maximum probability as the classification result of the test text on the target end language T, and comparing the classification result with the standard result (with the label and the category) of the test set to obtain indexes such as test accuracy, recall rate and the like.

The second embodiment is as follows: the first difference between the present embodiment and the specific embodiment is: the concrete solving process of the total loss function loss in the step two is as follows:

the overall loss function includes three terms:

one is the loss of the source language, namely the loss of the source language S, which is obtained from the source language part in the parallel corpus;

secondly, the loss of the target end language, namely the loss on the target end language T is obtained from the target end language part in the parallel linguistic data;

thirdly, classifier loss;

and constructing a total loss function loss according to the language loss of the source end, the language loss of the target end and the classifier loss.

Other steps and parameters are the same as those in the first embodiment.

The third concrete implementation mode: the present embodiment differs from the first or second embodiment in that: the source language loss, namely the loss on the source language S, is obtained from a source language part in the parallel corpus; the specific process is as follows:

at C_sChinese, monolingual (using only C)_s) The loss is:

wherein, C_sRepresenting a source language part; obj (C)_s|C_s) Representing monolingual loss in source language in parallel corpus; w represents one of the words in the context of the word s of the source language; p (w | s) represents the probability of predicting the window of s under the condition that the central word is s; adj(s) words that represent the context of the words s of the source language;

the probability value p in the formula is obtained by a double-layer fully-connected feedforward neural network; the process is as follows:

c is to be_sThe dimension 512 of the word vector is changed into | S | dimension after passing through a full connection layer, and after softmax operation, the probability operation expression of each word in the softmax operation is as follows:

wherein p is_iRepresenting the probability of the ith word, e_iRepresenting vectors generated after passing through fully-connected layersI dimension of (e)_jRepresenting the jth dimension of a vector generated after passing through the full connection layer, i is more than or equal to 1 and less than or equal to S, j is more than or equal to 1 and less than or equal to S, obtaining the probability of each word in S through softmax operation, picking out the probability represented by adj (S) from S, taking the logarithm and summing to obtain the probability

Obtained for each core word

Then adding to obtain

Outputting;

at C_sIn, bilingual loss is:

wherein, C_TRepresenting a target-side language portion; obj (C)_T|C_S) Representing bilingual loss in a source end language and a target end language in the parallel corpus; adj (t) words that represent the context of the word t in the target end language;

wherein

Representing aligned word pairs (one source-language word corresponds to one target-language word), the word alignment information being automatically obtained from parallel corpora (by IBM model or other word alignment tools such as GIZA + +); adj (.) represents a word adjacent to a certain word, and the probability value p in the formula is obtained by a double-layer fully-connected feedforward neural network;

c is to be_sThe word vectors of all words in the T are used as central word and word vectors and input into a neural network, the dimension 512 of the word vector is changed into | T | dimension after passing through a full connection layer, the probability of each word in the T is obtained through softmax operation, and the word vectors are picked out from the Tw∈adj(t),

The probabilities represented are summed up logarithmically to obtain

Obtained for each core word

Then adding to obtain

Other steps and parameters are the same as those in the first or second embodiment.

The fourth concrete implementation mode: the difference between this embodiment mode and one of the first to third embodiment modes is: the target end language loss, namely the loss on the target end language T, is obtained from the target end part in the parallel linguistic data; the specific process is as follows:

at C_TIn, monolingual loss is:

Obj(C_T|C_T) Representing monolingual loss in a target end language in the parallel corpus;

c is to be_TThe word vectors of all words in the T are used as central word and word vectors and input into a neural network, the dimension 512 of the word vector is changed into | T | dimension after passing through a full connection layer, the probability of each word in the T is obtained through softmax operation, and the word vectors are picked out from the T_adj(t)The probabilities represented are summed up logarithmically to obtain

Obtained for each core word

Then adding to obtain

At C_TIn, bilingual loss is:

Obj(C_S|C_T) Representing bilingual loss in a source end language and a target end language in the parallel corpus; wherein

Representing aligned word pairs (one target end language word corresponds to one source end language word), the word alignment information is automatically obtained from parallel corpora (by IBM model or other word alignment tools such as GIZA + +); adj (.) represents a word adjacent to a certain word, and the probability value p in the formula is obtained by a double-layer fully-connected feedforward neural network;

c is to be_TThe word vectors of all words in the S are used as central word and word vectors and input into a neural network, the dimension 512 of the word vector is changed into the dimension of | S | after passing through a full connection layer, the probability of each word in the S is obtained through softmax operation, w is belonged to adj (S) from the S,

the probabilities represented are summed up logarithmically to obtain

Obtained for each core word

Then adding to obtain

Combining (1), (2), (3) and (4) to obtain an objective function on the parallel corpus:

Obj(C)＝α₁Obj(C_S|C_S)+α₂Obj(C_T|C_S)+α₃Obj(C_T|C_T)+α₄Obj(C_S|C_T)，(5)

wherein alpha is₁，α₂，α₃，α₄The hyper-parameters are scalar quantities.

Other steps and parameters are the same as those in one of the first to third embodiments.

The fifth concrete implementation mode: the difference between this embodiment and one of the first to fourth embodiments is: the classifier penalty is:

since the task is to train the text classifier, the ideal word vector needs to carry text category information. Therefore, text category information needs to be fused into the word vectors, the way is that linguistic data of text classification is used as supervision information in the training process, the loss of a text classifier is added into a loss function, and a bilingual model and the text classifier are subjected to combined training to obtain the word vectors which are fused with text label information and cross-language information.

A logistic regression classifier is adopted as a text classifier, and the loss of the text classifier adopts a cross entropy loss function and is recorded as L; the text classifier penalty function is:

wherein, C_LRepresenting text classification corpora (tagged), S^dRepresenting any text in the text classification corpus; x represents a text vector and is obtained by weighted summation of word vectors of each word in the text; x_SdAs a text S^dA representative text vector, b is an offset; w is a weight vector corresponding to each text category (2W for the second category and 4W for the fourth category), tag (S)^d) As a text S^dThe tag of (positive or negative),

as a text S^dThe weight vector corresponding to the label of (1).

Other steps and parameters are the same as in one of the first to fourth embodiments.

The sixth specific implementation mode: the difference between this embodiment and one of the first to fifth embodiments is: obtaining a total loss function according to the source end language loss, the target end language loss and the classifier loss; the concrete formula is as follows:

loss＝-Obj(C)-L(C_L) (7)

wherein obj (C) represents an objective function on the parallel corpus; l (C)_L) Representing a text classifier loss function;

after a classifier loss function is added, the word vector information obtained by training is fused with monolingual information, cross-language information and text category information, and can meet the task requirements of people.

Other steps and parameters are the same as those in one of the first to fifth embodiments.

The seventh embodiment: the difference between this embodiment and one of the first to sixth embodiments is: in the second step, the total loss function loss is optimized by a gradient optimization method (such as SGD, Adam, AdaGrad and other methods) to make the total loss function loss reach a minimum value, and the specific process is as follows:

1) calculating partial derivatives of the total loss function loss to a word vector (each word represented by the parallel corpus from the first step) matrix, and calculating partial derivatives of the total loss function loss to a weight vector W and a bias b (in formula 6);

2) subtracting a partial derivative of the loss to the current word vector matrix from the value of the current word vector matrix, subtracting a partial derivative of the loss to the current weight vector W from the current weight vector W, and subtracting a partial derivative of the loss to the current bias b from the current bias b to calculate a total loss function loss;

3) and (3) repeatedly executing the steps 1) and 2) until the partial derivative of the step 1) is zero or loss is not reduced (the partial derivative and the loss are 1), and obtaining a corresponding group of word vectors and a classifier at the moment, wherein the classifier is a logistic regression classifier, and the classifier parameters are a weight vector W and an offset b.

Other steps and parameters are the same as those in one of the first to sixth embodiments.

The following examples were used to demonstrate the beneficial effects of the present invention:

the first embodiment is as follows:

the preparation method comprises the following steps:

the method comprises the following steps: preprocessing the corpus: including extracting a vocabulary and initializing a word vector matrix. Parallel corpora of the European parliament (100 ten thousand sentences per language pair) are adopted as the parallel corpora required by training word vectors, text classification training is carried out by adopting TED corpora, and the data set is a binary classification task. And performing word stem reduction on the classified linguistic data, removing low-frequency words and the like. The scheme also needs bilingual word alignment resources, if the bilingual word alignment resources are lacked, a GIZA + + tool is needed, and a bilingual word alignment table is obtained by training bilingual parallel linguistic data.

Step two: a loss function is constructed. The loss function includes three items, one is the loss of the source language, i.e. the loss in the source language S, which is obtained from the source part of the parallel corpus. The calculation method is obtained from the target end part in the parallel corpus according to the formula (1) and the formula (2). the second formula is the target end loss, and the calculation method is shown in the formula (3) and the formula (4). The probability p in each formula is calculated by a two-layer feed neural network. And thirdly, classifier loss, which is obtained by the formula (6). The total loss function is calculated by equation (7).

Step three: and (5) training and testing. The loss function is constructed in a specific corpus, and training is performed by using a gradient-based optimization method (such as SGD, Adam, AdaGrad and other methods) and using a word vector matrix and classifier parameters on the whole word list as trainable parameters of the whole objective function until convergence. And then testing on the test corpus. And obtaining a test result. This example uses SGD (random gradient descent method) as the optimization method.

The test result shows that: the classification accuracy obtained on multiple language pairs on a TED dataset exceeds that of the existing methods, F on Ender language pairs₁The value reached 0.413.

Example two:

the preparation method comprises the following steps:

the method comprises the following steps: preprocessing the corpus: including extracting a vocabulary and initializing a word vector matrix. The European parliament parallel corpus (100 ten thousand sentences per language pair) is used as the parallel corpus required by training word vectors, and RCV1 corpus is used for text classification training, and the data set is a four-classification task. And performing word stem reduction on the classified linguistic data, removing low-frequency words and the like. And a bilingual word alignment table, namely a translation dictionary, is obtained by utilizing parallel corpus training through a GIZA + + tool.

Step two: a loss function is constructed. A loss function is constructed. The loss function includes three items, one is the loss of the source language, i.e. the loss of the source language S, which is obtained from the source part of the parallel corpus. The calculation method is obtained from the target end part in the parallel corpus according to the formula (1) and the formula (2). the second formula is the target end loss, and the calculation method is shown in the formula (3) and the formula (4). The probability p in each formula is calculated by a two-layer feed neural network. And thirdly, classifier loss is obtained by a multi-classification logistic regression loss function improved by the formula (6), namely a cross entropy loss function of softmax regression. The expression of the loss function is:

the total loss function is obtained by the formula (7), wherein the loss part of the multi-classification classifier needs to be improved from the formula (6) to the formula (8).

Step three: and (5) training and testing. The loss function is constructed in a specific corpus, and training is performed by using a gradient-based optimization method (such as SGD, Adam, AdaGrad and other methods) and using a word vector matrix and classifier parameters on the whole word list as trainable parameters of the whole objective function until convergence. And then testing on the test corpus. And obtaining a test result. This example uses the Adam method as the optimization method.

The test result shows that: the classification accuracy rate obtained by the method on the RCV corpus exceeds the existing scheme. The correctness of the classification result obtained on the English language pair is 90.2%.

The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.

Claims

1. A cross-language text classification method based on cross-language word vector representation and classifier joint training, characterized in that:

Step 1: Corpus preprocessing:

Extract the vocabulary from the parallel corpus, initialize the word vector matrix in the parallel corpus with random numbers between -0.1 and 0.1, and perform stem reduction on the classified corpus to remove low-frequency words;

The parallel corpus is N pairs of English and corresponding Chinese translations;

The vocabulary is all words in the parallel corpus, and each word has an index;

The word vector matrix is a word vector matrix composed of all word vectors in the parallel corpus;

English is the source language, set to S, and the language of the text to be classified is the target language, set to T;

Definition C _s represents the source-side language part in the parallel corpus, and C _T represents the target-side language part in the parallel corpus;

It is defined that there are |S| words in the source language S, and |T| words in the target language T, and s and t represent the words of the source and target languages respectively;

Step 2: Optimize the total loss function loss through the gradient optimization method, so that the total loss function loss reaches the minimum value. When the minimum value is reached, it corresponds to a set of word vectors and a classifier. The classifier is a logistic regression classifier. The parameters are the weight vector W and the bias b;

Step 3: Use the test text on the target language T to use the weighted summation of a set of word vectors corresponding to when the total loss function loss reaches the minimum value to obtain the text vector, and input the text vector to make the total loss function loss reach the minimum value. The corresponding classifier is tested to obtain the probability distribution on each label, and the label with the highest probability is taken as the classification result of the test text on the target language T, and compared with the standard results of the test set, the test accuracy and recall rate are obtained. index;

The specific solution process of the total loss function loss in the second step is:

The overall loss function consists of three terms:

One is the source language loss, that is, the loss on the source language S, which is obtained from the source language part in the parallel corpus;

The second is the target-side language loss, that is, the loss on the target-side language T, which is obtained from the target-side language part in the parallel corpus;

The third is the classifier loss;

Construct the total loss function loss according to the source language loss, target language loss and classifier loss;

The total loss function loss is constructed according to the source language loss, the target language loss and the classifier loss; the specific formula is:

loss=-Obj(C)-L(C _L ) (7)

where Obj(C) represents the objective function on parallel corpus; L(C _L ) represents the text classifier loss function;

In the second step, the gradient optimization method is used to optimize the total loss function loss, so that the total loss function loss reaches the minimum value, and the specific process is as follows:

1) Calculate the partial derivative of the total loss function loss to the word vector matrix, and calculate the partial derivative of the total loss function loss to the weight vector W and the bias b;

2) The value of the current word vector matrix minus the partial derivative of loss to the current word vector matrix, the current weight vector W minus the partial derivative of loss to the current weight vector W, the current bias b minus the loss to the current bias b The partial derivative of ;

3) Repeat 1), 2) until the partial derivative of 1) is zero or the loss is no longer reduced, and a set of word vectors and a classifier corresponding to this time are obtained. The classifier is a logistic regression classifier, and the classifier parameters is the weight vector W and the bias b.

2. The cross-language text classification method based on cross-language word vector representation and classifier joint training according to claim 1, characterized in that: the source-end language loss, that is, the loss on the source-end language S, is obtained from the parallel corpus. The source-side language part of is obtained; the specific process is:

In C _s , the monolingual loss is:

Among them, C _s represents the source language part; Obj(C _s |C _s ) represents the monolingual loss in the source language in the parallel corpus; w represents one of the words in the context of word s in the source language; p(w| s) represents the probability of predicting the window of s under the condition that the central word is s; adj(s) represents the word in the context of word s in the source language;

The probability value p in the formula is obtained by a two-layer fully connected feedforward neural network; the process is:

The word vector of all words in C _s is input into the neural network as the central word word vector. The dimension of the word vector is 512. After the full connection layer, the dimension becomes |S| dimension. The probability of a word, pick out the probability represented by adj(s) from S, take the logarithm and add it to get

for each head word

Do the addition and get

output;

In C _s , the bilingual loss is:

Among them, C _T represents the part of the target language; Obj(C _T | C _S ) represents the bilingual loss in the source language and the target language in the parallel corpus; adj(t) represents the word in the context of word t in the target language;

in

word pairs representing alignment;

The word vector of all words in C _s is input into the neural network as the central word word vector. The dimension of the word vector is 512. After the full connection layer, the dimension becomes |T| dimension. The probability of a word, picking w ∈ adj(t) from T,

The representative probability is taken logarithmically and summed to get

for each head word

Do the addition and get

3. The cross-language text classification method based on cross-language word vector representation and classifier joint training according to claim 2, characterized in that: the target-side language loss, that is, the loss on the target-side language T, is determined by the parallel corpus. The target part of , is obtained; the specific process is:

In _CT , the monolingual loss is:

Obj(C _T |C _T ) represents the monolingual loss in the target language in the parallel corpus;

The word vector of all words in _CT is input into the neural network as the central word word vector. The dimension of the word vector is 512. After the full connection layer, the dimension becomes |T| dimension. After the softmax operation, each time in the softmax operation The probability operation expression of a word is:

where pi represents the probability of the _{ith word, e i} _represents the ith dimension of the vector generated after the fully connected layer, e _j represents the jth dimension of the vector generated after the fully connected layer, 1≤i≤|T| , 1≤j≤|T|, after obtaining the probability of each word in S after softmax operation, get the probability of each word in T, pick out the probability represented by adj(t) from T, take the logarithm and add it to the sum, get

for each head word

Do the addition and get

In _CT , the bilingual loss is:

Obj(C _S | C _T ) represents the bilingual loss in the source language and the target language in the parallel corpus; where

word pairs representing alignment;

The word vector of all words in _CT is input into the neural network as the central word word vector. The dimension of the word vector is 512. After the full connection layer, the dimension becomes |S| dimension. The probability of a word, picking w ∈ adj(s) from S,

The representative probability is taken logarithmically and summed to get

for each head word

Do the addition and get

Combining (1), (2), (3), and (4), the objective function on the parallel corpus is obtained:

Obj(C)=α ₁ Obj(C _S |C _S )+α ₂ Obj(C _T |C _S )+α ₃ Obj(C _T |C _T )+α ₄ Obj(C _S |C _T ), ( 5)

Among them, α ₁ , α ₂ , α ₃ , α ₄ are hyperparameters, all of which are scalars.

4. the cross-language text classification method based on cross-language word vector representation and classifier joint training according to claim 3, it is characterized in that: described classifier loss is:

The logistic regression classifier is used as the text classifier, and the loss of the text classifier adopts the cross entropy loss function, denoted as L; the loss function of the text classifier is:

Among them, _CL represents the text classification corpus, S ^d represents any text in the text classification corpus; X represents the text vector, which is obtained by the weighted summation of the word vector of each word in the text;

is the text vector represented by the text S ^d , b is the bias; W is the weight vector corresponding to each text category, tag(S ^d ) is the label of the text S ^d ,

is the weight vector corresponding to the label of the text S ^d .