CN111708863A - Method and device for text matching based on doc2vec and electronic equipment - Google Patents
Method and device for text matching based on doc2vec and electronic equipment Download PDFInfo
- Publication number
- CN111708863A CN111708863A CN202010492263.4A CN202010492263A CN111708863A CN 111708863 A CN111708863 A CN 111708863A CN 202010492263 A CN202010492263 A CN 202010492263A CN 111708863 A CN111708863 A CN 111708863A
- Authority
- CN
- China
- Prior art keywords
- text
- target
- matching
- vector
- input
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 239000013598 vector Substances 0.000 claims abstract description 213
- 238000006243 chemical reaction Methods 0.000 claims abstract description 13
- 238000004590 computer program Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 9
- 238000004422 calculation algorithm Methods 0.000 description 6
- 230000006872 improvement Effects 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 238000005265 energy consumption Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a method and a device for text matching based on doc2vec and electronic equipment, wherein the method comprises the following steps: according to the doc2vec model, carrying out vector conversion on any target text in the target text set for n times to obtain a target vector set; the target vector set comprises n target vectors; similarity matching is carried out on each target vector set corresponding to each target text and sentence vectors generated according to the input text, and the target text corresponding to the target vector set with the highest sentence vector matching degree is extracted to serve as the matched text. Compared with the prior art, after the target texts are subjected to n times of vector generation, similarity matching is carried out on each target vector set formed according to each target text and sentence vectors generated by the input texts, so that a text matching mode is determined, the influence of the nondeterministic output target vectors is reduced, and when optimal texts are matched in a plurality of target texts by using doc2vec, mismatching cannot be caused due to the randomness of doc2vec seeds.
Description
Technical Field
The application relates to the technical field of text matching, in particular to a method, a device and electronic equipment for matching texts based on doc2 vec.
Background
In the conventional text matching, a way of realizing text matching by using word2vec exists. However, the text matching mode is to convert words into user vectors for representation, the sequence of the words is not considered, semantic information of the words is ignored, and therefore the matching accuracy of word2vec is low when matching of the whole sentence is faced. In order to solve the above problem, in the prior art, matching of texts is realized by using doc2 vec. Because doc2vec is a vectorized representation of the created document, the whole sentence can be represented well, and the method is more suitable for matching the whole sentence compared with word2 vec. In a dialogue system, a knowledge base is generally arranged, question sentences and corresponding answer sentences are pre-recorded in the knowledge base, when text matching is carried out, the question sentences in the knowledge base are used as target texts matched with input texts of users, input vectors generated by the input texts through doc2vec are matched with target vectors generated by the target texts through doc2vec, and therefore the answer sentences corresponding to the target texts are automatically fed back to the users according to the target texts with the highest matching degree.
However, when doc2vec is used for feature expression, it is found that a non-deterministic output may be caused due to a randomness problem of doc2vec seeds, and there are generally many target texts in a database, when all target texts are respectively generated into corresponding target vectors through doc2vec, the randomness may cause inaccuracy of a calculation result, so that a matching degree between a target vector of a poor target text (which has a low actual matching degree with an input text) and the input vector may be higher than a matching degree between a target vector of a good target text (which has a high actual matching degree with the input text) and the input vector due to the randomness problem, and further mismatching may be caused, and thus when an existing doc2vec faces a plurality of target texts, an optimal text cannot be well matched from the plurality of target texts.
Disclosure of Invention
The present application aims to solve at least one of the technical problems in the prior art, and provides a doc2 vec-based text matching method, apparatus, computer-readable storage medium, and electronic device, so as to improve the accuracy when matching an optimal text in a plurality of target texts through doc2 vec.
The embodiment of the application provides a method for matching texts based on doc2vec, which comprises the following steps:
according to the doc2vec model, performing vector conversion on each target text in the target text set for n times to obtain each target vector set; wherein the set of target vectors comprises n target vectors;
similarity matching is carried out on each target vector set and sentence vectors generated according to input texts, and the target text corresponding to the target vector set with the highest sentence vector matching degree is extracted as a matched text.
Further, the performing similarity matching between each target vector set and a sentence vector generated according to an input text, and extracting a target text corresponding to the target vector set with the highest sentence vector matching degree as a matching text includes:
carrying out weighted average on n target vectors of the target vector set to generate a characteristic vector;
and performing similarity matching on each feature vector corresponding to each target text and the sentence vector, and extracting the target text corresponding to the feature vector with the highest matching degree with the sentence vector as the matched text.
Further, the performing similarity matching between each target vector set and a sentence vector generated according to an input text, and extracting a target text corresponding to the target vector set with the highest sentence vector matching degree as a matching text includes:
acquiring n scores of n target vectors of the target vector set after cosine similarity operation with the sentence vectors respectively, and performing weighted average on the n scores to generate matching scores;
and acquiring a maximum matching score from the matching scores corresponding to the target texts, and extracting the target text corresponding to the maximum matching score as the matching text.
Further, the weighted average of the n scores includes:
and extracting k scores which are greater than a preset threshold value from the n scores for weighted average to generate the matching score.
Further, before performing vector conversion on any target text in the target text set of the database for n times according to the doc2vec model, the method further includes:
performing text classification on the input text, and determining a text category corresponding to the input text in a database;
extracting the target text set under the text category.
Further, the text classification of the input text and the determination of the text category corresponding to the input text in the database include:
performing text matching on the input text and a pre-stored historical text set to obtain a historical text with the highest similarity with the input text in the historical text set; the history text is generated by acquiring a history input record of the terminal;
and according to the corresponding text type of the historical text in the database, determining the text type of the input text.
Further, the text classification of the input text and the determination of the text category corresponding to the input text in the database include:
and performing KNN operation on the input text and each historical text in a pre-stored historical text set to determine the text type of the input text.
Further, an embodiment of the present application provides a doc2 vec-based text matching apparatus, including:
the vector acquisition module is used for performing n times of vector conversion on each target text in the target text set according to the doc2vec model to acquire each target vector set; wherein the set of target vectors comprises n target vectors;
and the vector matching module is used for matching the similarity of each target vector set with sentence vectors generated according to input texts and extracting the target text corresponding to the target vector set with the highest sentence vector matching degree as a matched text.
Further, the method also comprises the following steps:
the data classification module is used for performing text classification on the input text and determining a text category corresponding to the input text in the database;
any text under the text category is extracted as the target text.
Further, an embodiment of the present application provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the doc2vec based text matching method as described in the above embodiments when executing the program.
Further, an embodiment of the present application provides a computer-readable storage medium, where computer-executable instructions are stored, and the computer-executable instructions are configured to enable a computer to execute the doc2vec based text matching method according to the embodiment.
Compared with the prior art, the embodiment determines a mode of matching the text by performing similarity matching on each target vector set formed according to each target text and sentence vectors generated by the input text after generating the vectors for n times for the target text, reduces the influence of the target vectors output nondeterministically, prevents mismatching caused by the randomness of doc2vec seeds when doc2vec is used for matching the optimal text in a plurality of target texts, and generates the target vector set by the doc2vec model from the target text by using the randomness of the doc2vec seeds, so that the accurate semantics of the text can be described on the whole, and the accuracy of subsequent matching results is improved.
Drawings
The present application is further described with reference to the following figures and examples;
FIG. 1 is a diagram of an embodiment of an application environment for a doc2vec based text matching method;
FIG. 2 is a flow chart illustrating a doc2vec based text matching method in one embodiment;
FIG. 3 is a flowchart illustrating a doc2 vec-based text matching method in another embodiment;
FIG. 4 is a diagram of an interaction interface of an intelligent customer service system in one embodiment;
FIG. 5 is a block diagram of an embodiment of a doc2vec based text matching apparatus;
FIG. 6 is a block diagram of another embodiment of a doc2vec based text matching apparatus;
FIG. 7 is a block diagram of a computer device in one embodiment.
Detailed Description
Reference will now be made in detail to the present embodiments of the present application, preferred embodiments of which are illustrated in the accompanying drawings, which are meant to supplement the description in the written description with figures, so that the person can visually and graphically understand each and every feature and every solution described herein, but should not be construed as limiting the scope of the application.
In the existing doc2 vec-based text matching method, doc2vec is used to generate an input vector of an input text and a target vector of a target text, and then distance calculation is performed on the input vector and the target vector, so that matching scores of the input vector and the target vector are determined, and whether the input text is matched with the target text is determined. When doc2vec is used for feature expression, the used algorithm is partially non-deterministic, and initialization of word vectors is deterministic, but negative sampling may cause non-deterministic output when words are randomly sampled, so that sentence vectors generated by the same text each time have difference, and the probability of mismatching is increased when the optimal text needs to be matched from a plurality of target texts.
To solve the above problem, fig. 1 is a diagram of an application environment of a doc2 vec-based text matching method in an embodiment. Referring to fig. 1, the doc2 vec-based text matching method is applied to a doc2 vec-based text matching system. The doc2 vec-based text matching system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network. The terminal 110 may specifically be a desktop terminal 110 or a mobile terminal 110, and the mobile terminal 110 may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server 120 or as a server 120 cluster of multiple servers 120.
FIG. 4 is a diagram of an interaction interface for an intelligent customer service system in one embodiment. Referring to fig. 4, the doc2vec based text matching method provided in this embodiment of the present application may be used in an intelligent customer service system in multiple fields, where an application scenario is that a user provides an input text to the intelligent customer service system through a terminal 110, a server 120 obtains the input text provided by the user, performs n times of vector conversion on any target text in a target text set of a database according to a doc2vec model to obtain a target vector set, performs similarity matching between each target vector set corresponding to each target text and a sentence vector generated according to the input text, extracts a target text corresponding to the target vector set with the highest sentence vector matching degree as a matching text and feeds back the matching text to the terminal 110, or feeds back related information corresponding to the matching text to the user, for example, the matching text is a standard problem in the database, and related information corresponding to a pre-stored matching text is an answer text of the standard problem, the answer text is fed back to the user.
Through the mode, when the doc2vec is used for matching the optimal text in the target texts, mismatching cannot be caused due to the randomness of the doc2vec seeds, and the target text can be generated into the target vector set through the doc2vec model by using the randomness of the doc2vec seeds, so that the accurate semantics of the text can be described on the whole, and the accuracy of the subsequent matching result is improved.
It can be understood that the doc2vec based text matching method provided by the embodiment of the present application is not limited to be applied to the intelligent customer service systems in the shopping field and the game field, and may also include different fields such as the weather query field, the medical consultation field, the government affairs consultation field, and the insurance consultation field, and within the knowledge range possessed by the person skilled in the art, the doc2vec based text matching method provided by the embodiment of the present application may be applied to the intelligent customer service systems in different fields according to the specific business requirements of the person skilled in the art.
Hereinafter, the document matching method based on doc2vec provided in the embodiments of the present application will be described and explained in detail through several specific embodiments.
As shown in FIG. 2, in one embodiment, a doc2vec based text matching method is provided. The embodiment is mainly illustrated by applying the method to computer equipment. The computer device may specifically be the server 120 in fig. 1 described above.
Referring to fig. 2, the doc2 vec-based text matching method specifically includes the following steps:
and step S11, performing vector conversion on each target text in the target text set for n times according to the doc2vec model to obtain each target vector set.
Wherein the set of target vectors comprises n target vectors.
Because the randomness of the doc2vec seeds can cause the difference of sentence vectors generated each time, in one embodiment, the target text is subjected to vector conversion through doc2vec for multiple times, one target vector is generated each time, and a target vector set is formed according to the target vectors. In order to ensure the accuracy of subsequent matching, the target vectors in the target vector set cannot be too few, but since each conversion needs to consume a certain time, if too many target vectors are needed, the time consumption is too long. Therefore, the value range of n is determined to be 10-20 in order to balance the matching precision and the calculation time length. Preferably, n is measured to be 11.
And step S12, performing similarity matching between each target vector set and the sentence vectors generated according to the input text, and extracting the target text corresponding to the target vector set with the highest sentence vector matching degree as the matching text.
In one embodiment, the input text may be obtained through a terminal device such as a mobile phone, a notebook computer, and a tablet computer. The matching method of the target vector set and the sentence vector may be that knn operations are performed on all target vectors consisting of n target vector sets and the input vector, so as to determine the target vector set with the highest matching degree with the input vector, and further, the target text corresponding to the target vector set is used as the matching text. Because of the randomness of the doc2vec seed, there may be more overlapping vectors, and therefore determining matching text using the knn algorithm is a better way. However, the knn algorithm is too large to be suitable for the operation with too many samples, and therefore, when the number of target vectors is too large, it is not suitable to adopt the knn algorithm. And since the most suitable k value needs to be determined by the knn algorithm, if the k value is not properly selected, a mismatch condition may occur.
Therefore, as an improvement of the above embodiment, in an embodiment, the target vector set is converted into a feature vector, and then similarity matching is performed between the feature vector and a sentence vector, so as to obtain similarity between the target text and the input text. If n target vectors in the target vector set are weighted and averaged, after the feature vectors are generated, similarity matching is carried out on each feature vector corresponding to each target text and a sentence vector, and the target text corresponding to the feature vector with the highest matching degree with the sentence vector is extracted as a matched text. When the feature vectors are generated in this way, the influence degree of each target vector on the matching accuracy cannot be determined, so that the weight of each target vector is the same. And performing cosine similarity operation on each feature vector and the sentence vector to obtain similarity values of each feature vector and the sentence vector, and further extracting the target text corresponding to the feature vector with the highest similarity value as the matched text. By the method, the complexity of calculation when the target vector set is matched with the sentence vector is reduced, and the calculation expense of the system is saved.
In addition to the processing of the foregoing embodiment, the input text also needs to generate an input vector, so in order to ensure the accuracy of the input vector, in an embodiment, the processing manner of generating the feature vector in the foregoing embodiment may be adopted, the input text is subjected to vector conversion n times through the doc2vec model, after the initial vector set is obtained, n initial vectors of the initial vector set are weighted and averaged to generate the input vector, and thus the accuracy of the input vector is ensured.
Because the arithmetic mean is less influenced by sampling, and the generation of the target text set through doc2vec can be regarded as a sampling mode, the concept of arithmetic mean is utilized, so that when the cosine similarity calculation is carried out on the feature vector and the input vector obtained according to the target vector set, the obtained score is credible under most conditions, and the matching of the input text and the target text can be better realized. However, the arithmetic mean is easily affected by extreme values in a set of data, i.e. when one target vector in the target text set is too different from other target vectors, the score of the final match may be inaccurate. Therefore, as another improvement of the foregoing embodiment, in an embodiment, n scores obtained by respectively performing similarity matching between n target vectors of a target vector set and sentence vectors are obtained, the n scores are weighted and averaged, after matching scores are generated, a maximum matching score is obtained from matching scores corresponding to target texts, and a target text corresponding to the maximum matching score is extracted as a matching text. Specifically, n scores obtained after similarity matching is carried out on n target vectors of the target vector set and sentence vectors respectively are obtained, and different weights are given according to the scores. The higher the score is, the higher the weight given to the score is, and the specific weight may be allocated according to a preset proportion, which is not described herein. In order to further increase the accuracy of subsequent matching, preferably, the generation manner of the matching score may be to extract k scores of the n scores that are greater than a preset threshold value, perform weighted average, and generate the matching score, that is, reduce the weight of the score that is less than the preset threshold value of the n scores to 0, thereby increasing the weight of the target vector with higher similarity, and further increasing the matching accuracy. Wherein k is less than or equal to n.
In one embodiment, when the number of the current input texts is detected to exceed a preset value, a matching mode of each input text is that n scores obtained after similarity matching is carried out on n target vectors of a target vector set and sentence vectors respectively are obtained, the n scores are weighted and averaged, after matching scores are generated, the maximum matching score is obtained from the matching scores corresponding to the target texts, and the target text corresponding to the maximum matching score is extracted as the matching text, so that the calculation cost is saved, and timely response can be realized when multiple input texts are confronted; when the number of the current input texts is detected to be smaller than the preset value, matching is respectively carried out on each input text in a manner that the knn operation, the weighting generation feature vector and the weighted average of the scores are carried out, three matching texts are output, and the same matching text in the three matching texts is selected as the final matching text, so that the matching accuracy is further improved while excessive calculation cost is not consumed.
In another embodiment, as shown in FIG. 3, a doc2vec based text matching method is provided. The embodiment is mainly illustrated by applying the method to computer equipment. The computer device may specifically be the server 120 in fig. 1 described above.
Referring to fig. 3, the doc2 vec-based text matching method further includes, in addition to the steps described in the above embodiment:
and step S10, performing text classification on the input text, determining the corresponding text category of the input text in the database, and extracting a target text set under the text category.
Since one intelligent customer service system may relate to multiple fields at the same time, for example, a game platform may sell the hands of game characters related to the game characters at the same time, the intelligent customer service system of the game platform may relate to both the shopping field and the game field. When the doc2 vec-based text matching method provided by the embodiment of the present application is applied to the intelligent customer service system, because the input text is too long, one input text may include keywords in two fields, for example, the input text is "when the purchased game character is shipped", and here, the meaning of the user may be when the user ships "the game character hand" purchased on the shopping platform, and may also be when the user ships "the game character" purchased in the game. At this time, an ambiguous problem is caused due to the excessively long input text, if the intelligent customer service system cannot cope with the ambiguous problem of the input text and only gives an answer corresponding to one of the meanings, the user cannot match a proper target text, the user needs to continuously adjust the input text, the user interaction cost is increased, and the energy consumption of the server 120 is also increased.
The texts in the database are divided into a plurality of text categories according to fields, such as a shopping field, a game field, and the like. In the embodiment of the application, the input text is subjected to text classification, and a plurality of semantics possibly possessed by the input text are identified through the text classification, so that target texts to which different semantics are applied are not guaranteed to be omitted, the target texts corresponding to the semantics are fed back to a user, the user interaction cost is reduced, in addition, the input text is classified in advance, so that the target texts of the whole database are not required to be matched when the input text is matched, only local text matching is required, and further the calculation cost is reduced.
In one embodiment, each text in the database is divided into certain text categories according to the field, in order to determine the corresponding text category of the input text in the database, each text category in the database is correspondingly provided with a category set, and elements in the category set are labeled documents representing the text category. Determining a text category corresponding to the input text in the database is actually to perform text classification on the input text, in this embodiment, a KNN algorithm is used to search K adjacent (similar or identical) labeled documents closest to the input text in a plurality of category sets of the database, and then the text category of the text is input according to classification labels of the K adjacent documents.
It can be understood that, in order to improve the accuracy of text classification, the number of labeled documents in the category set of the database is not too small, and if the KNN operation is performed on the labeled documents in the category set every time the text classification is performed on the input text, the energy consumption of the server is increased. Therefore, in this embodiment, the method for determining the text type corresponding to the input text in the database includes: and performing doc2 vec-based text matching on the input text and a pre-stored historical text set to obtain the historical text with the highest similarity with the input text in the historical text set. The history text is generated by acquiring a history input record of the terminal. And performing text classification on the input text, and determining the text category corresponding to the input text in the database. Generally, the server records the input text provided by each terminal and the text category corresponding to the corresponding input text, considering that the fields generally related to the user of each terminal and the content of the questions are changed within a certain range and do not change too much. When the user of the terminal carries out conversation with the intelligent customer service system again, the server firstly compares the input text with the historical text set of the terminal, obtains the historical text with the highest similarity between the historical text set and the input text by calculating the Jacard similarity coefficient between the input text and the historical text, and then takes the text category in the database corresponding to the historical text with the highest similarity with the input text as the category of the input text. Because the semantic requirement on the input text is not high only for obtaining the text category to which the input text belongs, in order to quickly determine the category of the input text, text matching can be performed by adopting a mode of calculating a Jacard similarity coefficient, and because the number of texts in a historical text set is less than that of labeled documents in a category set, the efficiency of text classification is greatly improved, and the energy consumption of a server is reduced.
In an embodiment, since the text type of each historical text in the historical text set is labeled, KNN operation may be performed on the input text and each historical text in the pre-stored historical text set to determine the text type of the input text. Because the number of texts in the historical text set is less than that of the labeled documents in the category set, the efficiency of text classification is greatly improved, and the calculation cost of the server is reduced.
As shown in fig. 5, in one embodiment, there is provided a doc2 vec-based text matching apparatus, including:
and the vector obtaining module 101 is configured to perform vector conversion on each target text in the target text set n times according to the doc2vec model, and obtain each target vector set.
Wherein the set of target vectors comprises n target vectors.
And the vector matching module 102 is configured to perform similarity matching between each target vector set and a sentence vector generated according to the input text, and extract a target text corresponding to the target vector set with the highest sentence vector matching degree as a matching text.
In an embodiment, the vector matching module 102 is specifically configured to perform weighted average on n target vectors in the target vector set to generate the feature vector. Similarity matching is carried out on each feature vector corresponding to each target text and the sentence vector, and the target text corresponding to the feature vector with the highest matching degree with the sentence vector is extracted as the matching text.
In another embodiment, the vector matching module 102 is specifically configured to obtain n scores obtained after similarity matching is performed between n target vectors of the target vector set and sentence vectors, and perform weighted average on the n scores to generate matching scores. And acquiring the maximum matching score from the matching scores corresponding to the target texts, and extracting the target text corresponding to the maximum matching score as the matching text. Wherein the weighted averaging of the n scores comprises: and extracting k scores which are greater than a preset threshold value from the n scores for weighted average to generate a matching score.
In another embodiment, the vector matching module 102 is specifically configured to, when it is detected that the number of the current input texts exceeds a preset value, perform matching on each input text in a manner that n scores obtained by performing similarity matching between n target vectors of a target vector set and sentence vectors are obtained, perform weighted average on the n scores, generate matching scores, obtain a maximum matching score from among matching scores corresponding to target texts, and extract a target text corresponding to the maximum matching score as the matching text; when the number of the current input texts is detected to be smaller than the preset value, matching is respectively carried out on each input text in a manner that knn calculation, feature vector generation by weighting and score weighted average are carried out, three matching texts are output, and the same matching text in the three matching texts is selected as a final matching text.
In another embodiment, as shown in fig. 6, the doc2 vec-based text matching apparatus further includes:
the data classification module 100 is configured to perform text classification on the input text, and determine a text category corresponding to the input text in the database. Any text under the text category is extracted as a target text.
In an embodiment, the data classification module 100 is specifically configured to perform text matching on the input text and a pre-stored historical text set, and obtain a historical text with the highest similarity to the input text in the historical text set. The history text is generated by acquiring a history input record of the terminal. And according to the corresponding text type of the historical text in the database, the text type of the text is really input.
In another embodiment, the data classification module 100 is specifically configured to, after obtaining the text type of each historical text in the pre-stored historical text set, perform KNN operation on the input text and each historical text, and determine the text type of the input text.
FIG. 7 is a diagram illustrating an internal structure of a computer device in one embodiment. As shown in fig. 7, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected through a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program which, when executed by the processor, causes the processor to implement a doc2vec based text matching method. The internal memory may also have stored therein a computer program that, when executed by the processor, causes the processor to perform a doc2 vec-based text matching method. Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, the doc2vec based text matching apparatus provided by the present application may be implemented in the form of a computer program, and the computer program may be run on a computer device as shown in fig. 7. The memory of the computer device may store various program modules constituting the doc2 vec-based text matching apparatus. The computer program consisting of the respective program modules causes the processor to execute the steps of the doc2vec based text matching method of the embodiments of the present application described in the present specification.
In one embodiment, there is provided an electronic device including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program performing the steps of the doc2vec based text matching method described above. Here, the steps of the doc2vec based text matching method may be the steps of the doc2vec based text matching method in the above embodiments.
In one embodiment, a computer-readable storage medium is provided, having stored thereon computer-executable instructions for causing a computer to perform the steps of the doc2vec based text matching method described above. Here, the steps of the doc2vec based text matching method may be the steps of the doc2vec based text matching method in the above embodiments.
The foregoing is a preferred embodiment of the present application, and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and such improvements and modifications are also considered as the protection scope of the present application.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
Claims (10)
1. A doc2 vec-based text matching method is characterized by comprising the following steps:
according to the doc2vec model, performing vector conversion on each target text in the target text set for n times to obtain each target vector set; wherein the set of target vectors comprises n target vectors;
similarity matching is carried out on each target vector set and sentence vectors generated according to input texts, and the target text corresponding to the target vector set with the highest sentence vector matching degree is extracted as a matched text.
2. The doc2 vec-based text matching method according to claim 1, wherein the performing similarity matching between each target vector set and a sentence vector generated according to an input text, and extracting a target text corresponding to the target vector set with the highest sentence vector matching degree as a matching text comprises:
carrying out weighted average on n target vectors of the target vector set to generate a characteristic vector;
and performing similarity matching on each feature vector corresponding to each target text and the sentence vector, and extracting the target text corresponding to the feature vector with the highest matching degree with the sentence vector as the matched text.
3. The doc2 vec-based text matching method according to claim 1, wherein the performing similarity matching between each target vector set and a sentence vector generated according to an input text, and extracting a target text corresponding to the target vector set with the highest sentence vector matching degree as a matching text comprises:
acquiring n scores of the n target vectors of the target vector set after similarity matching with the sentence vectors respectively, and performing weighted average on the n scores to generate matching scores;
and acquiring a maximum matching score from the matching scores corresponding to the target texts, and extracting the target text corresponding to the maximum matching score as the matching text.
4. The doc2 vec-based text matching method of claim 3, wherein the weighted average of the n scores comprises:
and extracting k scores which are greater than a preset threshold value from the n scores for weighted average to generate the matching score.
5. The doc2 vec-based text matching method according to claim 1, wherein before performing vector conversion on any target text in the target text set of the database n times according to the doc2vec model, the method further comprises:
and performing text classification on the input text, and extracting the target text set under the text category after determining the text category corresponding to the input text in a database.
6. The doc2 vec-based text matching method according to claim 5, wherein the text classification of the input text and the determination of the text category corresponding to the input text in the database comprises:
performing text matching on the input text and a pre-stored historical text set to obtain a historical text with the highest similarity with the input text in the historical text set; the historical text is generated by acquiring a historical input record of the terminal;
and according to the corresponding text type of the historical text in the database, determining the text type of the input text.
7. The doc2 vec-based text matching method according to claim 5, wherein the text classification of the input text and the determination of the text category corresponding to the input text in the database comprises:
acquiring the text type of each historical text in a pre-stored historical text set;
and performing KNN operation on the input text and the historical texts to determine the text type of the input text.
8. A doc2 vec-based text matching apparatus, comprising:
the vector acquisition module is used for performing n times of vector conversion on each target text in the target text set according to the doc2vec model to acquire each target vector set; wherein the set of target vectors comprises n target vectors;
and the vector matching module is used for matching the similarity of each target vector set with a sentence vector generated according to an input text and extracting a target text corresponding to the target vector set with the highest sentence vector matching degree as a matched text.
9. The doc2 vec-based text matching device of claim 7, further comprising:
the data classification module is used for performing text classification on the input text and determining a text category corresponding to the input text in a database;
any text under the text category is extracted as the target text.
10. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the doc2 vec-based text matching method according to any of claims 1 to 7 when executing the program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010492263.4A CN111708863B (en) | 2020-06-02 | 2020-06-02 | Text matching method and device based on doc2vec and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010492263.4A CN111708863B (en) | 2020-06-02 | 2020-06-02 | Text matching method and device based on doc2vec and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111708863A true CN111708863A (en) | 2020-09-25 |
CN111708863B CN111708863B (en) | 2024-03-15 |
Family
ID=72538562
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010492263.4A Active CN111708863B (en) | 2020-06-02 | 2020-06-02 | Text matching method and device based on doc2vec and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111708863B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114020878A (en) * | 2021-11-29 | 2022-02-08 | 清华大学 | Feature text matching method and device, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109947909A (en) * | 2018-06-19 | 2019-06-28 | 平安科技(深圳)有限公司 | Intelligent customer service answer method, equipment, storage medium and device |
CN110008396A (en) * | 2018-11-28 | 2019-07-12 | 阿里巴巴集团控股有限公司 | Object information method for pushing, device, equipment and computer readable storage medium |
CN110362651A (en) * | 2019-06-11 | 2019-10-22 | 华南师范大学 | Dialogue method, system, device and the storage medium that retrieval and generation combine |
US10467261B1 (en) * | 2017-04-27 | 2019-11-05 | Intuit Inc. | Methods, systems, and computer program product for implementing real-time classification and recommendations |
CN111027316A (en) * | 2019-11-18 | 2020-04-17 | 大连云知惠科技有限公司 | Text processing method, apparatus, electronic device, and computer-readable storage medium |
-
2020
- 2020-06-02 CN CN202010492263.4A patent/CN111708863B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10467261B1 (en) * | 2017-04-27 | 2019-11-05 | Intuit Inc. | Methods, systems, and computer program product for implementing real-time classification and recommendations |
CN109947909A (en) * | 2018-06-19 | 2019-06-28 | 平安科技(深圳)有限公司 | Intelligent customer service answer method, equipment, storage medium and device |
CN110008396A (en) * | 2018-11-28 | 2019-07-12 | 阿里巴巴集团控股有限公司 | Object information method for pushing, device, equipment and computer readable storage medium |
CN110362651A (en) * | 2019-06-11 | 2019-10-22 | 华南师范大学 | Dialogue method, system, device and the storage medium that retrieval and generation combine |
CN111027316A (en) * | 2019-11-18 | 2020-04-17 | 大连云知惠科技有限公司 | Text processing method, apparatus, electronic device, and computer-readable storage medium |
Non-Patent Citations (1)
Title |
---|
张彪;戴兴国;: "基于指标距离与不确定度量的岩爆云模型预测研究" * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114020878A (en) * | 2021-11-29 | 2022-02-08 | 清华大学 | Feature text matching method and device, electronic equipment and storage medium |
CN114020878B (en) * | 2021-11-29 | 2024-08-02 | 清华大学 | Feature text matching method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN111708863B (en) | 2024-03-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7302022B2 (en) | A text classification method, apparatus, computer readable storage medium and text classification program. | |
CN104137102B (en) | Non- true type inquiry response system and method | |
US11790894B2 (en) | Machine learning based models for automatic conversations in online systems | |
US20210042391A1 (en) | Generating summary content using supervised sentential extractive summarization | |
CN111639162A (en) | Information interaction method and device, electronic equipment and storage medium | |
CN112926308B (en) | Method, device, equipment, storage medium and program product for matching text | |
CN108038105B (en) | Method and device for generating simulated word vector for unknown words | |
CN114218945A (en) | Entity identification method, device, server and storage medium | |
CN111859940A (en) | Keyword extraction method and device, electronic equipment and storage medium | |
US20210319481A1 (en) | System and method for summerization of customer interaction | |
CN116483979A (en) | Dialog model training method, device, equipment and medium based on artificial intelligence | |
CN110717021A (en) | Input text and related device for obtaining artificial intelligence interview | |
CN114328894A (en) | Document processing method, document processing device, electronic equipment and medium | |
CN113449094A (en) | Corpus obtaining method and device, electronic equipment and storage medium | |
CN115730590A (en) | Intention recognition method and related equipment | |
CN112199958A (en) | Concept word sequence generation method and device, computer equipment and storage medium | |
CN114255067A (en) | Data pricing method and device, electronic equipment and storage medium | |
CN111708863B (en) | Text matching method and device based on doc2vec and electronic equipment | |
CN111737607B (en) | Data processing method, device, electronic equipment and storage medium | |
CN111143515B (en) | Text matching method and device | |
CN110851560A (en) | Information retrieval method, device and equipment | |
JP2009053743A (en) | Document similarity deriving apparatus, document similarity deriving method, and document similarity deriving program | |
CN111708872B (en) | Dialogue method and device and electronic equipment | |
CN114925185B (en) | Interaction method, model training method, device, equipment and medium | |
CN116483971A (en) | User question intelligent answer method, financial system and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |