WO2021236027A1 - Parameter optimization in unsupervised text mining - Google Patents
Parameter optimization in unsupervised text mining Download PDFInfo
- Publication number
- WO2021236027A1 WO2021236027A1 PCT/TR2020/050440 TR2020050440W WO2021236027A1 WO 2021236027 A1 WO2021236027 A1 WO 2021236027A1 TR 2020050440 W TR2020050440 W TR 2020050440W WO 2021236027 A1 WO2021236027 A1 WO 2021236027A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- scores
- parameter
- models
- clusters
- model
- Prior art date
Links
- 238000005065 mining Methods 0.000 title claims abstract description 16
- 238000005457 optimization Methods 0.000 title abstract description 8
- 238000000034 method Methods 0.000 claims abstract description 30
- 239000013598 vector Substances 0.000 claims abstract description 20
- 238000012935 Averaging Methods 0.000 claims abstract description 9
- 238000009826 distribution Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011524 similarity measure Methods 0.000 description 2
- GNFTZDOKVXKIBK-UHFFFAOYSA-N 3-(2-methoxyethoxy)benzohydrazide Chemical compound COCCOC1=CC=CC(C(=O)NN)=C1 GNFTZDOKVXKIBK-UHFFFAOYSA-N 0.000 description 1
- FGUUSXIOTUKUDN-IBGZPJMESA-N C1(=CC=CC=C1)N1C2=C(NC([C@H](C1)NC=1OC(=NN=1)C1=CC=CC=C1)=O)C=CC=C2 Chemical compound C1(=CC=CC=C1)N1C2=C(NC([C@H](C1)NC=1OC(=NN=1)C1=CC=CC=C1)=O)C=CC=C2 FGUUSXIOTUKUDN-IBGZPJMESA-N 0.000 description 1
- 238000004138 cluster model Methods 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Definitions
- the present disclosure relates to text mining field, and more particularly relates to a method for parameter optimization in the unsupervised text mining techniques.
- Text Mining is about discovering patterns from textual data.
- the techniques used in this field can be grouped in two main categories: supervised and unsupervised. While supervised text mining uses labelled text for training, unsupervised text mining uses unlabelled text.
- Performance of a model in an unsupervised text mining technique depends on its parameter settings. The performances of the models generated with different parameter values vary greatly. Despite their broad use in many different fields, the unsupervised text mining techniques have an unresolved problem: how to optimize parameters. Examples of the parameters may include, but are not limited to, the number of topics, a Dirichlet prior on document-topic distributions and a Dirichlet prior on topic-word distributions in Latent Dirichlet Allocation topic model, and the number of clusters in K-means clustering.
- Parameter optimization problem prevents the unsupervised text mining techniques from obtaining accurate results. If the parameters are not optimized in an appropriate manner, the results become meaningless and can be effective neither in the intrinsic nor in the extrinsic tasks. Thus, there is a need to develop an effective and efficient method for parameter optimization.
- Embodiments of the present disclosure relate to a method for optimizing parameters in the unsupervised text mining techniques.
- the method includes the following steps:
- a parameter pool is generated composed of a plurality of parameter vectors.
- a parameter vector is a collection of parameter values which have the same size with the number of parameters being optimized.
- a parameter vector may be any kind of collection that has a value for each of the parameters.
- parameter vectors may be initialized randomly within a range between the parameters’ predefined minimum and maximum values, while in another embodiment, they may be initialized using a braced initializer list.
- a model is generated with each parameter vector in the pool by using the selected unsupervised text mining technique.
- the technique and the model may be the topic modeling and a topic model respectively, while in another embodiment, they may be the clustering and a cluster model.
- the model may be a single model, while in another embodiment, it may be a plurality of replicated models generated with the same parameter vector. Average score of the replicated models may be used as the score of the parameter vector with which the replicated models are generated to alleviate the effects of the model instability.
- the pairwise semantic relatedness scores are calculated between the representative texts in the clusters of the models.
- the cluster may be a topic of a topic model, while in another embodiment, it may be a cluster of a clustering model.
- the representative texts may be top words of a topic, while in another embodiment, they may be top n-grams.
- the semantic relatedness score may be calculated by a distributional semantic similarity measure, while in another embodiment, it may be calculated by a knowledge-based semantic similarity measure.
- the scores of the clusters are calculated by averaging the scores of the representative texts. For each cluster, the score is calculated by averaging the scores of its representative texts.
- the measure used to average the scores may be the mean, while in another embodiment, it may be the median.
- the scores of the models are calculated by averaging the scores of the clusters. For each model, the score is calculated by averaging the scores of its clusters.
- the scores of the parameter vectors are compared to choose the next candidates. The score of a parameter vector is the score of the model generated with this parameter vector.
- the aim of the comparison may be to select the parameter vectors with higher scores, while in another embodiment, there may also be situations where the parameter vectors with lower scores are selected.
- the parameter pool is updated based on the rules determined by the selected optimization technique. In one embodiment, the rules may be determined by the mutation and crossover strategies of the Differential Evolution algorithm.
- the steps b through g are repeated until the termination condition is met.
- the termination condition may be the maximum number of iterations, while in another embodiment, it may be a pre-specified threshold between the best and the worst scores of the parameter vectors.
- the method given in this specification may be implemented as a distributed system.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present disclosure provides a method for parameter optimization in unsupervised text mining techniques. The method comprises: a) generating a parameter pool composed of a plurality of parameter vectors; b) generating a model for each parameter vector in the parameter pool; c) calculating pairwise semantic relatedness scores between representative texts in clusters of the models; d) calculating scores of the clusters by averaging the scores of the representative texts; e) calculating scores of the models by averaging the scores of the clusters; f) comparing the scores of the parameter vectors which are the scores of the corresponding models; g) updating the parameter pool; h) repeating the steps b through g until termination condition is met. The method increases the accuracy of the unsupervised text mining techniques by effectively and efficiently optimizing their parameters.
Description
PARAMETER OPTIMIZATION IN UNSUPERVISED TEXT MINING
TECHNICAL FIELD
The present disclosure relates to text mining field, and more particularly relates to a method for parameter optimization in the unsupervised text mining techniques.
BACKGROUND ART
Text Mining is about discovering patterns from textual data. The techniques used in this field can be grouped in two main categories: supervised and unsupervised. While supervised text mining uses labelled text for training, unsupervised text mining uses unlabelled text.
Performance of a model in an unsupervised text mining technique depends on its parameter settings. The performances of the models generated with different parameter values vary greatly. Despite their broad use in many different fields, the unsupervised text mining techniques have an unresolved problem: how to optimize parameters. Examples of the parameters may include, but are not limited to, the number of topics, a Dirichlet prior on document-topic distributions and a Dirichlet prior on topic-word distributions in Latent Dirichlet Allocation topic model, and the number of clusters in K-means clustering.
Parameter optimization problem prevents the unsupervised text mining techniques from obtaining accurate results. If the parameters are not optimized in an appropriate manner, the results become meaningless and can be effective neither in the intrinsic nor in the extrinsic tasks. Thus, there is a need to develop an effective and efficient method for parameter optimization.
DETAILED DESCRIPTION
As used herein, the singular forms "a," "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Additionally, the plural forms are intended to include that the item is one or more, including both singular and plural forms of the term it modifies.
The terms “comprises”, “comprising”, or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of
steps does not include only those steps but may include other steps not expressly listed or inherent to such a process or method.
References throughout this specification to “one embodiment”, “an embodiment”, “another embodiment”, “such embodiment”, “some embodiment”, “an example”, “another example”, “a specific example”, “an example embodiment”, etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Furthermore, the particular feature, structure, or characteristic may be combined in any suitable manner in one or more embodiments or examples.
Embodiments described and descriptions made in this specification are explanatory, illustrative, and used to make the present disclosure understandable. The embodiments and descriptions shall not be construed to limit the present disclosure. Other embodiments are possible, and modifications and variations can be made to the embodiments without departing from spirit, principles and scope of the present disclosure.
It would also be apparent to one of skill in the relevant art that the embodiments described in this specification can be implemented in many different embodiments of the unsupervised text mining techniques, the optimization techniques and the semantic relatedness measures. Various working modifications can be made to the method in order to implement the inventive concept taught in this specification.
Unless otherwise defined, all technical and scientific terms used in this specification have the same meaning as commonly understood by those skilled in the relevant art to which this disclosure belongs. The system, methods, and examples provided herein are only illustrative and not intended to be limiting.
Embodiments of the present disclosure relate to a method for optimizing parameters in the unsupervised text mining techniques. The method includes the following steps:
At step a, a parameter pool is generated composed of a plurality of parameter vectors. A parameter vector is a collection of parameter values which have the same size with the number of parameters being optimized. A parameter vector may be any kind of
collection that has a value for each of the parameters. In some embodiments, parameter vectors may be initialized randomly within a range between the parameters’ predefined minimum and maximum values, while in another embodiment, they may be initialized using a braced initializer list.
At step b, a model is generated with each parameter vector in the pool by using the selected unsupervised text mining technique.
In one embodiment, the technique and the model may be the topic modeling and a topic model respectively, while in another embodiment, they may be the clustering and a cluster model.
Moreover, in one embodiment, the model may be a single model, while in another embodiment, it may be a plurality of replicated models generated with the same parameter vector. Average score of the replicated models may be used as the score of the parameter vector with which the replicated models are generated to alleviate the effects of the model instability.
At step c, the pairwise semantic relatedness scores are calculated between the representative texts in the clusters of the models.
In one embodiment, the cluster may be a topic of a topic model, while in another embodiment, it may be a cluster of a clustering model.
Moreover, in one embodiment, the representative texts may be top words of a topic, while in another embodiment, they may be top n-grams.
Furthermore, in one embodiment, the semantic relatedness score may be calculated by a distributional semantic similarity measure, while in another embodiment, it may be calculated by a knowledge-based semantic similarity measure.
At step d, the scores of the clusters are calculated by averaging the scores of the representative texts. For each cluster, the score is calculated by averaging the scores of its representative texts. In one embodiment, the measure used to average the scores may be the mean, while in another embodiment, it may be the median.
At step e, the scores of the models are calculated by averaging the scores of the clusters. For each model, the score is calculated by averaging the scores of its clusters.
At step f, the scores of the parameter vectors are compared to choose the next candidates. The score of a parameter vector is the score of the model generated with this parameter vector.
In one embodiment, the aim of the comparison may be to select the parameter vectors with higher scores, while in another embodiment, there may also be situations where the parameter vectors with lower scores are selected. At step g, the parameter pool is updated based on the rules determined by the selected optimization technique. In one embodiment, the rules may be determined by the mutation and crossover strategies of the Differential Evolution algorithm.
At step h, the steps b through g are repeated until the termination condition is met. In one embodiment, the termination condition may be the maximum number of iterations, while in another embodiment, it may be a pre-specified threshold between the best and the worst scores of the parameter vectors.
Additionally, in one embodiment, the method given in this specification may be implemented as a distributed system.
Claims
1. A method for optimizing parameters in unsupervised text mining techniques, the method comprising: a) generating a parameter pool composed of a plurality of parameter vectors; b) generating a model for each parameter vector in the parameter pool; c) calculating pairwise semantic relatedness scores between representative texts in clusters of the models; d) calculating scores of the clusters by averaging the scores of the representative texts; e) calculating scores of the models by averaging the scores of the clusters; f) comparing the scores of the parameter vectors, which are the scores of the corresponding models; g) updating the parameter pool; and h) repeating the steps b through g until termination condition is met.
2. The method of Claim 1, wherein the model is a topic model, the cluster is a topic and the representative text is a top word.
3. The method of Claim 1, wherein the model comprises a single model or a plurality of replicated models generated with the same parameter vector, the score of which is calculated by averaging the scores of the replicated models.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/998,810 US20230205799A1 (en) | 2020-05-22 | 2020-05-22 | Parameter optimization in unsupervised text mining |
PCT/TR2020/050440 WO2021236027A1 (en) | 2020-05-22 | 2020-05-22 | Parameter optimization in unsupervised text mining |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/TR2020/050440 WO2021236027A1 (en) | 2020-05-22 | 2020-05-22 | Parameter optimization in unsupervised text mining |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021236027A1 true WO2021236027A1 (en) | 2021-11-25 |
Family
ID=78708757
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/TR2020/050440 WO2021236027A1 (en) | 2020-05-22 | 2020-05-22 | Parameter optimization in unsupervised text mining |
Country Status (2)
Country | Link |
---|---|
US (1) | US20230205799A1 (en) |
WO (1) | WO2021236027A1 (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040117336A1 (en) * | 2002-12-17 | 2004-06-17 | Jayanta Basak | Interpretable unsupervised decision trees |
US20110208709A1 (en) * | 2007-11-30 | 2011-08-25 | Kinkadee Systems Gmbh | Scalable associative text mining network and method |
US20160299955A1 (en) * | 2015-04-10 | 2016-10-13 | Musigma Business Solutions Pvt. Ltd. | Text mining system and tool |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016057984A1 (en) * | 2014-10-10 | 2016-04-14 | San Diego State University Research Foundation | Methods and systems for base map and inference mapping |
US10565444B2 (en) * | 2017-09-07 | 2020-02-18 | International Business Machines Corporation | Using visual features to identify document sections |
US20210150412A1 (en) * | 2019-11-20 | 2021-05-20 | The Regents Of The University Of California | Systems and methods for automated machine learning |
US11526814B2 (en) * | 2020-02-12 | 2022-12-13 | Wipro Limited | System and method for building ensemble models using competitive reinforcement learning |
US20230267283A1 (en) * | 2022-02-24 | 2023-08-24 | Contilt Ltd. | System and method for automatic text anomaly detection |
-
2020
- 2020-05-22 WO PCT/TR2020/050440 patent/WO2021236027A1/en active Application Filing
- 2020-05-22 US US17/998,810 patent/US20230205799A1/en not_active Abandoned
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040117336A1 (en) * | 2002-12-17 | 2004-06-17 | Jayanta Basak | Interpretable unsupervised decision trees |
US20110208709A1 (en) * | 2007-11-30 | 2011-08-25 | Kinkadee Systems Gmbh | Scalable associative text mining network and method |
US20160299955A1 (en) * | 2015-04-10 | 2016-10-13 | Musigma Business Solutions Pvt. Ltd. | Text mining system and tool |
Also Published As
Publication number | Publication date |
---|---|
US20230205799A1 (en) | 2023-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Aghdam et al. | Feature selection using particle swarm optimization in text categorization | |
Radu et al. | Clustering documents using the document to vector model for dimensionality reduction | |
CN111260064A (en) | Knowledge inference method, system and medium based on knowledge graph of meta knowledge | |
CN104991974A (en) | Particle swarm algorithm-based multi-label classification method | |
CN106503731A (en) | A kind of based on conditional mutual information and the unsupervised feature selection approach of K means | |
Chen et al. | Progressive EM for latent tree models and hierarchical topic detection | |
Forsati et al. | Web page clustering using harmony search optimization | |
Murty et al. | Automatic clustering using teaching learning based optimization | |
CN115098690A (en) | Multi-data document classification method and system based on cluster analysis | |
CN110597986A (en) | Text clustering system and method based on fine tuning characteristics | |
CN117973381A (en) | Method for automatically extracting text keywords | |
CN104714977A (en) | Correlating method and device for entities and knowledge base items | |
Yanyun et al. | Advances in research of Fuzzy c-means clustering algorithm | |
He et al. | Improving naive bayes text classifier using smoothing methods | |
WO2021236027A1 (en) | Parameter optimization in unsupervised text mining | |
JP5184464B2 (en) | Word clustering apparatus and method, program, and recording medium storing program | |
Zhu et al. | Swarm clustering algorithm: Let the particles fly for a while | |
Rani et al. | Clustering analysis by Improved Particle Swarm Optimization and K-means algorithm | |
Premalatha et al. | Genetic algorithm for document clustering with simultaneous and ranked mutation | |
Butka et al. | One approach to combination of FCA-based local conceptual models for text analysis—grid-based approach | |
Pun et al. | Unique distance measure approach for K-means (UDMA-Km) clustering algorithm | |
Mirhosseini et al. | Improving n-Similarity problem by genetic algorithm and its application in text document resemblance | |
Chang et al. | Enhancing an evolving tree-based text document visualization model with fuzzy c-means clustering | |
Liu et al. | A genetic semi-supervised fuzzy clustering approach to text classification | |
CN110516068A (en) | A Multidimensional Text Clustering Method Based on Metric Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20937089 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2020937089 Country of ref document: EP Effective date: 20221222 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20937089 Country of ref document: EP Kind code of ref document: A1 |