CN120146046A

CN120146046A - A low-resource topic key topic extraction method for online forums

Info

Publication number: CN120146046A
Application number: CN202510615488.7A
Authority: CN
Inventors: 王睿; 郑毅; 姚遥
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2025-05-14
Filing date: 2025-05-14
Publication date: 2025-06-13

Abstract

The application belongs to the technical field of natural language processing and text mining, and discloses a low-resource topic key subject extraction method for an online forum, which comprises the steps of carrying out data enhancement on the semantic retention of an original text through a large language model to generate an enhanced document set; extracting context-aware semantic representation of a document by using a pre-training language model, constructing a learnable topic embedding matrix, calculating to generate topic distribution, designing a semantic-aware contrast learning framework, adopting a dynamic negative sample screening strategy to optimize topic diversity, and simultaneously ensuring topic consistency by using priori alignment loss. The application innovatively fuses the LLM enhanced data expansion mechanism and the lightweight topic coding architecture, effectively solves three technical problems of data sparseness, model overfitting and noise sensitivity in a low-resource scene through the double optimization of contrast learning regularization and priori distribution matching, and provides an efficient and reliable topic modeling solution for social media public opinion analysis.

Description

Online forum-oriented low-resource topic key subject extraction method

Technical Field

The application belongs to the technical field of natural language processing and text mining, and particularly relates to an on-line forum low-resource topic-oriented key subject extraction method.

Background

With the popularity of the internet, the internet forum has become a popular platform for users to share daily life and communicate with opinion. These user-generated text provides valuable resources for analyzing public opinion about various social phenomena. However, most sub-forums on forums show low resource attributes, active members and limited posts. Despite the small size of the corpus, articles submitted by users contain important aspects such as "mental health and family problems," such as "autism and romantic relationships," such as "health and medical struggling," all of which can lead to a sense of discomfort and have potential analytical value.

The traditional topic modeling technology faces three major core challenges, namely a) data sparsity, namely the number of low-liveness community documents is often less than 3000, and the neural network is insufficiently trained. b) Noise sensitivity-user generated text has misspellings and irregular grammar problems, and affects the word bag model effect. c) Model complexity-neural topic model parameters of the VAE architecture are excessive, and overfitting easily occurs on small data sets.

The existing solutions have significant drawbacks, such as traditional bayesian approaches, such as LDA and BOW-based neural variants, such as ECRTM, that fail to address the challenges of data scarcity and noise information, as well as recently developed VAE-based contextualized topic models, such as the downsampled contextualized topic model CTMNeg and the contextualized word topic model CWTM, that are poor at handling the challenges of data scarcity and model complexity.

Disclosure of Invention

In order to solve the technical defects, the application provides a low-resource topic key topic extraction method for an online forum, which solves three technical problems of data sparseness, noise interference and model complexity through a LLM enhanced low-resource topic modeling framework, realizes topic consistency improvement, topic diversity improvement and calculation efficiency improvement, discovers topics with higher quality from the low-resource online forum, and is suitable for semantic analysis and public opinion insight of User Generated Content (UGC) in a social media platform.

In order to achieve the above purpose, the application is realized by the following technical scheme:

The application discloses a low-resource topic key topic extraction method for an online forum, which specifically comprises the following steps of:

Step 1, acquiring a low-resource document of an online forum, and carrying out semantic retention data enhancement on the acquired low-resource document through a large language model to generate an enhanced document set;

step 2, extracting document level representation in the enhanced document set by utilizing a pre-training language model;

step 3, constructing a learnable topic embedding matrix, and calculating document topic distribution through document-topic similarity;

Step 4, designing a semantic perception contrast learning frame, screening dynamic negative samples in the same batch of reinforced documents of the contrast learning frame, calculating contrast learning loss, optimizing a theme embedding matrix, ensuring theme consistency by using priori alignment loss, and obtaining a theme word;

and 5, expanding the subject words obtained in the step 4 by using a large language model, and inducing subject insights, so that the low-resource corpus can be better understood.

The application further improves that the step 1 obtains the low resource file of the online forum, carries out the data enhancement of the semantic retention to the obtained low resource file through a large language model, and generates an enhanced file set, which comprises the following steps:

step 1.1, constructing a document enhancement prompt template based on a large language model to generate a large language model result, wherein the document enhancement prompt template comprises semantic keeping constraint conditions:

a) The minimum semantic variation principle is that each text in the generated enhanced document set is required to be similar to the low-resource document as semantically as possible;

b) Sentence fluency optimization, namely eliminating spelling errors and irregular grammar expression;

Step 1.2, iterative generation, namely calculating the embedding similarity of a low-resource document and a large language model generation result corresponding to the low-resource document by using a pre-training language model;

Step 1.3, when Individual low resource documentsCorresponding generation resultIs below a threshold valueWhen the screening mechanism is triggered, the steps 1.1-1.2 are repeatedly executed, and the first step is regeneratedIndividual low resource documentsCorresponding generation resultWhereinFor the low-resource corpus, the embedding similarity of the highest generated result in each iteration is reserved, and the low-resource corpus is collectedObtaining a first enhanced document set after two times of data enhancementAnd second enhancing the document collectionWhereinIs the number of documents in the low-resource corpus.

The application further improves that the step 2 extracts the document level representation of the document in the enhanced corpus, and specifically comprises the following steps:

step 2.1, adopting a pre-training language model to strengthen each document Encoding to obtain word embedded set corresponding to the enhanced documentI.e. the set of words that the enhanced document contains:

Wherein, ,Is a word-embedded identification,To enhance documentsMiddle (f)The number of words to be used in a word,In the form of a transducer encoder,,To enhance documentsThe number of words in the set of words,Is the dimension of the film,For the dimension size of the hidden variable,Is the firstDocument level representation of individual words;

Step 2.2, generating a document-level embedded representation corresponding to the enhanced document :

。

The application further improves the method in that the step 3 constructs a learnable theme embedding matrix, calculates the document theme distribution through the document-theme similarity, and specifically comprises the following steps:

Step 3.1, constructing a learnable theme embedding matrix WhereinIs the number of topics;

Step 3.2, obtaining the topic distribution by the document level embedded representation and topic embedded matrix dot product calculation, and each enhanced document Corresponding topic distributionThe calculation mode of (a) is as follows:

Wherein, To enhance documentsA corresponding document-level representation of the document,In order to refer to calculating the distribution of the subject matter,And (3) for example normalization operation, independent normalization of each theme dimension is realized.

In the step 4, a semantic perception contrast learning frame is designed, dynamic negative samples are screened in the same batch of reinforced documents of the contrast learning frame, contrast learning loss is calculated, a theme embedding matrix is optimized, the theme consistency is ensured by using prior alignment loss, and a subject term is obtained, and the method specifically comprises the following steps:

step 4.1, calculating the first using the topic distribution of the enhanced document as a basis Personal enhanced document embeddingAnd (d)Personal enhanced document embeddingRelative semantic relevance score between:

Wherein, Representing the calculation of cosine similarity of the two vectors.

Step 4.2, screening and comparing dynamic negative samples in the reinforced documents of the same batch of the learning framework through the following formula:

Wherein, Super parameters that are thresholds;

Step 4.3 for Low resource documents All have low resource documentsCorresponding first enhanced documentAnd second enhancing the documentWhen training the contrast learning framework, the corpus is collected from low resourcesMedium random sampling,For batch size, take random samplesIs to enhance documents for a first batch of (a)And second batch enhanced documentsSeparately computing a first batch of enhanced document topic distributionsAnd second batch enhanced document topic distributionAnd construct positive sample pairsNegative sample pair pass within the same lotScreening, and calculating the contrast learning loss of the positive sample pair and the negative sample pair of the first batchContrast learning loss for second lot positive and negative sample pairs:

Wherein, Is a hyper-parameter that is a trade-off factor between positive and negative sample pairs,Representing temperature parameters of a document-level embedded representation, total contrast learning loss for the same batchThe method comprises the following steps:

Step 4.3, randomly sampling two batch sizes, namely Prior topic distributionAnd calculate a priori alignment loss:

Wherein, Representing a topic distribution deduced from two batches of enhanced documents,In order to be a moment order,For the number of topics currently being calculated,Is the firstThe distribution of the individual topics,Is the firstThe number of a priori distributions is chosen,Representing the mean of the d-th topic of the inferred topic distribution domain,Representing the mean of the d-th topic of the prior distribution domain,Representation ofIs the first of (2)The number of subjects to be treated,Representation ofIs the first of (2)A subject matter;

Step 4.4, total loss function :

Wherein, Is a super parameter.

The application further improves that the step 4 of designing the comparative learning framework training process of semantic perception comprises the following steps:

step T1, initializing a theme embedding matrix Configuration of dimension size including hidden variable with pre-trained language model parametersInputting the low-resource document into a large language model to generate an enhanced document set by the non-zero super parameters;

step T2, carrying out batch random extraction on the enhanced document set, inputting a contrast learning frame, and carrying out prior sampling in a Dirichlet parameter space to obtain document theme distribution of the enhanced document set;

Step T3, comprehensively evaluating the total contrast learning loss of the same batch Loss of alignment with a prioriAnd circularly executing the theme embedding matrix parameter updating operation, and stopping training when the semantic aware contrast learning framework converges.

The application further improves that the large language model expansion and induction subject seeing method of the step 5 specifically comprises the steps of giving the name of the low-resource document and the corresponding name thereof in the promptThe 10 subject words corresponding to each subject are required to be generalized for each subject to generate human-readable subject insights.

The beneficial effects of the application are as follows:

The application designs a semantic perception contrast learning framework, which is a theme model for establishing low resources by using a nerve-based method for the first time.

The application designs a semantic perception text expansion and corpus expansion scheme based on a large language model, which can solve the problem of data deficiency of a low-resource corpus.

The exploration presented by the present application incorporates semantic knowledge in the converter language model into the modeling process to address model complexity and noise information challenges.

Summarizing, the method is successfully applied to an online forum low-resource text analysis scene, a complete technical path from data enhancement to multi-level topic analysis is realized, compared with a traditional topic model, the method has obvious advantages in topic semantic consistency and fine granularity knowledge discovery, and a new solution is provided for text analysis in a low-resource environment.

Drawings

Fig. 1 is a flow chart of the present application.

FIG. 2 is a diagram of a model architecture for implementing the present application.

FIG. 3 is a comparison line graph of the four indicators of different comparison models in the present application.

Detailed Description

Embodiments of the application are disclosed in the drawings, and for purposes of explanation, numerous practical details are set forth in the following description. However, it should be understood that these practical details are not to be taken as limiting the application. That is, in some embodiments of the application, these practical details are unnecessary. Moreover, for the purpose of simplifying the drawings, some conventional structures and components are shown in the drawings in a simplified schematic manner.

As shown in fig. 1, the method for extracting the key topics of the low-resource topics for the online forum of the application specifically comprises the following steps:

Step 1, acquiring a low-resource document of an online forum, and carrying out semantic retention data enhancement on the acquired low-resource document through a large language model to generate an enhanced document set, wherein the enhanced document set is specifically as described in the part (a) in fig. 2:

The prompt specifically comprises the following steps:

。

Step 1.2, iterative generation, namely using a pre-training language model, adopting Sentence-BERT in the embodiment, and calculating the embedded similarity of a low-resource document and a large language model generation result corresponding to the low-resource document, wherein the concrete use cosine similarity for measurement;

Step 1.3, when Individual low resource documentsCorresponding generation resultIs below a threshold valueWhen the screening mechanism is triggered, the steps 1.1-1.2 are repeatedly executed, and the first step is regeneratedIndividual low resource documentsCorresponding generation resultWhereinFor a low-resource corpus set,For super parameters, the embedding similarity of the highest generated result in each iteration is reserved, and the corpus collection with low resources is obtainedObtaining a first enhanced document set after two times of data enhancementAnd second enhancing the document collectionWhereinIs the number of documents in the low-resource corpus.

Step2, extracting document level representation in the enhanced document set by utilizing a pre-training language model, and specifically comprising the following steps:

step 2.1, employing Pre-trained language model Sentence-BERT for each enhanced document Encoding to obtain word embedded set corresponding to the enhanced documentI.e. the set of words that the enhanced document contains:

Wherein, ,Is a word-embedded identification,To enhance documentsMiddle (f)The number of words to be used in a word,In the form of a transducer encoder,,To enhance documentsThe number of words in the set of words,Is the dimension of the film,For the hidden variable dimension size, 768 is set in Sentence-BERT,Is the firstDocument level representation of individual words;

。

Step 3, constructing a learnable topic embedding matrix, and calculating the topic distribution of the document according to the similarity of the document and the topic, wherein the topic distribution of the document is shown in the part 2 (b) of the figure, and specifically comprises the following steps:

Step 3.1, constructing a learnable theme embedding matrix This matrix is randomly initialized and parameterized throughout the training process, whereIs the number of topics;

Step 4, designing a semantic perception contrast learning framework, implementing a dynamic negative sample strategy in the same batch of enhanced documents of the contrast learning framework, calculating contrast learning loss, optimizing a theme embedding matrix, ensuring theme consistency by using priori alignment loss, and obtaining a theme word;

And 5, expanding the subject words obtained in the step 4 by using a large language model, and inducing subject insights to help better understand low-resource corpora, wherein the steps are as follows:

Wherein, Representing the calculation of cosine similarity of the two vectors.

Wherein, Is a threshold hyper-parameter, so that negative samples with high semantic similarity to the original sentence are masked as false negative samples.

Step 4.3 for Low resource documentsAll have low resource documentsCorresponding first enhanced documentAnd second enhancing the documentWhen training the contrast learning framework, the corpus is collected from low resourcesMedium random sampling,For batch size, take random samplesIs to enhance documents for a first batch of (a)And second batch enhanced documentsSeparately computing a first batch of enhanced document topic distributionsAnd second batch enhanced document topic distributionAnd construct positive sample pairsNegative sample pair pass within the same lotScreening, and calculating the contrast learning loss of the positive sample pair and the negative sample pair of the first batchContrast learning loss for second lot positive and negative sample pairs:

step 4.3 from Dirichlet ) Distributed random sampling of two batch sizes, i.ePriori topic distribution of dimensionsAnd calculate a priori alignment loss:

Step 4.4, total loss function :

Wherein, Is a super parameter.

The method for expanding and inducing the topic of the large language model in the step 5 comprises the steps of giving the name of the low-resource document and the corresponding name in the promptThe 10 subject words corresponding to each subject are required to be generalized for each subject to generate human-readable subject insights. The large language model expansion and induction subject seeing method in the step S6 specifically comprises the following steps:

the name of the post and the corresponding name are given in the prompt The 10 subject words corresponding to the respective subjects are required to be generalized for each subject to generate human-readable subject insights, and the prompt is specifically as follows:

。

the contrast learning framework training process of semantic perception comprises the following steps:

The training algorithm is as follows:

In order to verify the efficacy of the application, a sub-forum layout named 'what puzzles you' is selected to construct an experimental data set, and the system evaluation is carried out through four subject continuity indexes. The core evaluation index mean is shown in table 1.

TABLE 1

As shown in FIG. 3, the indexes of the application are CP:0.2301, CA:0.1651, NPMI:0.02136, UCI: 0.2558, which are all higher than the comparison model, wherein the highest comparison model is CP:0.0163, CA:0.1413, NPMI:0.0046, UCI: 0.3225 respectively. Three representative methods, namely a comparison model 1, a comparison model 2 and a comparison model 3 are selected as comparison references. The specific model is as follows:

The comparative model 1 is the LDA method. Comparative model 2 is the vONT method. Comparative model 3 is the LLMTopic method.

CP()、CA() NPMI (Normalized Pointwise Mutual Information) and UCI) The four indexes are public evaluation indexes for quantifying the modeling quality of the subject from the dimensionality of semantic relevance, vocabulary distribution and the like.

The present exemplary embodiment is only for explaining a feasible implementation path of the technical principle, and does not constitute a substantial limitation of the patent claims. Under the premise of strictly following the innovative boundary defined by the claims of the present invention, the related technical researchers perform parameter adaptation, architecture equivalent conversion or application scene extension and other deduction actions on the technical features disclosed in the specification, and all the deduction actions should be considered to be within the radiation range of the creative contribution of the original patent, so as to legally enjoy the patent protection efficacy.

Claims

1. The low-resource topic key topic extraction method for the online forum is characterized by comprising the following steps of:

2. The method for extracting the key topics of the low-resource topics for the online forum according to claim 1, wherein the step 1 is characterized in that the low-resource documents of the online forum are obtained, the obtained low-resource documents are subjected to semantic retention data enhancement through a large language model, and an enhanced document set is generated, and the method specifically comprises the following steps:

a) Minimum semantic variation principle, namely requiring that each text in the generated enhanced document set is semantically similar to a low-resource document;

3. The method for extracting low-resource topic key subject matter oriented to the online forum, which is characterized in that the step 2 is used for extracting document level representation of the documents in the enhanced corpus, and specifically comprises the following steps:

;

。

4. The method for extracting low-resource topic key topics facing to the online forum as set forth in claim 3, wherein the step 3 is to construct a learnable topic embedding matrix, calculate document topic distribution through document-topic similarity, and specifically comprises the following steps:

;

Wherein, To enhance documentsA corresponding document-level representation of the document,In order to refer to calculating the distribution of the subject matter,The operations are normalized for the examples.

5. The method for extracting low-resource topic key subject matter facing to an online forum according to claim 4, wherein in the step 4, a semantic perception contrast learning frame is designed, dynamic negative samples are screened in the same batch of enhanced documents of the contrast learning frame, contrast learning loss is calculated, a subject embedding matrix is optimized, and subject consistency is ensured by using prior alignment loss, so that subject words are obtained, and the method specifically comprises the following steps:

step 4.1, calculating the first using the topic distribution of the enhanced document as a basis Document level embedded representation embedding corresponding to individual enhanced documentsAnd (d)Document level embedded representation embedding corresponding to individual enhanced documentsRelative semantic relevance score between:

;

Wherein, Representing the calculation of cosine similarity of two vectors;

;

Wherein, Super parameters that are thresholds;

;

Wherein, Is a super-parameter which is used for the processing of the data,Representing temperature parameters of a document-level embedded representation, total contrast learning loss for the same batchThe method comprises the following steps:

;

Wherein, Representing a topic distribution deduced from two batches of enhanced documents,In order to be a moment order,For the number of topics currently being calculated,Is the firstThe distribution of the individual topics,Is the firstA priori topic distribution is obtained,Representing inferred topic distribution domainsThe mean value of the d-th topic of (c),Representing a priori distribution domainsThe mean value of the d-th topic,Representation ofIs the first of (2)The number of subjects to be treated,Representation ofIs the first of (2)A subject matter;

Step 4.4, total loss function :

;

Wherein, Is a super parameter.

6. The method for extracting low-resource topic key subject matter for an online forum of claim 5, wherein the step 4 of designing a semantic aware contrast learning framework training process comprises the following steps:

7. The method for extracting key topics of low-resource topics for online forum as set forth in claim 6, wherein the large language model expanding and generalizing topic finding method of step 5 is specifically to give names of low-resource documents and their corresponding names in promptsThe 10 subject words corresponding to each subject are required to be generalized for each subject to generate human-readable subject insights.