CN113519001A

CN113519001A - Generating common sense interpretations using language models

Info

Publication number: CN113519001A
Application number: CN202080018455.XA
Authority: CN
Inventors: N·拉贾尼; B·麦卡恩
Original assignee: Salesforce com Inc
Current assignee: Salesforce Inc
Priority date: 2019-03-04
Filing date: 2020-02-24
Publication date: 2021-10-19
Anticipated expiration: 2040-02-24
Also published as: JP2022522712A; US11366969B2; CN113519001B; JP7158598B2; US20200285704A1; WO2020180518A1; EP3935573A1

Abstract

According to some embodiments, systems and methods are provided for developing or providing common sense automated generation interpretation (CAGE) for prediction by inference used by artificial intelligence, neural networks, or deep learning models. In some embodiments, the systems and methods generate such interpretations using supervised trimming of Language Models (LMs). These interpretations can then be used for downstream classification.

Description

Generating common sense interpretations using language models

The inventor: n, Lajiani and B, Mackaen

RELATED APPLICATIONS

This application claims priority from U.S. provisional patent application No. 62/813,697 filed on 3/4/2019 and U.S. non-provisional patent application No. 16/393,801 filed on 24/4/2019, the entire contents of which are incorporated herein by reference.

Copyright notice

A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the patent and trademark office patent file or records, but otherwise reserves all copyright rights whatsoever.

Technical Field

The disclosure relates generally to natural language processing, and more particularly to generating a common sense interpretation (common interpretation) of reasoning (reasoning) or rationalization (ratification) using (leveraging) language models.

Background

Artificial intelligence implemented using neural networks and deep learning models has emerged as a great prospect as a technique for automatically analyzing real-world information with human-like (human-like) accuracy. However, artificial intelligence or deep learning models are generally unable to explain the reasoning behind their predictions or rationalization of the predictions, or to explain how well the reasoning or rationalization is based on common sense knowledge. This makes it difficult for humans to understand and trust such models.

Accordingly, it would be advantageous to have a system and method that provides, implements, or improves on common sense reasoning or rationalization in artificial intelligence or deep learning models, and additionally generates or provides an explanation of the reasoning or rationalization.

Drawings

Fig. 1 is a simplified diagram of a computing device, according to some embodiments.

Fig. 2 illustrates an example of questions, answers, and human generated interpretations that may be included in a common sense interpretation (CoS-E) dataset, according to some embodiments.

Fig. 3 illustrates an example distribution of interpretations collected in a CoS-E dataset, according to some embodiments.

FIG. 4 illustrates example time steps for training a common sense Auto-Generated extensions (CAGE) language model to generate an interpretation from a CoS-E dataset, according to some embodiments.

Fig. 5 is a simplified diagram of a language module or model according to some embodiments.

Fig. 6 illustrates example time steps for generating a predictive classification model or module according to some embodiments.

Fig. 7 is a simplified diagram of a classification model or module according to some embodiments.

Fig. 8 is a simplified diagram illustrating a system for generating a common sense interpretation of inferences through artificial intelligence or deep learning models, according to some embodiments.

Fig. 9 is a simplified diagram of a method of generating a common sense interpretation of inference through artificial intelligence or deep learning models, according to some embodiments.

Fig. 10 is a simplified diagram illustrating a system for generating a rationalized common sense interpretation through artificial intelligence or deep learning models, according to some embodiments.

Fig. 11 is a simplified diagram of a method of generating a rationalized common sense interpretation through artificial intelligence or deep learning models, according to some embodiments.

FIG. 12 illustrates a table showing an example set of inferences and rationalizations from common sense QA (CommonseQA), CoS-E, and CAGE samples according to some embodiments.

Fig. 13 shows a table showing the result comparison.

In the drawings, elements having the same reference number have the same or similar function.

Detailed Description

The description and drawings illustrating various aspects, embodiments, implementations or applications should not be taken as limiting-the claims define the claimed invention. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of the description and claims. In some instances, well-known circuits, structures or techniques have not been shown or described in detail as they would be known to one skilled in the art. The same numbers in two or more drawings identify the same or similar elements.

In the description, specific details are set forth describing some embodiments according to the disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art, that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are intended to be illustrative rather than restrictive. Those skilled in the art will recognize that, although not specifically described herein, other elements are within the scope and spirit of the disclosure. Furthermore, to avoid unnecessary repetition, one or more features shown and described in connection with one embodiment may be incorporated into other embodiments unless specifically described otherwise or the one or more features would render the embodiments inoperative.

SUMMARY

Artificial intelligence implemented using neural networks and deep learning models has emerged as a great prospect as a technique for automatically analyzing real-world information with human-like accuracy. Typically, such neural networks and deep learning models receive input information and make predictions based on the input information. However, these models may face the challenge of applying common sense reasoning or rationalization to develop or interpret their predictions. Common sense reasoning or rationalization is a challenging task of modern machine learning methods. Artificial intelligence or deep learning models often fail to explain the reasoning or rationalization (common sense or otherwise) behind their predictions, making it difficult for humans to understand and trust such models.

Applying common sense reasoning or rationalizing and interpreting will help make the deep neural network more transparent to humans and establish trust.

According to some embodiments, the disclosure provides systems and methods for generating interpretations useful for common sense reasoning or rationalization using a pre-trained language model. In some embodiments, a common sense automated generation interpretation (CAGE) is provided as a framework for interpretation of the generation of common sense question and answer (common sense QA). Common sense QA is a multi-choice Question-and-answer dataset proposed for developing Natural Language Processing (NLP) models with common sense reasoning capabilities, as described in detail in Talmor et al, "A Question Answering Change Targeting Commensense Knowledge Knowledge," arXiv:1811.00937v2 (11/2/2018), which is incorporated herein by reference. There are multiple versions of common sense QA (e.g., v1.0, v1.1), any of which may be used in one or more embodiments. NLP is a class of problems that neural networks may be suitable for. NLP can be used to instill (instill) an understanding of individual words and phrases into a new neural network.

In some embodiments, a human interpretation of the common sense inference is generated and established outside or added to the corpus of common sense QA (corpus) as a common sense interpretation (CoS-E). In some embodiments, CoS-E contains human interpretations in the form of both open-ended natural language interpretations and highlighted span annotations (highlighted span annotation) that represent words selected by humans as important for predicting correct answers.

According to some embodiments, the task of common sense reasoning is broken down into two phases. In the first stage, the system and method of disclosure provides a common sense QA example and a corresponding CoS-E interpretation for the language model. The language model determines the question and answer choices from the examples and is trained to generate the CoS-E interpretations. In the second phase, the system and method of disclosure uses the language model to generate an interpretation for each example in the training and verification set of the common sense QA. These common sense automatic generation interpretations (CAGE) are provided to the second common sense inference model by concatenating (concatement) the second common sense inference model to the end of the output of the original question, answer selection and language model. The two-stage CAGE framework achieves state-of-the-art results that exceed the best reported baseline by 10%, and also produces an explanation to prove that its prediction is a common sense auto-generated explanation (CAGE).

In summary, the disclosure presents a new common sense interpretation (CoS-E) dataset to study neural common sense reasoning. The disclosure provides a new method (CAGE) for automatically generating an interpretation based on common sense QA, which achieves an accuracy of about 65% of the prior art.

Computing device

Fig. 1 is a simplified diagram of a computing device, according to some embodiments. As shown in fig. 1, computing device 100 includes a processor 110 coupled to a memory 120. The operation of computing device 100 is controlled by processor 110. Also, while computing device 100 is shown with only one processor 110, it is understood that processor 110 may represent one or more central processing units, multi-core processors, microprocessors, microcontrollers, digital signal processors, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Graphics Processing Units (GPUs), etc. in computing device 100. Computing device 100 may be implemented as a stand-alone subsystem, a panel (board) added to a computing device, and/or a virtual machine.

Memory 120 may be used to store software executed by computing device 100 and/or one or more data structures used during operation of computing device 100. Memory 120 may include one or more types of machine-readable media. Some common forms of machine-readable media may include a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, an EPROM, a FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer can read.

The processor 110 and/or the memory 120 may be arranged in any suitable physical arrangement. In some embodiments, processor 110 and/or memory 120 may be implemented on the same panel, in the same package (e.g., a system-in-package), on the same chip (e.g., a system-in-chip), and so on. In some embodiments, processor 110 and/or memory 120 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 110 and/or memory 120 may be located in one or more data centers and/or cloud computing facilities.

As shown, the memory 120 includes a common sense interpretation module 130 that can be used to implement and/or simulate systems and models, and/or implement any of the methods described further herein. In some examples, the common sense interpretation module 130 can be used to develop, derive, or generate predictions, apply common sense reasoning or rationalization, and generate or provide the same interpretation as further described herein. In some examples, the common sense interpretation module 130 can also handle iterative training and/or evaluation of systems or models used to generate predictions, apply common sense reasoning or rationalize, and generate or provide an interpretation. In some examples, memory 120 may include a non-transitory, tangible machine-readable medium comprising executable code that, when executed by one or more processors (e.g., processor 110), may cause the one or more processors to perform a method described in further detail herein. In some examples, the common sense interpretation module 130 may be implemented using hardware, software, and/or a combination of hardware and software.

As shown, the computing device 100 receives input data 140 and natural language interpreted text 145 that are provided to the common sense interpretation module 130. The input data 140 may relate to any situation, scenario, question, etc. that it is desirable to apply artificial intelligence, neural networks, or deep learning models to analyze and make predictions, e.g., for question-answering (QA) or some other NLP task. In some embodiments, the natural language interpretation text 145 may include a human interpretation of a common sense inference, which may be a common sense interpretation (CoS-E). The human interpretation may be in the form of an open natural language interpretation as well as a highlighted annotation in the original input instance. In some embodiments, the natural language interpretation text 145 may include automatically generated interpretations. The natural language interpreted text 145 may be used for fine tuning or training of the common sense interpretation module 130. In some embodiments, this training may occur over one or more iterations performed or conducted by the common sense interpretation module 130.

The common sense interpretation module 130 operates on the input data 140 to develop, derive or generate predictions or results, which are performed using natural language interpretation text 145 to support or apply common sense reasoning. The module 130 may also generate or provide an explanation of its reasoning or rationalization. In some embodiments, the common sense interpretation module 130 implements or incorporates a Language Model (LM) that can generate an interpretation. In some embodiments, the common sense interpretation module 130 implements or incorporates a common sense inference model (CSRM) or classification model that develops or generates predictions or results based at least in part on interpretations from a Language Model (LM). In some embodiments, the common sense interpretation module 130 uses or incorporates a Generative Pre-Trained Transformer (GPT) language model (e.g., Radford et al, "stimulating language understanding by Generative Pre-training," or "stimulating language understanding," or "genetic training Pre-training," or "genetic training," or "genetic training, or"https://s3-us-west-2.amazonaws.com/ openai-assets/research-overs/language-unsupervised/language understanding a paper.pdfFurther described, incorporated herein by reference) and fine-tuned over the common sense QA training data through determination of questions, answer choices, and human-generated interpretations. The results and interpretation are provided as output 150 from the computing device 100.

In some examples, the common sense interpretation module 130 may include a single or multi-layer neural network with appropriate preprocessing, encoding, decoding, and output layers. Neural networks have shown great promise as a technique to automatically analyze real-world information with human-like accuracy. Typically, neural network models receive input information and make predictions based on the input information. While other methods of analyzing real world information may involve hard-coding processes, statistical analysis, etc., neural networks learn to predict step-by-step through trial-and-error processes using machine learning processes. A given neural network model may be trained using a large number of training examples, iteratively until the neural network model begins to consistently make similar inferences from the training examples that a human may make. Although the common sense interpretation module 130 is depicted as a software module, it may be implemented using hardware, software, and/or a combination of hardware and software.

Common sense interpretation (CoS-E)

According to some embodiments, the language modeling system and method of disclosure may use or utilize human interpretation of common sense reasoning, which may be in a common sense interpretation (CoS-E) dataset. In some embodiments, the CoS-E dataset is added to or constructed on top of an existing common sense QA dataset for use in the language model system and method of disclosure. The common sense QA dataset consists of two segmentations (splits), as described by Talmor et al, "A Question Answering Change Targeting Commensense Knowledge Knowledge," arXiv:1811.00937v2 (11/2/2018), which is incorporated herein by reference. In some embodiments, the CoS-E dataset and language model of the disclosure uses a more difficult stochastic segmentation, i.e., a primary assessment segmentation. Each example in the common sense QA consists of a question q, three answer choices c0, c1, c2 and a labeled answer a. CoS-E dataset with human interpretation E added_hFor explaining why a is the most appropriate choice.

In some embodiments, human interpretations of the common sense inference for the CoS-E dataset can be collected, for example, using Amazon Mechanical turn (MTurk). As shown in the example shown in fig. 2, the system presents or presents one or more questions 210 (e.g., "what did people try to do when eating hamburgers with friends. The system prompts the human participants for the following questions: "why is the predicted output the most appropriate answer? ". The system instructs the human participant to highlight relevant words 240 in the question 210 that prove the true answer selection 230 is correct (e.g., "hamburger, with friend"), and provides a short open interpretation 250 based on the highlighted proof (e.g., "often eat hamburger with friend to indicate nice time"), which can be used as a common sense inference behind the question. The system collects these interpretations to add to or build on top of the common sense QA training-random-segmentation and de-random-segmentation, which may have sizes of 7610 and 950 examples, respectively. The resulting CoS-E dataset includes both free-form interpretations and highlighted text of questions, answer choices, and true answer choices. The highlighted text or words 240 in the dataset may be referred to as "CoS-E-selected" and the free form interpretation 250 may be referred to as "CoS-E-open".

With respect to human-generated interpretations that gather common sense inferences, it may be difficult to control the quality of open annotations (e.g., interpretations 250) provided by participants interacting with the system. Thus, in some embodiments, the system may perform an in-browser check to avoid or reject a visibly erroneous interpretation. In some embodiments, the human annotator is not allowed to proceed forward in the system if s/he does not highlight the relevant words 240 in the question 210 or if the length of the interpretation 250 is less than four words. The system may also check that the interpretation 250 is not a substring of the question 20 or answer selection 220 without any other additional words. In some embodiments, the system collects these interpretations 250 from one annotator per example. The system may also perform one or more post-collection checks to capture instances that are not captured or identified by other filters. The system may filter out interpretations 250 that may be classified as templates. For example, the form "< answer > is the only option, i.e., [ correctly obvious ]" interpretations can be deleted by the system and then re-rendered by the same or a different human participant for annotation.

Fig. 3 illustrates an example distribution 300 of interpretations (e.g., the open interpretation 250 of fig. 2) collected in a CoS-E dataset in some embodiments. As shown in fig. 3, 58% of the interpretations from the CoS-E dataset contain a true answer selection (e.g., true answer selection 230) — case "a". 7% of the interpretations include interferers (or wrong selection of a problem) — case "B". 12% of the interpretations include true and interferents (A and B), while 23% of the interpretations do not include true or interferents (neither A nor B). 42% of interpretations have a bigram overlap with a question (e.g., question 210), while 22% of interpretations have a trigram overlap with a question.

In some embodiments, a human-generated interpretation of the CoS-E dataset (e.g., the interpretation 250 of fig. 2) may be provided, for example, as natural language interpreted text 145 that is input to the computing device 100 (fig. 1) for use by the common sense and interpretation module 130. According to some embodiments, the CoS-E dataset is added to an existing common sense QA dataset for the language model systems and methods, e.g., as implemented or incorporated in module 130. The usefulness of the Language Model (LM) to use the CoS-E data set is not limited to those specific examples of data sets. In some embodiments, the language model obtains state-of-the-art results by using the CoS-E dataset only during training. Empirical results show that even using only those interpretations that do not have any word overlap with any answer selection, the performance completely exceeded that of the baseline without the CoS-E dataset. It was also observed that there was also a significant proportion of interferent selections in the CoS-E dataset, and in further analysis, we found that for those examples, the annotator explained by eliminating the wrong selections. This indicates that even for humans, it is difficult to deduce many examples of common sense QA. CoS-E also adds a view of diversity to the common sense QA dataset, especially a diversity inference on world knowledge. Even though many interpretations are still noisy after the quality control checks, the interpretations of the CoS-E data set are of sufficient quality to train a language model that produces common sense inferences.

Common sense automatic generation interpretation (CAGE)

Language model systems and methods may develop, derive, or generate predictions or results of NLP tasks (e.g., question-answering). According to some embodiments, the language model systems and methods of disclosure generate or output interpretations of their reasoning and principles (rationales) -common sense automated generation interpretations (CAGEs) for their predictions or results. In some embodiments, for example, a language model or module, as implemented or incorporated in the common sense interpretation module 130, generates these interpretations in response to or using the input data 140 and the natural language interpretation text 145. Interpretations are generated by the language model and used as supplemental input to the classification model or module.

In some embodiments, CAGE is provided and applied to the common sense QA task. As previously described, each example in the common sense QA consists of a question q, three answer choices c0, c1, c2, and a labeled answer a; and the CoS-E dataset adds a human explanation of why a is the most appropriate choice E_h. The output of CAGE is the interpretation e of the language model generation, which is trained to be close to e_h。

According to some embodiments, to provide CAGE to the classification model, the Language Model (LM) is trimmed or modified to generate an interpretation from the CoS-E data set. In some embodiments, the Language model of the disclosure may be implemented or combined with a Pre-Trained OpenAI generation Pre-Trained-Transformer (GPT), as further detailed in Radford et al, "Improving Language Understanding by general Training Pre-Training," https:// s3-us-west-2. amazonas. com/OpenAI-apparatuses/research-coverage/Language-intersection paper. pdf,2018, which is incorporated herein by reference. GPT is a multi-layer transformer (see Vaswani et al, 2017, incorporated herein by reference) decoder.

In some embodiments, a Language Model (LM) (e.g., that of GPT) is refined or trained on a combination of common sense QA datasets and CoS-E datasets. This is shown, for example, in fig. 4 and 5. FIG. 4 shows one time step in training the CAGE Language Model (LM) or module 405 to generate interpretations from the CoS-E dataset. In some embodiments, the language model may be implemented in or be part of common sense interpretation module 130 (FIG. 1). As illustrated, the language model 405 is based on answer selection tokens (tokens) A₁、A₂、A ₃420 and a previously human-generated interpreted token E₁、...E _i-1430 in series with the question token Q410. Training Language Model (LM) or module 405 to generate an interpreted token E _i 440。

FIG. 5 is a language module or model 505 according to some embodimentsIs shown in simplified form. In some embodiments, the language model 505 may be consistent with the common sense interpretation module 130 and/or the language model 405. In some examples, language model 505 is a multi-layer neural network. As shown in fig. 5, in some embodiments, the multi-layer neural network may be a multi-layer transducer encoder that includes an embedding module 510 and a transducer module 512. In some embodiments, the embedding module 510 may include an embedding layer (E)₁、E₂、...E_N) And the transformer module 512 may include one or more layers of transformers (Trm). In some embodiments, each converter (Trm) may be implemented using Long Short Term Memory (LSTM). The language model or module 505 receives structured source text x, such as input data 140, in the form of question (Q) and answer choices. In some embodiments, the structured source text x is in natural language form. The structured source text x is passed to the embedding layer (E)₁、E₂、...E_N) Which decomposes the structured source text into tokens x_iWherein each token x_iMay correspond to words, numbers, labels, etc. In some embodiments, as shown, the language model or module 505 uses constrained self-attention at the transformer (Trm) level, where each token can only notice the context to its left. These left-side-only context transformer (Trm) layers collectively act as a transformer decoder for text generation. Generated text (T)₁、T₂、...T_N) For general knowledge interpretation E_i. In some embodiments, such an interpretation may be used to infer which answer selection is correct for the question.

In view of human interpretation from the CoS-E or inference/interpretation from language models or modules (e.g., 405 or 505), the systems and methods of disclosing content can learn to perform predictions of common sense QA tasks. In some embodiments, such as shown in fig. 6 and 7, the classification model or module generates or derives predictions made for the input question-set. FIG. 6 shows one time step in generating a predictive classification model (CRSM) 615. In some embodiments, the classification model may be implemented in or be part of the common sense interpretation module 130 (FIG. 1). As shown, a classification model or module 615 receives andanswer selection token A₁、A₂、A ₃620, and generates or derives a predicted token a₁ 650。

In some embodiments, the classification model or module 615 may be implemented or employ a Language representation model, such as a Bidirectional Encoder representation from transducers (BERT) model, as described in Devlin et al, "BERT: Pre-training of Deep Bidirectional transducers for Language interpretation," arXiv prediction arXiv:1810.04805 (11/10/2018), which is incorporated herein by reference. In some embodiments, classification model 615 may be implemented or employ BERT_LARGEModels that can be fine-tuned for multiple choice question answering by adding simple binary classifiers. The classifier will correspond to a special [ CLS ] placed at the beginning of all inputs of the BERT model]The final state of the token serves as input. For each example in the dataset, classification model 615 is constructed to fine-tune the BERT_LARGEThree input sequences of the model. The input representation of the description is the same as the input representation of the question.

Fig. 7 is a simplified diagram of a classification model or module 715 according to some embodiments. In some embodiments, the classification model 715 may be in accordance with the common sense interpretation module 130 and/or the classification model 615. In some examples, classification model 715 is a multi-layer neural network. As shown in fig. 7, in some embodiments, the multilayer neural network may be a multilayer transformer encoder including an embedding module 710 and a transformer module 712. In some embodiments, the embedding module 710 may include an embedding layer (E)₁、E₂、...E_N) And the transformer module 712 may include one or more layers of transformers (Trm). In some embodiments, a long term short term memory (LSTM) layer may be used instead of a transducer layer. The classification model or module 715 receives the structured source text x in the form of question (Q) and answer choices, such as the input data 140. In some embodiments, the structured text may also include interpretations generated, for example, by a trained language model (e.g., 405 or 505). Question, answer selection and interpretation are separated by partitions in the input dataSymbol [ SEP ]]And (4) separating. In some embodiments, each sequence is a question, delimiter token [ SEP ]]And a concatenation of answer choices. If the method requires an interpretation from CoS-E or generated automatically as in CAGE, the classification model or module 715 will question [ SEP]Explain [ SEP ]]And answer choices are concatenated. The structured source text x is passed to the embedding layer (E)₁、E₂、...E_N) Which decomposes the structured source text into tokens x_iWherein each token x_iMay correspond to words, numbers, labels, etc. In some embodiments, as shown, the classification model 715 uses two-way self-attention at the transformer (Trm) level, where each token can note context to its left and right. These transformer (Trm) layers collectively function as a transformer encoder. The classification model or module 715 generates or derives a prediction of answer choices for the input question.

Two settings or possibilities for generating the interpretation and prediction may be: (1) interpretation-then-prediction ("inference"); and (2) predict-then-explain ("rationalize").

Reasoning: reasoning is explained with reference to fig. 8 and 9. Fig. 8 is a simplified diagram illustrating a system for generating a common sense interpretation for inference through artificial intelligence or deep learning models, according to some embodiments. Fig. 9 is a simplified diagram of a corresponding method 900 for the system 800. One or more of the

processes

910 and 940 of the method 900 may be implemented at least in part in executable code stored on a non-transitory, tangible, machine-readable medium, which when executed by one or more processors, may cause the one or more processors to perform one or more of the

processes

910 and 940. In some embodiments, the system 800 may be implemented in the computing device 100 of fig. 1 (e.g., the common sense interpretation module 130), and the method 900 may be performed by the computing device 100 of fig. 1 (e.g., the common sense interpretation module 130).

With inference, as shown in fig. 8 and 9, an interpretation 840 of a downstream taxonomy or common sense inference model (CSRM)815 is generated using a trained CAGE language model 805 (which may be consistent with language models or modules 405 and 505).

For training, at process 910, language model 80And 5, receiving the natural language interpretation text. In some examples, natural language interpreted text (e.g., text 145) may include question q and answer selections c0, c1, c2 collected from or developed by humans, and interpretations e_h。

In some embodiments, the task of collecting or developing interpretations from humans consists of two parts. In the first section, the human annotator is instructed to highlight relevant words in the question that the proof output is correct. In the second section, the annotator is required to provide a brief open explanation as to why the predicted output is correct, but not other choices. These instructions cause the annotator to provide interpretations that actually provide a common reasoning behind the problem. In some embodiments, natural language interpreted text is used to train, test, and run language model 805.

By inference, based on question q, answer choices c0, c1, c2 and human-generated interpretation e_hRather than the actual predicted tag or answer a to fine tune the Language Model (LM) 805. Thus, the input context C during training_REThe definition is as follows:

C_REis "q, c0, c1 or c 2? General knowledge "

The target training language model 805 is modeled from the conditional language to generate an interpretation e.

After system 800 (e.g., language model 805) is trained, language model 805 and classification model or module 815 receive input data (e.g., input data 140) at process 920. The input data may relate to any situation, scenario, problem, etc. for which it is desirable to apply artificial intelligence, neural networks, or deep learning models for analysis and prediction. In some embodiments, as shown, the input data may include question Q810 and answer selection A₁、A₂、A ₃ 820。

At process 930, the language model 805 generates or develops an interpretation E840 of the common sense inference for the potential prediction or outcome of the input data. This may be accomplished, for example, as described with respect to

language models

405 and 505 of fig. 4 and 5. The machine-generated common sense interpretation 840 is provided to the classification model 815.

At process 940, classification model or module 815 (which may be consistent with classification models or modules 615 and 715) operates on the input data (e.g., question set 810 and answer selections 820) to develop, derive, or generate predictions or results 850. In some examples, the classification model 815 uses machine-generated interpretations 840 to support or apply common sense reasoning in its analysis. This may be accomplished, for example, as described with respect to

classification models

615 and 715 of fig. 6 and 7.

In some embodiments, the goal is to maximize:

where k is the size of the context window (in this case k is always larger than the length of e, so that the whole interpretation is within the context). Conditional probability P by having C_REAnd a parameter Θ that previously interpreted the token as a condition. This interpretation may be referred to as "reasoning" because it may be automatically generated during reasoning to provide additional context for the common sense question-answer. It is shown below that this method outperforms the reported state of the art of conventional sense QA by 10%.

The results and interpretation of the common sense inference are provided as an output (e.g., output 150 from common sense interpretation module 130).

Rationalization of: the reverse approach to reasoning is rationalized. Rationalization is shown with respect to fig. 10 and 11. Fig. 10 is a simplified diagram illustrating a system 1000 for generating a common sense interpretation for rationalization through artificial intelligence or deep learning models, according to some embodiments. Fig. 11 is a simplified diagram of a corresponding method 1100 for the system 1000. One or more of the

processes

1110 and 1140 of the method 1100 may be implemented at least in part in executable code stored on a non-transitory, tangible, machine-readable medium, which when executed by one or more processors, causes the one or more processors to perform one or more of the

processes

1110 and 1140. In some embodiments, system 1000 may be implemented in computing device 100 of fig. 1 (e.g., common sense interpretation module 130), and method 1100 may be performed by computing device 100 of fig. 1 (e.g., common sense interpretation module 130)Common sense interpretation module 130).

By rationalizing, as shown in fig. 10 and 11, a classification model or module 1015 (which may be consistent with classification models or modules 615 and 715) first makes predictions a, and then a language model or module 1005 (which may be consistent with language models or modules 405 and 505) generates interpretations based on those tags.

For training, at process 1110, classification model 1015 operates on input data (e.g., question set 1010 and answer selection 1020) to develop, derive, or generate predictions or results 1050. A language model or module 1005 receives natural language interpreted text. In some examples, the natural language interpretation text (e.g., text 145) may include question q and answer selections c0, c1, c2, and an interpretation e collected from or developed by a person as previously described_h。

In process 1120, language model 1005 and classification model 1015 receive input data (e.g., input data 140). The input data may relate to any situation, scenario, problem, etc. for which it is desirable to apply artificial intelligence, neural networks, or deep learning models for analysis and prediction. In some embodiments, as shown, the input data may include question Q1010 and answer selection A₁、A₂、A ₃ 1020。

At process 1130, the classification model or module 1015 operates on the input data to develop, derive, or generate predictions or results 1050. This may be accomplished, for example, consistent with the description of the classification models or

modules

615 and 715 of fig. 6 and 7. The results 1050 are provided to the language model 1015.

In rationalization, at process 1140, the conditions on the prediction label a of language model 1015 are used with the input to generate causal rationalization, or in other words, to generate an interpretation of the reasoning used to make the prediction. During the fine-tuning step of the language model 1015, context C is input_RAContains an output tag a and is constructed as follows:

C_RAq, c0, c1 or c2a because "

The training goals of the language model 1015 in rationalization are similar to those in reasoning, except that in this case the model 1015 can obtain a true label of the input question during training.

Because the language model or module 1015 conditions the predictive tags, the interpretation is not considered a common sense inference. Instead, they provide "rationalization" that makes the model more accessible and interpretable. It has been found that this rationalization approach is 6% better than the prior art model, as described below.

With respect to the systems and methods of fig. 8-11, some examples of a computing device, such as computing device 100, may include a non-transitory, tangible, machine-readable medium that includes executable code, which when executed by one or more processors (e.g., processor 110), may cause the one or more processors to perform the processes of

methods

900 and 1100. Some common forms of machine-readable media that may include the processes of

methods

900 and 1100 are, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, an EPROM, a FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.

Results

Results on a common sense QA dataset using the proposed variants of common sense automated generation interpretation (CAGE) are presented. BERT_LARGEThe model was used as a baseline without any CoS-E or CAGE.

FIG. 12 shows a table 1200 showing an example set of samples from common sense QA, CoS-E, and CAGE (for reasons and rationale). It can be observed that, in some embodiments, CAGE-inference generally takes a simpler construction than CoS-E-open. Nevertheless, this simple declarative schema can sometimes be more informative than CoS-E-opening. The system and method of implementing the CAGE disclosure accomplishes this by providing more explicit guidance (as in the last example 1202 of table 1200) or by adding meaningful context (as in the third example 1204 by introducing the word 'friend'). It is observed from table 1200 that, in some embodiments, CAGE-inference contains at least one of the answer choices 43% of the time, among which it contains the actual predicted answer choice for the model 21% of the time. This indicates that CAGE-reasoning is more efficient than direct-directed answers.

As can be seen from table 1200, CAGE-rationalization and CAGE-reasoning are generally the same, or differ only in word order, or by replacing one answer choice with another. Based only on the age of CAGE-rationalized 42% of the time, humans can predict responses, as with CAGE-reasoning. Although CAGE-rationalization appears to be better than CAGE reasoning, we have found that it does not significantly improve the language-generated behavior of the model, i.e., the behavior of human judgment, when trying to guess the correct answer without actual questions.

Additional experimental setups use only open explanations that do not contain any words from any answer selection. These interpretations may be referred to as "CoS-E-Limited open" interpretations because they are limited in terms of allowed word choice. We observed that even using these limited interpretations improved the BERT baseline, suggesting that these interpretations provide useful information, not just to mention correct or incorrect answers.

Fig. 13 shows a table 1300 showing results achieved using BERT baseline using only common sense QA inputs compared to systems and methods trained using inputs containing descriptions from CoS-E according to embodiments of the disclosure. As seen in table 1300, the BERT baseline model achieved 64% accuracy. Adding an open human interpretation (CoS-E-open) next to the question during training results in a 2% improvement in the accuracy of the question-answering model. The accuracy of the model increased to 72% when the model was further provided with interpretations generated by CAGE-inference (not conditioned on reality) during training and validation.

This description and drawings, illustrating inventive aspects, embodiments, implementations or applications, should not be taken in a limiting sense. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of the description and claims. In some instances, well-known circuits, structures or techniques have not been shown or described in detail to avoid obscuring implementations of the invention. The same numbers in two or more drawings identify the same or similar elements.

In the present specification, specific details are set forth describing some embodiments according to the disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art, that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are intended to be illustrative rather than restrictive. Those skilled in the art will recognize other elements that, although not specifically described herein, are within the scope and spirit of the disclosure. Furthermore, to avoid unnecessary repetition, one or more features shown and described in connection with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if one or more features would render the embodiments inoperative.

While exemplary embodiments have been shown and described, a wide range of modifications, changes, and substitutions is contemplated in the foregoing disclosure and in some instances, some features of the embodiments may be employed without a corresponding use of the other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Accordingly, the scope of the invention should be limited only by the attached claims, and, as appropriate, the claims should be construed broadly and in a manner consistent with the scope of the embodiments disclosed herein.

Claims

1. A method, comprising:

encoding and embedding, by an embedding module, structured source text for a question and answer set, the question and answer set comprising a question and a plurality of answer choices;

iteratively decoding, by a multi-layer transformer module, an output of the embedding module based on the generated tokens related to the structured interpretation text from the previous iteration to generate an interpretation text for inferring which answer selection is correct for the question, wherein the structured interpretation text from the previous iteration comprises interpretation text generated by a human annotator;

providing the generated interpretation text to a classification module; and

using the generated interpretation text, a prediction is generated at the classification module which answer selection is correct for the question.

2. The method of claim 1, wherein the structured source text of the question-and-answer set comprises text in natural language form.

3. The method of claim 1 or 2, comprising collecting the structured interpretation text from the human annotator.

4. The method of claim 3, wherein collecting the structured interpretation text comprises:

providing the human annotator with a set of training questions and answers; and

receiving the structured interpretation text from the human annotator in response to the training set of questions and answers.

5. The method of any of claims 1-4, comprising providing the set of questions and answers to the classification module, wherein the questions, the plurality of answer choices, and the generated interpretation text are separated by delimiters when provided to the classification module.

6. The method of any of claims 1-5, wherein the embedding module and the multilayer transformer module comprise at least a portion of a natural language model.

7. The method of any of claims 1-6, wherein the classification module comprises a multilayer transformer encoder.

8. A system, comprising:

an embedding module to encode and embed structured source text for a question-and-answer set, the question-and-answer set comprising a question and a plurality of answer choices;

a multi-layer transformer module for iteratively decoding the output of the embedding module based on the generated tokens related to the structured interpretation text from the previous iteration to generate an interpretation text for inferring which answer selection is correct for the question, wherein the structured interpretation text from the previous iteration comprises the interpretation text generated by the human annotator; and

a classification module for generating a prediction of which answer selection is correct for the question using the generated interpretation text.

9. The system of claim 8, wherein the structured source text of the question-and-answer set comprises text in natural language form.

10. The system of claim 8 or 9, wherein the embedding module and the multilayer transformer module comprise at least a portion of a neural network.

11. The system of any of claims 8-10, wherein the question, the plurality of answer selections, and the generated interpretation text separated by separators are provided to the classification module.

12. The system of any of claims 8-11, wherein the embedding module and the multilayer transformer module comprise at least a portion of a natural language model.

13. The system of any of claims 8-12, wherein the classification module comprises a multilayer transformer encoder.

14. A non-transitory machine-readable medium comprising executable code that, when executed by one or more processors associated with a computer, is adapted to cause the one or more processors to perform a method comprising:

iteratively decoding, by the multi-layer transformer module, an output of the embedding module based on the generated tokens related to the structured interpretation text from the previous iteration, including the interpretation text generated by the human annotator, to generate interpretation text for inferring which answer selection is correct for the question;

providing the generated interpretation text to a classification module; and

15. The non-transitory machine-readable medium of claim 14, wherein the structured source text of the question-and-answer set comprises text in natural language.

16. The non-transitory machine-readable medium of claim 14 or 15, comprising executable code that when executed by the one or more processors is adapted to cause the one or more processors to collect the structured interpretation text from the human annotator.

17. The non-transitory machine-readable medium of any of claims 14-16, comprising executable code that when executed by the one or more processors is adapted to cause the one or more processors to:

providing the human annotator with a set of training questions and answers; and

18. The non-transitory machine-readable medium of any of claims 14-17, comprising executable code that when executed by the one or more processors is adapted to cause the one or more processors to provide the set of questions and answers to the classification module, wherein the questions, the plurality of answer selections, and the generated interpretation text are separated by separators when provided to the classification module.

19. The non-transitory machine-readable medium of any of claims 14-18, wherein the embedding module and the multilayer transformer module comprise at least a portion of a natural language model.

20. The non-transitory machine-readable medium of any of claims 14-19, wherein the classification module comprises a multi-layer transformer encoder.