DocuBrain For DS 4440

An Analysis of
Downstream Transformer Generation of Question-Answer Pairs with Preprocessing and Postprocessing Pipelines


This paper explores automatically generating questions from text and evaluates their effectiveness. What caught our attention was the prospect of leveraging neural networks to automate a task we've been manually performing throughout this semester: crafting questions for assigned articles. This sparked our interest in investigating how we could develop a system to streamline and automate this process using AQG.

Section 1: Introduction

DocuBrain3 is an Automatic Question Generation (AQG) project designed to automatically generate questions from provided passages of text. By analyzing and understanding the content, our project aims to generate relevant and meaningful questions without human intervention. It takes passages of text as input and produces corresponding questions, facilitating enhanced comprehension and utilization of textual content. Inspired by the research paper "Downstream Transformer Generation of Question-Answer Pairs with Preprocessing and Postprocessing Pipelines," our approach leverages the insights from state-of-the-art methodologies outlined therein. Furthermore, we utilize the SQuAD dataset as a foundational resource, enriching our system's ability to generate questions with accuracy and relevance.

Section 2: Paper Review

"Downstream Transformer Generation of Question-Answer Pairs with Preprocessing and Postprocessing Pipelines" introduces TP3, a system designed for generating question-answer pairs (QAPs) from given articles. The TP3 system utilizes a pretrained T5 transformer model to generate questions, employing preprocessing and postprocessing pipelines with various NLP tools and algorithms. This paper's methodology is very closely related to our project, DocuBrain3, as we also utilize pretrained transformer models and the SQuAD dataset to automate question generation from textual content. For fine-tuning, each QAP and its context from SQuAD are concatenated, with the question as the target. Learning rates are determined using the Cyclical Learning Rates method and adjusted based on evaluation metrics like BLEU, ROUGE, METEOR, and BERTScore, leading to the identification of optimal rates. T5-Large consistently outperforms T5-Base, with specific learning rates yielding the best results. The paper provides detailed measurement results for both models, leading to the development of T5-Base-SQuAD1 and T5-Large-SQuAD1, which demonstrate superior performance.

The preprocessing pipeline in TP3 is crucial for selecting appropriate answers and ultimately generating high-quality questions. It involves several steps:

  • Removing Unsuitable Sentences: Interrogative and imperative sentences are removed, and semantic-role labeling is used to analyze and retain sentences with relevant semantic roles.
  • Filtering Candidate Answers: Nouns and phrasal nouns are considered candidate answers, with specific criteria for selection. Named entities and nouns with certain semantic-role labels are retained, while others are removed.
  • Pruning Inadequate POS Tags: Nouns with inappropriate part-of-speech tags are removed or pruned to improve question adequacy.
  • Removing Common Answers: Words with high probabilities in the Google Books Ngram Dataset, such as "anyone" and "stuff," are removed from candidacy.
  • Filtering Answers in Clauses: Candidate answers found within clauses are assessed to ensure their relevance. Instead of simply removing answers appearing within clauses, our method scrutinizes whether the candidate answers appear later in the same clause. This approach aims to exclude answers that may not contribute effectively to the question generation process, leading to more precise filtering.
  • Removing Redundant Answers: Redundant candidate answers are removed, prioritizing syntactically important phrases to avoid inadequate question generation.
These preprocessing steps ensure that only suitable answers are considered for question generation. For each candidate answer identified in the preprocessing stage, three consecutive sentences from the text are selected as context. This structured context and corresponding answer is formatted and provided as input to a fine-tuned T5 model (Text-To-Text Transfer Transformer) which then generates three candidate questions for each answer. The most semantically similar question to thecontext is then chosen as the final target question, ensuring relevance and coherence in questions.

Following question generation, postprocessing steps are employed to refine the quality of generated questions. Some examples include removing redundant information and single-word questions. The quality of QAPs generated by TP3-Base and TP3-Large is evaluated using standard automatic metrics and human judgments. T5-SQuAD1 demonstrates superior performance across ROUGE and METEOR metrics compared to other models, highlighting its effectiveness in generating high-quality QAPs. In conclusion, TP3 leverages pretrained T5 models and robust preprocessing/postprocessing pipelines to generate high-quality QAPs. Future enhancements could focus on refining transformer models and exploring alternative methods for context selection and question generation, advancing the capabilities of QAP generation. We aim to use these preprocessing and postprocessing steps and ideas as inspiration for our AQG project.

Section 3: Method & Implementation

Preprocessing Steps:

  1. Data Collection: We collected the Stanford Question Answering Dataset (SQuAD) for our project, which consists of over 87,000 questions sourced from 500+ Wikipedia articles. Each question in SQuAD is accompanied by a meticulously curated answer, facilitating in-depth exploration and analysis of language understanding tasks.
  2. SQUAD Dataset.png

  3. Data Cleaning:
    • Remove Unsuitable Sentences from Context: We filtered out sentences from the context that were interrogative (e.g., "Would you like to have tea or coffee?") or imperative (e.g., "Remember to pick up the dry cleaning today"), as well as those that did not contain necessary information (e.g., "Please wait").
    • Remove Improper Answers (Semantic Role Labels): We removed improper answers based on Semantic Role Labels (SRL). For example, if the question was "How far did he travel?" and the answer was "Further," where "Further" is labeled as ARGM-EXT (extent), it would be removed.
    • Remove Improper Answers (Part of Speech): We filtered out improper answers based on Part of Speech (POS) tags. For instance, if the question was "What time is it?" and the answer was "Now," where "Now" is labeled as a temporal adverb (TMP), it would be removed.
    • Remove Common Answers: We excluded common answers such as "stuff," "thing," "anyone," etc.
    • Filtering Answers in Clauses: Candidate answers found within clauses are evaluated for relevance by considering their placement within the clause. Instead of simply discarding answers appearing within clauses, our method examines whether the candidate answers are located later in the same clause.
    • Removing Redundant Answers: Redundant candidate answers are eliminated, giving priority to syntactically significant phrases.
    • Dependency Tree.png
  4. Tokenization: The text data was tokenized using the T5 tokenizer, breaking it down into individual tokens or words to facilitate further processing.
  5. Input-Output Pair Generation: We formatted the tokenized data into input-output pairs suitable for training the T5 model. This involved concatenating the context and question texts as input and using the answer texts as the target output. Specifically, we included markers such as "<context>" before the actual context and "<answer>" before the actual answer so that the model is made aware of the different components and is able to distinguish them.

Experimental Setup:

    • Preprocessed Data vs. Raw Data: We conducted experiments using both preprocessed and raw SQuAD data to compare the performance of the T5 model under different conditions. The preprocessed data underwent tokenization and input-output pair generation, while the raw data was fed directly into the model without any preprocessing.
    • Subset Selection: Due to time constraints, we opted to run our experiments on a subset of the SQuAD dataset rather than the entire corpus. This subset was selected to ensure a representative sample of the data while minimizing computational resources.

Notebooks

Section 4: Experimental Findings

Model Performance Evaluation

After conducting extensive experiments with our DocuBrain3 model, we evaluated its performance across various metrics. Despite promising initial results, our model encountered challenges in consistently generating well-formed questions from given contexts. Figure 6 showcases an example of a properly generated question, highlighting instances where the model successfully generated coherent questions.

properly_generated_q.png

However, our analysis revealed that such occurrences were infrequent, with many generated questions lacking well-developed answers. Additionally, there were instances where the model produced text in languages other than English, as depicted in Figure 7.

incorrect_generated_q.png

Even though we could not train the model on the full dataset due to limitations in computing power, we wanted to test the preprocessing methods outlined in the paper, so we compared a model trained on the preprocessed dataset to a model trained without the preprocessed dataset but using the same amount of data, hyper parameters, and training time.

As outlined in the research paper, there are currently not any good tests to show how 'well formed' a question is. Due to the lack of training data, there were no question answer pairs that met the criteria outlined in the research paper of an 'adequate qap,' so we designed our own. This manual metric strategy has a set of different groups that each generated 'question' falls under. Note, we do not judge the answers themselves to these questions because they frequently are nonsensical. The unique groups are as follows: Valid English Question (VEQ) is a question that is grammatically correct in English, Valid English Sentence (VES) is a sentence that is grammatically correct in English, Taken From Context (TFC) is a generated 'question' that was directly copied over from the context provide, Taken From Context Non English (TFCNE) is a translation of the context, Valid Non English Question (VNEQ) is a question generated in another language, Valid Non English Sentence (VNES) is a sentence that is not taken from the context but is in another language, and a Mixed English Question (MEQ) is a generated question that is grammatically correct that includes both English and non-english words. Any generated question that does not fall into any of these categories are labelled as an 'Invalid Sentence'. When comparing the two models:

num_gen_examples_per_sentence.png

we can see that the unpreprocessed model mostly generated 'Invalid sentences' while the preprocessed model generated significantly less invalid sentences, and significantly more TFCNE and VEQ. In order to better understand this disparity we delved deeper into the model weights to see if it can provide us any insights.

We hypothesize that the preprocessed model often outputs translations because its weights are much closer to the t5-small model, as we are aware that the t5 series of models are very good in machine translation. This also may be why it sometimes outputs the context directly, as this is a part of the 'summarization' feature of the t5 model. More detailed notes about this can be found in our visualization notebook DB3_Interesting_NN_Visualizations.ipynb.

We conducted a direct comparison between the changes in weights observed in the preprocessed model and the unpreprocessed model. Through this analysis, we observed that the weights of the preprocessing model exhibit closer proximity to those of the base model. This proximity led us to hypothesize that the preprocessing model tends to delve into machine translation tendencies to a certain extent.

This finding suggests that the preprocessing steps may influence the model's behavior, potentially steering it towards behaviors akin to machine translation. Further investigation into the impact of preprocessing on model performance could yield valuable insights into improving question generation capabilities.

Metrics Evaluation

In our pursuit of understanding the model's performance, we employed comprehensive metrics evaluation. The DB3_Metrics notebook facilitated the comparison between preprocessed and unprocessed data, providing insights into the impact of preprocessing on model performance. Line graphs depicting validation loss and training loss for both preprocessed and unprocessed data are presented below.

preprocessed_model_loss.png
unpreprocessed_model_loss.png

Visualization and Analysis

Various visualizations are generated by our notebook DB3_Interesting_NN_Visualizations.ipynb to shed light on the inner workings of our model. Weight visualizations, layer comparisons, and histograms provided valuable insights into the model's behavior and training dynamics. Our goal with creating these visualizations was to uncover any significant weight distributions between our fine tuned models and our base models. More visualizations can be seen within the notebook itself. Some examples are provided below.

Fine Tuned Encoder Weights

fine_tuned_encoder_weights.png

Fine Tuned Decoder Weights

fine_tuned_decoder_weights.png

t5 Small Encoder Weights

t5_small_encoder_weights.png

t5 Small Decoder Weights

t5_small_decoder_weightspng

As can be observed by the weight visualizations, there were no significant changes between our fine tuned model and the t5-small model. This may be due to our lack of training data and lack of access to resources that would allow us to train our models at a faster and more achievable rate. With more training data, we believe that we would be able to show a more significant change in weight distribution.

Examining the weight distribution alone may not yield substantial insights due to the vast number of weights; minor changes may not be discernible in the distribution graph unless they are significant. To address this limitation, we opted to visualize the disparities in values between the base model and the fine-tuned model to uncover more nuanced observations. The visualizations included in the graphs below are as follows, in sequential order:

  1. Mean absolute value change in weights for each layer across encoder and decoder blocks.
  2. The three layers exhibiting the most considerable differences in weights for both encoder and decoder.
  3. The three layers demonstrating the smallest differences in weights for both encoder and decoder.
mean_diff_per_layer.png

largest_diff_encoder_decoder.png
smallest_diff_encoder_decoder.png

These analyses were conducted on both t5-base and Zhang Cheng's fine-tuned model, which was derived from t5-base. The "Mean Differences Per Layer For Encoder/Decoder" histograms provide insights into the overall transformation of the model during training. By comparing these histograms to an "ideal" state, we can approximate the progress made towards achieving that ideal state. Additionally, it offers insights into the distribution of change across layers, aiding in assessing their individual contributions.

The "Largest Difference for Encoder and Decoder" histogram sheds light on the significance of each layer. Significant variations in specific layers may suggest the possibility of freezing less crucial layers without substantially impacting overall model accuracy or enhancement. Similarly, the "Smallest Difference for Encoder and Decoder" histogram serves a comparable purpose to the largest difference histogram but from the opposite perspective.

We hypothesize that the preprocessed model often outputs translations because its weights are much closer to the t5-small model, as we are aware that the t5 series of models are very good in machine translation. This also may be why it sometimes outputs the context directly, as this is a part of the 'summarization' feature of the t5 model. More detailed notes about this can be found in our visualization notebook DB3_Interesting_NN_Visualizations.ipynb.

We conducted a direct comparison between the changes in weights observed in the preprocessed model and the unpreprocessed model. Through this analysis, we observed that the weights of the preprocessing model exhibit closer proximity to those of the base model. This proximity led us to hypothesize that the preprocessing model tends to delve into machine translation tendencies to a certain extent.

OG_model_weights_encoder.png
OG_model_weights_decoder.png

This finding suggests that the preprocessing steps may influence the model's behavior, potentially steering it towards behaviors akin to machine translation. Further investigation into the impact of preprocessing on model performance could yield valuable insights into improving question generation capabilities.

Overall, our experimental findings underscore the challenges and opportunities inherent in automatic question generation. While our model demonstrates potential, further refinement and optimization are required to achieve consistent and reliable question generation. By integrating insights from our experimental findings, we aim to drive future improvements and advancements in the field of automatic question generation. Through future iterative experimentation and analysis, we believe the capibilities and performance of our DocuBrain3 model could be greatly improved for broader applicability and impact.

Section 5: Conclusion

Through our analysis of DocuBrain3's performance in automatic question generation, it's evident that while the model has shown promising capabilities, further enhancements could be achieved with additional data and computational resources. Access to a larger and more diverse dataset, coupled with extended training duration, holds the potential to significantly improve the quality and coherence of the questions generated. Moreover, increased computing power would facilitate deeper model exploration and refinement, enabling a more nuanced understanding of language patterns and contexts. This, in turn, could lead to improvements in the relevance, complexity, and intelligibility of the questions generated, thereby enhancing the overall utility and effectiveness of DocuBrain3 in educational and interactive learning settings.

Our analysis also sheds light on the impact of preprocessing methods on model performance. Despite not achieving the exact results outlined in the research paper, our comparison of a model utilizing the preprocessing method described in the paper with an identical model trained under the same conditions revealed notable benefits. Specifically, the preprocessing method significantly reduced the occurrence of invalid sentences while increasing the proportion of valid English questions. This underscores the importance of preprocessing techniques in optimizing model performance and underscores the potential for further refinement in this area.

By automating the process of question generation, we seek to provide a valuable tool for enhancing learning, testing comprehension, and promoting critical thinking. The potential applications of this technology range from educational settings, where it can aid students in studying and educators in assessing comprehension, to broader societal impacts, such as fostering a culture of lifelong learning and inquiry. The Socratic method, renowned for its emphasis on questioning as a pathway to deeper comprehension and critical thinking, serves as our inspiration for this project. Moving forward, future work could explore further refinements in question generation algorithms, integration with existing educational platforms, and empirical studies to evaluate the effectiveness of our approach in enhancing interactive learning outcomes.

References

[1] Cheng Zhang, Hao Zhang, Jie Wang. Downstream Transformer Generation of Question-Answer Pairs with Preprocessing and Postprocessing Pipelines Downstream Transformer Generation of Question-Answer Pairs with Preprocessing and Postprocessing Pipelines

[2] Manisha Divate and Ambuja Salgaonkar. Automatic question generation approaches and evaluation techniques. Automatic question generation approaches and evaluation techniques.

[3] Siti Fairuz Dalim, Aina Sakinah Ishak, and Lina Mursyidah Hamzah. Promoting Students' Critical Thinking Through Socratic Method: The Views and Challenges. Promoting Students' Critical Thinking Through Socratic Method: The Views and Challenges.

[4] Fabio Chiusano. A Full Guide to Finetuning T5 for Text2Text and Building a Demo with Streamlit. All you need to know to build a full demo: Hugging Face Hub, Tensorboard, Streamlit, and Hugging Face Spaces.

Team Members