index

An Analysis of BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

1. Introduction

BERT, which stands for Bidirectional Encoder Representations from Transformers, is a revolutionary NLP model developed by Google, bringing about a transformative shift in the field. This groundbreaking innovation has significantly impacted language understanding tasks, empowering machines to grasp context and nuances in human communication. Unlike recent language representation models [2] [3], BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task specific architecture modifications.

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

2. Historical background, related work

BERT has its roots in pre-training contextual representations, drawing inspiration from techniques like semi-supervised sequence learning, generative pre-training, ELMo, and ULMFit. In contrast to earlier models, BERT represents a deeply bidirectional, unsupervised language model, pre-trained exclusively on a plain text corpus. Unlike context-free models such as word2vec or GloVe, which generate a single word embedding for each vocabulary word, BERT considers the context of each word occurrence

Feature-based

Fine-tuning

Introduced in 2018, the paper has already been cited over 85000 times!

3. Biography

Jacob Devlin

C: ? H: ?

Software Engineer @ Google

MS CS @ University of Maryland

Ming-Wei Chang

C: 98,000+ H: 47

Research scientist @ Google

PHD CS @ UIUC

Kenton Lee

C: 109,000+ H: 32

Research scientist @ Google

PHD CS @ University of Washington

Kristina Toutanova

C: 103,000+ H: 49

Research scientist @ Google

PHD CS @ Stanford Universit

4. Diagrams

Transformer Encoder

  • Bi-directional (processes text left-to-right and right-to-left)
  • Autoencoding
  • Constraints words “seeing themselves” by masking certain tokens.
  • Cannot be used for generation trivially.
  • For example, BERT is encoder-only.

Transformer Decoder

  • Unidirectional (processes text in only one direction)
  • Autoregressive
  • Constraints the self-attention by masking the tokens to the right.
  • Primarily used for text generation.
  • For example, OpenAI GPT is decoder-only.

Pretraining Dataset

  • Books Corpus (800M words)
  • English Wikipedia (2,500M words)

4.1 Pre-Training Objectives

Masked LM (MLM)

  • Randomly, 15% of all Word Piece tokens are chosen and masked in each sequence.

    These tokens are predicted rather than constructing the entire input.

  • If the ith token is chosen, then it is replaced with

    (1) the [MASK] token 80% of the time

    (2) a random token 10% of the time

    (3) the unchanged i-th token 10% of the time.

Ti will be used to predict the original token with cross entropy loss.

Architecture details

BERT-Base, Uncased: 12-layers, 768-hidden, 12-attention-heads, 110M parameters

BERT-Large, Uncased: 24-layers, 1024-hidden, 16-attention-heads, 340M parameters

Next Sentence Prediction (NSP)

  • In each pretraining example, sentences A and B are provided. In 50% of cases, B is the actual next sentence following A (labeled as IsNext), while in the other 50%, it is a randomly selected sentence from the corpus (labeled as NotNext).
  • Tasks like Question Answering (QA) and Natural Language Inference (NLI) rely on grasping the connection between two sentences, a aspect not explicitly covered by language modeling. Pretraining involves a binarized next sentence prediction task, easily derived from any monolingual corpus.

4.2 Input Representation for BERT

Before feeding the input to Bert , we convert input into embeddings using 3 embedding layer

4.3 Fine-tuning BERT

We use the pre-trained BERT model, append an untrained layer to the end, and train the modified model for our classification task. This approach offers several advantages over training a task-specific deep learning model.

Advantages of Fine-tuning

  1. Rapid Development:
    • Pre-trained BERT model weights already capture extensive language information.
    • Fine-tuning the model is quicker, resembling tuning the already well-trained bottom layers.
    • Authors recommend only 2-4 epochs for fine-tuning, saving substantial time compared to training from scratch.
  1. Data Efficiency:
    • Fine-tuning on BERT's pre-trained weights requires a smaller dataset than training a model from scratch.
    • Overcomes the need for a large dataset, a common challenge in training NLP models from the ground up.
  1. Superior Results:
    • Simple fine-tuning, involving adding one fully-connected layer to BERT and training briefly, achieves state-of-the-art results.
    • Outperforms or matches custom architectures designed for specific tasks without the need for intricate adjustments.

5. Why is BERT important?

6. Societal Impact

Positive Impact

Negative Impact

7. Industry Applications

8. Follow-on Research

Note: The following papers were released within one year of BERT (May 2019 - Oct 2019)

9. Implementations/Results

9.1 Fine tuning of BERT for single sentence classification task using Cola dataset :

We used The Corpus of Linguistic Acceptability (CoLA) dataset for the classification of single sentences. This dataset consists of sentences annotated as either grammatically correct or incorrect. Initially released in May 2018, it serves as one of the assessments within the "GLUE Benchmark," where models such as BERT participate in competition.

For fine tuning, our initial step involves adapting the pre-trained BERT model to generate classification outputs. Subsequently, we aim to further train the model on our dataset until the entire model, from start to finish, is tailored to our specific task.

The PyTorch implementation by Hugging Face provides a range of interfaces tailored for diverse NLP tasks. While these interfaces are constructed upon a pre-trained BERT model, they feature distinct top layers and output formats customized to meet the requirements of their respective NLP tasks. We use BertForSequenceClassification class in our case.

We chose Batch size: 32, Learning rate: 2e-5, Epochs: 4 as suggested in paper, and evaluated predictions using Matthew's correlation coefficient because this is the metric used by the NLP community to evaluate performance on CoLA, got MCC to be 0.514 compared to 0.521 in paper.

9.2 Fine tuning of BERT for multiple choice answering task using SWAG dataset :

We used Situations With Adversarial Generations (SWAG) dataset where the model selects the correct answer out of several candidate answers from a multiple choice question answering context. This dataset consists of pairs of sentences where the first sentence is posed as a start sentence (or context), and a part of the second sentence is prompted with four choices for the model to make from.

It was initially released in August 2018. The PyTorch implementation by Hugging Face provides the use of BertForMultipleChoice class in our case. We chose Batch size: 16, Learning rate: 2e-5, Epochs: 3 as suggested in paper, and evaluated predictions using accuracy, got accuracy to be 0.791, compared to 0.816 in the paper.

10. Comparison/Ablation Studies

The graph shows performance on pre-training tasks using the BERT-Base architecture.

  • “No NSP” is trained without the next sentence prediction task.
  • “LTR & No NSP” is trained as a left-to-right LM without the next sentence prediction, like OpenAI GPT.
  • “LTR & No NSP + BiLSTM” adds a randomly initialized BiLSTM on top of the “LTR + No NSP” model during fine-tuning.

We can see that the unidirectional model does poorly on SQuAD (word-level), and Bi-LSTM mitigates that with bidirectional context. Also, MLM is important in some tasks.

BERT-Base achieves almost 1.0% additional accuracy on MNLI when trained on 1M steps compared to 500k steps. The MLM model begins to outperform the LTR model almost immediately.

BERT validation accuracy when trained and evaluated under several versions of SWAG, with the new dataset HellaSwag as comparison, where,

  • Ending Only: No context is provided; just the endings.
  • Shuffled: Endings that are individually tokenized, shuffled, and then detokenized.
  • Shuffled + Ending Only: No context is provided and each ending is shuffled.

BERT’s performance only falls by 11.9% when context is omitted (Ending Only), suggesting a bias exists in the endings themselves. When an ending seems unreasonable in the absence of context, then the space of machine-generated endings must be remarkably different from human-written ones.

11. Paper insights

SWAG is a commonsense NLI dataset. A model is given a context from a video caption and four ending options for what might happen next for each question. Only one option is correct: the video's actual next caption. Previous research (e.g., Gururangan et al., 2018; Poliak et al., 2018) discovered that when humans write the endings to NLI questions, they introduce subtle but significant class-conditional biases known as annotation artifacts.

Zellers et al. (2018) proposed Adversarial Filtering (AF) to address this. The main idea was to create a dataset D that is antagonistic for any arbitrary split of (D_train, D_test). Importantly, regardless of the final dataset split, AF produces a final dataset that is difficult to model. We put HellaSwag, a new NLI dataset that uses AF as the underlying workhorse, to the test; one that is simple for humans but difficult for machines. When measured on SWAG with varying training dataset sizes, BERT outperforms the ELMo NLI model (Chen et al., 2017) with only 64 examples. BERT, on the other hand, requires upwards of 16k examples to approach human performance, after which it plateaus.

Initially, SWAG Endings were generated using a language model and then chosen to fool a discriminator using a two-layer LSTM. This setup was resistant to ELMo models, but the shallow LM in particular produced distributional artifacts that BERT was picking up on. To investigate this, the authors of the new dataset used AF in two settings, comparing generations from Zellers et al. (2018) with those from a fine-tuned GPT (Radford et al., 2018). Surprisingly, the results revealed that the generations used in SWAG are so dissimilar to the human-written endings that AF never loses accuracy to chance, instead settling at around 75%. GPT's generations, on the other hand, are good enough that BERT accuracy drops below 30% over many random subsplits of the data, highlighting the importance of the generator.

12. Peer-Review

Merits

Problems

13. References

[1]Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.  arXiv.org (2018, October 11).

[2]Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner. Christopher Clark, Kenton Lee, Luke Zettlemoyer. Deep contextualized word representations. arXiv.org (2018, March 22).

[3]Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever. Improving Language Understanding by Generative Pre-Training.OpenAI (2018, June 11).

[4] Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, Qun Liu. ERNIE: Enhanced Language Representation with Informative Entities. arXiv.org (2019, May 17).

[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv.org (2019, June 19).

[6] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv.org (2019, July 26).

[7] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv.org (2019, September 26).

[8] Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv.org (2019, October 2).

[9] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv.org (2019, October 23).

14. Team Members

Chandra Teja Kommineni, Debajyoti Chakraborty, Ekam Chahal

Code can be found here