🪨 Prehistoric Era of NLP: Rule-Based Systems
- Used handcrafted grammar/syntax rules.
- Struggled to generalize across domains.
- Couldn’t learn as everything was hardcoded.
Contrastive Pretraining for Sentence-Level Understanding: A BERT-Based Extension.
The Next Sentence Prediction (NSP) objective in BERT has shown limited effectiveness in capturing real-world sentence coherence. Can contrastive learning replace BERT's Next Sentence Prediction to improve sentence-level semantic representation?
This project provides a comprehensive breakdown of the BERT paper, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," authored by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova at Google AI. The project aims to analyze BERT's methodology, historical context, and real-world applications through an illustrated and interactive blog. This project builds upon the foundational BERT framework, which introduced masked language modeling (MLM) and next sentence prediction (NSP) for learning bidirectional contextual representations. While BERT achieved remarkable performance across a wide range of NLP benchmarks, the NSP objective has shown limitations in effectively modeling inter-sentence semantics. To address this, our work proposes a contrastive learning extension that replaces NSP with a sentence-level triplet loss, leveraging hard negative sampling and mean-pooled [CLS] representations passed through a projection head. Our approach improves semantic separation, as demonstrated through t-SNE visualizations and contrastive similarity tests, and is further fine-tuned and evaluated on the MNLI benchmark. We also explored token-level adaptation of BERT for Named Entity Recognition (NER) by fine-tuning on the CONLL-2003 dataset without using [MASK] tokens. This experiment achieved strong F1 scores across entity types and demonstrated that BERT can effectively generalize to sequence labeling tasks without pretrain–fine-tune token mismatches. Together, these experiments explore and extend BERT’s capabilities both at the sentence level and token level. Acknowledgment: I would like to thank Professor David Bau for his insightful lectures and valuable feedback, which greatly shaped the direction and clarity of this project.
"BERT understands natural language."
BERT was the result of decades of progress from rule-based systems to statistical models to deep learning. It introduced the pretraining and fine-tuning paradigm that powers today’s NLP giants like RoBERTa, ALBERT, DistilBERT, modernBERT and even GPT models.
The original BERT model, as proposed by Devlin et al., is pre-trained using two unsupervised objectives: Masked Language Modeling (MLM), where random tokens in a sentence are masked and predicted, and Next Sentence Prediction (NSP), where the model classifies whether a given sentence B logically follows sentence A. While MLM enables BERT to learn rich token-level representations, the NSP objective has been criticized for offering limited value in capturing meaningful inter-sentence relationships, often relying on superficial co-occurrence patterns, as later discussed by Liu et al. in RoBERTa. In our work, we replace the NSP objective with a contrastive learning approach inspired by Gao et al. (SimCSE), which learns to embed semantically similar sentences closer together and dissimilar ones farther apart using a triplet loss over [CLS] representations. We enhance this by incorporating mean pooling (instead of relying solely on the [CLS] token), adding a projection head to shape the embedding space, and introducing hard negatives drawn from different documents to boost contrastive separation. This results in a more semantically aware and discriminative sentence encoder, which generalizes better across downstream tasks like Natural Language Inference (MNLI), outperforming models trained with the original NSP objective.
As an additional experiment, we fine-tuned a pretrained BERT-base-cased model for the task of Named Entity Recognition (NER) using the CoNLL-2003 dataset, which includes annotated sequences for entities such as persons (PER), organizations (ORG), locations (LOC), and miscellaneous (MISC). Building on the foundational work by Devlin et al. (2019), which introduced BERT with Masked Language Modeling (MLM) and Next Sentence Prediction (NSP), our approach eliminates reliance on the [MASK] token entirely to avoid the pretraining–fine-tuning mismatch often seen in token classification tasks. Inspired by efforts like Akbik et al. (2018) and Souza et al. (2020), we adopt BERT’s 12-layer bidirectional transformer encoder and apply a simple linear classification head for token-level prediction, rather than more complex alternatives like CRF. The model was trained and evaluated directly on real, unmasked text sequences and achieved an overall F1 score of 93%, with especially strong results on person and location entities. This aligns with recent trends such as Yamada et al. (2020) and Wang et al. (2021), which explore more natural and efficient ways to adapt large language models for downstream tasks like NER.
Input: Input: Wikitext-2 (Wikipedia-based dataset) Used to extract sentence pairs for learning inter-sentence relationships.
Sentences are split into anchor, positive, and hard negative pairs from the same or different articles.
.Model learns to bring anchor & positive close, push hard negatives away in embedding space.
Based on BertModel (no classification head).
Figure 1: Simplified training and fine-tuning pipeline of Orginal BERT.
Figure 2: Overall pre-training and fine-tuning procedures for BERT. Devlin et al.
Figure 3: Simplified training and fine-tuning pipeline of Contrasive BERT.
Figure 4: This figure shows how sentence embeddings are generated using BERT with mean pooling and an MLP projection head before applying contrastive loss.
To better understand BERT’s capabilities, we ran multiple experiments to analyze its pre-trained behavior and how it responds to different sentence pairs using Next Sentence Prediction (NSP) and contrasiveBERT.
The comparison showcases how our contrastive BERT model, enhanced with mean pooling, a projection head, and triplet loss, outperforms the original BERT model in distinguishing semantically coherent sentence pairs. As shown in the highlighted examples, the original BERT model assigns a high NSP score (1.0) to nearly all pairs, even when the two sentences are semantically unrelated (e.g., “Birds build nests.” vs. “Doctors read journals.”). In contrast, our model produces more context-aware cosine similarity scores—lower for mismatched pairs and higher for genuinely related ones. This indicates that our model better captures sentence-level semantics and is more robust against superficial co-occurrence. The t-SNE plot further illustrates this improvement, revealing tighter clustering of similar sentence pairs, and better separation of dissimilar ones, based on the learned embeddings.
Our model’s generalization is validated through downstream evaluations. The MNLI confusion matrix highlights balanced classification performance across entailment, neutral, and contradiction classes—showing strong adaptation to NLI tasks after fine-tuning. Furthermore, our Named Entity Recognition (NER) module, which reuses the pretrained encoder, achieves industry-grade scores with an overall F1 of 93%, including 96% on PER entities and 90.9% on ORG. These results confirm that the enriched sentence representations produced by our contrastive objective lead to improved performance not only on semantic similarity tasks but also on practical applications like NER. Together, these findings underscore the effectiveness of replacing NSP with contrastive training and demonstrate the broader utility of our model across multiple NLP benchmarks.
Sentence A: "The sun is hot."
Sentence B: "It provides light and warmth."
Original BERT NSP Score (IsNext): 1.0000
Contrastive BERT Cosine Similarity:
0.9651Sentence A: "Birds build nests."
Sentence B: "Doctors read journals."
Original BERT NSP Score (IsNext): 1.0000
Contrastive BERT Cosine Similarity:
0.5733Figure 5: t-SNE visualization showing better separation of sentence embeddings.
Figure 6: NER predictions powered by our model achieve excellent entity-level F1 scores.
Figure 7: Balanced performance across entailment, neutral, and contradiction labels after MNLI fine-tuning.
Figure 8: Contrastive BERT yields more nuanced similarity scores across varied sentence pairs.
Figure 9: Console demo comparing BERT’s NSP vs. contrastive scores for real-world sentence inputs.
Figure 10: Contrastive BERT training pipeline with hard negatives and triplet loss for semantically rich sentence embeddings.
The table below compares our proposed contrastive BERT model with well-established baselines such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), SimCSE (Gao et al., 2021), and a modern large-scale transformer ("ModernBERT") Zhang et al.,. We highlight architectural differences, sentence-level objectives, contrastive learning strategies, and performance-focused optimizations to show how our model prioritizes semantic separation and fine-tuning efficiency.
Aspect | BERT (Devlin et al., 2019) |
RoBERTa (Liu et al., 2019) |
SimCSE (Gao et al., 2021) |
ModernBERT | Our Work |
---|---|---|---|---|---|
Sentence-Level Objective | NSP | None | Contrastive (Dropout/NLI) | None (pure MLM) | ✅ Triplet Loss (Anchor + Hard Negative) |
Pooling | [CLS] token | [CLS] token | [CLS] token | Standard output embeddings | ✅ Mean Pooling over tokens |
Negatives | N/A | N/A | Batch / Random Negatives | N/A | ✅ Hard Negatives from other documents |
Projection Head | None | None | 2-layer MLP | None | ✅ MLP: Linear(768→256) → ReLU → Linear(256→128) |
Used for MNLI | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes (Fine-tuned with classifier) |
Base Architecture | BERT | RoBERTa | BERT | Enhanced Transformer (RoPE + GeGLU) | BERT |
Pretraining Tasks | MLM + NSP | MLM (No NSP) | MLM + Contrastive | MLM only (30% masking) | ✅ MLM + Triplet Loss |
Contrastive Training | No | No | Yes (Dropout-based) | No | ✅ Yes (Triplet loss with hard negatives) |
Sequence Length | 512 | 512 | 512 | 8192 (Extended) | 64 |
Tokenizer | BERT Tokenizer | BERT Tokenizer | BERT Tokenizer | Custom BPE | BERT Tokenizer |
Data Size | Wikipedia + BookCorpus | Large corpus, no NSP | Small (MNLI + Wikipedia) | 2T tokens, diverse sources | Small subset (Wikitext + MNLI) |
Evaluation Benchmarks | GLUE (MNLI) | GLUE | GLUE, STS tasks | GLUE, BEIR, CodeSearchNet, StackQA | GLUE (MNLI), similarity tasks |
Focus | Bidirectional context modeling | Corpus scaling + robust pretraining | Sentence embedding training | Efficiency in IR/NLU tasks | ✅ Sentence-level semantics via contrastive learning |
Building on this, a potential academic project could explore hierarchical BERT models that chunk and encode long documents, then reassemble the segments using a global attention layer. This would allow BERT to tackle full-length articles or reports, a clear step forward from its sentence-level focus.
This opens doors to BERT-inspired models that process visual and auditory data alongside text. A promising research direction would be developing a unified encoder that captures cross-modal alignments for applications like image captioning or video Q&A.
Inspired by these works, a natural academic project would involve optimizing BERT through distillation, quantization, and pruning techniques for low-resource environments like smartphones or edge devices—where speed and memory matter most.
This motivates future research into multilingual BERT models trained using meta-learning techniques to better generalize in low-resource language settings, especially for zero-shot translation and semantic understanding.
Given the societal impact of large language models, an important academic extension would be to explore how interpretability tools and fairness-aware training strategies can reduce bias in BERT's predictions and make its decisions more transparent.
Expanding BERT's Context Window: While BERT is effective at understanding sentence-level context, future research could explore expanding its context window to handle entire documents. A modified model could incorporate hierarchical attention mechanisms to capture broader textual dependencies.
Multimodal BERT: An exciting avenue of research would be extending BERT into multimodal learning, integrating text with vision and audio. A "VisualBERT" or "Multimodal-BERT" could be trained to jointly understand text and images for applications in automated video summarization, image captioning, and human-computer interaction.
Memory-Efficient BERT Variants: While BERT delivers state-of-the-art results, its computational cost remains a challenge. Research could focus on developing efficient variants of BERT using model pruning, distillation, and quantization techniques to make deployment feasible on edge devices and mobile applications.
Cross-Language Adaptability: Current multilingual BERT models still struggle with zero-shot language understanding in low-resource languages. Future work could focus on improving BERT’s transfer learning capabilities by incorporating meta-learning techniques or unsupervised domain adaptation.
Ethical and Interpretability Studies: As BERT is increasingly used in real-world applications, understanding its biases and decision-making processes is critical. Future research could explore ways to enhance interpretability and develop fairness-aware training methods to mitigate unintended biases.
Personally, I found that explicitly modeling inter-sentence relationships with hard negatives not only made the model more interpretable but also opened new directions for understanding sentence semantics beyond simple binary classification. One question that emerged from my work is whether combining contrastive objectives with span-level pretraining could further enhance both sentence and token-level understanding. This experiment strengthened my interest in representation learning and leaves room for future exploration in hybrid objectives and multilingual settings.By addressing these research challenges, future iterations of BERT could become more efficient, adaptable, and versatile, paving the way for broader AI applications in healthcare, education, and creative industries.