The Generalizability of Neural Network-Based Sentiment Analysis of Medical Discourse

Research Question

Yinan Chen's A Sentiment Analysis of Medical Text Based on Deep Learning [1] presents a neural network model based on a combination of pre-existing ideas (such as BERT, convolutional neural networks, and fully connected networks) to tackle the problem of medical discourse often being too complex and nuanced to be analyzed using traditional sentiment analysis tools.

The object of this investigation is to determine whether the conclusions from Chen's paper, specifically about the effectiveness of combining BERT encoding with a convolutional neural network to maximize the accuracy of sentiment analysis, can be generalized to tweets about various subject matters, which carry a vast array of different sentiments. The paper emphasized the specific reasons why medical discourse was so understudied and hinted that the combination of BERT with a convolutional network specifically worked to alleviate these concerns, but it would be interesting to investigate whether this combination would maximize sentiment analysis accuracy on a set of tweets covering a broad range of topics, or if CNNs are inherently superior in circumstances where sentiment is not as easily determined.

The tweet samples used in this investigation come from a webscraped dataset located on Kaggle, where the sentiment score has been calculated using unigram and bigram analysis and various classification techniques, with support vector machine (SVM) being the most accurate. [2] These unigrams and bigrams were compared with a dataset from TwittRatr, which compiled the most commonly flagged positive- and negative-sentiment words from a Twitter API.

Paper Summary

The paper discusses the potential challenges of conducting accurate sentiment analysis, focusing specifically on medical texts. Understanding sentiment, whether positive, negative, or neutral, in medical discourse is crucial for enhancing service quality and patient satisfaction. However, sentiment analysis in the medical domain presents unique challenges, including limited datasets and the intricate nature of medical language. These problems are amplified when medical discourse takes to social media cites like Twitter in two additional areas. First, there are often many parties involved and often in the same tweet, a user might be addressing multiple parties with varying sentiments, such that the composite idea may not have a sentiment in any one clear direction. Second, although the discussion of medical topics should be nuanced, evidence-based, and come from objective scientific fact, rather than clouded by one's personal opinions or biases.

To tackle these challenges, the author, Yinan Chen, employs deep learning techniques, leveraging the power of the BERT (bidirectional encoder representations from transformers) model as a foundation, integrating it with various deep learning techniques, such as convolutional neural networks (CNNs), fully connected networks (FCNs), and graph convolutional networks (GCNs). Each architecture offers distinct advantages in processing textual data, ranging from capturing local features to understanding complex semantic relationships.

The experiments are conducted on the METS-CoV dataset, a collection of medical-related tweets concerning COVID-19. This dataset provides a source of real-world medical discourse, allowing for comprehensive analysis of sentiment patterns. The findings reveal that CNN models, when combined with BERT, exhibit superior performance in sentiment analysis tasks when compared with other deep learning techniques such as GCN and FCN models, particularly on smaller medical text datasets like METS-CoV. This highlights the significance of model selection and architecture in achieving effective sentiment analysis results within the medical domain. Overall, the paper provides valuable insights into sentiment analysis in medical discourse, offering a roadmap for future research endeavors in NLP applications within the medical field. By bridging the gap between deep learning techniques and medical discourse analysis, this study opens the possibility for developing more efficient and accurate sentiment analysis and generative language models specifically tailored to the nuances of healthcare communication.

Implementation

The code begins by loading 10,000 randomly sampled tweets from the dataset, with pre-assigned positive and negative sentiment classifications. This implementation of the research question, although aiming to emulate Chen's methodology as much as possible, deviated slightly from what was outlined in the original paper. First, this dataset did not include a "neutral" sentiment, which was a convenient means for simplifying the problem as a "case study" into the possibility of generalizing Chen's results, rather than a decisive conclusion on the generalizability of the results. Second, the classifications in this dataset did not account for nuanced sentiment score, one which simultaneously addresses multiple parties with independent sentiments. This also serves as an oversimplification of Chen's methodology.

Despite the slight differences in implementations, after the tweets were loaded into a dataset, a BERT encoder was trained on the input tweet data and subsequently processed the tweets into usable vectors to be trained by the two different neural networks. The tweet dataset was then split into an 80% training set and 20% test set to be fed into the neural networks for comparison. A CNN and FCN class were created as subclasses of the PyTorch neural network class, and the BERT-encoded tweets were fed into these networks to be trained.

The CNN consists of three convolutional layers, with output dimensions of 128, 256, and 528, respectively, each followed by a ReLU activation function and a max pool to down-sample the feature maps. The output of the third convolutional layer is flattened to prepare it for two fully connected layers, with the first one applying a ReLU activation function and dropout regularization to prevent overfitting. The final output, representing class predictions, is obtained from the second fully connected layer. The FCN architecture is much simpler, consisting of three linear layers, with ReLU activations after the first and second layers. The first layer has an output dimension of 1000 and the second has 500, with the third layer having two output dimensions - one for each possible sentiment classification.

Both the FCN and CNN were trained over five epochs with an ADAM optimizer, using cross-entropy loss as the optimization function. The networks classified the tweets based on predicted sentiment, and the aggregate accuracy and F1 scores were calculated, to compare with the results of Chen's investigation.

Results

Despite the differences between the implementations laid out in Chen's paper and this investigation, comparing the sentiment accuracy results between the two can still provide insight into the generalizability of Chen's work. Chen considered two different variations of the BERT encoder: bert-base-uncased, which came from a "pretrained modeling suite", and COVID-TWITTER-BERT, which considered specifically-tuned hyperparameters for the analysis of online COVID-19 discourse. Because the research question is meant to investigate the generalizability of Chen's results, the results of this investigation will be compared to those of bert-base-uncased. Chen's experimental setup was replicated to the best of the authors' ability, however differences in hyperparameters were necessary due to hardware limitations on the part of both of the authors. For example, while Chen was able to train over 20 epochs, the neural networks described in this investigation were only able to evolve over 5 epochs. Tweaking the learning rate, dropout rate, and number of epochs are all potential sources of discrepancy between this investigation and Chen's, and serve as a cautionary measure before generalizing the results of Chen's study to a wider range of discourse topics.

After encoding the tweets with BERT, the convolutional neural network was found to have an accuracy of 0.4925 and an F1 score of 0.3250, compared to Chen's values of 66.29 ± 0.64 and 48.46 ± 2.60, respectively. The fully connected network was found to have an accuracy of 0.5075 and an F1 score of 0.3417, compared to Chen's values of 65.99 ± 0.77 and 53.52 ± 1.70. Although the clear separation between the error bars of Chen's findings and the results of this study indicate a statistical significance that Chen's analysis is not generalizable, there are several shortcomings that inhibit decisive conclusions from being drawn from this analysis. The error bars in Chen's results were calculated by repeating the training of BERT and the neural networks over 5 iterations and taking the standard deviation of the accuracy and F1 datasets. Again, due to hardware limitations, this investigation was not able to be repeated with reliable accuracy, and thus there are no error bars representing standard deviation. A further discrepancy in the given results arises when considering that Chen's work classified sentiments into three categories (positive, negative, and neutral), and considered multiple targets of sentiment in the same input data, while the work detailed here only considered two classifications of sentiment which applied to the entirety of a given data point. Despite this, Chen still found better success in their methodology, as given by the higher accuracy and F1 scores reported in the paper.

The most glaring result that indicates the inability of this investigation to generalize is the approximate 50% accuracy score for both the CNN and FCN networks on classifying binary data. This indicates that even after training, the networks have about the same accuracy of predicting the sentiment of the tweets as what would be predicted by random chance, which is probably caused by improper training of BERT or incorrect tuning of hyperparameters, impeding proper model training. The baseline accuracy for the BERT encoding was found to be 0.5075, meaning that not only was it possible that BERT was incorrectly trained, but the neural networks not improving the accuracy of test data classification indicates improper training as well.

This improper training may have multiple root causes. For example, in the case of the CNN, a dropout probability of 0.5 is higher than necessary to avoid overfitting and diminishes the power of the neural network model, and is higher than the 0.1 dropout probability outlined in Chen's paper. Furthermore, Chen's networks use a learning rate of 2 × 10^-5 to slowly iterate through gradient descent, while this analysis used a learning rate of 1 × 10^-3. Although not necessarily impactful to the results, the integrity comparison certainly wouldn't be hurt by aligning the hyperparameters of each investigation. More iterations of training and testing the BERT encoding and the neural network classifiers would allow for more replicable and definitive results. In general, there are many factors impacting the generalizability of the results of this investigation, some of which were outside the control of the authors and the hardware that was available. Better hyperparameter tuning, a higher volume of training and testing data, and multiple repetitions of initializing and training both the BERT encoding and the neural networks are projected to lead to more replicable and definitive answers to the research question.

Conclusions

The initial research question had two goals. First, it sought to determine whether the nature of online medical discourse was quantifiably different from that of uncategorized speech; Chen's paper hypothesized that this was true for multiple reasons. The second goal, similar to the goal of Chen's paper, was to compare the efficacy of different types of neural networks on predicting the sentiment of a set of tweets. Unfortunately, while the structure of this investigation was intended to produce definitive answers to both of these questions, the execution of creating, training, and testing the encoding and network procedures outlined in Chen's paper could be improved in multiple areas, thus leading to inconclusive results and the necessity to investigate further. Future investigations of this research question should be conducted on more equipped hardware to handle a higher volume of training and testing data, more epochs of neural network training, and a lower learning rate to more subtly tune the weights of each network through gradient descent. Given more time and computing power, however, a deeper dive into this proposed research question could lead to significant progress in the areas of natural language processing and generative AI, as networks would be able to better learn and process the emotions and subtext that are included in everyday speech which are obvious to humans, but not so much for machines.

References

[1] Yinan Chen. A Sentiment Analysis of Medical Text Based on Deep Learning. arxiv preprint arXiv:2404.10503, 2024.

[2] A. Go, R. Bhayani, and L. Huang. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.

Team Members

Benjamin Ecsedy

Ali Moeinzad

The Generalizability of Neural Network-Based Sentiment Analysis of Medical Discourse For DS 4440

Extending the Results of A Sentiment Analysis of Medical Text Based on Deep Learning