Major Research Project (MRP): Sentiment Analysis of Online Reviews from Yelp Open Dataset
Background:An applied research project (MRP) is required to be conducted and presented in partial fulfillment
of the requirements for the Master of Science: Data Science and Analytics from Toronto Metropolitan university.
I wanted to do a project on NLP and chose to do sentiment analysis on the Yelp Open Dataset.
In the digital age, online reviews have become a central component in driving consumer choices. This study focuses on sentiment analysis of Yelp reviews, juxtaposing traditional machine learning (ML) algorithms (Naïve Bayes, Logistic Regression, Random Forest, Support Vector Machines) against the contemporary BERT model. Drawing from a vast dataset of over 6 million reviews, a balanced training set was derived by undersampling prevalent 5-star reviews. Our key objectives encompass both categorizing reviews into positive or negative sentiments, but also predicting precise star ratings. Remarkably, while conventional ML models demonstrated a range of accuracy levels, BERT stood out with its proficiency, particularly in positive/negative sentiment classification, reaching a flawless accuracy rate. These findings underscore BERT’s potential in complex sentiment tasks, even as traditional models showcase notable abilities. The performance of each model is evaluated based on classification reports and a confusion matrix.
Challenges: There was only one significant challenge that I encountered while writing this paper:
I lacked the necessary computing power to run a deep learning BERT model on 500'000 validation samples.
I utilized a pre-trained BERT model fine tuned on an undersampled training dataset from the Yelp Open Dataset. Unfortunately, my personal computer lacked the capability to run
the fine-tuning or the validation. I ended up having to purchase a subscription to Google Colab for a strong runtime environment.
Even then, I had to reduce the number of validation samples from 500'000 to 250'000 as the Google Colab environment was exhausting its ram, leading to
crashes before my code could complete its execution.