Cyberbullying Detection on Twitter Using Natural Language Processing and Machine Learning Techniques

Stephen Afrifa; Vijayakumar Varadarajan

doi:10.15157/IJITIS.2022.5.4.1069-1080

Authors

Stephen Afrifa Tianjin University, Tianjin, China
Vijayakumar Varadarajan University of New South Wales, Sydney, Australia

DOI:

https://doi.org/10.15157/IJITIS.2022.5.4.1069-1080

Keywords:

Cyberbullying, Natural Language Processing, Machine Learning, Twitter, Random Forest, Support Vector Machine

Abstract

People use social media to engage and debate themes ranging from entertainment to sports to politics and many others. The use of social media has also resulted in an increase in cyberbullying, which is occurring at an alarming pace. Many cyberbullying messages may be found in the comment sections of many social media platforms, including Twitter, YouTube, and others. Cyberbullying has the ability to cause stress and mental distress, which should be detected early and avoid being published on social media platforms. In this study, we provide a system for detecting cyberbullying messages in English using natural language processing (NLP) and machine learning approaches. On Twitter, a total of 16851 tweets were gathered. The dataset was applied to an NLP approach to find the most offensive terms associated with cyberbullying. Based on our NLP results, it was clear that cyberbullying happens and must be addressed as soon as possible. The dataset was also utilized to train the random forest (RF) and support vector machine (SVM) algorithms. Random forest surpassed support vector machine, which attained an accuracy of 90.5%, with 98.5%. With careful attention to data preparation, where missing and outlier values are dealt beforehand, the high percentage of the model is obtained. This method facilitates the analysis of the available data at the expense of the study's statistical power and ultimately the validity of its findings. Additionally, it aids in producing a significant bias in the outcomes and increases the effectiveness of the data. The Root mean square error and mean square error were used to analyse the results. In comparison to the support vector machine, the random forest earned the best error score. Our findings may be utilized by agencies and groups to educate individuals about the proper use of social media in order to avoid cyberbullying.