Methodology
Data collection
Firstly we did a problem identification survey via Google Form. Then data collection began for our project to create the models to detect the offensive content in Sinhala, Tamil and English languages. Detecting profanity in audio signals for English was convenient, owing to the presence of sample tools and datasets for English. However, since Sinhala and Tamil languages are rather regional, not many technologies are supported in terms of the specified languages. Therefore, the project demanded most of the NLP components to be created from scratch or to employ alternative approaches to tackle the issue. Since our project concerns Natural Language Processing, social media posts and comments were chosen as the source of raw data, since there was a requirement of data with the presence of heavy colloquial language. This ensured the detection of profanity, regardless of whether the audio file contains formal speech or casual speech form. A list of offensive keywords was formed for all three languages to manually collect the social media posts and comments from mainly Twitter and Facebook. Initially, using the Twitter API to mine the raw data was considered. But, Twitter API requires a premium package to access data older than 7 days. Hence, it was discarded and the team collected the data using the keywords in social media manually.
Preprocessing of data
The comments were collected from social media posts and assigned a class considering the presence of offensive keywords. A different approach was carried out in the data collection for the Multi-class classification model, since it concerns the prediction of the offensive nature of each word in a sentence. Therefore, the data corpus for this purpose was ensured to contain short snippets from social media comments. The data corpora were split into training datasets randomly with 70% of the corpus and the rest was used as testing data set for model evaluation. The data corpus was preprocessed by removing stopwords, numbers, URLs, punctuations, special characters and duplicates. The stopwords list for Sinhala and Tamil was inspired from NLTK’s stopwords for English and it included colloquial stopwords as well.
The preprocessed datasets were used to generate the features for the Machine Learning algorithms. The vectorization and classification tasks were conducted using Scikit-learn library with regards to feature extraction, we tested the datasets with TF-IDF vectorizer and Count vectorizer along with the classifiers. The SVM classifier is known to outperform other classifiers as k-Nearest Neighbors and Naive Bayes. Hence, for the binary classification models, SVM was chosen to proceed with. Meanwhile, for the multi-class classification models, classifiers as Decision Tree, Multinomial Naive Bayes, Logistic Regression and SVM were experimented to determine which offers the most-suitable model.
The secondary filtering model was used to identify the offensive nature of each word in a sentence. This involves a data set of short snippets of text rather than lengthy sentences, where the performance of SVM is relatively lower than that of variants of NB (Naive Bayes). Regarding this, a combination of classifiers and vectorizers were used to identify the most-suitable model.After the models were finalized, the hyperparameters of the classifiers and vectorizers were tuned for maximum accuracy using the pipelining method.The system was created using python GUI library PyQt5. In the backend the system has four main modules: DSP module, Speech recognition module, NLP module, and Audio replacing module.
Finally testing of the system was done with a set of few people of age 25-30 and found overall positive response on their experience with the system.
Comments
Post a Comment