Our Progress on the NLP(Natural Language Processing) aspect of the Project
Since the beginning of our project, as a team, we have been doing background studies on the concepts involved in our project namely, Digital Signal Processing, Speech Recognition and Natural language processing (NLP). I have been handling the NLP aspect of our project for the past months and we have arrived at some conclusions relevant to our scope.
As per Wikipedia,
According to the methodology mentioned above, following the audio signal analysis and speech to text conversion, the NLP module detects the offensive profanity using Machine Learning. So far, we have implemented the machine learning model for English with around 96% using an opensource training dataset which consists of twitter feeds.
As for Sinhala and Tamil datasets, we are currently scraping the social media for feeds and posts which has offensive words in them using keywords to track them down. For the Sinhala language, we have so far obtained around 10,000 entries which are under our manual classification process to train it. As for Tamil, we are in the process of getting the data.
Firstly, we approached the English language, and we created the ML model, after trying out and analyzing several algorithms and their error rates. Since our model has a few errors dealing with nuances, we are planning on trying Deep Learning to tackle the problem.
Thus far this is the progress on the NLP aspect of our project and we are confident that we can attain an accuracy well above 60%(as mentioned in SDS and SRS) using our system.
As per Wikipedia,
NLP is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human languages, in particular how to program computers to process and analyze large amounts of natural language data.
According to the methodology mentioned above, following the audio signal analysis and speech to text conversion, the NLP module detects the offensive profanity using Machine Learning. So far, we have implemented the machine learning model for English with around 96% using an opensource training dataset which consists of twitter feeds.
As for Sinhala and Tamil datasets, we are currently scraping the social media for feeds and posts which has offensive words in them using keywords to track them down. For the Sinhala language, we have so far obtained around 10,000 entries which are under our manual classification process to train it. As for Tamil, we are in the process of getting the data.
Firstly, we approached the English language, and we created the ML model, after trying out and analyzing several algorithms and their error rates. Since our model has a few errors dealing with nuances, we are planning on trying Deep Learning to tackle the problem.
Thus far this is the progress on the NLP aspect of our project and we are confident that we can attain an accuracy well above 60%(as mentioned in SDS and SRS) using our system.
Comments
Post a Comment