NLTK is a powerful library that allows you to work with human language data. With it you can process text for classification, tokenization, stemming, tagging and parsing. Since I was taking the Datacamp course - Natural Language Processing Fundamentals in Python and was learning about NLTK I decided to use this library to create a Tweet classification script to be used on a term.
- Can deal with repeated tweets on a subject
- Once the classifier is trained the script can run the classification quickly
- It provides a good overall feeling about a subject
- It saves the searched tweets in a file so it can be used instead of querying the Twitter database again
The Not So Good
- NLTK is a bit slow - training the classifier is a bit slow and could be improved further
- The method used for the classifier needs a large amount of data to increase the accuracy of the classifier
- Due to the data used only a rating of "Good" or "Bad" is returned
- Bag of words used - the classifier is unable to categorise sarcasm and double negatives properly
How it all came to be
One of the Pybites challenges was to create a Twitter sentiment analysis to get a ratting about a subject. After the analysis is done the searched term would get a rating for Good, Bad, Neutral.
Pybites solution uses the textblob library, but since I was learning about NLTK at the time I have decided to use this library and implement my version of a classifier. Working on this exercise was extremely fun and made me learn a lot.
GitHub repo: https://github.com/FabioRosado/tweetnalytic
Pybites challenge 07: https://pybit.es/codechallenge07.html
Image credits: Unsplash