Sam wants to know what movies are worth watching, so he analysed 25,000 movie reviews. This is a tough natural language processing problem, because each movie only has a small number of reviews (less than 30). It’s nowhere near enough for a deep learning approach to work, so he had to identify and synthesise features himself.
He used BeautifulSoup to pull out some of the HTML structure from the reviews, and then made extensive use of the Python NTLK library.
The bag-of-words model (ignoring grammar, structure or position) worked reasonably well. A naive Bayesian model performed quite well -- as would be expected -- as did a decision tree model, but there was enough noise that a logistic regression won out, getting the review sentiment right 85% of the time. He evaluated all of his models with F1, AUC and precision-recall. He used this to tweak the model a little and just nudge it a little higher.
A logistic regression over a bag-of-words essentially means that there we are assigning a score to each word in the English language (which might be a positive number, a negative number or even zero), and then adding up the scores for each word when it appears. If overall it adds up to a positive number, we count the review positive; if negative the reviewer didn’t like the movie.
He used the Python Scikit learn library (as do most of my students) to calculate the optimal score to assign to each English language word. Since the vocabulary he was working with was around 75,000 words (he didn’t do any stemming or synonym-based simplication) this ran for around 2 days on his laptop before coming up with an answer.
Interestingly, the word “good” is useless as a predictor of whether a movie was good or not! It probably needs more investigation, but perhaps a smarter word grouping that picked up “not good” would help. Or maybe it fails to predict much because of reviews that say things like “while the acting was good, the plot was terrible”.
Sam found plenty of other words that weren’t very good predictors: movie, film, like, just and really. So he turned these into stopwords.
There are other natural language processing techniques that often produce good results, like simply measuring the length of the review, or measuring the lexical dispersion (the richness of vocabulary used). However, these were also ineffective.
What Sam found was a selection of words that, if they are present in a review, indicate that the movie was good. These are “excellent”, “perfect”, “superb”, “funniest” and interestingly: “refreshing”. And conversely, give a movie a miss if people are talking about “worst”, “waste”, “disappointment” and “disappointing”.
What else could this kind of analysis be applied to?
- Do you want to know whether customers will return happier, or go elsewhere looking for something better? This is the kind of analysis that you can apply to your communications from customers (email, phone conversations, twitter comments) if you have sales information in your database.
- Do you want to know what aspects of your products your customers value? If you can get them to write reviews of your products, you can do this kind of natural language processing on them and you will see what your customers talk about when they like your products.