check your tone!
AUTOMATICALLY ASSIGNING SENTIMENT TO TEXTS /INSIGHT DATA SCIENCE DEMO
*CLICK HERE FOR SLIDES OF the PROJECT DEMO*
The PVLL app is a text messaging service that aims to integrate user accessible analytics with on-going conversations across a mobil platform. PVLL is a small startup with fantastic ideas but limited bandwidth to implement; they simply don't have the person hours needed to push forward with all the fun and useful things they want to do right now. I am a Data Science Fellow here at Insight Data Science (Palo Alto) and was fortunate to be tasked with working on this unique data set for the team at PVLL.
THE OBJECTIVE: label each conversation with a "Flavor"
PVLL gave me a very broad objective: Label each conversation with a "flavor." Think, in the words of PVLL, "a Myers-Briggs type indicator but for conversations."
Broad objectives are fantastic for showing off creative problem solving skills. However, identifying the actual problem that needs to be solved can often be a challenge and must happen before starting any detailed work. This objective was vague and it took several iterations of looking at the data to fully understand the problem at hand. It was important here to dig into the driving forces behind this problem along with how the solution would be implemented.
I defined the problem as one of how to assign sentiment to an ongoing conversation. In our case, a conversation is not a singular event but is made up of multiple messages, with each message having a unique sentiment. Additionally, a conversation is dynamic and can change across time; a text conversation may start and stop between individuals with dialogue may occur outside of the texting environment.
The solution I came up with for this problem came up with looks like this: (1) Each message is automatically assigned a sentiment based on message structure and user profile. (2) The full conversation sentiment is updated, in real-time, given the new information. (3) The user receives a quick and easy way to understand assigned sentiment that is based on color and symbols (emoticons).
So why create this data product?
Consider this scenario: You are super busy and need to focus on finishing up the project you have been working on for the last week. You really don't have the time to keep reading texts sent by your friend, but want to be available in case something happens. The model I built will allow you to be notified if sentiment of the new message changes, or conflicts with, the sentiment of the current conversation.
Or maybe this scenario: You just met someone and can't quite tell if they are interested in you in *that* way. Maybe you want check with the app to see if there is an underlying tone to the message....or, maybe you want to check and make sure you are not sending a message with an unintended tone! This project will help with that too.
the major issue
I received over 3 million text messages from the group at PVLL and the data I received looked nothing like what I expected!! PVLL is protective of privacy and sent text information that was both anonymized and masked. Letters were replaced with a generic "a," and numbers were replaced with a generic "0." Sentence structure was intact (word spacing) and so was punctuation, however, hope of using natural language processing to analyze the dataset was quickly lost.
For instance, a sentence that would normally look like this:
Let me know if coffee tomorrow at 11 AM will work!
Looked like this:
Aaa aa aaaa aa aaaaaa aaaaaaaa aa 00 AA aaaa aaaa!
There was a large, "oh no!" moment until I realized emoticons were still available in the messages ☺️. Emoticons are packed full of sentiment and I was able to use this to label my messages. Working with emoticons is actually not a trivial task when mining and parsing text. I had to convert all of my emoticons to unicode for ease of use, although this created another (difficult but solvable) issue of working with regular expressions, something I just didn't have much experience with.
How I solved the problem
Unsupervised learning: Not having language within the text was a major issue. It meant that I didn't have any labeled data to train a predictive model on. My first step was to to build labels from unlabeled data. I knew I needed to groups emoticons in some manner, and I was not willing to curate, by hand, over 800 emoticons in my data set. Thinking that there might be redundancy across emoticon use, I built a correlation matrix to learn more about how "related" the most frequent emoticons were to each other. Taking only messages with emotions, I built a correlation matrix by asking how often did emoticon "A" show up in the same message as emoticon "B," and so forth. The relatedness structure that came out of this was pretty remarkable and unexpected. I found that emoticons strongly clustered together, and they clustered in a manner that spanned the full emotional range of a romantic relationship between two people: Romantic love clustered on one extreme of the correlation matrix, while extreme sadness showed negative correlation with love and clustered on the opposite side of the correlation matrix. I set my cluster threshold using a dendrogram based on a euclidean distance metric. The number of groups were then selected by visually checking the emoticons that fell into each group. It turned out that 13 clusters both fit the main clusters on the heatmap and also fit a visual check of emoticons. From these 13 clusters, I was able to provide a label to all messages with these emoticons, by placing a message with a given emoticon in the respective cluster.
Feature extraction: Next, I took all messages and extracted features from both the individual message and from summary statistics on the user. Message specific features included features such as; the number of lower case letters, the number of spaces, the number of words, the number of exclamation marks and question marks, etc. User specific features included summary statistics such as the average number of words used in messages and the average number of texts sent in a day.
Supervised learning: Newly built features and assigned labels were then compared. Using 90% of the data for training and 10% for testing, I applied machine learning to build a predictive model based on messages with emoticons. Both k-nearest neighbor and random forest (26.7% and 31.5 % accuracy) greatly outperformed a random and biased guess (7.7% and 18.7% accuracy). This was quite remarkable, given that I only used emoticons to assign a label to the message and did not use any emoticon information in building out features of the messages.
In order to apply the predictive model to all text messages, I had to make the major assumption that messages with emoticons are similar in structure to messages without emoticons. This may or may not be a valid assumption to make, however, with no other sentiment information to base my model on I needed to make this assumption. This is a currently blind spot in the method, however due to privacy concerns there is little I can do to avoid the issue. Below, I highlight how this model may be improved upon, including the suggestion that another Data Scientist, blind to relatedness amongst texts, should perform Natural Language Processing on messages unlinked and shuffled with respect to the user.
a real-time model
Finally, to assign conversation sentiment, I input all messages with assigned sentiment into a Hidden Markov Model (HMM). This allows for a real-time conversation sentiment update. As a new message enters into the conversation, the probability of staying in the same sentiment state updates given the new information. The HMM was trained using a Baum-Welch algorithm and posterior probabilities were assigned using both a forward and backward algorithm. The result is a fast model that allows for conversation sentiment updates given new information in the system, and may be applied in real-time.
HOW CAN THIS BE BETTER?
Feature extraction: I extracted features at a "first order." These are very basic, such as counts of words or letters, and when you really dive into the data you can imagine there is room to build some pretty complex and beautiful features that likely have strong meaning. For instance, a sentence that has, "?!?!," instead of a sentence that has, "!!," followed by, "??," has very different meaning. Using my current features, I cannot differentiate between the two. My project timeline was two weeks, so I had very little time to build out these features. However, the code for my modeling is flexible enough to handle novel features that the people at PVLL may want to include at the choosing.
Apply Natural Language Processing: Again, this is a flexible model. Although I didn't have actual text language to work with, my model is flexible enough to add in additional features. Specifically, as privacy is a concern with PVLL, my model will allow room for another Data Scientist, someone that is blind to the structure of how messages connected to anonymized users, may get this info and apply NLP on the individual messages. Incorporating this into my model, which leverages user profile information, would make for quite a robust model in sentiment analysis.
General code edits: There are several places in my code where I had to make quick decisions. For instance, there are cases where I had to make a random guess in order to break a tie in assigning an emoticon group to a message. The correct way to solve this tie would be to make a more informed guess based on the actual data, i.e, gather more data points. I took a better approach given how much time I had to work on the project, however it wasn't the absolute correct approach. I have commented in my code where these areas are.
I had a lot of fun working with this dataset and really enjoyed chatting with my contact at PVLL. A major insight uncovered was that emoticons have strong systematic use in messages (hence the structure of relatedness), forming the basis of a new area of study, Natural Emoticon Processing (NEP) :), and allowing for sentiment prediction. The model I built has a substantial improvement upon a baseline guess of sentiment, however there is certainly room to be better at guessing. This project was all about balancing time of implementation against solving for the correct answer; a problem of cost vs. benefit. Given the time available, I was quite happy with the final deliverable given to PVLL; a model in R that is called through a Python wrapper and that will easily integrate into their text processing pipeline.
list of scripts for processing data
All code used to build this project may be found within Github here
February 2015 - Keegan Kelsey