Motivation
Survey Analysis: For those dealing with large amount of textual data, making a generic conclusion based on thousands lines of sentences is a nightmare. Lets say you have collected a survey from big group of employees or customers and have asked them to write about they thing. How would you judge whether people are happy or not happy? How would you measure how many percent the survey is positive or negative?
Stock Prices & News Analysis: Measuring bias however is not only helpful for survey analysis. One of the biggest challenges for many traders, investors and business owners, is to analyze News. One of the biggest impact from News is on stock market. A negative or positive News about giant companies like Microsoft or Tesla for instance, leaves a big impact on their stock price. Of course we can open the browser and spend hours to read what News says lets say about Microsoft and decide whether we want to buy or sell. However, 1. we don't have time to read all News channels, 2. We cant do it everyday by googling names of different companies on Google News and spend lot of time on that, 3. its not easy to say to quantify how good or bad a News is, and how much it gonna affect the price.
Policy & decision making: Finally, for politicians, companies and traders, making a new policy, or modify a policy, requires a knowledge of measuring negative or positive News and feedbacks, in order to get the general idea of whether this is a right time to apply the changes.
Applicability of bias measure on textual data is not limited to only mentioned scenarios above. So we stop on how it could be useful, and straightway jump into our sample scenario.
Data Collection
We used Google News to fetch data related to our topic, in a particular date range. We used Selenium and Pyautogui packages in Python, to extract Google News headlines. For this use case we use “Microsoft” search topic. We collect News from first page of Google search. The sample collected data are shown in table below.
Microsoft Sample News Headlines Collected From Google News
Data Preparation
Text processing is a must-do step. In order to achieve this, we apply the followings on the text:
Convert texts to lowercase.
Replace all whitespaces and newlines with single space.
Remove all numbers.
Remove any links using Regular expressions.
Remove punctuations, stop words, linking words.
Once the text is cleaned, we convert strings to list of words (tokenization). At the end, we stem each word in the list, to get its root (e.g. economical and economist will be converted economy). The whole process helps us to convert all headlines and briefings to what so called bag of words. Table below shows how headlines look like after cleaning process and converting to bag of words:
Headlines cleaned and converted to bag of words
Measuring Text Bias
There are different methods in measuring text bias. We use two most popular methods called VADER and TextBlob. VADER is a rule based model. It creates a list of negative and positive words. Using VADER, we can conclude how many percent a sentence carries a negative or positive words.
The other popular method is TextBlob which is python library. TextBlob, gives each word a strength/intensity score, and then averages over all scores to identify how many percent a sentence carries a negative or positive weight.
Both techniques give score ranging between -100% to +100%, to show how many percent a sentence is negative or positive.
Both techniques cover slightly different vocabulary domain. Also the scores given to each word (in TextBlob) have been manually done by human, and obviously human error is plausible too. Hence I combine both methods to get a robust polarity measure.
I also realized that in some cases, stemming techniques are generating words which are not correctly typed or rarely used in English, which eventually means they are not covered in none of VADER or TextBlob vocabulary ( for instance when you look at the table of sample preprocess headlines above, you can see alternative have been stemmed to altern ). Hence we decided to use polarity measure from both techniques, on both stemmed and not stemmed texts. On the other hand, stemming is necessary because many different words with the same root can be derived from a text, where only their root exist in VADER or TextBlob vocabulary.
My experiments showed VADER scores are much closer to human perception, however, due to the difference in vocabulary domain, which was explained before, it turns out that in some sentences, VADER can't calculated any score, and TextBlub is still capable of doing it, or vice versa. Hence I decided to combine polarities from both methods in such a way that if VADER score is available, pick VADER score, but if VADER has not generated any score, then use the TextBlob score. Table below shows a sample output.
Polarity scores calculated for sample News about Microsoft
As mentioned before, negative polarity means the sentence is more biased toward negative content, therefore, we recognize any sentence with less than zero score, as negative sentence. The score also indicate the intensity with very high accuracy and pretty close to human perception. For instance, the first sentence in the table (apple, samsung and microsoft accused of 'worst forms' of child labor abuse) has 89% negative bias, while the forth sentence (microsoft warns windows 10 users to update immediately) has only 10% negative bias.
Conclusion
We are living in a world of contradictions, double standards and continuous judgements. Everyday people generate and read tones of News, articles and papers. Sometimes we see something horrible (war, invasion, terror attack, natural disaster, rape, gender inequality, racism...) happens in one country or even a company, and no one sees significant aftermath in global media and global market respectively, while if exactly the same thing happens somewhere else, it attracts lot of attention in News bulletin and global media, which eventually affects global market, political and social movements and major decision makings every where. In this new era, NLP and AI can help us to measure the biases against a particular topic at each historical period and help us to predict the future more realistically. I will write a post how News bias can help us to predict stock price.
Yorumlar