4-Step Reddit vs. NY Stock Exchange Sentiment Analysis

Published March 23, 2021
Updated Oct. 18, 2021

In light of recent headlines surrounding Reddit and the New York stock exchange, we at the Unit for Data Science and Analytics at ASU Library decided to do some analysis on the subject. For those unfamiliar, on January 22, 2021, a group of Reddit users organized themselves on the r/wallstreetbets subreddit and successfully performed a short squeeze on GameStop stocks. This drove the stock prices up dramatically and, in turn, put financial pressure on hedge funds. In the midst of all this, we had the thought: “Is there a relationship between the stock prices and activity on the subreddit?” So, we decided to investigate.

First, we had to develop a more formal hypothesis. We believed that the recent spike in Game Stop’s stock (GME) prices were correlated with the amount of activity on the “/r/wallstreetbets” subreddit. Furthermore, we were curious to see if the overall sentiment of the post titles were also correlated with the stock price. However, before we could perform any of this analysis, we had to collect the data.

Collecting the Data

To start, we needed to retrieve data from the /r/wallstreetbets subreddit. We used the Praw package in Python to do this; however, a Reddit app must first be created on the site before we could pull the data. Details on how to create a Reddit app can be found here. The code necessary to connect to Reddit through Python can be found below, however, you will still need to provide your Reddit app credentials in order to successfully connect. Please note that all subsequent code & packages will be based in Python.


reddit = praw.Reddit(client_id="YOUR_CLIENT_ID", 
                    client_secret=" YOUR_CLIENT_SECRET",
                    redirect_uri="http://localhost:8080",
                    user_agent="YOUR_USER_AGENT"
                       ) 
print(reddit.auth.url(["identity"], "...", "permanent"))


Once we were able to connect to Reddit, we encountered an issue. The Praw package has no way to pull the data for specific dates, only the last 25 or so most recent posts. To get around this, we employed the use of the pushshift API (credit to Reddit user u/kungming2 for the “submissions_pushshift_praw “ function definition). This allowed us to pull posts from specific dates; however, there was another problem. Pushshift only allows a maximum of 100 posts per pull, so we created a loop to pull all of the posts for the full span of the dates. (NOTE: This loop takes several minutes to pull all the data.)


start_date = 1609484400  # '2021-01-01 00:00:00' in Unix time
end_date = 1614581999  # '2021-02-28 23:59:59' in Unix time

posts = pd.DataFrame(columns=['title', 'flair', 'score', 'upvote_ratio', 'id',
                              'subreddit', 'url', 'num_comments', 'body', 'created'])  # Dataframe to store results

while start_date # Continue loop until end date is reached
    S = submissions_pushshift_praw(subreddit='wallstreetbets',
                                   start=start_date, end=end_date, limit=10000)  # Pull posts within date range

    for post in S:  # Looping through each post
        try: # Try/except to catch any erroneous post pulls
            if post.selftext != '[removed]' and post.selftext != '[deleted]': # Remove the deleted posts

                    posts = posts.append(
                        {'title':post.title,
                         'flair':post.link_flair_css_class,
                         'score':post.score,
                         'upvote_ratio':post.upvote_ratio,
                         'id':post.id,
                         'subreddit':post.subreddit,
                         'url':post.url,
                         'num_comments':post.num_comments,
                         'body':post.selftext,
                         'created':post.created}, ignore_index=True)  # Retrieve post data and append to dataframe

        except:
            next()  # Continue loop if error is found

    if len(S) # To identify when the last pull is reached
        break
    start_date = posts['created'].max()  # Select the next earliest date to pull posts from
    print(start_date)  # An indicator of progress


Now that we had the base Reddit data, it was time to fit the sentiment model. Fitting the model will allow us to determine the overall attitude (whether positive or negative) of each post in our dataset. For example, “I like dogs” would be a positive sentiment whereas “I don’t like cats” would be a negative sentiment. Since there are no sentiments to start with, we chose to go with a pre-trained model. After researching a few packages, we decided to go with the Flair package. This seemed to give an overall good performance while still remaining easy to work with. Lastly, since many of the posts in our dataset don’t have text in the body, we chose to fit a sentiment to the titles of the posts only.


import flair
flair_sentiment = flair.models.TextClassifier.load('en-sentiment')  # Load model

for index, row in posts.iterrows():  # Iterate over the rows of the dataframe
    s = flair.data.Sentence(row[0])  # Retrieve title of post
    flair_sentiment.predict(s)  # Predict sentiment
    posts['sentiment'][index] = s.labels[0]  # Add sentiment to dataframe
posts.to_csv('reddit_data_sentiment.csv') # Export results


The final step in collecting the data was the Game Stop (GME) stock price. This was very straightforward with the use of the pandas_datareader package. In just a few lines of code, we were able to pull the data in the date range we were interested in.


start = pd.to_datetime('2021-01-01')
end = pd.to_datetime('2021-02-28')

gme = data.DataReader('GME', 'yahoo', start, end)
gme.to_csv('GME_Stocks.csv')


Analysis

Now for the fun part… the analysis! We performed all the subsequent analysis in Excel. We find it easier to manipulate and “play” with the data compared to coding it in Python. Keep in mind that the analysis tool a person uses to analyze data is completely arbitrary. The real science comes in with the methodology.

Opening the two “.csv” files in excel reveals the following:

The first thing we needed to do was to confirm that our data was valid. For the Reddit data, it was rather straightforward. We compared what we were able to gather through our Python script with the actual posts on the website and confirmed that they matched. Thus, our Reddit data was valid. We also wanted to check that our stock price data was valid. To do this, we decided to plot the closing price of the stock over time as a quick check. We did so and observed a clear spike in stock price around the beginning of February. Comparing this graph with a graph generated by Google Finance, we could see that they matched. Thus, we were able to confirm that our stock price data was valid.  

In order to test the hypothesis we established earlier, we needed to compute the correlation between the number of Reddit posts per day and the closing stock price. We first aggregated the Reddit data by counting the number of posts each day, and matched that up with the stock price for the same day. Once this was done, we could calculate the correlation with an easy to use built-in Excel function.

As you can see, the correlation between the two variables was -0.109. Keep in mind that correlation is a measure of how “similar” two sets of data are. Meaning that if the two variables were positively correlated, then as the stock price goes up, the Reddit volume also goes up. If one goes up as the other goes down, then they are considered negatively correlated. Correlation itself is measured on a scale from -1 to 1 in decimal form. The closer to 1 (or -1) the value is, the more correlated the two sets of data are. Since the correlation here was very small (-0.109), we concluded that the GME stock price and Reddit post volume were not correlated.

Sentiment Analysis

Here, we wanted to see if the overall attitude, or sentiment, of the Reddit posts was correlated with the stock price. We hypothesized that the more positive the average sentiment was for a given day, the higher the stock price was for that day (and vice versa). In order to quantify the overall sentiment, we first computed the total number of positive and negative posts for each day. From there we computed a new metric that divides the difference between the positive and negative counts by the total number of posts. What this means is if the metric is 0, then there were an equal number of positive to negative posts. The closer the metric is to 1 (-1), the more positive (negative) the sentiment. The correlation this metric and the stock price was 0.141. Again, we could definitively conclude that overall sentiment of /r/wallstreetbets and GME stock prices were not correlated.

 

Conclusion

As you can see, the hypothesis we set out with was completely wrong… and that’s okay! This is all part of the scientific process. Science isn’t about being right or wrong, but rather, is about finding the true answer to your question. Thus, we can say with confidence that the volume and sentiment of the Reddit posts are not correlated.

Our methodology was by no means perfect. There are many opportunities to improve this analysis. For example, there may be a much better way to pull the Reddit data we are unaware of. Another thing to look into are the deleted Reddit posts that were omitted. Perhaps there’s some analysis of interest there? Beyond areas of improvement, there are many other things to try and test. It would be interesting to apply the different sentiment analysis packages on the same dataset and see how/why they differ. Something else to try is looking at if the score or number of comments correlate with the stock price. The possibilities are endless, so go out there and put the science in data science!

Feel free to contact us at datascience@asu.edu with any questions or if you would like to share your findings with us.

The source code and Excel workbook can be found here.