Twitter stock price prediction

Published Feb. 03, 2022

After analyzing the correlation between Reddit data and stock prices from a blog post, we at the Unit for Data Science and Analytics began to wonder: “What about Twitter data”? We have seen in the past how certain tweets have affected stock prices, for better or for worse. So, we formulated the hypothesis that Tweets posted to Twitter affect stock prices. Or, in other words, that average Tweet sentiment is correlated with the change in stock prices. We decided to focus on Tesla and Elon Musk for our modeling and analysis since we have seen how his Tweets have directly influenced the New York Stock Exchange.

Once the hypothesis was formulated, we did some research to see if this has been attempted before. Sure enough, it has. One particular blog post from another writer stood out. In it, James Briggs outlines a method for pulling data from Twitter, performing sentiment analysis, and applying that towards stock prices. Our student team used James’s blog post as a skeleton framework for the same thing: predicting stock prices using a Twitter data feed.

Data collection

To start, we needed to collect our data. To do this, we tap into Twitter’s API. In order to utilize Twitter’s API (and consequently our code) yourself, you will need to create a developer account and obtain a bearer token. Twitter’s API is very powerful, however, there are limitations. The biggest of which is we can only pull 100 tweets at a time. We want to pull as many of the Tweets we can within a specific timeframe, which is well over 100. In order to get around this, a loop was created that moves a week backwards in time in 60-minute increments. This enabled us to pull much more of the Tweets for the previous week. We also define the query here to be “tesla OR tsla OR elon musk and -spacex”. So, we are searching for mentions of Tesla, TSLA (Tesla’s stock ticker symbol), or Elon Musk and are removing any mentions of SpaceX. We removed mentions of SpaceX since it is closely tied to Elon Musk and we want the data to be as clean as possible. The code to achieve this can be found below.  Please note that this section takes several minutes to complete.

______________________________________________________________________________

import requests
import pandas as pd
import time
import regex as re
from datetime import datetime, timedelta

def get_data(tweet):
    data = {
        'id': tweet['id'],
        'created_at': tweet['created_at'],
        'text': tweet['text'],
        'retweet_count': tweet['public_metrics']['retweet_count'],
        'like_count': tweet['public_metrics']['like_count'],
        'reply_count': tweet['public_metrics']['reply_count']
        }
    return data

whitespace = re.compile(r"\s+")
web_address = re.compile(r"(?i)http(s):\/\/[a-z0-9.~_\-\/]+")
tesla = re.compile(r"(?i)@Tesla(?=\b)")
user = re.compile(r"(?i)@[a-z0-9_]+")

# setup the API request
endpoint = 'https://api.twitter.com/2/tweets/search/recent'
headers = {'authorization': f'Bearer {BEARER_TOKEN}'}
params = {
    'query': '(tesla OR tsla OR elon musk and -spacex -is:retweet) (lang:en)', #
    'max_results': '100',
    'tweet.fields': 'created_at,lang,public_metrics'
        }

dtformat = '%Y-%m-%dT%H:%M:%SZ'  # the date format string required by twitter

# we use this function to subtract 60 mins from our datetime string
def time_travel(now, mins):
    now = datetime.strptime(now, dtformat)
    back_in_time = now - timedelta(minutes=mins)
    return back_in_time.strftime(dtformat)

now = datetime.now()  # get the current datetime, this is our starting point
last_week = now - timedelta(days=6)  # datetime one week ago = the finish line
now = now.strftime(dtformat)  # convert now datetime to format for API

df = pd.DataFrame()  # initialize dataframe to store tweets
while True:
    if datetime.strptime(now, dtformat) < last_week:
        # if we have reached 6 days ago, break the loop
        break
    pre60 = time_travel(now, 60)  # get x minutes before 'now'
    # assign from and to datetime parameters for the API
    params['start_time'] = pre60
    params['end_time'] = now
    response = requests.get(endpoint,
                            params=params,
                            headers=headers)  # send the request
    time.sleep(2)
    now = pre60  # move the window 60 minutes earlier
    # iteratively append our tweet data to our dataframe
    for tweet in response.json()['data']:
        row = get_data(tweet)  # we defined this function earlier
        if row['like_count']>=0 and row['retweet_count']>=0 and row['reply_count']>=0:
            df = df.append(row, ignore_index=True)
______________________________________________________________________________

Tweet cleaning and sentiment modeling

Now that we have our data, it’s time to clean our Tweets and apply a sentiment model to each one. For cleaning, we removed excess whitespace, web addresses, and user name mentions since these do not help us with sentiment modeling. As far as the sentiment modeling, we tested both the Flair model as well as the NLTK model. We found more stable results from the NLTK model. Our testing of different sentiment models was not exhaustive and this continues to be an area of interest to us. In this section we also deal with time zones. Our Twitter data is based on UTC time, but our stock ticker data is based on New York’s time zone. We need these two time zones to align in order to make proper comparisons since Tweets and stock prices are time-dependent. To do this, we convert Twitter’s UTC time to New York time by moving it 4 hours.

______________________________________________________________________________

from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

def is_positive(tweet: str) -> bool:
    """True if tweet has positive compound sentiment, False otherwise."""
    return sia.polarity_scores(tweet)["compound"] > 0

# we will append probability and sentiment preds later
probs = []
sentiments = []
clean_tweets = []
timestamp = []
binary = []

for time in df['created_at']:
    timestamp.append(((datetime.strptime(time, '%Y-%m-%dT%H:%M:%S.%fZ')
                      - timedelta(hours = 4)) #timezone
                      + timedelta(hours = 0) #delay
                     ).strftime('%Y-%m-%d')) # %H:00:00'))

for tweet in df['text']:
# we then use the sub method to replace anything matching
    tweet = whitespace.sub(' ', tweet)
    tweet = web_address.sub('', tweet)
    tweet = tesla.sub('Tesla', tweet)
    tweet = user.sub('', tweet)
    if is_positive(tweet): binary.append(1)
    else: binary.append(0)
    clean_tweets.append(tweet)

# add probability and sentiment predictions to tweets dataframe
df['text_clean'] = clean_tweets
df['binary'] = binary
df['Date'] = timestamp
______________________________________________________________________________

Pulling stock data

Here, we pull stock data in a different way than we have previously. In our Reddit data analysis we used the Pandas Datareader package to accomplish this. However, we used the Yfinance package for this analysis. The reason Yfinance was chosen was because it gave us more data to work with as well as allowed us to calculate the percentage change from day to day. We originally were plotting the stock price vs. average sentiment, however, we decided on analyzing the percentage change since that is what we are more interested in. So, we can see that there are many ways of solving the same problem. Keep that in mind as you explore data science more!

______________________________________________________________________________

import yfinance as yf

tsla = yf.download( "TSLA",
    start=datetime.strptime(df['created_at'].min(),'%Y-%m-%dT%H:%M:%S.%fZ').strftime('%Y-%m-%d'),
    end=(datetime.strptime(df['created_at'].max(),'%Y-%m-%dT%H:%M:%S.%fZ')+timedelta(days = 2)).strftime('%Y-%m-%d'),
    interval='1d'
        )

tsla_stock = tsla.pct_change().reset_index()

converted = []
for time in tsla_stock['Date']:
    converted.append(time.strftime('%Y-%m-%d')) #  %H:00:00'))
tsla_stock['Date'] = converted
______________________________________________________________________________

Correlational analysis

Now we move on to the actual analysis. To accomplish this, we take the average sentiment across all the tweets for each day and plot that against the change in the stock price. Therefore, if our hypothesis is correct, we would see the percent change in stock prices increase as the average sentiment increases and vice versa. If there is perfect correlation we would see a perfectly straight diagonal line (see graph below). We also calculated the correlation coefficient to get a more objective measure of how the two variables correlate. Discussion of our findings will be covered in the next section.

______________________________________________________________________________

means = df.groupby(['Date'],  as_index=False).mean()
combined = means.merge(tsla_stock, how='inner')

combined['binary'].corr(combined['Close'])

import numpy as np
import matplotlib.pyplot as plt

fig,ax = plt.subplots(1)

# plot the data
ax.plot(combined['binary'],combined['Close'], 'ro')

______________________________________________________________________________

Twitter stock price prediction graph

Results

After all this work, we couldn’t find a strong correlation between average Twitter sentiment and change in stock prices. Our correlation coefficient fluctuated wildly and our graph’s distribution was cloud-like. In other words, our hypothesis was wrong and that’s okay! It’s all a part of the science. We came to the same conclusion in our Reddit analysis.

As mentioned previously, this was not an exhaustive exploration of what is possible with this code. Some areas yet to be explored are: changing the query to incorporate different key words, attempting this analysis on another company besides Tesla, and changing the sentiment model. There are many more possibilities for further research that we do not see. That’s where you come in! We hope this analysis sparks curiosity in you to explore data and data science. Let’s put the science in data science!

Feel free to contact us at datascience@asu.edu with any questions or if you would like to share your findings with us.

The source code can be found on GitHub