News and Blog

Blog Archive

May 05, 2021 ·

What is quantum computing?

To better understand quantum computers, we must first know the basics of classical computing. First off, a classical computer is essentially the one you are using to read this post! They use voltage states across transistors to make their computations. This is called encoding data.

Unlike classical computers, quantum computers utilize the quantum states of subatomic particles for their encoding. Quantum computers exploit superposition and entanglement to perform computations of complex problems in a more efficient manner than a classical computer. The end result is a computer that can calculate very large and complex problems in minutes that would normally take a classical computer YEARS to compute!

 

How will they impact our daily lives?

Quantum computers will most likely NOT replace classical computers for most tasks, at least not at first. We will most likely see them implemented in large, complex computation tasks for large companies or organizations. Tasks such as modeling molecules and how they interact with the human body, or finding the optimal routes for all package deliveries simultaneously. It’s better to think of quantum computers as something that will work in parallel with classical computers rather than something that will replace them altogether.

 

What does this mean for data scientists? (Why should we care?)

While the research and development are still relatively young, we should expect to see the commercial use of quantum computers within the next few years. Once this happens, development will really begin to accelerate and quantum computers will become commonplace before we know it. Thus, data scientists should begin to get involved now in order to be a driving force in the industry.

 

What is the Unit for Data Science and Analytics is doing?

We are looking to help advance the research of quantum computers for use in data science problems. This includes staying on top of current technology trends & developments, working with other staff & faculty within ASU on the topic, and developing algorithms that could be used on quantum computers.

 

How can I get involved?

Contact us directly at datascience@asu.edu. We work with any and all students & faculty on cutting edge research in data science. We look forward to hearing from you!

 

Summary:

Classical Computing

Quantum Computing

Performs calculations using transistors, which represent data as either a 0 or a 1.

Performs calculations using qubits, which represent data as any value between 0 and 1.

Computational power increases linearly as more transistors are added.

Computational power increases exponentially as more qubits are added.

Have low error rates and can operate at room temperature.

Currently have high error rates and must operate at extremely cold temperatures.

Well suited for everyday tasks such as video streaming, word processing, and other basic computations.

Well suited for optimization problems, simulations, and other exponentially complex computing tasks.

 

Further Learning:

Introductory resources

 

Advanced resources

 

Technology leaders

Mar 23, 2021 ·

In light of recent headlines surrounding Reddit and the New York stock exchange, we at the Unit for Data Science and Analytics at ASU Library decided to do some analysis on the subject. For those unfamiliar, on January 22, 2021, a group of Reddit users organized themselves on the r/wallstreetbets subreddit and successfully performed a short squeeze on GameStop stocks. This drove the stock prices up dramatically and, in turn, put financial pressure on hedge funds. In the midst of all this, we had the thought: “Is there a relationship between the stock prices and activity on the subreddit?” So, we decided to investigate.

First, we had to develop a more formal hypothesis. We believed that the recent spike in Game Stop’s stock (GME) prices were correlated with the amount of activity on the “/r/wallstreetbets” subreddit. Furthermore, we were curious to see if the overall sentiment of the post titles were also correlated with the stock price. However, before we could perform any of this analysis, we had to collect the data.

Collecting the Data

To start, we needed to retrieve data from the /r/wallstreetbets subreddit. We used the Praw package in Python to do this; however, a Reddit app must first be created on the site before we could pull the data. Details on how to create a Reddit app can be found here. The code necessary to connect to Reddit through Python can be found below, however, you will still need to provide your Reddit app credentials in order to successfully connect. Please note that all subsequent code & packages will be based in Python.


reddit = praw.Reddit(client_id="YOUR_CLIENT_ID", 
                    client_secret=" YOUR_CLIENT_SECRET",
                    redirect_uri="http://localhost:8080",
                    user_agent="YOUR_USER_AGENT"
                       ) 
print(reddit.auth.url(["identity"], "...", "permanent"))


Once we were able to connect to Reddit, we encountered an issue. The Praw package has no way to pull the data for specific dates, only the last 25 or so most recent posts. To get around this, we employed the use of the pushshift API (credit to Reddit user u/kungming2 for the “submissions_pushshift_praw “ function definition). This allowed us to pull posts from specific dates; however, there was another problem. Pushshift only allows a maximum of 100 posts per pull, so we created a loop to pull all of the posts for the full span of the dates. (NOTE: This loop takes several minutes to pull all the data.)


start_date = 1609484400  # '2021-01-01 00:00:00' in Unix time
end_date = 1614581999  # '2021-02-28 23:59:59' in Unix time

posts = pd.DataFrame(columns=['title', 'flair', 'score', 'upvote_ratio', 'id',
                              'subreddit', 'url', 'num_comments', 'body', 'created'])  # Dataframe to store results

while start_date < end_date:  # Continue loop until end date is reached
    S = submissions_pushshift_praw(subreddit='wallstreetbets',
                                   start=start_date, end=end_date, limit=10000)  # Pull posts within date range

    for post in S:  # Looping through each post
        try: # Try/except to catch any erroneous post pulls
            if post.selftext != '[removed]' and post.selftext != '[deleted]': # Remove the deleted posts

                    posts = posts.append(
                        {'title':post.title,
                         'flair':post.link_flair_css_class,
                         'score':post.score,
                         'upvote_ratio':post.upvote_ratio,
                         'id':post.id,
                         'subreddit':post.subreddit,
                         'url':post.url,
                         'num_comments':post.num_comments,
                         'body':post.selftext,
                         'created':post.created}, ignore_index=True)  # Retrieve post data and append to dataframe

        except:
            next()  # Continue loop if error is found

    if len(S) < 100: # To identify when the last pull is reached
        break
    start_date = posts['created'].max()  # Select the next earliest date to pull posts from
    print(start_date)  # An indicator of progress


Now that we had the base Reddit data, it was time to fit the sentiment model. Fitting the model will allow us to determine the overall attitude (whether positive or negative) of each post in our dataset. For example, “I like dogs” would be a positive sentiment whereas “I don’t like cats” would be a negative sentiment. Since there are no sentiments to start with, we chose to go with a pre-trained model. After researching a few packages, we decided to go with the Flair package. This seemed to give an overall good performance while still remaining easy to work with. Lastly, since many of the posts in our dataset don’t have text in the body, we chose to fit a sentiment to the titles of the posts only.


import flair
flair_sentiment = flair.models.TextClassifier.load('en-sentiment')  # Load model

for index, row in posts.iterrows():  # Iterate over the rows of the dataframe
    s = flair.data.Sentence(row[0])  # Retrieve title of post
    flair_sentiment.predict(s)  # Predict sentiment
    posts['sentiment'][index] = s.labels[0]  # Add sentiment to dataframe
posts.to_csv('reddit_data_sentiment.csv') # Export results


The final step in collecting the data was the Game Stop (GME) stock price. This was very straightforward with the use of the pandas_datareader package. In just a few lines of code, we were able to pull the data in the date range we were interested in.


start = pd.to_datetime('2021-01-01')
end = pd.to_datetime('2021-02-28')

gme = data.DataReader('GME', 'yahoo', start, end)
gme.to_csv('GME_Stocks.csv')


Analysis

Now for the fun part… the analysis! We performed all the subsequent analysis in Excel. We find it easier to manipulate and “play” with the data compared to coding it in Python. Keep in mind that the analysis tool a person uses to analyze data is completely arbitrary. The real science comes in with the methodology.

Opening the two “.csv” files in excel reveals the following:

The first thing we needed to do was to confirm that our data was valid. For the Reddit data, it was rather straightforward. We compared what we were able to gather through our Python script with the actual posts on the website and confirmed that they matched. Thus, our Reddit data was valid. We also wanted to check that our stock price data was valid. To do this, we decided to plot the closing price of the stock over time as a quick check. We did so and observed a clear spike in stock price around the beginning of February. Comparing this graph with a graph generated by Google Finance, we could see that they matched. Thus, we were able to confirm that our stock price data was valid.  

In order to test the hypothesis we established earlier, we needed to compute the correlation between the number of Reddit posts per day and the closing stock price. We first aggregated the Reddit data by counting the number of posts each day, and matched that up with the stock price for the same day. Once this was done, we could calculate the correlation with an easy to use built-in Excel function.

As you can see, the correlation between the two variables was -0.109. Keep in mind that correlation is a measure of how “similar” two sets of data are. Meaning that if the two variables were positively correlated, then as the stock price goes up, the Reddit volume also goes up. If one goes up as the other goes down, then they are considered negatively correlated. Correlation itself is measured on a scale from -1 to 1 in decimal form. The closer to 1 (or -1) the value is, the more correlated the two sets of data are. Since the correlation here was very small (-0.109), we concluded that the GME stock price and Reddit post volume were not correlated.

Sentiment Analysis

Here, we wanted to see if the overall attitude, or sentiment, of the Reddit posts was correlated with the stock price. We hypothesized that the more positive the average sentiment was for a given day, the higher the stock price was for that day (and vice versa). In order to quantify the overall sentiment, we first computed the total number of positive and negative posts for each day. From there we computed a new metric that divides the difference between the positive and negative counts by the total number of posts. What this means is if the metric is 0, then there were an equal number of positive to negative posts. The closer the metric is to 1 (-1), the more positive (negative) the sentiment. The correlation this metric and the stock price was 0.141. Again, we could definitively conclude that overall sentiment of /r/wallstreetbets and GME stock prices were not correlated.

 

Conclusion

As you can see, the hypothesis we set out with was completely wrong… and that’s okay! This is all part of the scientific process. Science isn’t about being right or wrong, but rather, is about finding the true answer to your question. Thus, we can say with confidence that the volume and sentiment of the Reddit posts are not correlated.

Our methodology was by no means perfect. There are many opportunities to improve this analysis. For example, there may be a much better way to pull the Reddit data we are unaware of. Another thing to look into are the deleted Reddit posts that were omitted. Perhaps there’s some analysis of interest there? Beyond areas of improvement, there are many other things to try and test. It would be interesting to apply the different sentiment analysis packages on the same dataset and see how/why they differ. Something else to try is looking at if the score or number of comments correlate with the stock price. The possibilities are endless, so go out there and put the science in data science!

Feel free to contact us at datascience@asu.edu with any questions or if you would like to share your findings with us.

The source code and Excel workbook can be found here.

Feb 08, 2021 ·

What COVID-19 Dashboards Aren’t Telling Us

These graphics and interactives are supposed to help us get a better understanding of the state of the pandemic. But too often, they offer incomplete pictures.

By MICHAEL SIMEONEGRACIE VALDEZ, and SHAWN WALKER

FEB 08, 20211:36 PM

Photo illustration by Slate. Photos by New York Times, Giacomo Carra on Unsplash, Longmongdoosi/iStock/Getty Images Plus, Erik Mclean on Unsplash, and the Center for the Systems Science and Engineering at John Hopkins University.

We are surrounded by charts, graphs, and dashboards that try to summarize and surveil the COVID-19 pandemic in the United States. Multicolor maps of cases by county or ZIP code, jagged time series plots depicting case rates and fatalities. Thin bars lined up side by side in cigarette formation to document tests or hospital capacity. Even when the charts aren’t in front of us, we are discussing them with new household words—“spiking,” “flattening,” “hot spot.”

We haven’t seen this kind of explosion of data visualization since the advent of the Weather Channel. Millions are viewing charts published in newspapers, social media, state departments of health, all to check in on the status of the epidemic at various scales and localities. When was the last time a good portion of the country tuned in daily to a handful of charts?

But these dashboards, none of which existed before March, are experiencing some growing pains. They offer the public a false sense of transparency and surveillance in a time of intense crisis. In the Washington Post, Jacqueline Wernimont characterizes them as “vital, yet flawed,” a turn of phrase that captures exactly how much we need tools to better communicate about the pandemic. Those who make and support these dashboards deserve gratitude and recognition for their efforts, but these resources are not without meaningful flaws that we should aspire to work through. While the purpose of data dashboards is to keep audiences up to date on real-time, reliable data, many COVID dashboards risk confusing audiences because of their design choices and lack of clear explanations about the data. Based on our repeated observations over the past year, COVID-19 dashboards have not fundamentally changed in their appearance or function since their inception. Peeking under the hood at the code that runs them suggests that the vast majority of these sites have been rolled out using off-the-shelf business analytics products, or repurposed visualizations from other topics or stories. This is for good reason: Developing and maintaining a website with live data and interactive charts is intensive in time, money, and computing infrastructure resources. But this also means that these websites do not fully meet the emergent needs of people living through the COVID-19 crisis. And even in the case of journalistic venues where visualizations have been evolving to be more bespoke for the pandemic, visualizations still adhere to conventions that confuse the purpose and message of the data presented. And these collective shortcomings could backfire.

At first glance, these dashboards seem oddly familiar but foreign at the same time. Their

design mimics familiar tools such as the activity counter metrics from our Fitbits, screen time reports from iPhones and iPads, and the quarterly sales reports from major companies. But what is their intended purpose? What specific problems or decisions are they supporting for viewers? Visualizing and reporting the summary figures for the COVID-19 pandemic is not as straightforward as measuring the number of steps you take in a day or number of transactions in a week. What counts as a fatality for COVID-19? Are cases counted as a positive test result or a presumed case determined by a patient’s clinical symptoms? When a daily total for new cases is posted, how do you assign a specific date to a new case? Is there a lag period? How often are these updated? What is the difference between a serology test and a PCR test? Are all of them counted together? What about patients who get two kinds of tests (nasal and saliva, for instance)—are both results counted or just one? What’s the rate of false negative results?

Every state manages testing differently, and state and local bureaucracies that span private and public health facilities and government offices further complicate the situation. So there isn’t a single set of answers that will work for every dashboard or graphic. And from Google’s own dashboards to those of states around the U.S., there are not visible, consistent explanations for how the numbers arrived to your internet browser window. Some states place explanation at the forefront, but many others bury this in fine print. The distance between the charismatic set of visuals and the figuring that got us the numbers in the first place presents a vacuum for meaning making. And misinformation loves a vacuum for meaning making.

Seen by a friendly reader who assumes consistent empirical values and practices, these dashboards could be considered helpful information. Seen by a hostile reader who sees any gaps in explanations as an opportunity for other narratives to make sense, these dashboards can be “evidence” used to draw very different conclusions about COVID-19. And these narratives abound. Throughout unofficial channels on Facebook, YouTube, Twitter, Reddit, (formerly) Parler, and message boards, there have been sustained concerns raised about the reliability of reporting COVID-19 data. These concerns more often than not paint a picture of government conspiracy, with the global pandemic a hoax or false flag used to secure power for nefarious actors. Seen in this way, these charts and graphs are proof of a nefarious liberal plot aimed at fooling the public, seizing political power, and robbing citizens of their liberty.

In each of these examples, data and charts are a reason for more distrust. To be fair, no dashboard or chart is going to shift a person’s worldview to suddenly transform their trust in a given institution. Those already convinced of a corrupt plot to ruin America won’t be persuaded by better documentation. But while these extreme visions of government conspiracy bolstered by bad data will not convert everyone who reads them, they do stand to sow significant doubt. And when anyone tries to validate wild claims about suspicious data by going to their state’s dashboard, they have a good chance of encountering information that isn’t well-explained or is flat-out difficult to understand.

These charts and dashboards exist in a time of simultaneous public health and information crisis. Earlier in the pandemic, one team of researchers found that more than 50 percent of social media posts about COVID-19 were likely bot and troll activity. Misinformation is a reliably consistent feature of the information streams that people encounter every day: from simple allegations that hospitals are using COVID-19 to make more money to claims that the virus is part of a global plot to kill off a large portion of the population through the use of an engineered bioweapon paid for by President Barack Obama and Anthony Fauci. What much of this misinformation has in common is that it requires people to distrust the official reports of cases, deaths, hospital beds, and more, and it is far easier to sow this distrust when there are not clear explanations for what we see. What’s more, these dashboards run on interactive web applications that are difficult to archive for outside parties, and so re-creating the reports of a given dashboard for a given day are nearly impossible—it is difficult to check any claim against the historical record of COVID reporting. If the charts look good but are not clearly explained and are not accountable to archiving, they become more available to anyone reading the pandemic as a hoax or conspiracy.

For instance, the Arizona Department of Health Services dashboard for Dec. 9—a time when cases (more than 7,000) and hospitalizations (more than 400) were climbing rapidly—showed what appears to be a falling number of hospitalizations:

https://compote.slate.com/images/b3409f72-731a-4360-9a1a-6ff6d6615c3e.jpeg?width=840&rect=1560x1040&offset=0x0

The Arizona Department of Health Services dashboard on Dec. 9. Arizona Department of Health Services

The problem is that the cases are updated daily, and those daily tallies can be updated retroactively due to a lag in reporting and processing cases. This is not spelled out explicitly.

That can make it easier for observers to manipulate the information. In early July, Arizona Rep. Andy Biggs tweeted a screenshot of the Arizona Department of Health Services’ COVID-19 dashboard calling on the public to not accept the media’s word and “research the data and facts” for themselves. He claimed that the screenshot showed a huge improvement in COVID-19 cases, which aligned with Biggs’ assertion that COVID-19 is an overblown hoax cooked up by the media and the Democrats.

Andy Biggs/Twitter

And the misinformation involving charts and dashboards need not involve conspiracy theories and overtly political narratives. Many of the charts often obscure some of the biggest challenges of the pandemic. For instance, we have seen higher death rates among minority/low-income communities and tribal nations. But the lack of reporting from these areas—itself the result of the same lack of resources that compounds the impacts of the pandemic—is too often represented in a way that makes it appear as if there is no problem at all. For instance, this chart from the front page of the New York Times, which uses color shading to indicate the capacity of ICU facilities by hospital jurisdiction, presents areas with “no data” as the lightest possible shade for the map. And while there is a labeled scale at the top of the chart, it leads one to believe that the lighter the color, the further the area is from crisis.

Dec. 9. The New York Times

Northeastern Arizona, home to Navajo, Paiute, and Hopi people, is shaded in the lightest possible color. Indeed, many rural areas across the U.S. are treated similarly. But these areas have been some of the hardest hit, and presentations like this one tend to emphasize areas with more robust reporting instead. Clicking through to the full story reveals a more elaborate legend and chart:

The New York Times

 

While this version has a legend designating the lightest shade on the map as “no data,” this is not a responsible presentation. Given the conditions surrounding the pandemic and the impact of COVID-19, and the rhetorical positioning of numbers by this map wherein darker and redder means more dire, “no data” could be considered something that may visually resemble the highest values in the data, not the lowest. In other words, if the question is “where is it getting bad?,” then lack of reliable data could be an indicator for things being quite bad. Areas that lack resources, medical facilities, and support for bureaucracies that manage the pandemic will not report data as consistently, and have critical vulnerabilities that are easy to blink past in this rendition. As presented, it is far too easy to mistake that there is no COVID problem in areas for which there is no data. And the implications of color similarity between lower ICU bed usage and no data at all encodes two very different situations as being far too similar. Using very different colors for nonreporting areas would be less visually appealing, and almost certainly cost the audience more time to understand the chart, but would mitigate the risk of someone making a snap and erroneous conclusion.

The use of bar charts to present raw data can also obscure the view of the public and decision-makers when it comes to vulnerable populations. Take this example from the Virginia Department of Health in June. It offers little sense of how the raw count of affected persons stacks up against the overall demographics of the state. You cannot evaluate any disproportionate impacts by examining these charts unless you consult additional data. But this understanding of what percentage of persons of a given race and ethnicity are affected is crucial to getting any sense of the intersectional impact of the virus. Otherwise, why report cases by ethnicity at all?

The Virginia Department of Health dashboard on Dec. 9. Virginia Department of Health

 

And then there is the consistent reliance on maps and mapping to show the status of the pandemic. In our 50-state survey of state COVID dashboards this summer, nearly every state used a map as the central graphic for their health dashboards. Texas provides a good example. But it is misleading to frame the COVID-19 pandemic as something where spatial proximity matters most. Namely, being close to a county that has a high infection rate does not necessarily mean that your own county is at risk. The virus spreads where people go; county-by-county choropleths (or, maps shaded by area) are insufficient to show this kind of complex interaction among roadways, airports, travelers, and their destinations. These maps offer a false sense of surveillance. North Dakota, which was the “right color” this summer and not close to any affected areas, became the epicenter of one of the worst outbreaks by the end of 2020. Shaded county-by-county COVID maps had no descriptive or predictive value for that outbreak. They show us the world without showing us the parts that matter.

While facts and clarity alone will not solve misinformation or misunderstanding, improvements to COVID dashboards should decrease conjecture, senses of false security, and ambiguity. To be clear, the burden of presenting COVID-19 data is enormous. But there is something perverse about using the same tools to show the massive losses to a historic disease as one might use for fourth-quarter sales by region. What we show and how we show it should rise to the occasion. We should make sure that dashboards offer more, not less, explanatory value than the misinforming narratives that vastly outnumber the charts and graphs published by states and newspapers. When we visualize data about a pandemic, we should think about it as presenting key relationships, risks, and changes, not reporting out a virus or body count dressed up in a variety of formats. If there’s no clear story for the data, there are plenty of available opinions out there on the internet to fill that gap.

Future Tense is a partnership of SlateNew America, and Arizona State University that examines emerging technologies, public policy, and society.

Oct 21, 2020 ·

October 6, 2020

ASU Library data and analytics unit produces 'Misinfo Weekly' featuring professors Michael Simeone and Shawn Walker

2020 has been a year of extraordinary headlines. From a deadly pandemic to devastating wildfires to murder hornets, we’ve gotten used to doing double and even triple takes of the large, black type staring back at us through our screens and scrolling along the bottom of our televisions. Amid all of that, a rise in the spread and abundance of misinformation has made even the savviest among us stop and scratch our heads more than a few times before retweeting.

That’s the bad news. The good news is that we’re not alone, and there are ways to help make sense of a seemingly nonsensical time. That’s what Arizona State University professors Michael Simeone and Shawn Walker want to help listeners of their new podcast be able to do.

This summer, the duo embarked on "Misinfo Weekly," produced by the Unit for Data Science and Analytics at ASU Library, as they began to realize just how quickly misinformation was proliferating with more people at home, constantly tuned into their devices.

“We started realizing that this summer was shaping up to be a summer of misinformation,” Simeone said. “So we took this opportunity to kind of make our research as available as possible to people at a time when we hope it can help them make sense of things.”

Every week, the co-hosts provide some perspective on current events in misinformation, breaking down basic and advanced concepts using data-driven examples.

Past topics have included QAnon conspiracies, health disinformation and COVID-19. The 10th and most recent episode, “The Miserable Case of @sciencing_bi,” looks at how a former professor and science equity activist used a sock puppet Twitter account to gain access and prestige when interacting with academics and professionals online. Things got strange when she killed her fictional character and blamed it on a poor national and organizational response to the COVID-19 pandemic.

ASU Now spoke with Simeone and Walker via Zoom to learn more about what sets their podcast apart from others, whether this is really a unique phenomenon of our time, and how to be better at recognizing misinformation.

Editor's note: Responses have been edited for length and clarity.

Question: What about your podcast is different from others that deal with misinformation?

Simeone: We are definitely trying to make sure that people have a better awareness about misinformation and are better equipped to deal with misinformation, but our approach is more example-driven than rules-driven. So we don't imagine that there's a computer program that's going to solve all the misinformation problems. We don't imagine that there's going to be some kind of training that we can put people through, and after they learn this in 15 or 20 minutes, then they're going to be fine, because misinformation is always evolving. Every episode is an opportunity to learn some new things about misinformation, but we try to turn that into a robust sensibility that somebody can learn. We want it to become a capability for our listeners that they have that is very different from just knowing the 10 rules of spotting misinformation.

Walker: In the first episode, we spend the entire time talking about how difficult it is to navigate mis- and disinformation. Oftentimes it's presented to folks as very simple. Like, if you just follow these steps, then you can tell. But some sites look so professional that unless you really dig under the surface, you don't know where the information is really coming from. So we’re trying to talk about misinformation and identify it in ways that are a bit deeper and less superficial, and also acknowledge the difficulties of doing so and how this is all so complicated. And – shocker – this is not new. Since humans have been able to communicate, mis- and disinformation have been a problem. It’s true that nowadays, social media puts a twist on it. But a lot of folks still act like this is brand new, so we really wanted to present a more accurate picture that this is super complex.

Q: What is the difference between mis- and disinformation? Does it matter?

Walker: Normally the difference is intent. Oftentimes disinformation is described as something meant to intentionally deceive you. So disinformation would be someone knowingly sharing incorrect information on purpose versus someone unintentionally sharing incorrect information, which is misinformation. But one of the things we often talk about is that it really doesn't matter much whether something is mis- or disinformation when you’re talking about impact, because there are a lot of intended and unintended consequences.

Simeone: Right. Just being able to tell when someone is sharing incorrect information on purpose is not the most helpful perspective to have. A more helpful way to complement any kind of training regarding misinformation is to do an inventory of your own vulnerabilities, because whether it's intentional or unintentional, that's how it hits home with somebody. Just because someone is a climate activist and feels very strongly about the environment and they're very educated about it, that doesn't mean that they're immune to misinformation. This whole idea about confusing what is true with how you feel is just grist for the mill for misinformation. And everyone can be vulnerable to that.

Q: Why is there this assumption that misinformation is a new phenomenon and/or what about the social media age makes it unique?

Simeone: I think one of the reasons why people think that the social media age of misinformation is distinct is because of the scale and speed at which it travels. People said a lot of similar things around the invention of the printing press. You know, that information is coming at us too fast and there's just too much information out there. I don't want to have to adjudicate between whether we really are in a unique state or not, but I would say the scale and the speed and the availability of information nowadays is definitely remarkable. And then there’s the gatekeeping. A lot of the traditional gatekeeping mechanisms for publishing just aren't there anymore. It used to be, if you wanted to make something that you said available to the entire United States, there were a number of different checkpoints you had to go through. Now you need a phone and a Twitter account that may or may not be associated with your identity. So even though the history of deceiving people through communications technologies at a broad scale to achieve political ends can go far back in history, the kind of scope and dynamics of today are worth paying attention to.

Walker: Slightly related are the heuristics that we use and the heuristics that we were taught. As an odd sort of side note, I'm teaching an honors seminar about viral misinformation around the election and COVID, and none of my students said they had formal media literacy or information literacy training. I don't know about you, but in high school, I went into the library and we had this whole thing on it. But anyway, there are these heuristics that we have, but the issue is that a lot of these campaigns and a lot of the technology basically breaks them down so that they don't function anymore. Like Michael was saying about gatekeepers, you used to be able to ask yourself, ‘Does this look professional?’ It used to be very expensive to create misinformation. Now it's super cheap. With this switch to digital content, we have much more peer production, we have higher distribution. The vast majority of folks have a 4K camera in their pocket with free editing software. Ten years ago, that would have been impossible.

Q: So recognizing misinformation isn’t as easy as following a 10-step Buzzfeed guide. But do you have any advice for how to stay on top of it?

Simeone: Yes, but first, I feel honor bound to say that it's not like we figured it all out. The podcast is sort of our way of making sure that we engage with the subject as it's transforming, and that’s what we hope it does for our listeners. And there are some decent guidelines out there about making sure you've read what you retweet and basic things like that. But I feel like one thing that we try to emphasize is how important it is to really think through what your individual vulnerabilities might look like to misinformation. If someone wanted to lie to you, what would you want to believe? That's a really important thing to keep in mind. The other thing is to just try to maintain a certain level of humility. And I don't mean for that to sound preachy, but if something online looks so good that you think it's untrue, it probably is. So just approach that stuff with some trepidation; be humble and understand your weaknesses when it comes to deciding if something is fake or not.

Walker: We often talk about these places or opportunities for pause, especially when you feel emotional because of a piece of content. So when you see something and you get riled up, either positively or negatively. That’s another area of vulnerability. So when you feel that, you might want to just pause for a second, just to sort of reflect and think about why it might be making you feel that way.

Q: What else do you hope listeners take away from your podcast?

Walker: I would say besides a bit of laughing at us, hopefully there's some enjoyment in — well, we’re not trying to add levity to the subject, but I hope we make it feel less dire.

Simeone: Yeah, more approachable.

Walker: Yeah. And I hope that in offering varied cases and in the way that we discuss them, I hope that might make people feel less overwhelmed by it.

Simeone: And again, learning by example — because we look at example after example — it really helps you get a sensibility for these kinds of things. And I think that's what we want to do. We want to open up participation on this topic so that it's not just a certain platform's responsibility to censor it, or it's not just an algorithm's decision to block it in my browser. Because if this is something that's important to us, then it requires some kind of collective or broader sustained activity around it.

Walker: And I guess I also hope they walk away with a feeling and an understanding that this is hard, and that we also struggle with it. Too often it’s simplified in ways that make people feel stupid or inadequate and unequipped to deal with it; and that includes us, and this is something we study. So if we can share how we struggle with this, then hopefully that makes everyone else realize it’s OK to struggle with it. It’s OK to be tired. It’s OK to be frustrated. It’s OK to be confused. And we're hoping that we can help you think through that.

 

Emma Greguska

Reporter, ASU Now

(480) 965-9657 emma.greguska@asu.edu

Jul 07, 2020 ·

Gain knowledge and learn how to solve problems with data.

Now available through the ASU Library’s Unit for Data Science and Analytics, a free, new digital learning experience offers credentialing to students, faculty and staff of all levels and disciplines in need of an introductory course in data science.

Foundations of Data Science is a six-module ASU Canvas course, designed by data science experts, aimed at enhancing one’s understanding of using data as a research tool – everything from data visualization and machine learning to natural language processing and model selection. No programming is involved.

The course is available to self-enroll.

In addition, the Unit for Data Science and Analytics offers all-virtual open lab workshops beginning September 9.

Jul 07, 2020 ·

Gain knowledge and learn how to solve problems with data.

Now available through the ASU Library’s Unit for Data Science and Analytics, a free, new digital learning experience offers credentialing to students, faculty and staff of all levels and disciplines in need of an introductory course in data science.

Foundations of Data Science is a six-module ASU Canvas course, designed by data science experts, aimed at enhancing one’s understanding of using data as a research tool – everything from data visualization and machine learning to natural language processing and model selection. No programming is involved.

The course is available to self-enroll.

In addition, the Unit for Data Science and Analytics offers all-virtual open lab workshops beginning September 9.

Apr 21, 2020 ·

We are excited to share with you a COVID-19 data web browsing tool developed by the ASU Library’s unit for Data Science and Analytics. The tool is aimed at helping researchers browse and process a vast collection of biomedical research related to COVID-19. The research is being collected and distributed by Kaggle, an online community of data scientists and machine learning practitioners. 

Kaggle challenged its online collaborators, including our director, Michael Simeone and Masters student, Steve Jadav, to develop data solutions that will help medical professionals keep up with the rapid acceleration of coronavirus literature. The site uses a special search algorithm to help retrieve search terms, similar to what search engines use to help make sure results capture the spirit and not just the letters of the search. It also uses a summarization routine that ranks sentences based on their information content and presents the ones, in order, that may be most informative.“Good information right now is absolutely crucial.”

Let’s Put the Science in Data Science!