Web archives and the Wayback machine
Episode: S2 E2
Podcast published date:
Thu, 2/25 5:05PM • 56:07
SPEAKERS
Jessica Ogden, Shawn Walker, Michael Simeone
Michael Simeone 00:00
This is Misinfo Weekly, a somewhat weekly program about misinformation in our time. Misinfo Weekly is made by the Unit for Data Science and Analytics at Arizona State University Library.
Shawn Walker 00:11
Hello, today is February 5, 2021. And we have a guest today, Dr. Jessica Ogden. She's a senior research associate at the University of Bristol, an ESRC postdoctoral fellow and a fellow at the Bristol digital futures initiative. Hi, Jessica, how are you today?
Jessica Ogden 00:29
Hi, Sean, I'm doing fine. Thanks.
Shawn Walker 00:32
We're super excited to have you. So today we're going to talk a bit about web archives. What web archives are, how they're created. Lots of issues around that as well as, you know, misinformation in their connection to web archives. I know we've talked a little bit in previous episodes about web archives, and sort of mentioned them a little bit. But Jessica, what is a web archive?
Jessica Ogden 00:52
Yeah, that's probably a great, great place to start. Maybe if we back up just a little bit and think about why web archives are a thing before we sort of discuss what they are. And I guess I would start with the fact that the web is really ephemeral. So from the kind of beginnings of the web, we know that there's no kind of inbuilt preservation mechanism. So things can be put online and taken offline for lots of different reasons, some purposeful and some not. So sometimes things just become obsolescent because your servers fall down, or somebody doesn't renew their domain name, or sometimes they're deliberately removed. And we see that across, at least in the present climate, across different social media platforms where either algorithms or moderators remove things from online or indeed individuals themselves. You know, if you put something online that you think "I actually don't want that online anymore," you end up deleting it. And so kind of ephemerality manifests in lots of different ways online, and and hence why sort of different web archiving initiatives have come in to start archiving. Well, not to start they've been archiving the web for, for quite a long time, actually.
Michael Simeone 02:01
I think it might be easy to think that the internet is kind of lossless format, in the sense that if you're a consumer, and you're using some high profile services, or really well established services, sometimes it can feel like you can go all the way back to the beginning of whatever kind of content that you want. And it is still possible to find old webpages. A starting place. We're talking about web archiving and misinformation. Is this idea about how much of the web just evaporates?
Jessica Ogden 02:33
Yeah, that's a really, really good question. And as you were talking, I was thinking about the ways that certain platforms reinforce this idea that your data and think there are there for all time. And the example can be, you know, how Facebook shows you the things that you posted 10 years ago, or, or whatever. And it almost falsely reinforces that notion that your posts will be there forever. But to answer your question about the longitudinal or the kind of lifespan of stuff online, I think, you know, most people refer to a study, which is actually quite old now. So sort of from the mid 2000s, which I think documented that the average lifespan of certain kinds of social media content at that time, was around 100 days, which is actually really not that long. And I think they're definitely, you know, some other studies that are probably more up to date than that one now. But I would be curious to see how that kind of plays because it plays differently on different, different platforms in different, different domains. Of course, that's always the bookmark that people sort of go to when they're trying to explain how much the web actually isn't online for that long.
Michael Simeone 03:36
Right? So it's, it's not like, the phenomenon is like, so big and dispersed. It's not like we can put a percentage on, you know, how much attrition we experience. But that lifespan figure where you can actually clock the lifespan of a given item. And it's lives less than, say, a set of tires on a car caused us to reevaluate when people say, "Well, why do we need libraries anymore? We have Google."
Shawn Walker 04:01
I just screamed in my head a little bit when you said that, why do we need libraries anymore? Ah, we need libraries. So a question, Jess, would be then. So who are the the sort of players in these spaces? Because most people think like, oh, like click on links in Google, so Google must be the one that's doing all of this. But you know, who's really doing this?
Jessica Ogden 04:17
Yeah, that's a great question, I think more or less as a sort of extension of the digital preservation movement of the kind of late 80s and early 90s, the sort of what I would classify as sort of conventional memory institutions kind of came into this space, when they started thinking about what's called born digital information and media, and extended that into the web. So they started thinking about, first on the internet. And then of course, with the arrival of the web in the early 1990s. Thinking about well, how we need to be capturing this. This is the kind of new mode of communication. This is where history is taking place, as it were. So we need to have a mechanism for preserving that through time. And so you get institutions mostly to certain national libraries and archives that start playing in this field. Developing, and actually kind of reusing the tools around web scraping to develop, you know, certain kinds of standards and extended tools to do this and to kind of develop a sort of field of practice around archival standards and things for, for the web. And one of the big players that also emerged in the late 90s was the Internet Archive, which I would assume at least some percentage of your listeners will have heard of, they're kind of the probably the biggest and the kind of most famous web archive, I would argue. Which is manifested through things like the Wayback Machine where you can go
Michael Simeone 05:32
Synonymous with the Wayback Machine. Yeah.
Jessica Ogden 05:34
Yep. So that you can go there, and you can visit because they've been archiving the web for a really long time. You know, there's a lot of content there where you can go and view the web of the past there. But just to sort of, to round it off. And this is kind of where some of my research has been focused, as well as thinking about some of the other sort of disruptors in the field. So beyond libraries and archives, and the Internet Archive is definitely, I would classify as a disruptor. Within libraries and archives as well, which we could unpack it, if you're interested. There's also community groups, hacker organizations, volunteer collectives that get together online and archive the web in various ways and formats and have different kind of interests in different parts of the web. And a lot of my research is kind of centered on that. But, so yeah, just to say there's lots of different players out there archiving the web for different reasons, of course, industry as well not to forget various companies and corporations who are out there archiving the web for their own purposes, too.
Michael Simeone 06:30
So rather than think about the internet as something that you record, like recording a TV show, and you want to just make sure that someone's recording at all. There's multiple parties with different interests, and different philosophies, who are recording can't record the entire internet. So recording different fractions of the Internet, and all of those assembled together are a kind of aggregate archive within they have their own individual archives as well.
Jessica Ogden 06:55
Yeah, that's right. It's a really, really important point too, because the the sort of in this, again, some of my research is around this kind of selectivity of web archiving, and how that's determined by who's doing it and and what they want to archive really. And that's kind of basic premise of, you know, archival theory and library practices that, you know, selection decisions need to be made. But it's a kind of. It manifests, really interestingly on the web, because you get these national web archives who are only interested in archiving the UK web, for example. But what is the UK web? Most people don't think about the web in terms of their own country's domain, they think about the web, is this kind of internet connected international thing. And when we start thinking about it in terms of information, ecosystems, and issues around misinformation, and other ways in which things travel online, we think about them as interconnected. We don't think about them as...as country specific. And so I think they it, raises really interesting questions and problems around both how you track those different forms of information that you might be interested in tracking as a scholar or researcher, but also how you record them for posterity for different kinds of uses.
Michael Simeone 08:02
Sohistory is written by the person with the most capable servers.
Jessica Ogden 08:05
[Laughing] Yeah, yeah.
Shawn Walker 08:07
The biggest hard drive wins, basically.
Jessica Ogden 08:08
The most bandwidth. Yeah. That's right.
Shawn Walker 08:11
Well, there's also a time component correct? Because when we talk about ephemerality, ephemerality is not just things disappearing ephemerality also is that the web in content on the web is constantly changing. And so what role does time play? Because it's not like we can just archive a website once in 1998. And then like, poof, we're done. We have it forever.
Jessica Ogden 08:31
Yeah, that's a really, really good point, Shawn. And I think it speaks to some of the work. So one of the case studies that I did, as part of my PhD research was around looking at how activists came together after the election of Donald Trump to archive climate science data. And so what they tried to do was build a series of tools around how do you monitor changes online in specific contexts? So they were really interested in how do you monitor the government's activities and information around the climate science agenda, because they were, you know, they have the real big expectations about them being kind of anti climate science, and that they would probably go online and remove all of the access to open government data, as well as you know, educational resources around climate science. And so they were really concerned about how you prove that. So if you only took a snapshot here and there, it'd be really difficult to know when those things disappeared or when they were removed, and then infer some kind of intentionality around that. And so they had to archive quite a lot and and then develop tools to to what was called diffing tools to look at the changes on those pages, to see how the rhetoric around the climate science data was changing over time. And then they would issue these kind of rapid fire policy briefs and documents around what was happening there. And so, you can see as just one example, you can see how those that temporality component you're asking about Shawn is really important to to how and what you can say about web archives. What you can say using them as resources, but it also becomes a really powerful tool for for monitoring change over time as well. Which brings with it you know, so kinds of risks depending on who you're monitoring, I guess.
Shawn Walker 10:11
So I think that brings up a good point you've been saying about this idea of selection. There's also sort of this technical process of how does something get web archived? So let's, can we break those two things down? So you'd be first, can you sort of talk about, so you have a webpage, a URL, for example, that you want to archive? What sort of an overview of like the flow that might be helpful for the public to think because it's not like you have someone at, say, the Internet Archive, it's not like you submit a URL in the Internet Archive. And then like a pigeon types on a keyboard, and says goes into web browsers like File, Save, and then they drag that up to Internet Archive Server somewhere, right. So how does this magic work?
Michael Simeone 10:51
Well, yeah, we actually that's, that's awesome, right? We don't we don't want anyone to feel like it's magic. So even in the way we've been talking about it, when you say these groups, they just archive these pages. What does that, what does that mean? Right, like a golden rule is, anytime there's a complex technical process, people are likely to assume it's magical and takes a lot less time than it actually does. So what's going on?
Shawn Walker 11:10
And probably that it works a lot better than it actually does, too.
Jessica Ogden 11:12
[Laughing] Yeah. Yeah. Yeah, absolutely. I think those are all really good and important points. And I think it was actually one of the starting points for my research is because there's a lot of rhetoric, I guess, you could say around around web archives, and, and well, all technologies, as you're kind of alluding to Michael around sort of black boxing, what these technologies are actually doing. And often when you have some level of automation, or semi automation in those processes, that becomes even further blackbox. And it becomes this kind of magical tool to all of your needs for looking at the past web. And I think, in the case of web archives, in terms of how they work, generally speaking, there are, are now there've been a kind of number of standards, in terms of formats and sort of what we could call best practices, I guess, around creating archival versions of the web. So when we say archival versions, or when I say it, rather, I take a really broad view on that. But I also, I tried to sort of couch it in terms of sort of preservation standards, I guess, because, you know, you could think about how if you went into your normal browser, you could just, you know, download a zip drive of a webpage. And that could maybe be considered an archive, which of course, it is a form of archives. But in this sense, the tools are really built to build on standards that are set by libraries and archives, so that they can be kind of preserved against various obsolescence in terms of technologies over time, and you can deposit them in these preservation repositories for for for so called, you know, all time which we could, we could unpack, but it's probably best to leave it there. Sorry, I'm rambling a little bit. But I think in terms of the technologies, it they're based on web crawlers, so sort of standard standard web crawlers where you can kind of semi automate a command line tool, for example, to use HTTP protocol to go to a website and essentially scrape everything there and put it into this this archival format, which then preserves all of the headers and things associated with that page. When you made the call. And there are a whole series of sort of metadata associated with with that. But there are problems with that approach. And Shawn, you know, a lot about these, these problems are associated with the quality of these archives often.
Shawn Walker 13:28
Alright, so I guess, if we talk briefly about sort of a webpage, right, it's it's a complicated beast, not magical and mythical. So I'll stay away from magic, Michael. But it's, it's a complicated beast. It's not just text, right? So web page includes what images, some web pages are very dynamic. And that that there's, there's JavaScript or code running in the browser to present things to the user based on their interaction. So how, how does that cause issues or not cause issues with web archives?
Jessica Ogden 13:58
Yeah, well, I think that's our rule of thumb is, the more dynamic a page is, the more difficult it is to, to archive it and create a so-called sort of authentic representation of what that web page looked like. And I think, as you alluded to the JavaScript is like a major, major problem for most archives. Although I should say that some of the tools, you know, there's been a lot of sort of rapid development in these tools, especially over the last few years, which is improving the quality of these representations. But, I think one additional contextual thing to throw in is that, of course, we all interact with the web in different ways. Not everybody sitting at a Chrome browser on a desktop viewing the web. We, we view the web in lots of different ways on mobile applications on different operating systems on different types of browsers. And we experience the web in different ways. And web archives can't really capture all those different elements, that kind of contextual stuff that happens around, around the web is also there. You know, there often is singular representation of something that the computer sees that the command line sees that they can interpret through the protocol, not necessarily through what you might see as a human being on, on the screen when you interact with the web.
Michael Simeone 15:14
Yeah, that's an interesting point about what you're actually archiving. Because it's not like anyone has access to the server, and can get that level of fidelity of whatever is on there. All of that I think to lay people is going to sound painstaking and exhausting. And I bet that people are thinking, "I'm glad someone is doing this. Oh, that the internet is incredibly lossy, and ephemeral? Well, I'm glad someone's doing it." Right. But I think it's important to highlight exactly how much hard work this is. And it's not like we're making...there's a new TV drama about web archiving that came out, right. It's not like the kind of work that tends to be glamorized. And so there is so much consequential stuff going on in the world of web archiving. But it's something that I think people kind of repress in their heads as something that absolutely has to happen. And it's very complex.
Jessica Ogden 16:05
Yeah, I think it's interesting, I think it makes me think of, you know, there's been a lot of research recently around this kind of what what could be called as one of the maintenance work of data work. The maintenance and repair work that goes around information work, as well as work with data, and the kind of how that's reliant on these kind of large scale infrastructure information infrastructures, that that underlie the web, but also sort of sit on top of it to, to create these kind of levels of preservation. I think it's also important for listeners to also see the other side, which, you know, I'm a great champion of libraries and archives. And I think they're absolutely doing, you know, the work that we need them to do in this, especially in this kind of current, current moment, and the kind of current content...contemporary climate that we live in politics and all the rest. But I do think that there are some really big questions not just about what's happening in libraries and archives, but in these other more kind of collectives and community groups that are archiving as well as in the kind of corporate and platform level, what's happening. Is that web archiving isn't always an inherent good, I guess, would be the kind of question I you know, is it always inherently a good thing? You know, there are some really some risks that come with archiving the web, especially at scale, both for individuals who may not want some of their the things that they put online to be archived for all time in the Internet Archive. And both for kind of specific groups and populations for which the risks are even higher. And I think we see that within some of the work that's been going on around protest movements, and how these web archives can be used as surveillance tools in certain communities to identify people who are exercising, you know, their their rights, essentially, to protest and to free speech. Now, of course, I think this is when it gets into really interesting territory is when we start thinking about what happened, the capital and the insurrection, and how similar tools are now being used to identify what what happened there. And I think there's a whole host of issues to be unpacked there. But I guess I just want to sort of posit to the listeners and to you all, that, that web archives aren't necessarily always a good thing, potentially, you know, as complex as the processes that under underlie the collection of web archives, I think we should equally treat the use of those web archives with with equal sort of critical scrutiny as well.
Shawn Walker 18:26
I mean, we also know that some folks use web archives for devious purposes or to basically keep content online that might be problematic, like, we know white supremacist groups, for example, create web pages, then ask the Internet Archive to save them. And then their website, of course, is taken down because it's hate speech. But the Internet Archive copy still lives on. And they circulate that. So it's also complicated, right in what's you know, how we choose what to keep, even after we've collected it. Right?
Jessica Ogden 18:53
Yeah, that's right. And I think, you know, there's a real, and this is sort of where my attentions have been turned recently, with, with what's happening on Parler, but I think elsewhere, around, you know, content, moderation doesn't stop at the act of removing something from a platform. And I think we really need to be looking at how these different web archives and it's not just the Internet Archive, it's also other open source and open access web archives that are being used across different platforms to circulate archival links that are no longer available online, which have been removed because they're deemed hate speech or, or whatever else. And I think there's some, Yeah, some really open questions about the sort of social processes that surround how these links and how archives are used, despite, you know, these kind of high level content moderation policies and algorithms that are being developed to remove things. How do we start understanding both how they're being used on different platforms, but then how we go about either intervening or mitigating that use in certain certain circumstances? Because it becomes extremely complex when you're dealing with web archives, they are meant to be there. You know, if we believe the, the internet archives mission is to keep keep these archives online forever. So you know, and that's not to say, of course they don't they do. They do have so called dark archives, they do remove things, you know, such as child exploitation images and things of that nature. And there were some big cases around images circulating around terrorism and live streams associated with that which were archived and removed. But I think it becomes, the picture becomes more complex when you insert these web archives into that, that ecosystem.
Michael Simeone 20:34
Yeah so it sounds like going on all the time, is contestation for what the history of the internet looks like and will look like. And events like the storming of the Capitol events, like when entire apps are taken offline, and events like having entire communities be de platformed put a lot of pressure on the folks who are doing this kind of work. And puts a lot of pressure on us collectively, in terms of the kind of history we're trying to stitch together about what is and was online.
Jessica Ogden 21:08
Yeah, that's that's right. I think I would love to, to speak with someone at, at the Internet Archive in some of these other organizations to understand. It does seem like I think maybe Shawn, you said this earlier, but it does seem that web archives are coming more into the public consciousness, I guess, to a certain extent, but I think maybe that recognition of both the need and the work that's that's happening, there is, I think, I think rising to the fore given these issues, and given especially the events of the last, it feels like forever, but for the last couple of months, I would say have really brought those issues to the fore. You know, I don't know if it's worth talking about some of those, those activities as well, in terms of how, you know how archivists were intervening collectively on Parler, but also some of the other activities around misinformation and the insurrection.
Michael Simeone 21:57
Yeah, I mean, I think this is a great time to wheel the conversation towards misinformation. I don't think it's obvious why web archiving matters so much to not just the study of misinformation, but also any attempts to kind of practically counteract misinformation. So even how to understand and behave around misinformation, web archiving matters to that effort. Why? What are some of the kind of like high level reasons that you think web archiving applies so squarely to folks interested in misinformation?
Jessica Ogden 22:33
Yeah, I mean, I think you sort of touched on it. And this is the most obvious reason, I guess, is if you're studying misinformation, or you're studying communities who are propagating misinformation, you know, web archives, arguably are a, you know, a key resource and tool, or at least they could be and and arguably, should be in the study of misinformation. Because, you know, if part of the interventions or at least some of the advocacy around interventions in misinformation around sort of deplatforming are being advocated, then we really need to still, if you're going to study those those communities in some way, you need to then have an archive of what's happening there if you're also arguing for deplatforming of certain communities. And so there's a tension there, I think between between the interventions, potentially, if you don't do the archiving before, of course, they're removed from online because often you don't have the sort of source community that you're aiming to study. But I think, additionally, and I think this is kind of, there's been some recent work that's emerged around this. Around, you know, how web archives are actually also being used by people circulating misinformation, mis and disinformation, as a tool to circumvent a lot of these interventions. And I think it becomes very meta when you start talking about web archives, because you also need web archives to study how archives are being used in those contexts. Because if you haven't captured that in some way, you know that that content is often really ephemeral. Then you often don't know how they're being used. And so the argument usually comes down to well, we should archive all the things. Which is, you know, an easy argument to make, but, but not terribly realistic. Because, you know, the web is a really, really big place. And, you know, decisions have to be made about what to collect and what to archive. So I think, you know, a lot of libraries and archives and community groups are really advocating for specialists and experts and researchers, especially in this space to really start intervening to to assist with the kind of creation of these collections and target some of that collection so that we have those as research collections, as well.
Shawn Walker 24:38
So,I think maybe to make those examples, somewhat concrete, this idea of, you know, how archives maybe being used to subvert other things. We can think of, you know, one example of the role of the web archives is, you know, former President Donald Trump's Twitter account has been suspended so those tweets are no longer accessible on twitter.com. But they are somewhat accessible via the Internet Archive, and other web archives. Or we can think of QAnon and often uses direct links into the Internet Archive and archive.is other web archives to refer to, say, news articles at specific moments in time before they've been updated to kind of basically to prove their argument or also to circumvent, say, labels that might be put on content saying like, this is misinformation. This is problematic. They then sort of, as Michael was saying, to kind of circumvent that whole process by linking into the archives. So then they preserve their argument at a very specific moment in time, and can pick a copy of content that says kind of what they wanted to say, and presents like a specific version of reality. Are there other specific examples you were thinking of? Whenever you mentioned that?
Jessica Ogden 25:48
Yeah, no, those, those are really good examples. There's also been some, you know, some things emerging in our own work around you know, how this is a kind of hot off the press. So I'm not sure how much I can actually say about this, but that there's a kind of social practice around sharing archived links as a sort of precursor to the expectation that a lot of that content will be removed. And I think that exists again, you know, that that's a thing that we don't really fully understand. And I think it kind of speaks to other practices within other kind of hacker subcultures as well around self archiving. So this is a thing for people who sort of study these these communities, there's a real push to do a lot of self archiving as a form of kind of creating your own history or the history of your own community by, by archiving it in real time. And often that manifests itself in different subreddits and different QAnon communities where they're doing it as a sort of reflective social practice, where they're creating those archives, as well as sharing that within their own profiles. So you know, your top 10 favorite, QAnon theories are already archived, and they're here, and they're linked in my profile as just one kind of small example. But I think there's it's part of a larger social practice, I would argue online, where these archives are kind of, they form a sort of central component to how these communities, you know, see information and share information and create specific kinds of community identity around that information sharing. And I think, you know, QAnon is probably the "best" contemporary (the "best" is in air quotes), contemporary example of how powerful that really is, for a community sort of sense of Yeah, the role of information and how that creates social identity and, and the sharing of that information.
Michael Simeone 27:36
Yeah, I mean, this is this reminds me of some things that we saw when we looked at the Twitter, misinformation about the wildfires this summer. When people were alleging that, Antifa was starting to wildfires. And one thing that we did see is people sharing personal archives of newspaper articles that were still online. And this practice that you're speaking out is, is very interesting, right? It can be easy to say, "Oh, well, that on the one hand archiving is the side of light. And misinformation is the side of dark. And that's just how it goes." But as soon as I archive something, I am signaling some kind of preemption. And preemption is a form of conflict. And so if I am going to say I am taking a pre-emptive strategy towards this information, than it already means I'm anticipating some kind of attack. And so these archives have the flavor sometimes at least I don't have a ton of exposure to them, but just looking through the ones that had to do with Twitter have with these Antifa fires, alleged Antifa fires, sorry. You know, it had a lot of the flavor of the red yarn board of the conspiracy theorists. [It’s Always Sunny in Philadelphia sound bite].... That these personal archives are kind of like the equivalent but it shows i think that you know, the practice of archiving not just the archive itself, can actually further somebody being misinformed or disinforming person. Right. it helps underscore that misinformation is more than just the content, the situation you create around the information can make it feel more devious if you're trying to spin a conspiracy theory. And so putting everything in an archive and handing it to somebody and saying, "Hey, I had to archive this because it's really dangerous" positions that content in a different way than if you were to just link it on NBC news.com. That part's really interesting. Shawn, this is also this part of the conversation reminds me of some of your interest in ephemerality and misinformation, and how important it is for those doing misinformation campaigns, either knowingly or unknowingly. How ephemerality is an important tool for them.
Shawn Walker 29:41
Right and some of our work that we've collaborated on and also with Dr. Marco Bastos, at University College, Dublin, you know, we've seen that in some misinformation campaigns within about 48 hours, a large percentage of those links disappear in campaigns. So content appears, we see multiple copies of an article kind of saying the same thing, but they look like they're from different legitimate news outlets, and then all that disappears within 48 hours. And so that means it's very difficult for us to go back and have a discussion about the record. We just have kind of what we remember, versus the actual record. And I think that could also be a potentially do you think Jessica, that'd be a challenge for web archives is content that's not sticking around? I mean, so we can think of like, the White House website, or our university's website, you know, ASU's website, there are lots of archives of that. And if that's not going anywhere, that website just, it's always there. Every time the crawler says hello, ASU gives its new, innovative website to the crawler. But what happens with sort of content that's kind of appears and then sticks around for a couple minutes and then disappears, does, but how do things actually get into the archive? You know, we have one way we talked about people create their personal archives. But that's that's not the only way. And I imagine these archives aren't omnipotent that they just know every webpage as soon as it pops up. How does this work?
Jessica Ogden 30:59
Yeah, well, it kind of depends on where you look. So there are lots of different mechanisms. And maybe I could speak to one group. So one of the groups that I, I also studied as part of my ,my PhD research is called Archive Team. And they were, they're sort of a collective, a self described, loose collective of, of hackers and archivists and librarians, and hobbyists, and writers and so called loud mouths, who go out and archive the web, but they do it in lots of different ways. And so part of my study was to try and understand what those different mechanisms as you're kind of alluding to that the actually, you know, the selection practices about how you, how you actually monitor and collect these different sites. And they do it in lots of different ways. And they're really creative about it several ways I have to say, so what you know, one way they do it, which doesn't speak to your ephemerality question, which I can come back to, but they use different pages on Wikipedia. So they'll have various bots that monitor the addition of links on Wikipedia, especially around people and events. And use those kind of robots who are sitting there waiting for new links to be added, as cues to then send those links to the the infrastructure that they've built, to then go out, send the crawler out, archive the site, and put it through this kind of complicated pipeline that then packages it up in different ways and creates all the metadata around it and then deposit it in the in the Internet Archive. They also have other other tools for you, there are communities so they they sit on a sort of internet Relay Chat Room, and people come in and suggest like, "Hey, I heard this platform is going down, you know, you might want to go and archive it." They send requests to the, to the robots, but they also have other, you know, mechanisms in terms of social media and other bullhorns to mobilize collectives of people to connect to the tools that they have, in order to essentially crowdsource what's being archived. And often that's driven by, you know, major platforms going offline. So when, when Vine announced that it was going offline, or Google Plus, we can remember Google Plus, they they archived GeoCities way back in the day back in 2009, as well, and and again, a lot they mobilized again, for Parler a few weeks ago. That's just to say that a lot of the things that they select to archive are driven by the idea that all things online are created equal. So they try, they at least espouse to not to not select what they think is important, or what they think is politically salient in that moment. But but they, they try, at least to, to kind of demonstrate that all things should be archived. And part of my research was unpacking how that actually manifested. And it turns out, you always make selection decisions, because, you know, we said you, you can't, you can't archive it all. And, and often that manifests itself in sort of, what parts of the platform do you archive? You know, when they were archiving Tumblr, you know, do you collect the notes and the comments that are attached to posts or you do just click the post? Well, it turns out, we only have three days to archive this before Tumblr is going to remove, you know, this whole community. So, you know, it's, it's not just about what you think is valuable, but it's often also a relationship between that and the time available before you know something is going to be going to be removed. Either by the platform themselves or by other kind of forms of content moderation. So it's just kind of complex, I guess.
Michael Simeone 34:35
Yeah, it sounds like a complex organizational and technological process. And it sounds like it's very likely that if I post some content on a platform, and then I delete it, or my account goes away, unless I have access to the servers of the service itself that might be holding on to my deleted post. There is no Internet Archive of what I've done. If I, if I tweet something, and it misleads 100 people, and then I delete that tweet, finding my post, again, if I've deleted it within 24 hours, and somebody doesn't say, "Hey, we need to archive that tweet," it's just gone.
Jessica Ogden 35:12
Yeah, I think that's I mean, I think for most people, I think that's definitely the case. And I think it's also worth saying that it's also platform specific. So I think Facebook is, is notoriously under archived. And part of this has to do with the, the technical complexities around archiving Facebook. But also because it is a, as we know, as sort of walled garden and does everything possible to sort of keep, keep people from archiving it as a so-called public platform. Because it's, you know, at least in its early days was supposed to be based on friendship and networks. And only, you know, you only show yourself to people that, you know, we know that there's lots of holes within that rhetoric, rhetoric and how it's kind of technically manifested these days. But, But nevertheless, Facebook is very under archived and doesn't visit necessarily exists on in, in most web archives. Which, again, proves really big challenges for how we understand what's going on in some of these communities, and the propagation of information. And back to Shawn, your question about sort of other examples of what's happening, you know, the there's been some work recently around COVID-19, and the vaccine rollout and how web archives are being used to, again to circulate misinformation around the vaccine roll out, especially on Facebook, I've seen. And, you know, it becomes really difficult to understand exactly what's happening there. Unless you're archiving it, you know, there's, I think, still some really big challenges there for for misinfo researchers.
Michael Simeone 36:47
Is it fair to say that Parler is kind of the opposite of Facebook, when it comes to the walled garden countermeasures to archiving? Or have I got it all wrong? But you know, Parler? Yeah, I mean, let me set it up this way, the, the archive of Parler. And I know like listeners are really interested in Parler. I think like our most popular podcast episode, was the one about Parler.
Shawn Walker 37:11
When it gets real quick to say that. I think the interesting the idea that there is the archive of Parler as someone who's collecting a lot of Parler data around the Capitol, like there isn't actually one archive of Parler there, like hundreds of archives of Parler.
Michael Simeone 37:23
Right.
Shawn Walker 37:23
That some overlap. Some don't. I mean, like, this is a hot mess. And recreating that is like mind blowingly difficult.
Michael Simeone 37:31
Okay, yeah. Okay, so given that, right, we've got all this data from Parler. That leads up to the attack on the Capitol in January 2021. But now we've got, this is a complex thing, it seems like in terms of the data was available, people collected the data. But what's, what's the interest here in making sure that there is an archive of Parler? Why not let it slide off into the abyss? Why do people care so much outside of the immediate application of finding and identifying people who committed criminal acts? So outside of, of that most immediate need, why is Parler so interesting to so many people from an archiving point of view?
Jessica Ogden 38:20
Yeah, I think that's a really great question. I guess my sort of starting point would be that, you know, this is sort of historical classic, classic archives take on this, which is, archives are power, you know, they, they help us understand the world, but they also, you know, create a particular representation of the world. And I think, allow us to tell particular kinds of stories, right. And so in the case of Parler, and you know, whether or not it's the opposite of Facebook, in this, in this case, I think, is a really apt point, in the sense that technically speaking, Parler was extremely unsophisticated. And it's both in its sort of technical apparatus in terms of how the platform is set up. But also in terms of how it moderated content of you know, as it's been revealed by people who've studied this platform far more than me. They weren't interested so much in moderating content, because, of course, certain kinds of content is money, right? So it creates attraction and creates certain kinds of community that they were willing and interested in creating. And, again, I mean, my default is always to think as a researcher, so apologies but, you know, as a researcher, if you want to, if you want to study any of those things, and hadn't bothered to be archiving it since, you know, last...last October or beforehand, then well, you're kind of out of luck at this stage. If, if, you know these archiving activities hadn't taken place, but also the lack of sophistication both in terms of the security apparatus that surrounded Parler. And in terms of you know, as services began to back away from Parler in the wake of the Capitol riots like that made it easier to work. archive. And that's kind of what's happening now is we're sort of unraveling how these archival activities happen. And and Shawn says how all these different archives are manifested, right, which don't create one representation of what, what happened. But, but many. And so there's a lot of work to be done to sort of understand how those things come together to in order to understand the lead up to to the insurrection at the Capitol. I feel like I've meandered from your question a little bit. But it's just to say that without these archives, I think it would be extremely difficult. I mean, you know, there are other questions, we could go to other platforms, right. So there's been a lot of attention on what Parler's roll and the lead in to Capitol, the Capitol riots. But you know, there were other activities happening on, on Gab and other social media platforms, which are even more difficult, in some ways to archive and so yeah, there's a sort of tension, I guess.
Michael Simeone 40:53
I mean, it does feel like all these videos aren't all in one place, except for Parler. So there's, there's that and I think, you know, to the earlier conversation but, you know, just because you archive something and share it from that archive, you may have very different aims and goals, right. And I feel like some of this stuff is promotional, is evidence of something that is, you know, politically meaningful for some audiences. At the same time. You know, it also feels like, this is an example of how misinformation, the rubber meets the road. How you go from a series of messages that promises information warfare, that is actually used the language of information warfare. Which is, you know, kind of paradoxical, but then turns into something that resembles slightly different modes of warfare. So it feels like the Parler data is, is a document in some ways, even just the outside of, you know, looking at the events that happened or the documentation of the events. There's also a documentation of how people internalize certain ideas. And I think for many people, hearing some of the people at that Riot, repeat lines back repeat talking points about, you know, taking your country back, right. Which is one of the biggest Q talking points, especially late kind of late Q, if we can say like, the last Q drop was November of 2020, I think. So late Q was, you know, are you ready to take your country back, and then you hear people resonating with that message, there's a, there's a document to that, that I think that we don't have an archive of these videos we don't quite capture, it's easy to forget how we can go from an information campaign to something to something more dire.
Shawn Walker 42:33
But we also kind of when we think about this Parler archive as an example, there's a lot in Parler. In this archive, there is content that you're talking about of videos of the actual storming and occupation of the Capitol. But then there's also people's everyday posts, there's people that sort of got caught up and said things that they might regret, but weren't actively involved, so on and so forth. So what about the ethics of this information? Because, I imagine that users didn't sign up at Parler and expect that eventually their data might be in a web archive that's publicly accessible one day.
Jessica Ogden 43:09
Yeah, I think that's a really important question, especially because we at least we, I think we know that, you know, some of the archive, archival activities that happened, actually, we're able to access all posts from, you know, the beginning of Parler, through to the end of Parler. Which is almost, you know, never the case, for these platforms, because they're usually at such a scale and such complexities around how you archive them. That you don't often get access to everything. And I think, in that case, you know, and, and, of course, what we know about, you know, things, you know, people did delete posts on Parler, and then, but in actual fact, they didn't really get deleted, depending on what API you use to archive them. So, you know, there's, I think there's still questions around corporate responsibility around security and social media that are intertwined with these kind of ethics questions of, you know, should this stuff be archived in the first place? I mean, I think I tend to come down on the side of Yes, for as far as a researcher, who sees, as you know, the things we've just been talking about. Sees the value in, in understanding what's happening in these spaces online. But you know, as a historical record, as well. But I do think that there are still questions about then how those archives are used. You know, who should have access to them, in what capacity, and for what types of use. And, you know, there's been a few events. I think there was a sort of that, Shawn, we were at this kind of ethics and web archiving conference a couple of years ago in the US. You know, which brought a couple hundred people together to talk about these issues. So there it's not that they're not being discussed, at least within the sort of small field of practitioners and researchers that are interested in it. But, I personally, I would like to see a wider conversation happening both within internet researcher community but in the kind of public consciousness around you know, Questions of so what happens to these web archives, you know, who has control over them? Who says that they can have access to something that, you know, maybe I created and put online but no longer want to be online. And I should caveat that with saying that places like the Internet Archive, of course, have mechanisms where you can, you know, write to them and say, like, "Hey, you have my website, please take it down. I don't want it to be available anymore." You know, so not to make it out that they wouldn't, they wouldn't do that, because they, they would, but I think there's still bigger questions around sort of opt in versus opt out, and, and what the risks are associated with being in a web archive.
Michael Simeone 45:37
Yeah, and a high level here, there's a counter intuitive, you know, one can assume or could assume that, "Oh, I'm, I'm trying to understand and mitigate misinformation. That's a good thing. Oh, I'm trying to collect and store and keep an archive of the internet. That's a good thing." But it sounds like the practice of both of those things as it pertains to Parler, you can still do harm.
Jessica Ogden 45:59
I think that's right. And I think it becomes, when you hold up other examples where we know, you know, some of the work that the Document the Now project has been doing to, to again, raise the consciousness of these questions of ethics and social media archives. In particular, you know, where we've seen the example, I raised earlier around how, you know, especially around some of the Black Lives Matter, protests, and you're going back to the protests that happened after Ferguson and beyond. Where these web archives were really becoming, as a tool for state surveillance, and where people were being picked up, based on their participation in these protests, regardless of whether or not they did anything illegal. And I think, Document the Now project has done a lot of really great work around advocacy work around how you ask questions about what you're what you're collecting to yourself as an organization, but also to your community groups, and the people involved in those archives. And, you know, as part of the archives, that should have a say, and whether or not they want to be represented in these archives, and they, they've developed some really great tools and guidelines around that. So if anybody's interested, you should definitely check that out. I think there's probably still loads more to be done around I think the Parler versus what what's happened historically, in the Black Lives Matter protests, I think, present two different but questionably similar. The questions that arise in both, are both are nevertheless, you know, just as important, I guess, in terms of the risks to individuals. And, and that kind of tension between those and state based surveillance? I'm not sure.
Michael Simeone 47:33
Yeah, yeah, it sounds like the costs and risks and motivations of surveillance really matter. Because that's what web archiving is, sounds like, can be thought of as a kind of surveillance.
Shawn Walker 47:43
And we can think about the issues of, you know, what's in the archive, what's not versus deleted content, right? We have the flip side of these protests, where sometimes famous political actors or public officials say, "Oh, don't archive my content," right. And companies could say don't archive content, in some ways to sort of further their goals or to lack some of that documentation. So in a sense, you know, all web archives are incomplete, right? And so how do we handle this idea of what we might see, say, for example, in the Internet Archive, we might look at a page and the page might look like, Oh, this is a complete page. But it's not necessarily a complete record of that page. Or it might miss content or other things like how, how do we handle that? How do we think about that? Especially within the context of maybe, I'm thinking of this as a misinformation researcher, that oftentimes people go back to the Internet Archive, and we're like, "hey, this link here, this shows what this was at this moment in time." But maybe not right?
Jessica Ogden 48:39
I mean, I think there's probably a lot of work to be done around literacy. And how we engage with web archives, I guess, in the sense that I know, because I've been working in and around web archives is that you get these sort of temporal differences between things that are captured on the page. Which kind of sounds like a really sort of dry and technical aspect of poor quality capture of a page. Which just means that, you know, certain things on the page were captured at different times. And essentially, the archive will snapshot of that page ends up being a mishmash of things from different times and spaces. And so, you know, some of the work at Old Dominion University has really demonstrated some of this stuff around they called zombie archives, where you get these, these representations that never were, essentially so archives that never existed online. Because there's sort of this mishmash of content from different times and places. And if you don't have the sort of, albeit that's a technical problem that could be you know, sussed out technically and and the Internet Archive in particular has done some work around trying to mitigate that in certain kinds of ways by putting flags on the page and reminders and things. But if you aren't aware of those things, and you don't see the flag, then you haven't opened the information page or you you know, you're just a casual... You know, user of the things, you've more than likely will not understand that to be the case. And often, you know, these are used in ways that isn't necessarily comparable with, with the kinds of resources that it is in the sense that it never existed. So and one good example of that is the, I think that you've, you've seen in your work, Shawn, as these sort of COVID dashboards, where, if you go back and look at some of the dashboards that have been archived in the Wayback Machine and and other web archives, so not just to pick on the Wayback Machine, but in other places as well. They become a mishmash of data from yesteryear and today. Where you go and look at the dashboard, and they're, they're not accurate representations of COVID numbers in particular, you know, these kind of state dashboards. And so they really can't be used in the way that you might want to use them either as a researcher or as a casual public user of the resource. And so I think there's still a lot of work to be done to articulate those things to, to, you know, the wider public.
Shawn Walker 51:01
And I think we know, Michael and I've done a lot of work around COVID dashboards and are continuing to, and the National COVID dashboard of Ireland is a great example of that. Because, the when the page loads, when a fully actually loads, you see, you know, the numbers at the top, and you see a specific date. So like January 30, for example, you see the January 30 numbers at the top, but then the graphs that are embedded in the page are from December. So that's like quite the contrast in data. But it's not obvious because you have to look, you'd have to look at that every graph has little timestamp at the bottom and little tiny letters. And if you don't look at that, you then have this page, like you said that like never existed. So for some dashboards, we have this historical record that never existed. And for other COVID dashboards, we have no historical record of what they looked like, all we have is this page that loads and then says error could not grab data. So you know, really could be complicated. And I think it's become more in vogue for journalists to use Internet Archive and other web archives as tools, which I think is really important to stabilize content. But it's just, it's not that simple. Like you're saying of just including a link to the Internet Archive or to archive.is or the Library of Congress web archive or a National Archive. It's not that simple. Because these are really complex infrastructures. And, you know, at first glance, it might not be what it seems.
Michael Simeone 52:23
And it's made even more maddening by any given chart or graph on a dashboard might not represent the most up to date data. You know, so for instance, many state dashboards update, have a rolling update.
Shawn Walker 52:37
It's just a complicated space. And I think what, you know, in talking to Jess, what's been really clear is that web archives are amazing resource. But they're complicated resource, a complicated infrastructure, and that we're not archiving everything. It's no, they're not what they seem at first. But they're also this, you know, this amazing tool to think that we could go back to 1998, and look at the White House website, so we can actually see what it looks like is just this sort of magical thing to be able to transport it back in time. But then like, Whoa, this whole thing is actually pretty complicated living being, to say the least.
Jessica Ogden 53:17
I would maybe pull it, pull the conversation back to, to what I was saying earlier in the sense that, you know, librarians and archivists are your friends. So, you know, in the event that you do have questions about these, these sources, I found in my own research, they're often the only people that can help answer. And to these questions around, you know, what's happening here with the data, what can I actually use it for? What are the best ways, you know, to get the data out of out of these complex infrastructures. It gets complicated, of course, when we're dealing with these sort of distributed collectives of online hackers and such, like, but but when we're talking about using, you know, my own work, I've used archives at the UK Web Archive, being a deep discussion and lots of hand holding with the libraries. And, you know, the librarians and the archivists and the web archivists who work there, you know, is essential to, to getting and using those resources in ways that they, you know, are useful for and not, you know, exacerbating some of the complexities here. By using them in a wrong way, or ways that they shouldn't be used. And so just a plug for our librarians and archivists out there, because they are, you know, as they always have been, I think they're really, you know, integral to how we go about using them as researchers.
Michael Simeone 54:33
Yeah, I think it really underscores the idea that misinformation loves amnesia, and that if we can't have a conversation about history, and kind of collectively reason through our recent past or not so recent past, the capabilities of misinformation increase, potential for misinformation increases, and we can't have that kind of collective kind of thinking through of history without archives.
Shawn Walker 54:59
I think that's an excellent place for us to wrap up here. Jess, do you have any final thoughts you want to add as we head out?
Jessica Ogden 55:05
Thank you both for the conversation. I think it's really brought to the fore for me, you know, some of the thoughts that I've been having about the intersection of web archives and ephemerality online, and information ecosystems. And I think, you know, there's so much more work to be done around what's happening in the current climate around misinformation and, and the role that these archives are playing in that both in the circulation and in the stuff that happens after the various interventions that are, that are happening online. I just think, you know, there's a really fertile ground to continue the conversation. Really, yeah. Thanks a lot for having me here today.
Shawn Walker 55:46
This has been a lot of fun. Well, everyone, thanks for joining us this week. Stay safe and be well.
Michael Simeone 55:51
For questions or comments, use the email address datascience@asu.edu. And to check out more about what we're doing, try library.asu.edu/data.