Research applications with web archives: Collaboration among Archives Unleashed cohorts

Published Feb. 16, 2022

Photo by Riccardo Annadale on Unsplash — Photo by Riccardo Annandale on Unsplash

Web archives present opportunities to explore and enhance our understanding of our digital cultural heritage. After a quarter-century of intentional collection and curation, they offer a breadth of coverage on topics fast outpacing traditional analogue material. Yet, web archives maintain a mysterious quality.

Over the lifespan of the Archives Unleashed project, we’ve heard so often, “how do you do web archival research?”, “where do you start”? Even the primary step of selecting a topic of study can be overwhelming. Initially, we ran a series of datathons to train scholars and provide a hands-on approach to working with archived web content. However, while datathon events contributed to skill-building, their short (two-day) intensive model meant a loss of momentum for projects, with teams unable to go beyond exploratory discovery, and a missed opportunity for comprehensive analyses or publication development.

In addition to our collaboration with members from Internet Archive’s Archive-It team to develop a platform for generating web archive derivatives (ARCH), our combined teams seek to support and facilitate research engagement with web archives.

Building off past successes of community building and engagement, the inaugural Archives Unleashed Cohort Program was launched in July 2021.

Five teams were selected to take part in a year-long collaboration and provided with additional resources, support and mentorship to conduct focused research while using web archives as scholarly objects.

These teams, spanning across North America and Europe, have selected a wide range of topics to study, including crisis communication, health misinformation, pandemic discourse, comparative feminism media activism, and the development of online commenting systems. They truly highlight the innovative applications of web archive research within and adjacent to the digital humanities field.

It’s been truly inspiring to see the creative ways teams have approached web archive collections, from the research questions they are asking to the methodologies employed. As such, we are delighted to share highlights from cohort projects and celebrate their successes with collection curators and the broader #webarchiving and research communities.

Archives Unleashed Cohort Project Roundup

AWAC2 Analysing Web Archives of the COVID Crisis through the IIPC Novel Coronavirus dataset

Project Members: Valérie Schafer (University of Luxembourg), Frédéric Clavert, (University of Luxembourg), Karin De Wild (Leiden University), Niels Brügger, Aarhus University, Susan Aasman (University of Groningen), Sophie Gebeil (University of Aix-Marseille)

A first distant reading of randomly selected French content (30% sample) using Iramuteq (F. Clavert)

The AWAC2 (Analysing Web Archives of the COVID Crisis through the IIPC Novel Coronavirus Dataset) project was launched by a team from the WARCnet network. The aim is to investigate a transnational event through web archive collections by performing a distant reading of the IIPC COVID-19 web archive collection to shed light on actors, content types and interconnectivity.

The AWAC2 team has so far conducted a preliminary investigation of the corpus content using text-mining software such as IRaMuTeQ, Jupyter Notebooks, Python libraries such as gensim, and the network analysis application Gephi. The main aim of these analyses is to shed light on what is in the IIPC COVID-19 corpus. The use of simple distant reading techniques — such as crawl distribution by language by day or crawl distribution by domain by day — improved our understanding of the content of this fascinating corpus. One of the methodological questions that arose from the preliminary investigations is the issue of duplicates and versions, which is now being investigated.

In parallel, our colleague (and the project initiator) Valérie Schafer published a post on the IIPC blog describing the project and asked the web archiving community to vote for a theme that we should investigate in more detail — going beyond the broad big data analyses that we will perform on the corpus. The community voted for us to take a closer look at *Women, Gender and COVID-19* within the collection (domestic violence, care, homeschooling, etc.). We will start exploring this topic soon.

Crisis Communication in the Niagara Region during the COVID-19 Pandemic

Project members: Tim Ribaric, David Sharron, Cal Murgu, Karen Louise Smith, Duncan Koerber (Brock University)

Tim Ribaric and team are exploring COVID-19 from a localized perspective, by examining how organizations in the Niagara region have responded to provincial and municipal government COVID-19 mandates. When the pandemic began the Archives at Brock University quickly started harvesting content from different sources in the Niagara region. These sources were community groups, regional governments, local health units, and various other local news sources. The end result was the COVID 19 Niagara Archive. Further, the goals of the project are to serve both the pursuit of research as well as to contribute to various teaching and learning applications within the University.

In terms of research, an analysis of the web archive content is currently underway using a framework known in media studies as crisis communication. In other words, an examination of the content and tone of the information that was shared in the community during the various inflection points of the crisis. This investigation is completed in part by making use of a search tool called SolrWayBack, an easy-to-use platform for indexing and searching large-scale web archival collections. Additional strides in this investigation are being made using computational notebooks hosted on the Google Colab environment. By carefully extracting full-text derivatives from the complete archive using targeted domains and analyzing using notebooks several subsets of the initial dataset have been created. With these datasets in hand, we can now proceed with a more targeted exploration.

With respect to support of teaching and learning, this project has created several curricular and co-curricular opportunities that have introduced web archives via computational notebooks as an emergent source of information that can be used in varying contexts. By collocating derivatives of the full archive and presenting them in fastidiously crafted computation notebooks, learners can interact with different components of the data without needing to understand coding. We’ve introduced this style of learning in one of our co-investigator’s classrooms, as well as unique co-curricular programming through the Brock Library. We believe that web archive collections represent a rich source of information that have the potential to inform various academic disciplines. This project is helping move this notion forward in the Brock community by demonstrating exactly what can be accomplished through examinations of web archives, both in research and in the classroom.

Regular updates are provided through the project website: https://brockdsl.github.io/archives_unleashed/

Mapping and tracking the development of online commenting systems on news websites between 1996–2021

Project members: Anne Helmond (University of Amsterdam/University of Siegen), Johannes Paßmann, Robert Jansma (University of Siegen), Luca Hammer (University of Siegen), Lisa Gerzen (Ruhr University Bochum). Contributors: Dave Wahl (University of Amsterdam), Steffen Reinhard (Ruhr University Bochum), and Theresa Schulte (University of Siegen).

This project traces the adaptation and distribution of online commenting technologies on news websites. Early accounts of online comments conceptualized them as blurring the lines between the producers and consumers of online texts as internet users could now “talk back” to mass media by providing feedback on the editorial content of professional media organizations (Bruns, 2005). Whilst these ideas emphasize the “democratizing” potential of online comments, more recently commenting is seen as a problematic practice that belongs to the “bottom half of the web” (Reagle, 2016). As online comments have become more toxic and have expanded in scale and scope they increasingly require moderation, leading many news websites to shut down their commenting sections (Gillespie, 2018; Wired, 2015).

In collaboration with the Internet Archive and the Archives Unleashed Project, we have created three longitudinal datasets consisting of the top 50 international, German, and Dutch archived news websites between 1996–2021. The new ARCH interface (Ruest et al., 2020) provides access to these newly created datasets in derivative formats and makes the (large) amount of data more manageable and turns it into a more accessible format for researchers. Using these derivatives we aim to trace the dynamics of news websites implementing, changing, and shutting down their commenting systems to understand the evolution and distribution of these technologies. Such changes in technologies also serve as key starting points for qualitative inquiries to understand commenting practices by interviewing the developers of commenting systems and webmasters implementing them (Paßmann 2021).

To detect the presence of a commenting system we are searching for the code patterns of commenting technologies in the archived HTML code of these news websites (cf. Helmond, 2017; Nielsen, 2019; Owens and Thomas, 2019). We are creating a database of known commenting technologies and their recognition patterns. Using Jupyter Notebooks, the recognition patterns can be used to search for commenting technologies across our three archived news datasets or any other web archive. Initial findings show that not all online comments are well preserved, however, there are still patterns in the archived HTML code showing which commenting technology websites were used. A key contribution of this project is to operationalize an approach for the large-scale analysis of archived HTML code which may also be used to study the evolution and distribution of other key web technologies such as content management systems, analytics, advertising, social buttons, and trackers.

Scaffolding development and experimentation with Jupyter Notebook

Everything Old is New Again: A Comparative Analysis of Feminist Media Tactics between the 2nd- to 4th Waves

Project Members: Shana MacDonald (University of Waterloo), Aynur Kadir (University of Waterloo), Brianna Wiens (York University), Sid Heeg (University of Waterloo)

The Everything Old is New Again: A Comparative Analysis of Feminist Media Tactics project is run by the Feminist Think Tank (FTT) research lab at the University of Waterloo. The project aims to explore the shifting conversations and practices that have shaped feminist media over the last four decades. To explore these shifts and flows the FTT team has been working with several collections within the Internet Archive including:

The team is focusing on two specific research questions. The first uses a text-based analytic approach to determine the presence of a set of feminist ‘keywords’ across the collections. The goal is to see if the prominence of each word shifts in its usage over time in order to determine the different priorities of feminist media activism in different eras. The second exploration seeks to determine image content within the collections to see if particular images emerge as dominant across the collection. This will help to map and better understand the role of visual culture in furthering feminist principles at different moments in the historical trajectory of the study.

To this point, we’ve been spending time getting comfortable with the format of web archives (which is a new endeavour for our team!). It has helped us clarify research questions and shed light on the different kinds of insights offered by large web archival collections which contrast, but also compliment, our smaller scale FTT collection of feminist social media tactics (memes, hashtags, etc.) that we’ve been gathering over the last eight years. Working with the ARCH (Archive Research Compute Hub) platform gives us the chance to put these different approaches to digitally-born artifacts from our commitment to feminist data and feminist archiving principles which seeks to examine the existence of power embedded within our technologies, interfaces, platforms, and even in our research questions and frames.

Viral health misinformation from Geocities to COVID-19

Project members: Shawn Walker, Michael Simeone, Kristy Roschke, Anna Muldoon, Major Brown (Arizona State University)

Shawn Walker and colleagues from Arizona State University are exploring and comparing two case studies of health misinformation: HIV mis/disinformation circulating on Geocities in the mid-1990s to early 2000s with the role of official COVID-19 Dashboards in COVID mis/disinformation.

Combining web archives and Twitter data, the team is focused on investigating the IIPC Novel Coronavirus (COVID-19) Archive-It collection, Geocities derivatives collection by Nick Ruest, and Twitter’s COVID-19 stream.

Jupyter notebooks and other computational methods have been critical for experimenting and establishing the scaffolding necessary to conduct large-scale analysis. Currently, the team is engaging with the plain text of the Geocities data, and exploring methods such as coding for keyword and sentiment analysis, and topic modelling to better understand text reuse within the larger corpus.

This work contributes to our understanding of current and historical health misinformation as well as the connections between them, and will also garner insights into how historical narratives of health misinformation have been recycled and repurposed.

Research Applications with Web Archives: Collaboration Among Archives Unleashed Cohorts

Tags