UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

14 posts categorized "Digital scholarship"

25 November 2024

Datasheets for Web Archives Toolkit is now live

By Helena Byrne, Curator of Web Archives

Datasheets for Web Archives Toolkit Banner with authour names and logos
Datasheets for Web Archives Toolkit

Since autumn 2022, Emily Maemura from the University of Illinois and Helena Byrne from the UK Web Archive team at the British Library have been exploring how the Datasheets for Datasets framework, devised for machine learning by Gebru et. al, could be applied to web archives. In order to explore the research question “can we use datasheets to describe the provenance of web archives, supporting research uses?” a series of workshops were organised in 2023. 

These workshops included a card sorting exercise with expertise in web archives as well as general information management. After the card sorting exercise there was a general discussion about using this framework to describe web archive collections.

These workshops formed the core of the guidance documentation published in the Datasheets for Web Archives Toolkit published in the British Library Research Repository.

The Toolkit

This Toolkit provides information on the creation of datasheets for web archives datasets. The datasheet concept is based on past work from Gebru et al. at Microsoft Research. The datasheet template and samples here were developed through a series of workshops with web archives curators, information professionals, and researchers during Spring and Summer 2023. The toolkit is composed of several parts including templates, examples, and guidance documents. Documents in the toolkit are available at a single DOI (https://doi.org/10.22020/rq8z-r112) and include:

  1. Toolkit Overview 
  2. Datasheets Question Guide
  3. Datasheet Blank Template

Implementation 

The UK Web Archive has implemented this framework to publish data sets from its curation software the W3 Annotation Curation Tool (ACT). These data sets are available to view in the UK Web Archive: Data folder in the British Library Research Repository. So far there are just a few collections published but this will grow over the coming months.

18 September 2024

Creating and Sharing Collection Datasets from the UK Web Archive

By Carlos Lelkes-Rarugal, Assistant Web Archivist

We have data, lots and lots of data, which is of unique importance to researchers, but presents significant challenges for those wanting to interact with it. As our holdings grow by terabytes each month, this creates significant hurdles for the UK Web Archive team who are tasked with organising the data and for researchers who wish to access it. With the scale and complexity of the data, how can one first begin to comprehend what it is that they are dealing with and understand how the collection came into being? 

This challenge is not unique to digital humanities. It is a common issue in any field dealing with vast amounts of data. A recent special report on the skills required by researchers working with web archives was produced by the Web ARChive studies network (WARCnet). This report, based on the Web Archive Research Skills and Tools Survey (WARST), provides valuable insights and can be accessed here: WARCnet Special Report - An overview of Skills, Tools & Knowledge Ecologies in Web Archive Research.

At the UK Web Archive, legal and technical restrictions dictate how we can collect, store and provide access to the data. To enhance researcher engagement, Helena Byrne, Curator of Web Archives at the British Library, and Emily Maemura, Assistant Professor at the School of Information Sciences at the University of Illinois Urbana-Champaign, have been collaborating to explore how and which types of datasets can be published. Their efforts include developing options that would enable users to programmatically examine the metadata of the UK Web Archive collections.

Thematic collections and our metadata

To understand this rich metadata, we first have to examine how it is created and where it is held..

Since 2005 we have used a number of applications, systems, and tools to enable us to curate websites. The most recent being the Annotation and Curation Tool (ACT), which enables authenticated users, mainly curators and archivists, to create metadata that define and describe targeted websites. The ACT tool also serves  to help users build collections around topics and themes, such as the UEFA Women's Euro England 2022. To build collections, ACT users first input basic metadata to build a record around a website, including information such as website URLs, descriptions, titles, and crawl frequency. With this basic ACT record describing a website, additional metadata can be added, for example metadata that is used to assign a website record to a collection. One of the great features of ACT is its extensibility, allowing us, for instance, to create new collections.

These collections, which are based around a theme or an event, give us the ability to highlight archived content. The UK Web Archive holds millions of archived websites, many of which may be unknown or rarely viewed, and so to help showcase a fraction of our holdings, we build these collections which draw on the expertise of both internal and external partners.

Exporting metadata as CSV and JSON files

That’s how we create the metadata, but how is it stored? ACT  is a web application and the metadata created through it is stored in a Postgres relational database, allowing authenticated users to input metadata in accordance to the fields within ACT. As the Assistant Web Archivist, I was given the task to extract the metadata from the database, exporting each selected collection as a CSV and JSON file. To get to that stage, the Curatorial team first had to decide which fields were to be exported. 

The ACT database is quite complex, in that there are 50+ tables which need to be considered. To enable local analysis of the database, a static copy is loaded into a database administration application, in this case, DBeaver. Using the free-to-use tool, I was able to create entity relationship diagrams of the tables and provide an extensive list of fields to the curators so that they could determine which fields are the most appropriate to export.

I then worked on a refined version of the list of fields, running a script for the designated Collection and pulling out specific metadata to be exported. To extract the fields and the metadata into an exportable format, I created an SQL (Structured Query Language) script which can be used to export results in both JSON and/or CSV: 

Select

taxonomy.parent_id as "Higher Level Collection",

collection_target.collection_id as "Collection ID",

taxonomy.name as "Collection or Subsection Name",

CASE

     WHEN collection_target.collection_id = 4278 THEN 'Main Collection'

     ELSE 'Subsection'

END AS "Main Collection or Subsection",

target.created_at as "Date Created",

target.id as"Record ID",

field_url.url as "Primary Seed",

target.title as "Title of Target",

target.description as "Description",

target.language as "Language",

target.license_status as "Licence Status",

target.no_ld_criteria_met as "LD Criteria",

target.organisation_id as "Institution ID",

target.updated_at as "Updated",

target.depth as "Depth",

target.scope as "Scope",

target.ignore_robots_txt as "Robots.txt",

target.crawl_frequency as "Crawl Frequency",

target.crawl_start_date as "Crawl Start Date",

target.crawl_end_date as "Crawl End Date"

From

collection_target

Inner Join target On collection_target.target_id = target.id

Left Join taxonomy On collection_target.collection_id = taxonomy.id

Left Join organisation On target.organisation_id = organisation.id

Inner Join field_url On field_url.target_id = target.id

Where

collection_target.collection_id in (4278, 4279, 4280, 4281, 4282, 4283, 4284) And

(field_url.position Is Null Or field_url.position In (0))

JSON Example
JSON output example for the Women’s Euro Collection

Accessing and using the data

The published metadata is available from the BL Research Repository within the UK Web Archive section, in the folder “UK Web Archive: Data”. Each dataset includes the metadata seed list in both CSV and JSON formats, a data dictionary and a datasheet which gives provenance information about how the dataset was created as well as a data dictionary that defines each of the data fields. The first collections selected for publication were:

  1. Indian Ocean Tsunami December 2004 (January-March 2005) [https://doi.org/10.23636/sgkz-g054]
  2. Blogs (2005 onwards) [https://doi.org/10.23636/ec9m-nj89] 
  3. UEFA Women's Euro England 2022 (June-October 2022) [https://doi.org/10.23636/amm7-4y46] 

22 May 2024

Reflections on the IIPC Early Scholars Spring School on Web Archives 2024

By Cameron Huggett, PhD Student (CDP), British Library/Teesside University

IIPC-2024-Paris-Early-Scholars-Summer-School-banner
IIPC Early Scholars Spring School on Web Archives banner

My name is Cameron, and I am currently undertaking an AHRC funded Collaborative Doctoral Partnership (CDP) project, between the British Library and Teesside University. My research centres on racial discourses within association football fanzines and e-zines from c.1975 to the present, and aims to examine the broader connections between football fandom, race and identity. 

I attended the Early Scholars Spring School on Web Archives, prior to commencement of the conference, which allowed me to knowledge share with colleagues from a number of different countries, institutions and disciplines, offering new perspectives on my own research. Within this school, I was fortunate enough to be able to deliver a short lighting talk, outlining my own use of web archiving within my research into the history of racial discourses within football fanzines. This generated an engaging discussion around my methodologies and led me to reflect upon how quantitative techniques can be better adopted within historical research practices.

I also particularly enjoyed discovering more about the collections of the Bibliothèque Nationale de France (BNF) and Institut National de L'audiovisuel (INA). The scope of the collections and innovative user interfaces were particularly impressive. For example, INA had created a programme that allowed the user to view a collection item, such as an election debate broadcast, alongside archived tweets relating to event in real time.

 My primary takeaway was how web archives can be innovatively employed to record the breadth and depth of online communities and discourses, as well as supplement more traditional sources within a historian’s research framework.  

24 January 2024

Exploring Alternative Access: Making the Most of Web Archives During UK Web Archive Downtime

Nicola Bingham, Lead Curator of Web Archiving, British Library

The British Library is continuing to experience disruption following a cyber-attack and are working hard to restore services. Disruption to some services is, however, expected to persist for several months. In the meantime, our buildings are open and we’ve released a searchable online version of our main catalogue, which contains records of the majority of our printed collections as well as some freely available online resources. Our reference team are on hand to answer queries, advise on collection item availability and help with other ways to complete your work. Please email [email protected] or find out more. The disruption is affecting our website, online systems and services. Please see our temporary website for up-to-date information.

Despite the disruption to access to the UK Web Archive, we continue to crawl or acquire copies of websites, as well as add new websites to our acquisition process which is being undertaken with Amazon Web Services in the Cloud, ensuring that the UK Web Archive collection is updated and preserved as usual.

We appreciate that for regular users of the UK Web Archive, the temporary unavailability of this valuable resource is inconvenient and disruptive. There exist several alternative openly accessible web archives that can serve as sources of information while the UK Web Archive is offline.

Other Openly Accessible Web Archives

Internet Archive: Known as the largest and most comprehensive web archive globally, it includes the famous Wayback Machine and boasts an extensive collection of archived web pages.

Understanding the Differences

While the Internet Archive captures a broad spectrum of global content, the UK Web Archive focuses specifically on the UK web. The UK Web Archive offers comprehensive crawls, curated collections, and secondary datasets for research. However, access is primarily restricted to legal deposit libraries, with some resources available openly.

The Internet Archive allows remote access to archived websites, but its search functionalities and scope differ from the UK Web Archive.

Memento Time Travel: This innovative platform operates under the Memento protocol, allowing users to view archived websites across various openly accessible web archives. It acts as a bridge, enabling access to past versions of web resources stored in archives such as the Internet Archive, Archive-It, UK Web Archive, archive.today, GitHub, and more. While it displays links to Mementos, it doesn’t retain the content itself.

Portuguese Web Archive (Arquivo.pt): Developed by the Portuguese Foundation for Science and Technology, this archive aims to preserve and grant access to the Portuguese web domain and its contents. It also archives a significant amount of European Union and transnational content. It's a valuable resource for preserving the digital heritage of Portugal and contributing to the preservation of European and Portuguese-language online information.

UK Government Web Archive: An openly accessible archive preserving UK central government information, encompassing videos, tweets, images, and websites dating from 1996 to the present day.

UK Parliament Web Archive: This openly accessible archive covers parliamentary websites and social media content from 2009 to the present day.

National Records of Scotland Web Archive: Offering open access, this archive allows browsing and searching of websites related to Scotland’s people and history.

Seeking Information and Resources While the UK Web Archive is offline, the UK Web Archive blog remains accessible and serves as a useful source of information about the archive.

Additionally, although the UK Web Archive itself might be temporarily inaccessible, its information pages have been preserved by the Internet Archive, accessible [here] (https://web.archive.org/web/20240000000000*/https://www.webarchive.org.uk).

For those keen on delving deeper, the British Library Research Repository houses supporting documents related to the UK Web Archive, such as collection scoping documents, annual reports, statistics, and research publications. The repository can be accessed [here](https://doi.org/10.23636/hj5v-3c07).

While the UK Web Archive takes a brief hiatus, we hope these alternative resources help. And perhaps embracing these other openly accessible archives might even unveil new avenues and perspectives for exploration.

While we work hard to recover all our online services you can find regular updates on progress published on our Knowledge Matters blog.

28 July 2023

UK-Ireland Digital Humanities Association Launch Event Report from the British Library

By Helena Byrne, Curator of Web Archives, Frankie Perry, Music Manuscripts and Archives Cataloguer and Stella Wisdom, Digital Curator for Contemporary British Collections

UK-Ireland Digital Humanities Association Launch Event Banner with event details
UK-Ireland Digital Humanities Association Launch Event Banner

The First Annual Event for the UK-Ireland Digital Humanities Association took place  on 29th and 30th June 2023 at Senate House, University of London as well as online. The Association “aims to build a collaborative vision for the field, and create new and sustainable long-term partnerships in alignment with the international community”. The programme set across one and half days covered a wide variety of topics and included an opportunity for the Community Interest Groups to meet up. 

The British Library was involved in four presentations either as an individual presentation or as part of a collaborative project. In this blog post we hear back from the British Library colleagues who attended.

Helena Byrne, Curator of Web Archives

I was involved in two collaborative presentations with Sharon Healy (Maynooth University) and Juan-José Boté-Vericad (Universitat de Barcelona). Our first presentation was a lightning talk on day one called 'Finding Web Archives under the ‘Big Tent’ of DH: A Case Study of Ireland and the UK'. This presented one element of a forthcoming chapter in a WARCnet edited collection on web archiving. This presentation reviewed postgraduate courses for the provision of web archiving in information management and digital humanities courses in Britain and Ireland. Our second presentation was part of Panel #2 on day two called 'The Potential of a Reborn Digital Archival Edition for Collating a Corpus of Archived Web Materials'. This presentation outlined a methodology for researchers without coding skills to select, collate and analyse a corpus of archived websites. 

The highlight for me was Panel #3, especially the presentation 'Towards a Critical Black Digital Humanities: A Critical Librarian’s Response' by Naomi L.A Smith (University of West London). This presentation and the discussion that followed highlighted some of the challenges as well as some of the positive action steps that can be taken to ensure digital humanities research is more inclusive. 

Frankie Perry, Postdoctoral Research Assistant, InterMusE project, University of York / Music Manuscripts and Archives Cataloguer, British Library

I gave a paper with Prof. Rachel Cowgill (University of York) who is Principal Investigator on the InterMusE project – a collaborative venture between musicologists, computer scientists, and archive and library specialists funded by the AHRC’s UK-US New Directions for Digital Scholarship in Cultural Institutions programme. The British Library is an institutional partner, with Dr Rupert Ridgewell (Lead Curator, Printed Music) as Co-Investigator; the universities of Swansea and Illinois at Urbana-Champagne are further partners, and we’re also working with the University of Waikato. In our paper, we introduced the complexities of sourcing, digitising, and piecing together ephemera relating to historical musical events (eg. concert programmes, flyers, newspaper reviews), using as our case study materials relating to the British Music Society (1918-1933) and its regional centres and branches. We showed the interface of the digital archive built for the project, which uses a combination of the Greenstone Digital Library system, the Mirador Annotation Viewer, and the SimpleAnnotationServer to make materials browsable, searchable, and interactive for musicologists and community users alike.

I really enjoyed the event and the snapshot it provided into current digital humanities research and techniques. I especially enjoyed a paper by Orla Delaney (Cambridge) on 'Database ethnography and the museum object record', and one by Lisa Griffith (Digital Repository of Ireland) and Laura Molloy (CODATA) titled 'Pathways to collaboration – creating and sharing GLAM image collections as data'.

Stella Wisdom, Digital Curator for Contemporary British Collections

My lightning talk 'Collaborating to Curate and Exhibit Complex Digital Literature' reflected on the cooperation between curators, researchers, experimental writers and creative practitioners to plan and produce the British Library’s Digital Storytelling exhibition (2 June 2023 - 15 October 2023). A hands-on display, which explores the ways that digital innovations have transformed and enhanced our narrative experiences. Showcasing eleven examples of electronic literature that invite readers to become a part of the story themselves, through interactive narratives that respond to user input, reading experiences influenced and personalised by data feeds, and works that draw from multiple platforms and audience participation to create immersive story worlds. Preparing and in some cases modifying these interactive works to display them in a public gallery has only been possible through practical collaborations between Library staff with the writers and games studios who created these digital stories. I shared some insights from my experience of this co-curation work and encouraged attendees to visit the exhibition.

It was a pleasure to meet a number of people in real life who I had only previously spoken with online. A personal highlight was hearing Reham Hosny from the University of Cambridge and Minia University speak about 'DH and E-Lit Communities: Intersectional Perspectives'. In the refreshment breaks at this event I chatted with Reham about her novel, Al-Barrah (The Announcer) and she demonstrated to me how both augmented reality and hologram technologies work with the printed book to immerse readers in this thought provoking narrative.

04 November 2020

Curating culturally themed collections online: The Russia in the UK Collection, UK Web Archive

By Hannah Connell, Collaborative PhD Student, King’s College London; British Library

Title slide from Hannah's presentation with a London Underground map in Russian

 

I spoke about my position as a curator for the Russia in the UK curated collection as part of the recent Engaging with Web Archives conference (EWA), which was held online from the 21st-22nd of September 2020. This conference reflected the breadth of the web archiving community, bringing together speakers from researchers to librarians, as well as curators and web archiving teams from many different countries.

As always, it was inspiring to participate in such a welcoming event. Even online, the conference retained the collaborative atmosphere which has marked my experience of research in web archiving, allowing new researchers to interact with more experienced practitioners and encouraging questions and conversations between researchers, users and archivists.

The researcher-curated collection, Russia in the UK, is part of the UK Web Archive (UKWA). I was particularly pleased to have had the opportunity to present this curated collection, a resource on the Russian-speaking community in the UK, which was first started in November 2017. Such collections play an important role in making the wide range of material preserved in the UKWA more visible to researchers.

Curators are important to the preservation work of the UKWA. Curated collections are collected manually by curators and researchers with specialist knowledge in their field. The role of a curator in creating a UKWA collection involves identifying relevant websites to be included in a collection, and recording the metadata for these websites, including the translation and transliteration of titles and descriptions in other languages.

This collection is valuable both as a resource for further research, and as a means of questioning research practices. It is not possible to capture everything on the web, and collection curators ensure that a representative sample of websites for each thematic collection are selected. The practice of creating and maintaining a collection such as the Russia in the UK  ultimately influences the shape of the collection and the online representation of the diasporic community it will come to reflect. As such, it is important for researchers and users to understand the decisions taken by curators in selecting and capturing websites.

My paper for EWA focused on the creation of a curation guide for curators of new curated collections. This  draws on the ongoing process of curating the Russia in the UK collection, documenting both the provenance of this special collection and reflecting on this process as a model for future collections.  

In documenting the creation of this collection, I hope to enable future researchers to explore and contribute to this record of the online activity of the Russian diaspora in the UK, and to question and develop the curatorial and research practices behind the curation of collections.

You can watch Hannah Connell’s presentation on the EWA YouTube channel.

 

02 November 2020

Digital archaeology in the web of links: reconstructing a late-90s web sphere

By Dr. Peter Webster, Independent Scholar, Historian and Consultant

Fiber cables for the internet

 

The historian of the late 1990s has a problem. The vast bulk of content from the period is no longer on the live web; there are few, if any, indications of what has been lost – no inventory of the 1990s web against which to check. Of the content that was captured by the Internet Archive (more or less the only archive of the Anglophone web of the period), only a superficial layer is exposed to full-text search, and the bulk may only be retrieved by a search for the URL. We do not know what was never archived, and in the archive it is difficult to find what we might want, since there is no means of knowing the URL of a lost resource. Sometimes we need, then, to understand the archived web using only the technical data about itself that it can be made to disclose.

Niels Brügger has defined a web sphere as ‘web material … related to a topic, a theme, an event or a geographic area’.  My paper at the EWA conference presents a method of reconstructing a web sphere, much of which is lost from the live web and exists only in the Internet Archive: the web estate of the many conservative Christian campaign groups in the UK in the 1990s and early 2000s.

This method of web sphere reconstruction is based not on page content but on the relationships between sites, i.e., the web of hyperlinks. The method is iterative, and tracks back and forth between big data and small. Individual archived pages and directories, printed sources, the scholarly record itself, and even traces of previous unsuccessful attempts at web archiving come into play, as does a large dataset held by the British Library. From the more than 2 billion lines in the UK Host Link Graph dataset it is possible to extract the outlines of this particular web sphere.

You can watch Peter Webster’s presentation on his website peterwebster.me

 

Previous studies using a similar method are: 

Webster, Peter. 2019. Lessons from cross-border religion in the Northern Irish web sphere: understanding the limitations of the ccTLD as a proxy for the national web. In The Historical Web and Digital Humanities: the Case of National Web domains, eds Niels Brügger & Ditte Laursen, 110-23. London: Routledge.  http://dx.doi.org/10.17613/yms5-9v95     

Webster, Peter. 2017. Religious discourse in the archived web: Rowan Williams, archbishop of Canterbury, and the sharia law controversy of 2008. In: The Web as History, eds Niels Brügger & Ralph Schroeder, 190-203. London: UCL Press. (Available Open Access at:  https://www.uclpress.co.uk/products/84010)

 

30 October 2020

The UK Web Archive creeps and crawls into the domain of Halloween with byte sized steps

By Helena Byrne, Curator of Web Archives at the British Library

Spider web with one spider some small flies stuck in the web and a dragon fly hovering just above the web.
Creepy crawlers - British Library digitised image from page 79 of "The Child's Book of Poetry. A selection of poems, ballads and hymns"

 

Halloween and the UK Web Archive

From the start of October all the shops and supermarkets were filling up with Halloween costumes, decorations and lots of fun sized confectionery that are easy to share with some of the trick-o-treaters who might be knocking on your door. It is not clear yet how the coronavirus pandemic will impact any of the informal celebrations that take place every year. No doubt the UK Web Archive crawlers are picking up lots of Halloween and 5th of November themed webpages as part of the 2020 Domain Crawl.

Halloween in the UK is often perceived to be a cultural import from North America. A YouGov poll in 2019 showed that only 30% of people surveyed were planning to celebrate the occasion. This Shine graph shows how the popularity of the term on the archived .uk web has increased in popularity over time. 

Click on a point in the graph to see a sample of how the phrase is used.

 

Screenshot of the search for Halloween on the UK Web Archive Shine trends search

 

Halloween History

The tradition of Halloween actually goes back centuries and was widely celebrated by people in Ireland, Britain and northern France. During pagan times, the 1st of November was officially the start of winter, this season was  associated with death as the crops, wildlife and many people died due to the cold and lack of sunlight during this period. Because of this day’s association with death, it was believed that the ghosts of the dead returned to earth on the night of the 31st October. It was during the Reformation that the tradition of celebrating Halloween died out in Britain, especially in England

A recent YouGov poll has shown that Guy Fawkes Night is more popular in Great Britain than Halloween

 

The 5th of November

The commemoration on the 5th of November goes by many names, traditionally it was Guy Fawkes Night but is sometimes referred to as Bonfire Night or Fireworks Night. But there seems to be some regional differences in what term is used and how the night is celebrated. 

What do you call this commemoration?

This is a question we visited back in 2017 and as you can see in the Shine graph in more recent years the term Bonfire Night was used more widely on the archived .uk web. 

 

Screenshot of the search results for Bonfire Night, Guy Fawkes, Gun Powder Plot and Fireworks Night on the UK Web Archive Shine interface

 

Get creative with Halloween at the British Library

Our Assistant Web Archivist, Carlos Lelkes-Rarugal, has designed some short animated videos using recordings from the British Library Sound Archive and images from the Ghosts & Ghoulish Scenes, British Library Flickr. See these on the UK Web Archive, Digital Scholarship and the Sound Archive’s Wildlife Department Twitter accounts.

This and other sounds can be experienced in the Sound Archive at the British Library which has over 260,000 wildlife sound recordings from all over the world. You can hear a selection of some of these recordings on the British Library, Sound & Vision blog, the latest blog post Going Batty for Halloween, gives an overview of the history of bats and Halloween. 

The Digital Scholarship’s latest blog post, Mind Your Paws and Claws, encourages you to use these images and sounds for various creative projects. The Ghosts & Ghoulish Scenes Flickr Album was previously used in the Gothic Off the Map competition

 

Get involved with preserving the UK web

The UK Web Archive aims to archive, preserve and give access to the historic UK web space. We endeavour to include important aspects of British culture and events that shape society. 

Anyone can suggest UK websites to be included in the UK Web Archive by filling in our nominations form: www.webarchive.org.uk/nominate 

We have a Festivals collection, but are there any local Halloween or 5th of November events near you that haven’t been added yet? Equally, if these events have now been cancelled, we would like to add some of these online cancellation notifications to our collection Coronavirus (COVID-19) UK. Browse through what we have so far and please nominate more content!

 

UK Web Archive blog recent posts

Archives

Tags

Other British Library blogs