UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

22 May 2024

Reflections on the IIPC Early Scholars Spring School on Web Archives 2024

By Cameron Huggett, PhD Student (CDP), British Library/Teesside University

IIPC-2024-Paris-Early-Scholars-Summer-School-banner
IIPC Early Scholars Spring School on Web Archives banner

My name is Cameron, and I am currently undertaking an AHRC funded Collaborative Doctoral Partnership (CDP) project, between the British Library and Teesside University. My research centres on racial discourses within association football fanzines and e-zines from c.1975 to the present, and aims to examine the broader connections between football fandom, race and identity. 

I attended the Early Scholars Spring School on Web Archives, prior to commencement of the conference, which allowed me to knowledge share with colleagues from a number of different countries, institutions and disciplines, offering new perspectives on my own research. Within this school, I was fortunate enough to be able to deliver a short lighting talk, outlining my own use of web archiving within my research into the history of racial discourses within football fanzines. This generated an engaging discussion around my methodologies and led me to reflect upon how quantitative techniques can be better adopted within historical research practices.

I also particularly enjoyed discovering more about the collections of the Bibliothèque Nationale de France (BNF) and Institut National de L'audiovisuel (INA). The scope of the collections and innovative user interfaces were particularly impressive. For example, INA had created a programme that allowed the user to view a collection item, such as an election debate broadcast, alongside archived tweets relating to event in real time.

 My primary takeaway was how web archives can be innovatively employed to record the breadth and depth of online communities and discourses, as well as supplement more traditional sources within a historian’s research framework.  

24 January 2024

Exploring Alternative Access: Making the Most of Web Archives During UK Web Archive Downtime

Nicola Bingham, Lead Curator of Web Archiving, British Library

The British Library is continuing to experience disruption following a cyber-attack and are working hard to restore services. Disruption to some services is, however, expected to persist for several months. In the meantime, our buildings are open and we’ve released a searchable online version of our main catalogue, which contains records of the majority of our printed collections as well as some freely available online resources. Our reference team are on hand to answer queries, advise on collection item availability and help with other ways to complete your work. Please email [email protected] or find out more. The disruption is affecting our website, online systems and services. Please see our temporary website for up-to-date information.

Despite the disruption to access to the UK Web Archive, we continue to crawl or acquire copies of websites, as well as add new websites to our acquisition process which is being undertaken with Amazon Web Services in the Cloud, ensuring that the UK Web Archive collection is updated and preserved as usual.

We appreciate that for regular users of the UK Web Archive, the temporary unavailability of this valuable resource is inconvenient and disruptive. There exist several alternative openly accessible web archives that can serve as sources of information while the UK Web Archive is offline.

Other Openly Accessible Web Archives

Internet Archive: Known as the largest and most comprehensive web archive globally, it includes the famous Wayback Machine and boasts an extensive collection of archived web pages.

Understanding the Differences

While the Internet Archive captures a broad spectrum of global content, the UK Web Archive focuses specifically on the UK web. The UK Web Archive offers comprehensive crawls, curated collections, and secondary datasets for research. However, access is primarily restricted to legal deposit libraries, with some resources available openly.

The Internet Archive allows remote access to archived websites, but its search functionalities and scope differ from the UK Web Archive.

Memento Time Travel: This innovative platform operates under the Memento protocol, allowing users to view archived websites across various openly accessible web archives. It acts as a bridge, enabling access to past versions of web resources stored in archives such as the Internet Archive, Archive-It, UK Web Archive, archive.today, GitHub, and more. While it displays links to Mementos, it doesn’t retain the content itself.

Portuguese Web Archive (Arquivo.pt): Developed by the Portuguese Foundation for Science and Technology, this archive aims to preserve and grant access to the Portuguese web domain and its contents. It also archives a significant amount of European Union and transnational content. It's a valuable resource for preserving the digital heritage of Portugal and contributing to the preservation of European and Portuguese-language online information.

UK Government Web Archive: An openly accessible archive preserving UK central government information, encompassing videos, tweets, images, and websites dating from 1996 to the present day.

UK Parliament Web Archive: This openly accessible archive covers parliamentary websites and social media content from 2009 to the present day.

National Records of Scotland Web Archive: Offering open access, this archive allows browsing and searching of websites related to Scotland’s people and history.

Seeking Information and Resources While the UK Web Archive is offline, the UK Web Archive blog remains accessible and serves as a useful source of information about the archive.

Additionally, although the UK Web Archive itself might be temporarily inaccessible, its information pages have been preserved by the Internet Archive, accessible [here] (https://web.archive.org/web/20240000000000*/https://www.webarchive.org.uk).

For those keen on delving deeper, the British Library Research Repository houses supporting documents related to the UK Web Archive, such as collection scoping documents, annual reports, statistics, and research publications. The repository can be accessed [here](https://doi.org/10.23636/hj5v-3c07).

While the UK Web Archive takes a brief hiatus, we hope these alternative resources help. And perhaps embracing these other openly accessible archives might even unveil new avenues and perspectives for exploration.

While we work hard to recover all our online services you can find regular updates on progress published on our Knowledge Matters blog.

18 October 2023

UK Web Archive Technical Update - Autumn 2023

By Andy Jackson, Web Archive Technical Lead, British Library

This is a summary of what’s been going on since the 2023 Q2 report

Replication

The most important achievement over the last quarter has been establishing a replica of the UK Web Archive holdings at the National Library of Scotland (NLS). The five servers we’d filled with data were shipped, and our NLS colleagues kindly unpacked and installed them. We visited a few weeks later, finishing off the configuration of the servers so they can be monitored by the NLS staff and remotely managed by us.

This replica contains 1.160 PB of WARCs and logs, covering the period up until February 2023. But, of course, we’ve continued collection since then, and including the 2023 Domain Crawl, we already have significantly more data held at the British Library (about 160 TB more, ~1.3 PB in total). So, the next stage of the project is to establish processes to monitor and update the remote replica. Hopefully, we can update it over the internet rather than having to ship hardware back and forth, but this is what we’ll be looking into over the next weeks.

The 2023 Domain Crawl

As reported before, this year we are running the Domain Crawl on site. It’s had some issues with link farms, which caused the number of domains to leap from around 30 million to around 175 million, which crashed the crawl process.

2023-10-10-dc2023-queues

2023 Domain Crawl queues over time, showing peak at 175 million queues.

However, we were able to clean up and restart it, and it’s been stable since then. As of the end of this quarter we’ve downloaded 2.8 billion URLs, corresponding to 183 TB of (uncompressed) data.

Legal Deposit Access Service

We’ve continued to work with Webrecorder, who have added citation, search and print functionality to the ePub reader part of the Legal Deposit Access Service. This has been deployed and is available for staff testing, but we are still resolving issues around making it available for realistic testing in reading rooms across the Legal Deposit Libraries.

Browsertrix Cloud Local Deployment

We have worked out most of the issues around getting Browsertrix Cloud deployed in a way that complies with Non-Print Legal Deposit legislation and with our local policies. We are awaiting the 1.7.0 release which will include everything we need to have a functional prototype service.

Once it’s running, we can start trying our some test crawls, and work on how best to integrate the outputs into our main collection. We need some metadata protocol for marking crawls as ready for ingest, and we need to update our tools to carefully copy the results into our archival store, and support using WACZ files for indexing and access.