UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

23 May 2022

Building Event Collections from Web Archives

By Sara Abdollahi, PhD student, L3S Research Center

The world is frequently experiencing events such as terrorist attacks, Brexit, and the migrant crisis, that has resulted in a vast amount of event-centric information on the web. Researchers, particularly digital humanities researchers and social scientists who analyse the significant events that influence and shape our societies, can benefit from web archives that reflect the perception of events as they happened at the time.

The Research challenge
Web archiving services provide a preserved state of the web that facilitates its study in the future. The ever-growing structure of web archives is one of the main challenges in accessing information for specific research. It is often difficult or even impossible for researchers to find their required documents. Typically, web archives offer interfaces for the users to access the information they need through keyword search. Researchers can then type the name of the event they are interested in and retrieve a list of web documents containing the text's keyword. The returned results are often overwhelming due to their quantity, potential redundancy, and irrelevance, needing an additional intensive cleaning phase to get more related web documents.

The UK Web Archive (UKWA) as well as some other web archives, offer manually collected event-centric collections to solve this issue, which can be considerably time-consuming to create. More importantly, these collections might not cover all necessary information related to a specific event.

A Potential Solution
To address the mentioned challenge, I propose automatically building event collections from web archives using knowledge graphs. Knowledge graphs such as
Wikidata and DBpedia are collections of interlinked real-world entities and concepts. 

In this research, I utilise the EventKG knowledge graph which provides structured information about events, their characteristics, and relationships (e.g., sub-events) and can thus be used as a resource for extending and diversifying the search space when building event collections.

Take the Arab Spring as an example; Tunisian Revolution, Bahraini protests of 2011, and 2011 Yemeni revolution are three sub-events of it. The figure below demonstrates an example of using EventKG to create event collections for Arab Spring. 

Building Event collections diagram

By utilising sub-events to expand the initial user query, a more diverse initial set of documents can be retrieved. This process leads to increased precision and coverage of the final event collection. Traditional methods might miss related documents to sub-events if there is no mention of the main event in those documents. To advance such methods, I demonstrate the impact of event-centric features and relations from a knowledge graph on building event collections.

Sara is giving a presentation of this project at IIPC Web Archive Conference 2022 (session 15) - Register for free.

17 May 2022

UK Web Archive Technical Update - Spring 2022

By Andy Jackson, Web Archive Technical Lead, British Library

Hadoop storage and replication
With the live services happily running off both the old and new Hadoop clusters, we have been focusing on setting up and populating our third Hadoop cluster, destined for the National Library of Scotland.

The Legal Deposit libraries have worked together to fund this additional, independent copy of the UK Web Archive holdings. This is primarily for the purposes of preservation, as having a further copy managed by a separate team and organisation will help ensure our records are not lost or damaged. Longer-term, this system can also function as an independent access and research platform, and this is something we hope to explore as part of the Archives of Tomorrow project.

As there is a petabyte of content to replicate, we were initially concerned that the process of migrating the data would take an extremely long time, and possibly put an unsustainable load on our internal network infrastructure. Happily, these worries were unfounded: over the last six weeks, we’ve replicated about 300TB of WARCs, and this has not caused any noticeable network capacity problems. We’ve also been able to start running cluster jobs that calculate checksums for the files on both ends of the replication, so we can verify everything is working.

Computer server

Legal Deposit Access Solution
The current system for accessing Non-Print Legal Deposit material in our reading rooms has accessibility problems, and is being replaced with two components:

  • An enhanced version of PyWB that can render PDFs and ePubs.
  • An ‘NPLD Player’ app that will allow the content to be accessed from reading room PCs that have not been set up to prevent copies of items being accidentally taken away.

With both components being developed through a contract with Webrecorder.

This quarter has mostly been about laying the groundwork for this (like writing deployment documentation), so we might make more progress next quarter.

Crawlers
We use web browsers to render a lot of seed pages, and this now represents a significant amount of data and included a lot of duplication of common files and media. To mitigate this, we have enabled deduplication for the browser-based crawling.

We’ve also improved monitoring of the process of moving WARCs and logs to Hadoop, so we can spot if backlogs are building up.

Annotation and Curation Tool (W3ACT)
For the core W3ACT service, the only changes have been to fix the links to QA Wayback that were being misdirected to the wrong URL, and upgrade PyWBs to 2.6.4.1.

However, we have been working on embedding additional services behind the W3ACT login. These include:

  • A way to view the logs from the W3ACT crawls.
  • An instance of SolrWayback, configured to search full text indexes from the W3ACT crawls.

Our Danish colleagues have been very helpful, collaborating with us to augment SolrWayback so it could be run with our systems. There are still some gaps (e.g. the internal playback part does not work reliably as our old Solr indexes do not provide all the fields SolrWayback needs) but it’s still very valuable as a way of exploring and evaluating how we might work in the future.

One gap, however, is that we haven’t yet updated the Storage Report with one that is up-to-date and runs across both clusters (ukwa-notebook-apps#12). That should be done early in April.

UKWA Website
The majority of the work has focused on finishing the 'high-level category' view of the UKWA Topics and Themes, finalizing the design and pulling together the translations. 

In addition, like QA Wayback, the public PyWB service has been updated to 2.6.4.1, and we’ve shifted the services to new hardware.

Finally, we have been laying the groundwork for regular automated regression testing, including testing for accessibility issues. Once established, this will be a huge help, allowing us to modify our services with more confidence, knowing that if we accidentally break any critical functionality, the test suite will catch the problem early. This is particularly important as preparation for larger changes, like integrating static documentation and translations into the main website (ukwa-services#48).

Google Sheets Add-On No Longer Available
A while ago, we experimented with an add-on for Google Sheets that provided a way to query web archive holdings from an online spreadsheet (this COPTR link provides some additional information).

Unfortunately, this has become unavailable due to a particular kind of digital obsolescence: changes to Google’s policies. To make it work again, we have to modify our formal policies and documentation in a way that meets Google’s specific requirements. Realistically, due to other work taking priority, it’s likely to be some time before we are able to look at restoring it.

Read the previous UKWA Technical update (Jan 2021) blog post

11 May 2022

The Queen's Platinum Jubilee in the UK Web Archive

By Daniela Major, PhD Student, School of Advanced Studies, University of London

Whether you’re an avid monarchist, a staunch republican or simply obsessed with Netflix’s “The Crown”, there is no doubt that Elizabeth II has achieved a unique place in history The 70 years of her reign have been witness to profound changes in world politics and in British society. When she was crowned, Churchill was her Prime Minister, Khrushchev was freshly in charge of the Soviet Union and Eisenhower had just become the President of the United States.

Queen Elizabeth II

Throughout her decades as monarch, Queen Elizabeth has worked with 14 UK Prime-Ministers and met 13 American Presidents. She has received state visits from countless foreign leaders, who themselves influenced the shape of 20th and 21st century history: from Charles de Gaulle to Mikhail Gorbachev.

During her reign, the United Kingdom went through dramatic changes. From the dismantling of the British Empire to referendums on Welsh devolution and Scottish independence. The Queen’s honour list depicts a country where diversity is celebrated. She’s given honours to authors such as V.S Naipaul and Salman Rushdie, singers like Paul McCartney and Bono and artists like Paula Rego.

For many reasons, the Platinum Jubilee is a great opportunity to explore this dialogue between the present and the past. How and why we celebrate, or how and why we refuse to do so, places us in a specific historical context. In this case, right into 21st century UK, in a world in constant change.

Queens Platinum Jubilee logos

So far, we have discovered that food is a favourite in every celebration. Fortnum & Mason and the Big Jubilee Lunch are celebrating the Jubilee by sponsoring a competition awarding the best pudding – following the Victoria Sponge, named after Queen Victoria, and Coronation Chicken, created in honour of Elizabeth II’s coronation. The judges include Mary Berry of Great British Bake-Off fame, food historian Regula Ysewijn and MasterChef’s Monica Galetti.

A slew of cultural celebrations are on the cards: The Reading Agency launched the Big Jubilee Read which chose ten outstanding books from the last 7 decades. The Royal Mint has created a commemorative coin and the Royal Philharmonic Concert Orchestra gave a concert at the Royal Albert Hall. Throughout the whole of the UK, Town Councils are preparing for street parties, tree planting, and jubilee lunches.

This is where you come in. The UK Web Archive wants to know how you are choosing to remember this Jubilee.

  • Are you taking part in the Jubilee’s bake-off?
  • Are you lighting a beacon or attending a street party?
  • Are you going to a protest? Have you written about how the UK cannot have 70 more years of monarchism?

Help us remember this moment in History so that future historical sources reflect the full diversity of public activity. Help us show how people across the UK celebrate important dates and how they look back to their own past, how they celebrate their present.

If you know of a website worth keeping for posterity, nominate it and make your suggestion.