UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

16 January 2023

UK Web Archive Technical Update - Winter 2022

By Andy Jackson, Web Archive Technical Lead, British Library

This is a summary of what’s been going on since the update at the start of the autumn.

2022 Domain Crawl Completion

As in previous years, the 2022 Domain Crawl continued to run right up until the end of the year. Overall, things ran smoothly, with only brief outages for upgrading the virtual server over time as the size of the frontier grew.

Graph showing the UK web archive 2022 annual domain crawl

Because we’re running on the cloud, we are paying for how much compute capacity, RAM and disk space we’re using. So, when the crawl is young and the Heritrix3 frontier database is small, it makes sense to use a small computer. But as the crawl frontier grows, so does the amount of RAM the crawler needs to manage the frontier, so we scale up as we go.

This is one of the reasons we spent time making it possible to configure the frontier database so more house-keeping and clean-up processes are run while the crawl is running. This helps Heritrix clear disk space after it has dealt with URLs, and led to significant savings. The 2020 crawl ended up using 45TB of disk space to store the crawl state, and deleting old ‘checkpoint’ files (which can be used to revert the crawl state to a previous point in time) did not help free up more space. But after changing those configuration options, the 2021 and 2022 crawls only needed 15TB of space, and deleting checkpoints was much more effective.

2023 Domain Crawl Planning

We originally moved to the cloud to relieve pressure on the BL networks as staff switched to remote working during the pandemic. But even when COVID restrictions were eased, the library has continued to support staff working remotely where possible. Fortunately, over the last year the library has upgraded many of the network systems across both the London and Boston Spa sites, which means we now have permission to run the 2023 crawl on site.

As there is still some uncertainty as to how this will affect other network users, we are planning to begin the crawl much earlier in the year (perhaps as early as February). This gives us more time to revisit our options if something goes awry.

Internal Collections API

Working with the Archives of Tomorrow project to understand their requirements, we now have an internal API where W3ACT metadata can be downloaded for entire collections, including all sub-collections and target site metadata. Authenticated W3ACT users can retrieve these full collection extracts (including unpublished collections), which are updated daily. The JSON files are available at https://www.webarchive.org.uk/act/static/api-json-including-unpublished/collection/ for logged-in users.

The public version of the API is in the final stages of development, and should be released early in 2023. Unlike the internal API, this will not include collections that are not yet ready for publication.

W3ACT 2.3.4

Just a few days ago, W3ACT 2.3.4 was released. This included a number of tweaks and bugfixes, including correcting the CSV export feature and adding more export formats (TSV and JSON). For more details, please take a look at the associated release milestone.

There was also an issue with how W3ACT data was used, meaning the subdomains of sites with open access licences were being given the same licence as the ‘parent’ domain. This has now been resolved and access is consistent with the data in W3ACT.

Document Harvester Outage

From the 12th of December onwards, the Document Harvester had stopped picking up GOV.UK documents properly. This appears to have stemmed from some edits carried out in W3ACT, where the Watched Target that covered the GOV.UK document publication service was merged with the main GOV.UK Target (which was not Watched). This meant the crawler was no longer looking for documents from GOV.UK.

We made the GOV.UK Target into a Watched Target, and then cleared the relevant crawl logs for re-processing. Those logs have now been processed and the missed documents have been identified.

We’re looking at how this happened and will take steps to prevent this happening in the future.

The Application Support team has been working with Networks team and our Legal Deposit Library partners to start to roll out an initial ‘alpha’ service across all sites. This will help all library staff to try out the system and lay the foundations for a ‘beta’ service in reading rooms. The Project Manager has also been working hard to understand the likely timeline for the project and communicate this to all stakeholders, while keeping the project management triangle in mind.

Additionally, we’re working on setting up a suitable Continuous Deployment pipeline for this service using GitLab CI/CD. This will allow us to analyse, test and safely deploy new versions of the access service without having to manage the system by hand.

CDX Backfill

One of the critical components of the web archive is the content index (CDX), which is an index of all the URLs we have archived, and is required for playback to work. Ours runs on OutbackCDX (from the National Library of Australia), and a subset of it’s functionality is available via our API.

In the past, we’ve had problems running large CDX indexing jobs, and this had left us in an unfortunate situation where the 2016, 2018 and 2019 domain crawls were not indexed. During the last few months, we modified the the indexing process to (re)process our WARCs and ‘backfill’ the index, which has filled in those gaps.

This also showed that we could process our entire collection (i.e. over 1PB) in a reasonable time (roughly three months depending on the precise workload), which is reassuring. It will likely be necessary to re-build indexes from time to time, and it’s good to know it should be possible to do so in a reasonable amount of time. Also, the act of reading every byte of every WARC is an additional explicit proof that the files have been kept safe over all these years! We know HDFS has been systematically monitoring the files over time, but it’s nice to run an independent check.

The 2020, 2021 and 2022 domain crawls will have to wait a little longer, as they are stored on Amazon Web Services and need transferring to the British Library before they can be indexed.

Browsertrix-Cloud

Finally, we’re proud to be part of the IIPC project Browser-based Crawling For All, which contributes to the development of Browsertrix Cloud and attempts to ensure IIPC members can take advantage of it. As part of this, we proposed two sessions for next years’ IIPC conference, both of which have been accepted:

  • A workshop called Browser-Based Crawling For All: Getting Started with Browsertrix Cloud, aimed at helping attendees take advantage of Browsertrix Cloud. We’re particularly interested in uncovering barriers that might prevent adoption.
  • A panel called Browser-Based Crawling For All: The Story So Far, giving an insight into the current state of the project and of Browsertrix Cloud (including any feedback from the workshop).

Hoping to see you there!

12 January 2023

Changes in Nature’s Calendar – Early Bloomers

The Importance of Citizen Science in Monitoring and Adapting to Climatic Change

By Andrea Deri, Cataloguer and UKWA Climate Change Collection’s lead curator

On 1 January 2023, I had my usual walk from Folkestone Gardens via Sue Godfrey Nature Park, Deptford, London Borough of Lewisham to Greenwich Park, Royal Borough of Greenwich. Overcast, temperature in single digit, humid but calm. Trees and shrubs mostly leafless: an accentuating background to patches of bright green mosses.

I was hoping to see some flowers on winter blossoming plants, for example the bell-shaped flowers of clematis ‘Jingle Bell’ in St Alfege Church’s yard, and the spidery flowers of witch hazels in the Royal Observatory Garden in Greenwich. I was also curious what other flowers I would find, earlier than usual, triggered by the warming climate. Having joined a month ago (1 December 2022) the annual wildflower ‘hunt’ on the first day of the winter, a survey of species in flower in my locality, Deptford’s urban area since 2009 organised by the Creekside Education Trust and the London Natural History Society, I expected several early bloomers. Here is Creekside’s blog post of the 2021 wildflower survey. 

While the witch hazels (Fig. 1.) did not disappoint, I was up for a surprise with clematis “Jingle Bell”: only the silky fluffy seedheads were left: it finished flowering earlier this year. I was lucky to see its last flowers on Christmas Eve 2022 (Fig. 2.). Other early flowers greeted me on a hazelnut shrub in Sue Godfrey Nature Park (Fig. 3.). But, I was truly astonished to see daffodils fully opened in a park by Creekside, just across the Creekside Discovery Centre (Fig.4.) 

Witch hazel flower

Figure 1 Witch hazel (Hamamelis sp.) in flower. Photo: Andrea Deri, Royal Observatory Garden, Greenwich, London, 1 January 2023

I started searching for phenology calendars, almanacs, and any information on the blooming time of these species in my local and other areas in order to compare my observations with the “expected” (based on previous years) flowering periods. The online findings supported my assumption: I did observe earlier than expected flowerings, with the most specific data for the hazelnut.

Clematis ‘Jingle Bell’ 
According to the Royal Horticultural Society (RHS) clematis “Jingle Bell” flowers in winter and early spring. Compared to this broad-brush period, my observation this year suggests this individual specimen finished flowering much earlier than expected and earlier than I had observed this specimen in previous years. 

Clematis flower

Figure 2 Clematis cirrhosa “Jingle Bells” one bell-shaped flower and fluffy seedheads. Photo: Andrea Deri, St Elfege Church, Greenwich, London, 24 December 2022

Daffodil 
A post on the Daffodil Society prompted me to do a search on RHS’s website for daffodils where February-March was quoted as the usual flowering period. More precise than for the clematis. Early flowering daffodil horticultural varieties, however, can bloom as early as January, stated one of the Gardeners World blogposts. I may have encountered an early flowering daffodil garden variety. In addition to its literary associations, this iconic flower may have just now become also a conversation starter about the climate crisis. Would its freshness and brightness frame a difficult dialogue in hope? 

Daffodil flowers

Figure 3 Daffodils (Narcissus sp.) in flower. Photo: Andrea Deri, near Creekside Discovery Centre, Deptford, London, 1 January 2023

Hazelnut 
The Woodland Trust Nature’s Calendar offered me with the tool I had been really looking for: a peer-reviewed database linked to a live map that allowed me to compare my observation with fellow observers in the UK at day level precision.  

Hazelnut flower

Figure 4 Hazelnut (Corylus avellana) in flower: crimson female flowers, yellow catkin male flowers. Photo: Andrea Deri, Sue Godfrey Nature Park, Deptford, London, 1 January 2023

Before I signed up to add my hazelnut observation, I took a screenshot of the “Add a Record” webpage on 5 January 2023 that showed the first hazelnut flower sighting on 4 January 2023. (Fig.5.)

Screenshot of Wildlife trust 'Nature's calendar' website

Figure 5 Screenshot of Nature's Calendar, Woodland Trust. Photo: Andrea Deri, @20:34 pm GMT 5 January 2023

Hazelnut first flowering was among the recently recorded data of the Nature’s Calendar (Fig. 5.) My observation of hazelnut flowers on 1 January 2023 was not extraordinary but earlier than the one featured online. Hazelnut is expected to be in flower in early January according to Nature Calendar (downloadable pdf). But as early as 1 January? To answer this question, I had to register to enter my data. When I entered my observation date, I received an automatic note, all in red: 

This date falls outside of the expected range

The date you have entered is unusually early or late for this species and event; please double check the record. If it’s correct we’d like to know more about your observation, so please add a comment before clicking ‘next’ to continue. If possible, a photo is very useful too. Please note that your record will not appear on the live map until it has been checked by the Nature’s Calendar team.”

For evidence, I uploaded one of my photos of the hazelnut flowers (Fig.4.) and a description of the place and circumstances. My hazelnut flowering observations may turn out to be some of the earliest this year. To prove or refute this statement I rely on the Woodland Trust’s online database, the Nature’s Calendar team’s peer-review and keen monitoring of fellow citizen scientists. This type of on-land & online live collaboration in monitoring the slightest phenological changes is gaining increasing importance in addressing local impacts of climatic changes.

Will hazelnut flower earlier and earlier in the future? Only regular visitors can answer this question by careful monitoring the same hazelnut shrub and recording the date of the first flowers and uploading the data to Nature Calendar.

Nature Calendar invites citizen scientists to monitor a carefully selected list of species of shrubs, trees, flowers, grasses, fungi, birds, insects and amphibians throughout the year. Their changes over time will give us information on how these species (plants, animals and mushroom) adapt to the unfolding climatic changes. Phenological change data contributes to better decisions in wildlife conservation, among others.  

While I was browsing, I came across several websites and webpages on various other decisions and local actions related to climate change adaptation. For example: What can I do about climate change in my garden?  What local residents are doing in the boroughs of Lewisham and Greenwich about the climate crisis:  Climate Action Lewisham, Climate Home – a home of creativity, imagination and community activism by young people, Lewisham Climate Action Bond as an example of Local Climate Bonds, Lewisham Climate Emergency Declaration and Action Plan, CAPE Informing Local Action on Climate Change / London Borough of Lewisham, The Climate Emergency website of Royal Borough of Greenwich, Carbon Neutral Greenwich, Greenwich Climate Network. 

Some of the activities and organisations were familiar to me, I was taken aback by others: ‘How could I miss them?  I live here!” A fast-changing landscape of actions and online information. Having saved these sites to my further actions, I also realised some of these online contents could be highly ephemeral. Uploading my list of URLs to the UKWA Climate Change collection saved local digital content for future research on climatic changes.  

Sauntering through streets, gardens and parks has turned into an archival journey, connecting past, present and future. Fit for the first day of the year. Fit for any days, anywhere where your interest, experience, and local knowledge crosses climatic changes.  

The Natural History Museum’s community science webpage lists a broad range of UK wildlife monitoring activities related to climatic changes, including the New Year Plant Hunt of the Botanical Society of Britain and Ireland and the upcoming annual Big Garden Birdwatch (27-28 January 2023) organised by the Royal Society for the Protection of Birds since 1979. 

Contribute to the web archive
Your next walk or online stroll may spark you to nominate some of your local climate initiatives (civil society, governmental, business, media, arts and academia) to the UK Web Archive Climate Change Collection. Many thanks for your consideration. 

12 December 2022

Examining sports history through digitised & born digital resources

By Helena Byrne, Curator of Web Archives, The British Library

The Irish Sporting Lives workshop and symposium took place at the Ulster University campus in Belfast from 11-12 November 2022. Day one took the form of a half day workshop aimed at  PhD/ECR researchers. It focused both on imparting knowledge about how to research historical figures and how to write sporting biographies. There were three sessions in the workshop:

  1. Margaret Roberts: It’s not what you research… it’s the way that you research it: that’s what gets results
  2. Helena Byrne: Examining sports history through digitised and born digital resources
  3. Turlough O’Riordan & Terry Clavin: Writing sporting lives

The slide deck and speaker notes on ‘Examining sports history through digitised and born digital resources’ are now available in the British Library Shared Research Repository under a CC BY 4.0 Attribution licence. 

The running time for this session was 70 minutes, therefore, many of the slides were discussed only briefly to allow more time for the activity phase of the workshop. The slides accompanying the notes can be edited by anyone to suit different session lengths. If more time is available, more time can be spent on exploring the different options discussed in the slides. As there was limited time in this workshop, no live demos were given during the presentation. The workshop focused on the subject of sport, but it could be adapted to suit any subject area. 

For more general web archiving training materials at a beginner level, please see the International Internet Preservation Consortium (IIPC) Training Materials page: https://netpreserve.org/web-archiving/training-materials/  

The agenda for this session covered: 

  • Warm Up Activity
  • Digital Resources
  • Digitised Newspapers
  • Web Archives
  • Hackathon – Preserve Irish sporting heritage online. 
  • Wrap Up Activity

The session mostly focused on using web archives and only briefly covered digitised newspapers because this was covered in more depth in the first session led by Margaret Roberts.

What sport(s) do you study - word cloud

The warm-up activity collected anonymous information on what type of academic background the workshop participants were from, what their general level of awareness of web archives were, and in particular their awareness of the UK Web Archive. Participation in this activity was optional and not all participants responded to every question. Most of the participants came from a history background while others were from subjects including English Literature, Law, Sports Management or Independent Researchers who research a wide variety of sports. 

There were twelve responses to the question ‘Do you understand the difference between the terms digitised and born digital?’. Six respondents replied ‘yes’, while three said ‘no’ and three said ‘not sure’. In the ‘Digital Resources’ section of the presentation, the difference between these two terms was clarified during the presentation. More in depth user studies on web archive research conducted by Healy et. al. (2022) and Costea (2018) have highlighted that there is often confusion amongst researchers on the difference between a digital library/digital archive, a database and a web archive.

There were thirteen responses to the question ‘Have you ever used a web archive?’. Six respondents replied 'yes', while four said ’no’ and three said ‘not sure’. There were twelve responses to the question ‘Have you ever used the UK Web Archive?’. Four respondents replied ‘yes’, while six said ‘no’ and two said ‘not sure’.

DIY Web Archiving Strategies - logos of several web archiving companies

The session highlighted different ways that the researchers could use DIY web archiving techniques to mitigate against the impact link rot and content drift could have on their research. 

In the hackathon part of the session, participants were tasked to use some of the DIY web archiving strategies discussed to preserve the Irish sporting heritage. Participants could choose from  two options: 

  1. Add online content used in your research to the relevant web archives. 
  2. Review what web content has already been preserved from your area of study in the UK Web Archive Sports Collections. Then select online content from the web to nominate to the UK Web Archive.

Although there was approximately 25 minutes available at the end of this presentation for this activity, it would really need more time and if possible pre-workshop preparation to get maximum results for this activity. 

To wrap up this session, participants were asked two questions about how likely they were to use web archives in their research. Firstly, on a scale of 1 meaning very unlikely to 5 very likely, participants were asked ‘How likely are you to use a web archive as a resource for your research?’. Seven participants answered this question and the aggregated response was 4.4. Secondly, eight participants responded to the question ‘How likely are you to save content you view online in a web archive?’. This was also a scale question with 1 meaning very unlikely to 5 very likely, and the aggregated response was 3.4. 

Although the workshop elicited a small sample of results, they show that there is an interest in using web archives in academic research, not just as a reference source but as a way for managing online citations in the field of sports studies. It would be beneficial to the research community if those teaching research method classes could incorporate web archive training into their classes. The training materials published through the British Library Shared Research Repository can be adapted to suit any subject area.

References:

Healy, S., Byrne, H., Schmid, K., Bingham, N., Holownia, O., Kurzmeier, M., & Jansma, R. (2022). Skills, Tools, and Knowledge Ecologies in Web Archive Research. WARCnet Special Report. Aarhus, Denmark: WARCnet, https://cc.au.dk/fileadmin/dac/Projekter/WARCnet/Healy_et_al_Skills_Tools_and_Knowledge_Ecologies.pdf    

Costea, M.-D. (2018). Report on the Scholarly Use of Web Archives. Aarhus, Denmark: NetLab. Retrieved 2019-08-30, from http://netlab.dk/wp-content/uploads/2018/02/Costea_Report_on_the_Scholarly_Use_of_Web_Archives.pdf