UK Web Archive blog

138 posts categorized "Web/Tech"

16 January 2023

UK Web Archive Technical Update - Winter 2022

By Andy Jackson, Web Archive Technical Lead, British Library

This is a summary of what’s been going on since the update at the start of the autumn.

2022 Domain Crawl Completion

As in previous years, the 2022 Domain Crawl continued to run right up until the end of the year. Overall, things ran smoothly, with only brief outages for upgrading the virtual server over time as the size of the frontier grew.

Graph showing the UK web archive 2022 annual domain crawl

Because we’re running on the cloud, we are paying for how much compute capacity, RAM and disk space we’re using. So, when the crawl is young and the Heritrix3 frontier database is small, it makes sense to use a small computer. But as the crawl frontier grows, so does the amount of RAM the crawler needs to manage the frontier, so we scale up as we go.

This is one of the reasons we spent time making it possible to configure the frontier database so more house-keeping and clean-up processes are run while the crawl is running. This helps Heritrix clear disk space after it has dealt with URLs, and led to significant savings. The 2020 crawl ended up using 45TB of disk space to store the crawl state, and deleting old ‘checkpoint’ files (which can be used to revert the crawl state to a previous point in time) did not help free up more space. But after changing those configuration options, the 2021 and 2022 crawls only needed 15TB of space, and deleting checkpoints was much more effective.

2023 Domain Crawl Planning

We originally moved to the cloud to relieve pressure on the BL networks as staff switched to remote working during the pandemic. But even when COVID restrictions were eased, the library has continued to support staff working remotely where possible. Fortunately, over the last year the library has upgraded many of the network systems across both the London and Boston Spa sites, which means we now have permission to run the 2023 crawl on site.

As there is still some uncertainty as to how this will affect other network users, we are planning to begin the crawl much earlier in the year (perhaps as early as February). This gives us more time to revisit our options if something goes awry.

Internal Collections API

Working with the Archives of Tomorrow project to understand their requirements, we now have an internal API where W3ACT metadata can be downloaded for entire collections, including all sub-collections and target site metadata. Authenticated W3ACT users can retrieve these full collection extracts (including unpublished collections), which are updated daily. The JSON files are available at for logged-in users.

The public version of the API is in the final stages of development, and should be released early in 2023. Unlike the internal API, this will not include collections that are not yet ready for publication.

W3ACT 2.3.4

Just a few days ago, W3ACT 2.3.4 was released. This included a number of tweaks and bugfixes, including correcting the CSV export feature and adding more export formats (TSV and JSON). For more details, please take a look at the associated release milestone.

There was also an issue with how W3ACT data was used, meaning the subdomains of sites with open access licences were being given the same licence as the ‘parent’ domain. This has now been resolved and access is consistent with the data in W3ACT.

Document Harvester Outage

From the 12th of December onwards, the Document Harvester had stopped picking up GOV.UK documents properly. This appears to have stemmed from some edits carried out in W3ACT, where the Watched Target that covered the GOV.UK document publication service was merged with the main GOV.UK Target (which was not Watched). This meant the crawler was no longer looking for documents from GOV.UK.

We made the GOV.UK Target into a Watched Target, and then cleared the relevant crawl logs for re-processing. Those logs have now been processed and the missed documents have been identified.

We’re looking at how this happened and will take steps to prevent this happening in the future.

The Application Support team has been working with Networks team and our Legal Deposit Library partners to start to roll out an initial ‘alpha’ service across all sites. This will help all library staff to try out the system and lay the foundations for a ‘beta’ service in reading rooms. The Project Manager has also been working hard to understand the likely timeline for the project and communicate this to all stakeholders, while keeping the project management triangle in mind.

Additionally, we’re working on setting up a suitable Continuous Deployment pipeline for this service using GitLab CI/CD. This will allow us to analyse, test and safely deploy new versions of the access service without having to manage the system by hand.

CDX Backfill

One of the critical components of the web archive is the content index (CDX), which is an index of all the URLs we have archived, and is required for playback to work. Ours runs on OutbackCDX (from the National Library of Australia), and a subset of it’s functionality is available via our API.

In the past, we’ve had problems running large CDX indexing jobs, and this had left us in an unfortunate situation where the 2016, 2018 and 2019 domain crawls were not indexed. During the last few months, we modified the the indexing process to (re)process our WARCs and ‘backfill’ the index, which has filled in those gaps.

This also showed that we could process our entire collection (i.e. over 1PB) in a reasonable time (roughly three months depending on the precise workload), which is reassuring. It will likely be necessary to re-build indexes from time to time, and it’s good to know it should be possible to do so in a reasonable amount of time. Also, the act of reading every byte of every WARC is an additional explicit proof that the files have been kept safe over all these years! We know HDFS has been systematically monitoring the files over time, but it’s nice to run an independent check.

The 2020, 2021 and 2022 domain crawls will have to wait a little longer, as they are stored on Amazon Web Services and need transferring to the British Library before they can be indexed.


Finally, we’re proud to be part of the IIPC project Browser-based Crawling For All, which contributes to the development of Browsertrix Cloud and attempts to ensure IIPC members can take advantage of it. As part of this, we proposed two sessions for next years’ IIPC conference, both of which have been accepted:

  • A workshop called Browser-Based Crawling For All: Getting Started with Browsertrix Cloud, aimed at helping attendees take advantage of Browsertrix Cloud. We’re particularly interested in uncovering barriers that might prevent adoption.
  • A panel called Browser-Based Crawling For All: The Story So Far, giving an insight into the current state of the project and of Browsertrix Cloud (including any feedback from the workshop).

Hoping to see you there!

12 January 2023

Changes in Nature’s Calendar – Early Bloomers

The Importance of Citizen Science in Monitoring and Adapting to Climatic Change

By Andrea Deri, Cataloguer and UKWA Climate Change Collection’s lead curator

On 1 January 2023, I had my usual walk from Folkestone Gardens via Sue Godfrey Nature Park, Deptford, London Borough of Lewisham to Greenwich Park, Royal Borough of Greenwich. Overcast, temperature in single digit, humid but calm. Trees and shrubs mostly leafless: an accentuating background to patches of bright green mosses.

I was hoping to see some flowers on winter blossoming plants, for example the bell-shaped flowers of clematis ‘Jingle Bell’ in St Alfege Church’s yard, and the spidery flowers of witch hazels in the Royal Observatory Garden in Greenwich. I was also curious what other flowers I would find, earlier than usual, triggered by the warming climate. Having joined a month ago (1 December 2022) the annual wildflower ‘hunt’ on the first day of the winter, a survey of species in flower in my locality, Deptford’s urban area since 2009 organised by the Creekside Education Trust and the London Natural History Society, I expected several early bloomers. Here is Creekside’s blog post of the 2021 wildflower survey. 

While the witch hazels (Fig. 1.) did not disappoint, I was up for a surprise with clematis “Jingle Bell”: only the silky fluffy seedheads were left: it finished flowering earlier this year. I was lucky to see its last flowers on Christmas Eve 2022 (Fig. 2.). Other early flowers greeted me on a hazelnut shrub in Sue Godfrey Nature Park (Fig. 3.). But, I was truly astonished to see daffodils fully opened in a park by Creekside, just across the Creekside Discovery Centre (Fig.4.) 

Witch hazel flower

Figure 1 Witch hazel (Hamamelis sp.) in flower. Photo: Andrea Deri, Royal Observatory Garden, Greenwich, London, 1 January 2023

I started searching for phenology calendars, almanacs, and any information on the blooming time of these species in my local and other areas in order to compare my observations with the “expected” (based on previous years) flowering periods. The online findings supported my assumption: I did observe earlier than expected flowerings, with the most specific data for the hazelnut.

Clematis ‘Jingle Bell’ 
According to the Royal Horticultural Society (RHS) clematis “Jingle Bell” flowers in winter and early spring. Compared to this broad-brush period, my observation this year suggests this individual specimen finished flowering much earlier than expected and earlier than I had observed this specimen in previous years. 

Clematis flower

Figure 2 Clematis cirrhosa “Jingle Bells” one bell-shaped flower and fluffy seedheads. Photo: Andrea Deri, St Elfege Church, Greenwich, London, 24 December 2022

A post on the Daffodil Society prompted me to do a search on RHS’s website for daffodils where February-March was quoted as the usual flowering period. More precise than for the clematis. Early flowering daffodil horticultural varieties, however, can bloom as early as January, stated one of the Gardeners World blogposts. I may have encountered an early flowering daffodil garden variety. In addition to its literary associations, this iconic flower may have just now become also a conversation starter about the climate crisis. Would its freshness and brightness frame a difficult dialogue in hope? 

Daffodil flowers

Figure 3 Daffodils (Narcissus sp.) in flower. Photo: Andrea Deri, near Creekside Discovery Centre, Deptford, London, 1 January 2023

The Woodland Trust Nature’s Calendar offered me with the tool I had been really looking for: a peer-reviewed database linked to a live map that allowed me to compare my observation with fellow observers in the UK at day level precision.  

Hazelnut flower

Figure 4 Hazelnut (Corylus avellana) in flower: crimson female flowers, yellow catkin male flowers. Photo: Andrea Deri, Sue Godfrey Nature Park, Deptford, London, 1 January 2023

Before I signed up to add my hazelnut observation, I took a screenshot of the “Add a Record” webpage on 5 January 2023 that showed the first hazelnut flower sighting on 4 January 2023. (Fig.5.)

Screenshot of Wildlife trust 'Nature's calendar' website

Figure 5 Screenshot of Nature's Calendar, Woodland Trust. Photo: Andrea Deri, @20:34 pm GMT 5 January 2023

Hazelnut first flowering was among the recently recorded data of the Nature’s Calendar (Fig. 5.) My observation of hazelnut flowers on 1 January 2023 was not extraordinary but earlier than the one featured online. Hazelnut is expected to be in flower in early January according to Nature Calendar (downloadable pdf). But as early as 1 January? To answer this question, I had to register to enter my data. When I entered my observation date, I received an automatic note, all in red: 

This date falls outside of the expected range

The date you have entered is unusually early or late for this species and event; please double check the record. If it’s correct we’d like to know more about your observation, so please add a comment before clicking ‘next’ to continue. If possible, a photo is very useful too. Please note that your record will not appear on the live map until it has been checked by the Nature’s Calendar team.”

For evidence, I uploaded one of my photos of the hazelnut flowers (Fig.4.) and a description of the place and circumstances. My hazelnut flowering observations may turn out to be some of the earliest this year. To prove or refute this statement I rely on the Woodland Trust’s online database, the Nature’s Calendar team’s peer-review and keen monitoring of fellow citizen scientists. This type of on-land & online live collaboration in monitoring the slightest phenological changes is gaining increasing importance in addressing local impacts of climatic changes.

Will hazelnut flower earlier and earlier in the future? Only regular visitors can answer this question by careful monitoring the same hazelnut shrub and recording the date of the first flowers and uploading the data to Nature Calendar.

Nature Calendar invites citizen scientists to monitor a carefully selected list of species of shrubs, trees, flowers, grasses, fungi, birds, insects and amphibians throughout the year. Their changes over time will give us information on how these species (plants, animals and mushroom) adapt to the unfolding climatic changes. Phenological change data contributes to better decisions in wildlife conservation, among others.  

While I was browsing, I came across several websites and webpages on various other decisions and local actions related to climate change adaptation. For example: What can I do about climate change in my garden?  What local residents are doing in the boroughs of Lewisham and Greenwich about the climate crisis:  Climate Action Lewisham, Climate Home – a home of creativity, imagination and community activism by young people, Lewisham Climate Action Bond as an example of Local Climate Bonds, Lewisham Climate Emergency Declaration and Action Plan, CAPE Informing Local Action on Climate Change / London Borough of Lewisham, The Climate Emergency website of Royal Borough of Greenwich, Carbon Neutral Greenwich, Greenwich Climate Network. 

Some of the activities and organisations were familiar to me, I was taken aback by others: ‘How could I miss them?  I live here!” A fast-changing landscape of actions and online information. Having saved these sites to my further actions, I also realised some of these online contents could be highly ephemeral. Uploading my list of URLs to the UKWA Climate Change collection saved local digital content for future research on climatic changes.  

Sauntering through streets, gardens and parks has turned into an archival journey, connecting past, present and future. Fit for the first day of the year. Fit for any days, anywhere where your interest, experience, and local knowledge crosses climatic changes.  

The Natural History Museum’s community science webpage lists a broad range of UK wildlife monitoring activities related to climatic changes, including the New Year Plant Hunt of the Botanical Society of Britain and Ireland and the upcoming annual Big Garden Birdwatch (27-28 January 2023) organised by the Royal Society for the Protection of Birds since 1979. 

Contribute to the web archive
Your next walk or online stroll may spark you to nominate some of your local climate initiatives (civil society, governmental, business, media, arts and academia) to the UK Web Archive Climate Change Collection. Many thanks for your consideration. 

12 December 2022

Examining sports history through digitised & born digital resources

By Helena Byrne, Curator of Web Archives, The British Library

The Irish Sporting Lives workshop and symposium took place at the Ulster University campus in Belfast from 11-12 November 2022. Day one took the form of a half day workshop aimed at  PhD/ECR researchers. It focused both on imparting knowledge about how to research historical figures and how to write sporting biographies. There were three sessions in the workshop:

  1. Margaret Roberts: It’s not what you research… it’s the way that you research it: that’s what gets results
  2. Helena Byrne: Examining sports history through digitised and born digital resources
  3. Turlough O’Riordan & Terry Clavin: Writing sporting lives

The slide deck and speaker notes on ‘Examining sports history through digitised and born digital resources’ are now available in the British Library Shared Research Repository under a CC BY 4.0 Attribution licence. 

The running time for this session was 70 minutes, therefore, many of the slides were discussed only briefly to allow more time for the activity phase of the workshop. The slides accompanying the notes can be edited by anyone to suit different session lengths. If more time is available, more time can be spent on exploring the different options discussed in the slides. As there was limited time in this workshop, no live demos were given during the presentation. The workshop focused on the subject of sport, but it could be adapted to suit any subject area. 

For more general web archiving training materials at a beginner level, please see the International Internet Preservation Consortium (IIPC) Training Materials page:  

The agenda for this session covered: 

  • Warm Up Activity
  • Digital Resources
  • Digitised Newspapers
  • Web Archives
  • Hackathon – Preserve Irish sporting heritage online. 
  • Wrap Up Activity

The session mostly focused on using web archives and only briefly covered digitised newspapers because this was covered in more depth in the first session led by Margaret Roberts.

What sport(s) do you study - word cloud

The warm-up activity collected anonymous information on what type of academic background the workshop participants were from, what their general level of awareness of web archives were, and in particular their awareness of the UK Web Archive. Participation in this activity was optional and not all participants responded to every question. Most of the participants came from a history background while others were from subjects including English Literature, Law, Sports Management or Independent Researchers who research a wide variety of sports. 

There were twelve responses to the question ‘Do you understand the difference between the terms digitised and born digital?’. Six respondents replied ‘yes’, while three said ‘no’ and three said ‘not sure’. In the ‘Digital Resources’ section of the presentation, the difference between these two terms was clarified during the presentation. More in depth user studies on web archive research conducted by Healy et. al. (2022) and Costea (2018) have highlighted that there is often confusion amongst researchers on the difference between a digital library/digital archive, a database and a web archive.

There were thirteen responses to the question ‘Have you ever used a web archive?’. Six respondents replied 'yes', while four said ’no’ and three said ‘not sure’. There were twelve responses to the question ‘Have you ever used the UK Web Archive?’. Four respondents replied ‘yes’, while six said ‘no’ and two said ‘not sure’.

DIY Web Archiving Strategies - logos of several web archiving companies

The session highlighted different ways that the researchers could use DIY web archiving techniques to mitigate against the impact link rot and content drift could have on their research. 

In the hackathon part of the session, participants were tasked to use some of the DIY web archiving strategies discussed to preserve the Irish sporting heritage. Participants could choose from  two options: 

  1. Add online content used in your research to the relevant web archives. 
  2. Review what web content has already been preserved from your area of study in the UK Web Archive Sports Collections. Then select online content from the web to nominate to the UK Web Archive.

Although there was approximately 25 minutes available at the end of this presentation for this activity, it would really need more time and if possible pre-workshop preparation to get maximum results for this activity. 

To wrap up this session, participants were asked two questions about how likely they were to use web archives in their research. Firstly, on a scale of 1 meaning very unlikely to 5 very likely, participants were asked ‘How likely are you to use a web archive as a resource for your research?’. Seven participants answered this question and the aggregated response was 4.4. Secondly, eight participants responded to the question ‘How likely are you to save content you view online in a web archive?’. This was also a scale question with 1 meaning very unlikely to 5 very likely, and the aggregated response was 3.4. 

Although the workshop elicited a small sample of results, they show that there is an interest in using web archives in academic research, not just as a reference source but as a way for managing online citations in the field of sports studies. It would be beneficial to the research community if those teaching research method classes could incorporate web archive training into their classes. The training materials published through the British Library Shared Research Repository can be adapted to suit any subject area.


Healy, S., Byrne, H., Schmid, K., Bingham, N., Holownia, O., Kurzmeier, M., & Jansma, R. (2022). Skills, Tools, and Knowledge Ecologies in Web Archive Research. WARCnet Special Report. Aarhus, Denmark: WARCnet,    

Costea, M.-D. (2018). Report on the Scholarly Use of Web Archives. Aarhus, Denmark: NetLab. Retrieved 2019-08-30, from   

07 December 2022

Pride and Visibility in the LGBTQ+ Lives Online Collection

By Ash Green, CLIP LGBTQ+ Network, and Goldsmith University

The LGBTQ+ Lives Online UK Web Archive collection currently holds over 600 sites, web pages, blogs etc focused on the LGBTQ+ experience of people in the UK. Community and the coming together of individuals is a key aspect of the LGBTQ+ experience, and this is particularly reflected in sites acting as networks; focused on Pride events; and visibility and remembrance days such as Bi Visibility Day, Lesbian Visibility Week, Trans Day of Remembrance, International Day Against Homophobia, Biphobia and Transphobia. These events, networks and days are there to support the community; remind others outside the community we are part of, that we exist; that we celebrate who we are; that the need to highlight and address inequalities continues to remain important despite LGBTQ+ people having existed for millennia.

Pride march with rainbow flags
Gotta Be Worth It from Pexels

An example of sites in the UK Web Archive under some of these banners include: LGBT Mummies (aiming to support LGBT+ women & people globally on the path to motherhood or parenthood); London Gaymers (a safe place for the LGBT gaming community in London and across the UK to connect with like minded individuals); African Rainbow Family (a non-for-profit charitable organisation that support lesbian, gay, bisexual, transgender intersexual and queer (LGBTIQ) people of African heritage and the wider Black Asian Minority Ethnic groups); Pride Sports (a focus on increasing participation in sport by lesbians, gay men, bisexual and transgender people as well as the wider community). As you can see from the examples given, many of the informal networks are focused on where other aspects of an individual’s life overlaps with being an LGBTQ+ person.

We also have Pride sites archived within the collection, including both local (Pride In Surrey , Glasgow’s Mardi Gla , York Pride) and nationwide (LGBTQYMRU ) events. Before the pandemic they were mainly face-to-face events, but between 2020 and 2022, there was an increase in online events as many sought to keep LGBTQ+ people connected in a safe way.

We would like to build the collection of UK sites focused around Pride and awareness/visibility days. We don’t limit our collection of sites to big organisations only – as we have said before, all LGBTQ+ content is welcome, including personal content if it is published in the UK. And even though we would like to develop the areas of the collection highlighted above, we are also still happy to receive submissions around any aspects of LGBTQ+ Lives Online. So, if you know of any online content you think we should be archiving within this collection please nominate it here.

Under the Non-Print Legal Deposit Regulations 2013, the UKWA can archive UK published websites, but are only able to make the archived version available to people outside the Legal Deposit Libraries Reading Rooms, if the website owner has given permission. The UK Legal Deposit Libraries are the British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries, Cambridge University Libraries and Trinity College Dublin Library. If you’re curious about what is in the LGBTQ+ collection you can browse through it here.

01 December 2022

History on the move: Curating a collection on the Queen’s Platinum Jubilee

By Daniela Major, PhD Student, School of Advanced Studies, University of London

Note: This blog post was written before the death of Her Majesty Queen Elizabeth II. The Jubilee collection has documented the end of an extraordinary reign and will hopefully serve as a basis for future researchers to understand this historical moment.

Before I started my placement at the UK Web Archive, my project idea was to build a collection about the History of London. I had thought it would give me an opportunity to delve into history blogs and history websites, and to explore how people interpret historical events; it was, however, a Jubilee year, and the opportunity came up instead to curate a collection about this very modern event, which would, moreover, unfold as I built the collection.

Queen's Platinum Jubilee 2022 logo in english and welsh

The particular challenges of this exercise were very attractive to someone who still considers herself an historian. It is fairly straightforward to build a collection about events that have gone past and that have been analysed by countless historians. It is a very different thing to curate a collection about events that are happening, whose consequences remain unknown. In this sense, the Queen’s Platinum Jubilee was a great opportunity because in many ways Queen Elizabeth II already belongs to History. It is entirely possible to historicise her existence and her years in power. It is also possible to use her reign as a way to look into the making of modern Britain and modern Europe, as she was present through many key historical moments in the last 70 years.

A priority which was defined early on was representing different parts of the UK, rather than focusing only on the big cities. We looked into how towns, villages and cities were celebrating the Jubilee, what events they were organizing, where street parties would take place and how councils involved local communities in the celebrations. From a geographical representation came the necessity to represent different voices and opinions, both from the UK and the Commonwealth. It was vital the collection didn’t turn out to be laudatory. Future researchers would be interested in knowing whether there was resistance to the monarchy and whether consensus was real or fabricated.

As with so many questions in History, the answer is both yes and yes. Yes, there is resistance, but yes there is genuine and even widespread appreciation for the Queen.

For the majority of my academic career, I have looked to the past to study it. Historians are used to question the archives. We have to question the silences and the omissions, we have to remember who created records, who kept them, and why. Curating this collection placed me firmly on the other side of these interrogations. I was the one deciding what should go into the collection, what should be kept for posterity. The web is vast, content is being produced every minute of every hour. It is not conceivable to include everything. The responsibility is enormous, but it made me all the more aware of the need to hear different sides, so as to not exclude voices which have often been silenced in the past.

The Web affords researchers the possibility to glimpse into facets of life and points of view that many previous historical records have omitted. It is a rich source with enormous democratic potential, and one which will become even more essential in the years to come; it must be protected and looked after. The work that web archivists do, and that I have been privileged enough to take part in, is vital to safeguard the history of the present and the future.

View the Queen's Platinum Jubilee, 2022 collection

Also the Queen's Diamond Jubilee, 2012 collection 

Queen's platinum jubilee collection screenshot

30 November 2022

If Websites Could talk - Part 5

By Hedley Sutton, Team Leader, Asian & African Studies Reference Services

Check out previous episodes in this series - Part 1Part 2, Part 3 and part 4.

Over a year has passed since we last eavesdropped on the ongoing debate among U.K. domain websites as to which of them deserves to be recognised as the most extraordinary site of all. 

“We think we should be considered,” said *Heritage Cast Iron Radiators*. “We’re not a site that you come across every day.”

Screenshot of the Carrotworkers collective website

“Agreed, but you could surely say the same about us,” retorted the *Carrotworkers’ Collective*. “What do you reckon, *Angelfish Opinions*?”

There was no response, the latter being in deep conversation about matters piscine with the *Catfish Study Group*.

“Let’s hear it for the mammals!” cried *Platypus Research*. “You’re with us, *Led by Donkeys* , are you not? And you, *Absolute Dogs*? Not quite sure if you count, *Hatching Dragons*”?

“We insects always get overlooked,” muttered the *British Bee Veterinary Association*.

“We know how you feel,” commiserated *Polly Parrot Rescue UK*.

“What about us?” said the *UK Soft Power Group*. “Our charm, our intelligence …”

“Look, we want to take this tired debate to a whole new dimension,” said the *Quantum Communications Hub*. “With the help of the *Cosmic Shambles Network*, nothing can possibly stop us!”

“That’s not quite fair,” said the *Tuneless Choir*. “If you’re going to work together on your bid, then we might well hook up with the *London Vegetable Orchestra*”.

“Wait a minute – two can play at that game,” said the *Museum of Human Kindness* , “Can’t they, *Empathy Museum*?”

Fortunately at this point the *Centre for Effective Dispute Resolution* made a useful suggestion. It was decided that the fairest way forward was for candidate sites to first contact the *UK Anonymisation Network*, and then let the *Academy of Experts* make the final choice.  

And thus it came to pass that the chosen site was … *Much Better Adventures*.

03 November 2022

Calling All Digital Preservers!

By Andy Jackson, Web Archive Digital Lead, British Library

Calling All Digital Preservers!

World Digital Preservation Day logo -WDPD2022

The digital preservation community is small and under resourced. This means we must work together if we want to make the biggest impact. To this end, a small group of us have been attempting to help the members of the digital preservation community better support each other. As it is World Digital Preservation Day  (, we'd like to encourage you all to (re)discover what we've built so far:

If you'd like to help, we'd love to hear from you....

  • What have we missed from the Awesome List?
  • Can you answer any of the unanswered DigiPres questions? Do you need to ask questions of your own? Are there old questions and answers on mailing lists that need a more visible home, so others can find them again?
  • Can you contribute to the COPTR Tool Registry?
  • Are these resources useful? Should we change our approach?

The last one is really important. We've been in digital preservation long enough to see a lot of portals and projects come and go, and we recognize that making it possible to build on past work sometimes requires changing what we've built so far.

Please get in touch if you have any questions. You could talk us directly via Twitter or Mastodon (e.g., or use the discussion forums. We're happy to hear any and all ideas!

In particular, in the last few weeks, the homepage has been modified and the Awesome List has been set up, based on community feedback ( Now would be a great time to get some feedback on what we've been doing!

Thanks for reading, and thanks to everyone who has contributed so far.

Andy Jackson (@anjacks0n/ & Paul Wheatley (@prwheatley), on behalf of all the contributors.

With thanks to the Open Preservation Foundation for hosting many of these resources, and to the Digital Preservation Coalition for their support.

18 October 2022

UK Web Archive Technical Update - Autumn 2022

By Andy Jackson, Web Archive Technical Lead, British Library

This is a summary of what’s been going on since the update at the start of the summer.

Website Refresh
On 16 August 2022 we relaunched the UK Web Archive website, although you might not have noticed!

The previous version of the website treated page content like it was software, so updating what the pages said was far too difficult. This quarter, we finally got to release some changes we’d made so that most of the website pages are statically generated from Markdown source held on GitHub, using Hugo. This means we could add in a content management system called NetlifyCMS, which should make editing and translating the pages of our site much easier.

We’ve taken care to match the old website presentation and carefully overlay the new system while falling back on the old system for more complex dynamic pages. You might notice some minor differences to the styling between the two, if you look closely…

An important part of this was our automated accessibility testing. While accessibility evaluation cannot be fully automated, these tools help us manage the process of making changes to our website and minimise the risks of making things worse in time periods between full accessibility evaluations.

Computer server and cables

2022 Domain Crawl Launch
As the British Library networks are in the final stages of being upgraded, 2022 is the last year we expect to run the domain crawl on Amazon Web Services.

We launched the 2022 crawl on the 17th August 2022, and since the British Library is now a member of Nominet we were able to use an up-to-date list of UK domains as our starting point.

So far, we’ve processed nearly over 500 million URLs, totaling over 20TiB of data (uncompressed).

However, we’ve noticed what seems to be an uptick in systems like fail2ban automatically mis-reporting our crawler activity as abusive behaviour. This means we have to put more work into managing our relationship with AWS, and has slowed things down a bit. Nevertheless, we expect the crawl to run successfully until the end of the year, as in previous years.

Hadoop Replication
After many weeks of steady progress, our replica Hadoop storage service is now pretty much at capacity. Filling the thing up with about one petabyte of content took a while, but it’s been taking us a bit longer to be sure we’ve double-checked the transfer worked.

We are now awaiting a decision on whether we can purchase another server for this cluster, so we can make sure there’s room for the most recent crawls, and for content we expect to get in the near future. Either way, we’ll then start to plan shifting the hardware up the the National Library of Scotland.

Exporting Collection Metadata
Working with the Archives of Tomorrow project, we’ve been developing a way to export our collection metadata so it’s more suitable for reuse.

Having real use cases drive the work has been useful, and over the next weeks we’re hoping to integrate the outputs into the UKWA API so anyone can use that data.

Legal Deposit Access & NPLD Player
Working with Webrecorder we’ve seen some good progress on a new version of PyWB that supports direct rendering of PDFs and ePubs, and on the secure player application that will be used to provide access in some reading rooms.

Much of the work has focussed on the challenges around testing and preparation for a new version of a service that works across multiple independent institutions. But it’s been good to start to get some user feedback on how the system works in practice, which has already flushed out some additional requirements for the first release.

iPres 2022
As covered in this dedicated blog post, iPres 2022 included a presentation partly based on lessons learned from managing the technical aspects of the UK Web Archive. The plan is to publish a longer version of that work later in the year.

Major Outage
After the successes of the iPres conference, we were quickly brought back down to earth by a severe hardware failure on the 25th of September. One of the network switches failed, and the whole UKWA dedicated network locked-up in a way that made it difficult to understand and route around the failure.

This took a while to diagnose and resolve, so we moved some critical components onto other machines so our curators and users could use our services. While this was relatively successful, it also showed that some of our automated tasks need breaking down so that different functions can be managed independently. For example, we need crawl launches to be able to proceed even if nothing else is running. These problems meant that our daily crawling activity was delayed and patchy for most of last week.

These complications mean it’s taken a bit longer than expected to undo all the interim changes that were made during the hardware outage. However, as of last week, everything is back to normal

UK Web Archive blog recent posts



Other British Library blogs