UK Web Archive blog

5 posts from October 2022

18 October 2022

UK Web Archive Technical Update - Autumn 2022

By Andy Jackson, Web Archive Technical Lead, British Library

This is a summary of what’s been going on since the update at the start of the summer.

Website Refresh
On 16 August 2022 we relaunched the UK Web Archive website, although you might not have noticed!

The previous version of the website treated page content like it was software, so updating what the pages said was far too difficult. This quarter, we finally got to release some changes we’d made so that most of the website pages are statically generated from Markdown source held on GitHub, using Hugo. This means we could add in a content management system called NetlifyCMS, which should make editing and translating the pages of our site much easier.

We’ve taken care to match the old website presentation and carefully overlay the new system while falling back on the old system for more complex dynamic pages. You might notice some minor differences to the styling between the two, if you look closely…

An important part of this was our automated accessibility testing. While accessibility evaluation cannot be fully automated, these tools help us manage the process of making changes to our website and minimise the risks of making things worse in time periods between full accessibility evaluations.

Computer server and cables

2022 Domain Crawl Launch
As the British Library networks are in the final stages of being upgraded, 2022 is the last year we expect to run the domain crawl on Amazon Web Services.

We launched the 2022 crawl on the 17th August 2022, and since the British Library is now a member of Nominet we were able to use an up-to-date list of UK domains as our starting point.

So far, we’ve processed nearly over 500 million URLs, totaling over 20TiB of data (uncompressed).

However, we’ve noticed what seems to be an uptick in systems like fail2ban automatically mis-reporting our crawler activity as abusive behaviour. This means we have to put more work into managing our relationship with AWS, and has slowed things down a bit. Nevertheless, we expect the crawl to run successfully until the end of the year, as in previous years.

Hadoop Replication
After many weeks of steady progress, our replica Hadoop storage service is now pretty much at capacity. Filling the thing up with about one petabyte of content took a while, but it’s been taking us a bit longer to be sure we’ve double-checked the transfer worked.

We are now awaiting a decision on whether we can purchase another server for this cluster, so we can make sure there’s room for the most recent crawls, and for content we expect to get in the near future. Either way, we’ll then start to plan shifting the hardware up the the National Library of Scotland.

Exporting Collection Metadata
Working with the Archives of Tomorrow project, we’ve been developing a way to export our collection metadata so it’s more suitable for reuse.

Having real use cases drive the work has been useful, and over the next weeks we’re hoping to integrate the outputs into the UKWA API so anyone can use that data.

Legal Deposit Access & NPLD Player
Working with Webrecorder we’ve seen some good progress on a new version of PyWB that supports direct rendering of PDFs and ePubs, and on the secure player application that will be used to provide access in some reading rooms.

Much of the work has focussed on the challenges around testing and preparation for a new version of a service that works across multiple independent institutions. But it’s been good to start to get some user feedback on how the system works in practice, which has already flushed out some additional requirements for the first release.

iPres 2022
As covered in this dedicated blog post, iPres 2022 included a presentation partly based on lessons learned from managing the technical aspects of the UK Web Archive. The plan is to publish a longer version of that work later in the year.

Major Outage
After the successes of the iPres conference, we were quickly brought back down to earth by a severe hardware failure on the 25th of September. One of the network switches failed, and the whole UKWA dedicated network locked-up in a way that made it difficult to understand and route around the failure.

This took a while to diagnose and resolve, so we moved some critical components onto other machines so our curators and users could use our services. While this was relatively successful, it also showed that some of our automated tasks need breaking down so that different functions can be managed independently. For example, we need crawl launches to be able to proceed even if nothing else is running. These problems meant that our daily crawling activity was delayed and patchy for most of last week.

These complications mean it’s taken a bit longer than expected to undo all the interim changes that were made during the hardware outage. However, as of last week, everything is back to normal

07 October 2022

The UEFA Women’s EURO 2022 Arts and Heritage Programme

by Caterina Loriggio, UEFA Women’s EURO Arts and Heritage Lead

Jan Lyons (Manchester Corinthians) and Gail Redston (Manchester City) looking at the 1921 Ban. Part of Trafford's heritage programme. Photo by Rachel Adams for UEFA WEURO 2022 heritage programme
Jan Lyons (Manchester Corinthians) and Gail Redston (Manchester City) looking at the 1921 Ban. Part of Trafford's heritage programme. Photo by Rachel Adams for UEFA WEURO 2022 heritage programme

The UK Web Archive has been collaborating with the UEFA Women’s EURO 2022 Arts and Heritage Programme to develop the UEFA Women's Euro England 2022 web archive collection. In this guest blog post, we hear about the wider arts and heritage programme around the tournament from Caterina Loriggio.

The UEFA Women’s EURO 2022 arts and heritage programme was designed to promote community engagement, develop cultural leadership, support health and wellbeing, reinforce civic pride and to support local economies post-pandemic. Host City partners (Rotherham, Sheffield, Trafford, Wigan, Manchester, Milton Keynes, Brent, Hounslow, Brighton, and Southampton) were all keen to amplify the opportunity the tournament provided to engage and inspire their residents and visitors.

The £3m programme was supported by National Lottery players through Arts Council England and National Lottery Heritage Fund grants and through funding from the Host Cities. It included four arts commissions, eight museum/archive exhibitions, eight outdoor exhibitions, heritage outreach and education programmes, 45 memory films and new online content covering the history of the women’s game. The project also researched for the first time the full line-up of all the women who have played for England over the past 50 years. Many of those women will be honoured at Wembley Stadium on October 7th in front of a sell-out crowd when they will take a lap of honour during half time in the England USA match.

It was the first time The FA had ever delivered a cultural programme. A key priority for The FA is to establish female role models for both girls and boys. When Host City partners requested a cultural programme to support the tournament the Association saw that this could be a great opportunity to further fulfil this objective. It was also clear that partnering with cultural organisations in Hosts Cities, and national institutions such as the UK Web Archive and British Library would also be a great way to promote the UK’s cultural sector and would be a very effective tool to capture, for the first time on a national scale, the hidden history of women’s football.

Prior to writing funding applications, I led, with the support of the Football Supporters’ Association, four online fan consultations to ensure the programme spoke to the wants of women’s football fans. We also commissioned the organisation ‘64 Million Artists’ to lead half-term virtual workshops for young people aged 12 – 18 in Host Cities (many of whom played football). The fans and young people’s feedback was shared with artists, archivists and curators and was clearly reflected in all elements of the programme. The fans were clear that they could ‘never get enough history’.

Archives and contemporary collecting played an important part in the heritage programme. It was apparent many stories of women’s football (fans as well as players) had been lost already and that women who had played during the ban (1921-1970) were of an age that if we did not collect their stories now, then there was a real risk that they might never be captured. As well as collecting physical objects for museums and archives like caps, pennants, and programmes, there was a significant degree of online archiving. Many of the Host Cities created online exhibitions, hosted films, and imagery on digital archive platforms and digitally captured objects which retired footballers were happy to loan but not donate. Nationally we made 36 memory films live on The FA website. These will be moved to EnglandFootball.com in time for the 50th Anniversary of the Lionesses in November, plus there will be some new content made especially for the anniversary. We were greatly supported in our programme by The National Football Museum and Getty Images who gave us access to their photography archives, which greatly enriched all our work. We also sought to create content for the future by commissioning Getty photographers and by running fan and young people’s photography campaigns to capture the atmosphere of match day and the fan experience beyond the pitch. Some of these images will be shared in an online Getty Images Gallery to be launched in November.

It is hoped that the learnings from this programme will help to secure cultural content in future UK bids for major sporting events. I hope that archiving and collecting will remain important components in all these future projects.

Related Links
This is the ninth blog post published so far about the women’s Euros, the others can be found on the UK Web Archive blog under the 'sports' tag.

There is still an active call for nominations for the UEFA Women's Euro England 2022 web archive collection. Anyone can suggest UK published websites to be included in the archive by filling in our nomination form.

06 October 2022

WARCnet Special Report: Skills, Tools and Knowledge Ecologies in Web Archive Research, 2022

by Sharon Healy, Maynooth University (Project Lead)

WARST report image - skills, tools and knowledge ecologies in web archive research

The WARST team are delighted to announce the publication of a WARCnet Special Report, titled: Skills, Tools and Knowledge Ecologies in Web Archive Research. This study is part of a collaborative project by researchers from Maynooth University, the British Library, the International Internet Preservation Consortium, Bayerische Staatsbibliothek, and the University of Siegen. The research team are all members of Web ARChive studies network researching web domains and events (WARCnet).

The study focuses on individuals around the globe who participate in web archive research, in the context of web archiving, curation, and the use of web archives and archived web content for research or other purposes. We consider web archive research to be representative of the processes and activities described in Archive-It’s web archiving life cycle model from appraisal, acquisition, and preservation, to replay, access, use and reuse (Bragg & Hannah, 2013).

The methodology for the study entailed desk research, participation in WARCnet meeting discussions, and an online questionnaire. The study sought to identify and document the skills, tools, and knowledge required to achieve a broad range of goals within the web archiving life cycle and to explore the challenges for participation in web archive research, and the interludes of such challenges across communities of practice. We suggest that there is a perpetual need to examine the roles of skills, tools, and methods associated with the web archiving life cycle as long as internet, web and software technologies keep advancing, upgrading, and changing.

The Executive summary offers an overview of the findings, and is translated into Danish, French, Spanish and Catalan.

The Report is available to download from WARCnet website:

https://cc.au.dk/fileadmin/dac/Projekter/WARCnet/Healy_et_al_Skills_Tools_and_Knowledge_Ecologies.pdf

A section of the Report that focused on the software, tools and methods used in the web archive research life cycle was presented in a poster at iPres 2022.

05 October 2022

iPres 2022 Conference Report from the UK Web Archive

By Helena Byrne, Nicola Bingham, Dr Andrew Jackson, British Library, Eilidh MacGlone, National Library of Scotland and Caylin Smith, Cambridge University Libraries

IPres2022-logo

iPres is the largest international conference on digital preservation. The conference has been held every year since 2004. The 2022 edition was hosted by the DPC in Glasgow. This meant that the official conference website ipres2022.scot was within scope for the UK Web Archive to preserve. You can view the archived version of the website here: 

https://www.webarchive.org.uk/wayback/archive/20220914105705/https://ipres2022.scot/ 

Screenshot of the iPres 2022 conference website

iPres 2022 was held from Monday 12 to Friday 16 September. There were a mix of presentations over the week with workshops, long papers, short papers, poster presentations and lightning talks as well as show and tell sessions in the form of a ‘Bake Off’. On the final day of the conference, there were a number of site visits to organisations that are running a digital preservation programme. 

This year’s conference also coincided with the 20th anniversary celebrations of the DPC, as well as the DPC Preservation Awards that are held every two years. In 2020, the UK Web Archive won The National Archives (UK) Award for Safeguarding the Digital Legacy at the virtual Digital Preservation Awards 2020 ceremony.

There are also a number of awards given at iPres in various categories. This year’s winner of the Angela Dappert Memorial Award established in 2021, was Dr Andrew Jackson, Technical Lead for the UK Web Archive for his presentation ‘Design Patterns in Digital Preservation: Understanding Information Flows’. 

Many UK Web Archive colleagues from the British Library, National Library of Scotland and Cambridge University Library attended the conference both as delegates and presenters. In this blog post they have reported back on their conference experience.

British Library

Dr Andrew Jackson
As well as presenting my Design Patterns paper, I was also involved in a workshop on format registries in digital preservation. Both sessions were well-attended and seemed to go well, and I’m planning to post about both in more detail in the future. 

I particularly enjoyed the session on DNA storage, especially because of Euan Cochrane’s approach: working with a DNA lab at Yale University to independently verify the work being done by Twist Bioscience.  It’s still a long way from being a storage option we can depend on, but it’s starting to look like it might actually happen!

There were a lot of good quality papers but I particularly enjoyed “Monitoring Bodleian Libraries' Repositories with Micro Services” presented by James Mooney. The overall approach was very similar to how I like to work, from the design of the overall architecture (federated monitoring of resources in situ rather than centralised and ingest-driven) to the style of implementation (microservices combined with best-in-class open source service components).

Nicola Bingham
This was the first iPres conference I have attended. I wish I could have been there in person but due to practicalities, I attended online. Some of my highlights were the presentation from William Kilbride in which he stated that one of the aims of the DPC was to build “the social infrastructure of digital preservation” (as opposed to focussing on technical aspects), which I think has always been true but is now more so than ever especially when it comes to diversifying our archives and enabling communities to have agency in telling their own stories, as articulated by Tamar Evangelista-Dougherty in her keynote. 

Other highlights were hearing from Garth Stewart, Head of Digital Records at National Records Scotland. Garth presented on NRS’s two year project to ingest and make available Scottish Government Cabinet Records and had practical advice for negotiating the transfer of good quality metadata from the depositors - it’s all about gaining trust and explaining to depositors that the quality of metadata provided impacts the experience of the end users. I was also intrigued that they had the challenge of building and maintaining two access solutions, one for journalist access and one for the public. 

A final highlight for me was the long paper, “A Digital Preservation Wikibase” by Kenneth Seals-Nutt of Yale University. Kenneth’s presentation set down the practical steps taken by Yale University Library’s department of digital preservation to implement a Wikibase instance and how this was used to transform a data set related to software into a knowledge base using technologies of the Semantic Web. This is particularly useful to us at the UK Web Archive as we consider the next steps in our web archiving roadmap. 

Helena Byrne
This was my first time attending iPres but I wasn’t able to make it in person so I was delighted that they had an option to join the conference remotely. I was also involved in a collaborative poster presentation with Katharina Schmid (Bayerische Staatsbibliothek) and Sharon Healy (Maynooth University). Our poster ‘Exploring Software, Tools and Methods used in Web Archive Research’ was part of a bigger study that will be published through WARCnet in the coming weeks. 

There were so many great talks, especially around inclusion and diversity in the wider digital preservation field. This along with activism was also a common theme in the three keynotes. These were all very different in scope so it is hard to pick one over the other but I will definitely be watching back over these in the coming weeks and I will share them with colleagues when they are published online.

National Library of Scotland

Eilidh MacGlone
I was grateful to have the opportunity to attend iPres this year. This was my first experience of the conference, and it was a happy one. There were lots of opportunities to meet up with new people and catch up with those I knew from the preservation world. And it was useful! The continuous improvement models are a very handy way to set achievable targets to professionals who are often the only preservationists in their organisation. I know this will be useful to me, even though I am not on my own. I was fascinated to hear about DNA data storage, which although not yet operating at scale, has interesting properties of robustness at room temperature.

You can read more about one of Eilidh’s takeaways from iPres in her blog post - iPres report: a simple workshop exercise using Robust Links.

Cambridge University Library

Caylin Smith
Glasgow 2022 was the second in-person iPres I’ve attended; I previously attended in 2019 when the conference was held in Amsterdam. I was grateful to attend again this year to present about ongoing research as well as catch up with friends and colleagues in the field and meet some new faces. 

Along with Sara Day-Thomson (Edinburgh University Library) and Patricia Falcao (TATE), I led a workshop on the first day of the conference. Titled “Preserving Complex Digital Objects: Revisited”, this workshop picked up on the workshop we gave at iPres in 2019 and focused on supporting the collection management of digital materials for which few or no solutions currently exist. 

There were many great submissions to iPres this year. One paper on the topic of web archiving that stood out to me was “These Crawls Can Talk. Context Information for Web Collections” by Susanne van den Eijkel and Daniel Steinmeier from the KB (National Library of the Netherlands). I’m looking forward to thinking further about their research in the context of web archiving activities at Cambridge University Libraries. 

The next iPres conference will be held in Champaign-Urbana, Illinois in the U.S.A. from September 19 - 22, 2023.

04 October 2022

iPres report: a simple workshop exercise using Robust Links 

By Eilidh MacGlone, Web Archivist, National Library of Scotland

Inspiration at iPres
I had the opportunity to attend iPres 2022, an international conference dedicated to digital preservation. One of the sessions - Robust Links - run by the Digital Preservation Coalition (DPC), really sparked ideas for me. Robust Links offers anyone the opportunity to make links more permanent and less susceptible to 'link rot'. You add a link and it offers several options, one being to link to a 'memento' version of the web page.

It initially seemed out of reach, a bit too technical; but, listening, I recalled using glitch. It is a platform which can handle JavaScript and style sheets. I have known about Robust Links for a few years, but it delighted me to have it function in a page I built. This step was valuable to me: it helped me phrase the question I need to ask within my own organisation. 

NLS workshop
I was therefore inspired to include Robust Links in this workshop exercise for National Library of Scotland staff. I asked attendees to create another category for an imaginary "Scottish Music collection". I built this with websites we already collect. I was going to share this as a document file, but it became a web page following a quick refresher on HTML. 

Screenshot of the 'scottish music collection' website 

In this way, Robust Links create a kind of distributed collection through “archived near” links without the risk of cutting each other off. Legal deposit items have to be read by one person at a time, which can make a task that shares the same titles a little tricky. It also gives us the chance to talk about how the new categories interact with the original list. Here were our results: 

Screenshot of the results section of the 'scottish music collection' website

It was also a starting point for retrieving information through public directories. These included OSCR, the charities register for Scotland and the Companies House register. Finally, it is a kind of crowd sourcing exercise. More than a quarter (six out of twenty one) were not in the archive. 

Colleagues gave positive feedback about our workshop, and this exercise. I plan to continue developing the idea and would love to hear from anyone making their own version.