THE BRITISH LIBRARY

UK Web Archive blog

6 posts from October 2020

26 October 2020

The 1916 Easter Rising Web Archive

By Brendan Power, Digital Preservation Librarian, Library of Trinity College Dublin

The 3 Legal Deposit Library logos who were involved in the collaboration - Bodleian Libraries, Trinitiy College Dublin and the British Library

At the recent conference, ‘Engaging with Web Archives: Opportunities, Challenges and Potentialities’, I presented a paper on a collaborative project between The Library of Trinity College Dublin, the University of Dublin, the Bodleian Libraries, the University of Oxford, and the British Library. The project was carried out in 2015/16 and aimed to identify, collect, and preserve online resources related to the 1916 Easter Rising and the diverse ways it was commemorated and engaged with throughout its centenary in 2016. The Bodleian Libraries primarily collected UK websites under the provisions of the 2013 Non-Print Legal Deposit Regulations (NPLD), while The Library of Trinity College Dublin focused on websites in the .ie domain. Since no legislation exists in the Republic of Ireland to ensure that the .ie domain is preserved, websites within the .ie domain were collected on a voluntary basis, that is, with the express formal permission of the website owners through the signing of a license agreement.

 

We aimed to reflect the variety of ways that the Irish and British states, cultural and educational institutions, as well as communities and individuals, approached the centenary events. These included official commemorative websites, the websites of museums, archives, heritage, cultural, and education institutions, along with traditional and alternative news media websites, blogs, and community websites. These resources will be invaluable primary resources to analyse how people interpreted and engaged with the Easter Rising in its centenary year. Researchers have reflected on the events organised on the fiftieth anniversary of the Easter Rising in 1966 and how these events were framed, the aspects that were championed, and the critical viewpoints denied expression. In a similar way, the records created throughout the centenary will be an essential resource for researchers in analysing how the generations of 2016 engaged with the legacy of the Easter Rising and the approaches, themes, and tone adopted.

 

The resulting web archive collection contains over 318 seeds, i.e. websites or sub-sections of these. Of these 318 websites, 112 (35%) were selected by The Library of Trinity College Dublin, 190 (60%) by the Bodleian Libraries, and 16 (5%) by curators at the British Library. 118 (37%) of the websites were from the .ie domain, 172 (54%) were from the .uk domain and 28 (9%) were associated with other areas, predominantly the USA. For all websites outside the UK (146), formal permission was sought from the website owners, resulting in 61 licenses to archive and make the archived copies publicly available. We received no response from 83 website owners, and 2 organisations agreed in principle to inclusion in the web archive but were not in a position to sign the license agreement required to allow us to archive the website as they could not affirm that they controlled the copyright of all the content that was to be archived. This meant an overall permissions rate of 42%, with the rate for websites in the .ie domain being even higher, at 51%.

 

Since the project was completed there have been many helpful reminders of the impact that such work has. This included one organisation that had created a website dedicated to an Easter Rising project which was no longer live on the web. The person that was responsible for the website had left the organisation and their replacement had no access to the materials that had been on the website. They had discovered an e-mail from me back in 2016 inviting them to participate in the web archive. Once they contacted me, I was able to direct them to the UK web archive and, as the organisation had signed the license agreement, they were able to access the archived website immediately from their office. This access had saved them both the time and staff resources that would have been expended in order to recreate some of the resources that were available on the archived website. It serves as an example of what embedding sustainability into a project can save in terms of time and staff resources and demonstrated the positive economic impact that organisations can derive by participation in cultural heritage initiatives such as web archives.

 

The co-curators of this collection have also previously published a paper on the collection in the academic journal, Internet Histories called Capturing commemoration: the 1916 Easter Rising web archive project.

You can watch Brendan Power’s presentation on the EWA YouTube Channel.

 

21 October 2020

The UK Web Archive and Wimbledon; A Winning Combination

By Robert McNicol, Kenneth Ritchie Wimbledon Library, Wimbledon Lawn Tennis Museum

 

Wimbledon Lawn Tennis Museum Logo

 

Opened in 1977, the Kenneth Ritchie Wimbledon Library, part of the Wimbledon Lawn Tennis Museum, is the most comprehensive collection of tennis publications in the world. We hold books, periodicals, programmes and other publications from more than 90 different countries.

As with everything at Wimbledon, we are always looking for ways to evolve and improve how we do things. That’s why we were delighted to team up with the UK Web Archive to put together a curated collection of tennis websites. The Tennis collection sits within the Sports Collection (Ball Sports Excluding Football) section of the UK Web Archive Sports Collection.

So far, we have added over 70 sites to the Tennis collection but ultimately the aim is to archive all UK-based tennis websites. This includes websites of tennis clubs, governing bodies and media, as well as the websites and social media feeds of individual players. We have already added the Twitter feeds of all world-ranked British players to the collection.

Social media archiving is an area we are particularly interested in and we have been experimenting with using Webrecorder to archive social media feeds to a level not possible on the UK Web Archive. We have recently conducted several trials, using both the manual and auto-pilot functions of Webrecorder to archive the Wimbledon Twitter and Instagram feeds. We have had mixed results from these pilot projects and would be interested in comparing notes with any other organisations that have used Webrecorder to perform social media archiving.

As well as social media feeds, we have been using Webrecorder to archive our own website, Wimbledon.com, which, as a particularly dynamic website, the UK Web Archive struggles to capture fully. Wimbledon.com is this year celebrating its 25th anniversary and by archiving it regularly we will be able to save the information contained in it for researchers of future generations. In the same way, we have also been trialling the archiving of our AELTC Intranet site, Wimbledon Insider.

We’ve greatly enjoyed our collaboration with the UK Web Archive so far and are very grateful for the web archiving advice that they have provided. We hope that our tennis expertise has also been of benefit to the UK Web Archive and the British Library. We look forward to working together for many years to come.

If you would like to nominate a tennis website to be archived, please fill in the public nomination form on the UK Web Archive website or get in touch with me at rmcn@aeltc.com, we’d love to hear from you.

You can watch Robert McNicol’s presentation on the EWA YouTube Channel.

 

19 October 2020

Exploring media events with Shine

By Caio Mello, Doctoral Researcher at the School of Advanced Study, University of London

Computer screen with some HTML code on the screen

This blogpost is a summary of the presentation I delivered with my colleague Daniela Major in the conference Engaging with Web Archives: ‘Opportunities, Challenges and Potentialities’ in September 2020. This presentation is entitled ‘Tracking and analysing media events through web archives’.

My research explores the media coverage of the Olympic Games in a cross-cultural, cross-lingual and temporal perspective. I am especially interested in comparing how the concept of 'Olympic legacy' has been approached by the Brazilian and British media considering different locations, languages and social-political contexts. I have written a bit about this before on the UK Web Archive blog in December 2019 and March 2020.

Because of its controversial nature, the term Olympic legacy is used in a variety of contexts and it has multiple meanings. Considering its narrative importance to legitimize the billionaire investment of cities to host these events, this study has as the main objective to explore and define the concept of Olympic Legacy and how it changes over time.

Here however, I will be focusing on my experience doing a secondment at the British Library with the UK Web Archive team. I have explored the potential of using the platform Shine to track news articles on Olympic legacy.

Why Shine?

Shine is a tool to explore .uk websites archived by the Internet Archive between 1996 and April 2013. While a big part of the content of the UK Web Archive can only be accessed from inside the British Library, Shine is open access and provides us with search results and URL data that can be easier to manage.

We have developed a pipeline based on 5 steps: searching, extraction, cleaning, filtering and visualisation. To extract information, we have conducted web scraping of the data using Python notebooks looking at specific newspapers (like The Guardian) and broadcast websites (like BBC) using the keyword “Olympic legacy”. Having searched for URL’s in Shine and extracted the results, the main challenge is cleaning. After extracting just the body text of the articles, we saw that many of them did not mention Olympic legacy. Usually, Shine provides results where the words searched appear in peripheral locations of the webpage. Cleaning consists of removing all the information around the main text, such as images, adverts, menus and links. With the documents we needed in hand, we had to verify if their content is relevant or not to our analysis. Sometimes, the term Olympic legacy appears but it is not necessarily related to Rio and London Olympics or it is not the main topic of the article. The process of filtering demanded a huge effort of close reading to identify contexts. At the end, we have produced some charts to visualise word-trends and topics that pop up around legacy. Although the Shine search results are limited in terms of time - it searched up until 2013 - it has been very useful as an exploratory tool to conduct preliminary analysis in a small-scale, and to build web archive and web scraping methods before applying my methods to huge amounts of texts elsewhere. 

You can watch Caio de Castro Mello Santos & Daniela Cotta de Azevedo Major’s presentation on the EWA YouTube Channel.

*This project has received funding from the European Union’s Horizon 2020 research and innovation programme. For more information: cleopatra-project.eu.

 

14 October 2020

Engaging with Web Archives - Conference Report

By Jason Webber, Web Archive Engagement Manager, The British Library

 

Engaging with Web Archives conference banner

 

Is it possible to have a successful conference when you can no longer meet in person? Going exclusively online doesn’t seem to have stopped the ‘Engaging with Web Archives’ (EWA) Conference from being a superb experience. Co-Chairs of the event are Sharon Healy and Michael Kurzmeier, PhD students at Maynooth University.

Originally planned as a more traditional, in person, conference in April 2020 the EWA team re-planned for a completely online event on 21and 22 September 2020. It is notable that this was the first web archiving conference in Ireland. Most talks were pre-recorded which meant that questions could be posed in the chat box and were often answered live by the presenter during the talk. This can be a significant advantage of pre-recorded talks.

The programme was packed with high quality presentations from many areas of web archiving but here I’ll highlight a few that were UK Web Archive (UKWA) projects or used UKWA data. 

 

Highlights

 

A Keynote talk was delivered by Professor Jane Winters, School of Advanced Study, University of London. Web archives as sites of collaboration. Jane has worked with the UK Web Archive extensively over many years and is one of only a few Professors in the UK training and promoting web archives to students. Jane's talk (link to YouTube).

 

Sara Day Thomson (University of Edinburgh) Developing a Web Archiving Strategy for the Covid-19 Collecting Initiative at the University of Edinburgh. Sara formerly worked for the Digital Preservation Coalition (DPC) led a ‘Web Archiving Task Force’ and more recently has been building important collections on Covid-19 with the University of Edinburgh in partnership with UKWA. Sara's talk (link to YouTube).

 

Dr. Brendan Power (The Library of Trinity College Dublin): Leveraging the UK Web Archive in an Irish context: Challenges and Opportunities. With Trinity College Dublin being a UK Legal Deposit Library we try and work together as much as possible and this talk highlights what is possible with specific mention of the Easter Rising collection. Brendan's talk (link to YouTube).

 

Robert McNicol (Kenneth Ritchie Wimbledon Library): The UK Web Archive and Wimbledon: A Winning Combination. We try to represent as many aspects of UK life as possible including sport. This also highlights our cooperation with other libraries and archives. See the Tennis collection. Robert's talk (link to YouTube).

 

Dr. Peter Webster (Independent Scholar, Historian and Consultant): Digital archaeology in the web of links: reconstructing a late-90s web sphere. Peter has conducted several pieces of research utilising the UKWA secondary datasets. These are free and available for download. Peter's talk (link to YouTube).

 

Helena Byrne (Curator of web Archiving, British Library): From the sidelines to the archived web: What are the most annoying football phrases in the UK? Helena is a curator in the UK Web Archive but also has a keen interest in sport and women’s football in particular. Here, Helena shows how the Trends feature (graphs) in our SHINE service can help guide research in an easy and accessible way. Helena's talk (link to YouTube).

 

Caio de Castro Mello Santos & Daniela Cotta de Azevedo Major (School of Advanced Study, University of London): Tracking and Analysing Media Events through Web Archives. Caio was a Phd student placement with UKWA as part of the Cleopatra project. Read about some of his work on this blog on Olympic legacy. Caio and Daniella's talk (link to YouTube).

 

Hannah Connell (King’s College London; British Library): Curating culturally themed collections online: The Russia in the UK Special Collection, UK Web Archive. Hannah has worked extensively collecting one of the several diaspora community collections. In addition to Russia in the UK, there is London French and Latin America UK. Hannah's talk (link to YouTube).

 

Dr. Jessica Ogden (University of Southampton) & Emily Maemura (University of Toronto): A tale of two web archives: Challenges of engaging web archival infrastructures for research. Jessica has also worked previously with UKWA as a Phd placement on the challenges of researchers using web archives. This vital work helps guide our planning for the future. Jessica and Emily's talk (link to YouTube).

 

Dr. Olga Holownia (International Internet Preservation Consortium): IIPC: training, research, and outreach activities. Olga works full time for the IIPC but has been based within the UK Web Archive team at the British Library. We have been delighted to have worked with and been supported by the IIPC since it began (The British Library is a founding member).

 

Rosita Murchan (Public Record Office of Northern Ireland): PRONI Web Archive: A Collaborative Approach. PRONI maintains their own web archive but also collaborates with the UK Web Archive in collecting material specific to Northern Ireland. This is important as there currently is no Legal Deposit partner in Northern Ireland. Rosita’s talk (link to YouTube).

 

Summary

Whilst it is a shame not to meet people in person this conference has shown me how online conferences can be a viable way forward. I’m very much looking forward to the next one.

 

See all of the pre-recorded talks on the EWA conference Youtube Channel. You can find the Engaging with Web Archives on Twitter and catch up on the conference discussion with the hashtag #EWAvirtual

 

Look out for more in-depth blog posts from EWA conference speakers over the coming weeks on the UK Web Archive blog.

 

07 October 2020

Safeguarding the Digital Legacy: the UK Web Archive is a finalist for the 2020 Digital Preservation Awards

By Ian Cooke, Head of Contemporary British Published Collections at the British Library

2020 Digital Preservation Awards logo

 

Here at the UK Web Archive we are very excited and proud to have made it to the finalists for the 2020 Digital Preservation Awards, in the ‘Safeguarding the Digital Legacy’ category.

Alongside the other finalists, we presented at #WeMissiPRES conference on 23 September. We only had a few minutes, so our ‘lightning talk’ went by in a flash. Here is a slightly extended version of our presentation. 

This year, the UK Web Archive celebrates its 15 year anniversary. It is 15 years since we first made public an online interface to our newly-created Web Archive. It’s important to us that we date from that point as, all through our 15 years, access has been a core part of what we do, and drives how we think about preservation.

Anniversaries are important, because they offer us a point to look back, to give us a longer-term perspective on our work, but also because they prompt us to think about our values as well as our legacy.

So, thinking about our values, preservation and legacy, we want to talk about three things that we are really proud of:

 

The content matters

This has led us in everything else. Communication on the web is primarily about us, about the people and communities that we share our lives with. Preservation of the web matters, because it is vital to how we understand ourselves now, and how we understand our recent past. From our beginnings, we have made the case that the web is not trivial – it should be valued – and we continue to make that case. We do this by creating thematic collections, which put the focus on the subject not the form; by talking publicly and online about our work; and by working with researchers to understand what the archived web can tell us.

Being led by the content can result in complex and innovative technological interventions, such as the continued monitoring and refinement of our domain crawls to ensure that we are as comprehensive as we can be.

It is also about policy and engagement. It’s about making sure we understand the content, and the people creating it. We reach out to communities and groups to help create collections, and this is something we understand better as we have grown. We do this by partnering with specialist archives or community groups, or through public calls for co-operation. An example currently is our LGBTQ+ lives collection, where we are working with the LGBTQ+ network of the Chartered Institute of Library and Information Professionals in the UK and also have been using social media to call for content.

 

We work collaboratively

This has been at the heart of the UK Web Archive, which has always existed as a collaborative venture between organisations – now linking the six Legal Deposit Libraries of the UK. We also engage with our peer institutions, to learn and share experience. Collaboration is vital to build and maintain the capacity that all institutions active in web archiving need to meet the preservation challenges presented by the live web. A key part of that has been the International Internet Preservation Consortium (IIPC), where we are proud to be the host for the Programme and Communications Officer. As well as participating in conferences, workshops and hackathons, we regularly take part in the ‘Online Hours: Supporting Open Source’ calls, which are dedicated to ensuring that the IIPC’s open source initiatives are truly open to members.

We work collaboratively also with researchers, both in collection-building and in research projects using the archived web. Working with researchers helps us to understand ‘real life’ challenges, and inspires the way we build our services and communicate about them. We are immensely proud of our role in the ‘Big UK Domain Data for the Arts and Humanities’ project, which helped us build our ‘Shine’ analytical tool for full-text indexes. More recently, we have been working on research in economic geography – using our postcode data set; and with researchers from the Alan Turing Institute, to understand how our data can be used to analyse word value change over time.

Research use of the UK Web Archive has developed over time. An early, and enduring use, has been a ‘close reading’ of websites. This approach may look at one or a small number of websites and study the content, layout and functionality in detail. Sometimes these studies have a longitudinal aspect, looking at change over time. Our user interface helps researchers find individual websites, or groups of websites, that are relevant to their study. This approach has been supplemented by other research methods which attempt to understand a much larger body of content at scale. This research uses tools and data to understand communication and behaviour on the web. These methods can be mutually supportive, with the results of computational analysis of the web providing supporting context for a close reading of a small number of sites.

 

We work openly

From the start, we have seen access as a vital part of our preservation work. This includes helping us to validate the preservation actions that we have taken, and also in wider advocacy for preservation of born digital content. We seek permissions to make selected web content more openly available, and look to use existing licences to make other content available. We currently do this with content released under Open Government Licences. We also work to make sure that the data we generate about our collections is available, whether that is the full-text indexes that can be searched in our User Interface, or datasets that we have generated from earlier crawls of the UK domain. Earlier this year, we worked with the National Library of Australia, National Library of New Zealand and the historian Tim Sherratt, to develop tools (using Jupyter Notebooks) that could be re-used to analyse our openly accessible data.

Looking ahead, we want to review and update our curatorial tools to support collaboration and collection building. We want to understand what the barriers are to using the archived web in research, and share more information to help researchers understand our collections. Linked to this, we are developing a research engagement plan, which will make sure that our collections and services continue to develop to meet identified needs.

So, as we look back over our 15 year history, these are three of the things that make us proud, and will continue to inspire us. Understanding the value of our collections, working in partnerships and connecting our users and public with our collections. These are values that we know we share with the wider Digital Preservation community, so are very grateful for this chance to join the celebration.

 

You can watch back on all of the presentations from this category on the #WeMissiPRES conference YouTube Channel.  

 

01 October 2020

Request for Information: Metadata Management Tool for the UK Web Archive

By Helena Byrne, Curator of Web Archives at the British Library

 

What is a Request for Information (RFI)? 

A Request for Information (RFI) is not a tender opportunity, but is part of a market consultation exercise aiming to ensure that the procurement route selected and the options ultimately developed for any procurement are properly informed.

At the conclusion of this RFI process, the information gathered may be used to assess potential suppliers and service offers and produce a shortlist for invitation to tender or procurement under a Government Framework Arrangement. At this stage, no final decision has been taken on the precise procurement route to be followed.

 

The UK Web Archive RFI

The UK Web Archive (UKWA) is a collection shared by the six UK Legal Deposit Libraries: the British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries Oxford, Cambridge University Library and the Library of Trinity College Dublin. UKWA aims to archive, preserve and give access to the historic UK web space. This is achieved through annual domain crawls, the first of which was undertaken in 2013, and more frequent crawls of key websites and specially curated collections which date back as far as 2005. These collections reflect important aspects of British culture and events that shape society.

The UKWA team based at the British Library is seeking to acquire a metadata management tool or set of tools to integrate with our web archive services. This will support the description of websites and web pages in our archive, the creation of topic-based collections and encouraging the participation of non-specialists in describing our archived web records. The intention is for this tool to handle the metadata associated with our web archiving services rather than the technical aspects of crawling and storing web content.

Our current Annotation Curation Tool (ACT) covers many functions. However, as the collection has grown in size, and the system matures in age, some of these features have become difficult to manage and response times to enquiries can be very slow, meaning the system is becoming more difficult to use as basic functions become almost impossible to execute. ACT is a bespoke tool, and in this RFI we are looking to explore off the shelf options that can be adapted to suit our requirements and that can be easily modified as these requirements change over time.

 

RFI Timeline

Set out below is the proposed RFI timetable, this is intended as a guide and, whilst the British Library does not intend to depart from this timetable, it reserves the right to do so at any time.

Publish RFI

01st  October 2020

Initial Responses returned by

06th  November 2020

Shortlist and Clarifications

13th  November 2020

Presentations (via video conference)

26th  November 2020

RFI Concludes and feedback provided 

10th  December 2020

 

British Library e-Tendering Service


To ensure that your organisation is involved in this project at this early stage of engagement please provide details of the most appropriate contact within your organisation’s business development team – ideally your business development director or similar – to allow us to invite them into the 001599 online Request For Information (RFI) process. Please send a named contact email address to tony.cole@bl.uk at your earliest convenience.