THE BRITISH LIBRARY

UK Web Archive blog

94 posts categorized "Web/Tech"

17 March 2021

Shakespeare in the UK Web

Add comment

By Jason Webber, Web Archive Engagement Manager, The British Library

It's Shakespeare week (15-21 March). William Shakespeare is, almost certainly, the most quoted literary figure (in English) and the popularity of his plays and poems endures into the digital age. His work is continuingly being taught, examined, analysed and most of all, quoted on the internet. Often quoted in unlikely places such as 'Now is the winter of our discontent' on the Butterfly Conservation website.

Shakespeare-butterfly

Most Popular?
What are the most popular Shakespeare quotes? Perhaps unsurprisingly 'To be or not to be" has far and away the most mentions in our SHINE service - all .uk websites collected 1996-2012 (JISC dataset obtained from the Internet Archive):

Shakespeare quotes 01

Shakespeare quotes from SHINE

If we take away "to be or not to be" this graph looks even more interesting:

Shakespeare quotes 02

Shakespeare quotes from SHINE

Want to try your own Shakespeare quotes in our SHINE service?

  1. Go to the trends page of SHINE: www.webarchive.org.uk/shine/graph
  2. Add a word or phrase into the input box, NOTE: phrases should go in quotes e.g. "all that glisters"
  3. To compare multiple words or phrases, separate by a comma e.g. "william shakespeare", "christopher marlowe, "ben johnson"
  4. Click on any point in the graph to see examples of the context the word or phrase was used
  5. Enjoy!

Do let us know your own favourite quotes on Twitter: @UKWebArchive

12 March 2021

University of Edinburgh’s Collecting Covid-19 Initiative: Collaborative Collection Building with the UKWA

Add comment

By Sara Day Thomson (Digital Archivist), Lorraine McLoughlin (Appraisal Archivist), and Aline Brodin (Cataloguing Archivist), University of Edinburgh  

With thanks to Eilidh MacGlone, Web Archivist, National Library of Scotland and UK Web Archive 

Introduction
The University of Edinburgh’s Centre for Research Collections (CRC) – which includes collections housed in libraries, archives, galleries, and museums – launched the Collecting Covid-19 Initiative in late April 2020. The Initiative invites staff, students, and anyone affiliated with the University to donate any materials that document their experiences of the pandemic and lockdown. Websites, photographs, videos, artwork, and all other materials are welcomed. In preserving a range of materials and formats for the long term, the CRC aims to prevent gaps in memory and to preserve a record of the University’s response to the pandemic.  

CRC Montage

As submissions began to come in via our online form, it became evident that online communications and platforms have played a critical role in how the University community has responded to the pandemic and lockdown. Even some submissions in more ‘traditional’ formats, like images or narratives, have been published online and submitted as a URL. In addition, web-based submissions range from ‘flat’ websites to social media posts to content shared on third party platforms. However, with no web archiving programme in place, the collecting team reached out to the UK Web Archive via the National Library of Scotland (NLS) for support in collecting these valuable records of life during the pandemic.

Covid-19 CallOut

In this post, we discuss this collaboration and how Covid-19-related web resources are integrated into the wider collection at the University. We also discuss how the Initiative aligns with existing collecting policies but also provides us with an opportunity to establish approaches for more active collecting. These new approaches are not temporary but will provide lasting innovations that will support more responsive (and therefore representative) collecting of the University’s diverse communities and activities beyond the pandemic.

Selection of Web Resources: Donations and Active Collecting
The team of archivists looking after the Initiative has taken a two-fold approach for considering what to include. Primarily, a range of web-based works have been submitted by members of the community, including student publications and tweets. In addition to these submissions, the team has been actively identifying relevant web resources, such as official University communications and research activities, to capture a meaningful sample. Identified materials include: 

  • University communications such as emails to staff and students, news feeds, and information webpages 
  • Remote learning resources and websites for projects and initiatives created by staff members and research centres  
  • Resources created by and for the University's students and alumni such as networking groups on social media, blogs, and webpages offering advice and guidance 

Edinburgh Uni C19 response

This approach to actively selecting contemporary content for the Archive is relatively unusual (though not unprecedented). Typically, the archivist intervenes at the ‘end of life’ of a collection. The traditional archival process of collecting materials at the end of a project, or even at the end of a researcher’s career - involving multiple conversations and usually in-person donation - does not support active, contemporaneous collecting. 

Websites can change rapidly or disappear altogether. The files or links embedded in websites may break or move location within months (or sooner!). Therefore, archivists don’t have time to wait for web resources to amass over time and don’t have a crystal ball to predict what content will grow into cohesive collections. Web archiving provides a method for capturing contemporary, born digital resources like those surrounding the pandemic in a rapid, proactive way. 

Collaborative Processes for Collection-Building  
Working with the UKWA has allowed us to get started with capturing these web resources through access to their technical infrastructure and, very importantly, their valuable expertise. The UKWA uses a tool with a web interface for selecting and managing web resources – Annotation and Curation Tool – which has made collaborative collection-building much easier. The tool is well-documented (so great for newbies!) and staff possess wide knowledge of methods for capturing and curating web resources. It’s not a surprise that the UKWA has a well-established history of collaborating with external specialists to build topical collections around different subjects. This experience has made it relatively straightforward for us to develop a set of procedures.

W3ACT

Capturing and Contextualising Web Resources 
With the help of Eilidh MacGlone, the Web Archivist at NLS, we have begun to add relevant web resources (either submitted or actively selected) using ACT. We assign these captured resources to a dedicated collection: Collecting Covid-19 Initiative of the University of Edinburgh. This University of Edinburgh collection sits alongside other collections within UKWA related to the coronavirus pandemic. In fact, many of the web resources selected for collection have already been added to the UKWA by other curators like Eilidh. Therefore, the dedicated University of Edinburgh collection both provides a home for the web resources in the CRC’s Initiative and also contributes to the growing collections of web resources documenting this momentous event in the wider UKWA.

By including these web resources in our dedicated collection, we provide important context, often linking them to wider activities at the University or to other related, non-web materials. We can also provide descriptive information supplied at the point of submission by a member of the University community or based on organisational knowledge of the resource and how it relates to our other holdings.  

In addition to adding richer metadata, we enjoy a closer relationship to the creators of these web resources – either through direct consultation or through our existing collecting remit. These relationships enhance the meaning and significance of these archived resources, giving them an anchor to a place and to a community. Our collecting policies also inform the process of review for open access and, where needed, facilitates permission gathering to make as many resources in the collection as possible openly available online.  

Integrating Web Resources into the Wider Collection 
As mentioned, the web content selected for the Initiative will sit in a dedicated collection amongst other UKWA topical collections. However, we want to ensure the web resources remain integrated with other materials in the CRC Initiative’s collection in different formats. Though we don’t have anything to share yet, we plan to create catalogue entries for web resources with a link to the UKWA access portal. This way the end user will have a single point of entry to all the materials in the collection, with web resources just one click away. One caveat, without an open access licence, these links will only be accessible via terminals on-site at the Legal Deposit Libraries. We anticipate most of our users at the CRC will expect to be able to view web resources on the web. Therefore, we are highly motivated to ensure as many of the web resources are granted open access licences as possible. 

Open access for archived web pages that clearly form part of the University’s web estate and fulfil the criteria the University Archives’ collecting policy is relatively straightforward. However, many submissions to the Initiative have been created on third party platforms, outside the University’s web estate. Others have been developed collaboratively, with significant contributors from outside the University community. In these instances, it may prove more complicated to grant open access and therefore more complicated to make available remotely, online.   

In addition to links to the archived web resources themselves, we aim also to create some basic guidance about web archives and how they can be accessed and used. Though plans are still in the works, this guidance would likely sit on our public interface or possibly on individual catalogue records. We hope this informational metadata will help facilitate wider use of archived web resources in research but also prompt users to ensure their own web content is being archived and looked after. First things first, however, we’re busy building our own internal knowledge about web archiving. (So much to do! So many possibilities!) 

A Learning Experience 
As we have begun adding web resources to our collection, we have learned a great deal about web archiving, ACT, and procedures at the UKWA (largely informed by Legal Deposit legislation and restrictions). We’ve found that many types of web resources evade the crawlers, requiring adjustments to records on ACT … and many emails to Eilidh at NLS. More complex pages or content on third party platforms, as opposed to ‘flat’ web pages, pose real challenges to collecting a complete, authentic copy. Ultimately, finding the time to sit down and add web resources to the collection has been the greatest challenge of all. The team of archivists looking after the Collecting Covid-19 Initiative – including all formats of content not just web – have other core responsibilities (and, like most, the added complication of trying to translate our jobs to home working).  

The Collecting Covid-19 Initiative is still live and actively receiving submissions. Our Cataloguing Archivist Aline Brodin regularly surveys University outputs to identify relevant resources. We have begun reaching out to different groups and communities across the University to request input into the direction of our collecting and improve diversity and representation. We expect the nature of submissions and identified materials to evolve as the situation evolves and, as we gain experience in web archiving, we expect our procedures and approaches to evolve as well.  

Though we are at the very beginning of our journey, we hope our own little corner of web resources related to the pandemic will enhance wider collections about Covid-19 and how different communities have responded in real time.   

Multiple Approaches  
While the collaboration with the UKWA to build our own collection of web resources related to the pandemic is beyond valuable, there are some limitations to this approach (as discussed above). One is technical – the infrastructure used by the UKWA (the Heritrix-based crawlers) are built for scale not detail. As a result, there are a few web resources we have struggled to capture. The other limitation is practical – the archives team at Edinburgh only has minimal permissions in the UKWA system (to ensure the integrity of the archived content it holds). Therefore, many basic functions – such as quality assurance and granting open access licences – must go through Eilidh at NLS. The UKWA team are incredibly busy and their capacity to support individual queries is limited (they are after all archiving the UK web…).  

Therefore, we have pursued an alternative approach for a small portion of selected content using Webrecorder Desktop. This approach comes with its own limitations. Webrecorder is a tool built to capture complex, often interactive web resources. However, to enable this ‘high-fidelity’ approach, the tool requires a curator to click every link and every button to trigger a capture. This makes Webrecorder a time-consuming approach to capturing web resources – especially large ones. Furthermore, the output of Webrecorder is a WARC file. While WARC files are the gold standard for preservation, they pose a barrier for access. The typical user of CRC collections is unlikely to know what a WARC file is and even less likely to know how to access and view one.  

Conifer CAHSS blog

Despite these limitations, the team has devised a workflow that uses Webrecorder for selected web resources that cannot be captured through UKWA. This capture of a University blog ‘Covid-19 Perspectives’, for example, was captured using Webrecorder Desktop and the similar web-based service Conifer. The WARC files exported from Webrecorder will be ingested into our preservation system and possibly made available by request by users. We’re currently exploring the possibility of an institutional account with Conifer – who provides a web-based service for capturing and sharing archived web resources. This way, we could provide access via a link embedded in our catalogue, exactly the same way as for UKWA resources. This approach would create a more seamless user experience, though also relies on a third party platform for continued access.  

Conclusion
Though our collaboration with the UKWA and experiments with other web archiving tools focuses on the Collecting Covid-19 Initiative, we hope to apply these lessons learned to different areas of collecting. The archives team has started conversations with the University’s web team to discuss plans for archiving the web estate as a vital record of the institution's history. I’ve delivered a few tutorials on the basics of web archiving for different staff across the Library, including how-to sessions for Webrecorder Desktop and submitting URLs to the UKWA. I’ve also started discussions with a research data management colleague about building services for researchers to capture and deposit web and social media content as part of their research outputs.  

If this experience has taught us anything, it’s that none of these undertakings will be possible without close collaboration and willingness to test out new methods and tools. While the scale of resources that need to be archived can seem daunting, I’m confident the incremental progress we make will ensure a much richer, more authentic record makes it to the future.  

More Information
To learn more about the approach to collecting materials (in all formats) for the University of Edinburgh’s Collecting Covid-19 Initiative, see 'Collecting Covid-19: an initiative to document the University’s community response to the pandemic', by Lorraine McLoughlin and Sara Day Thomson, COVID-19 Perspectives blog, College of Arts, Humanities, and Social Sciences at The University of Edinburgh, https://blogs.ed.ac.uk/covid19perspectives/2021/03/08/collecting-covid-19-an-initiative-to-document-the-universitys-community-response-to-the-pandemic-by-lorraine-mcloughlin-and-sara-day-thomson/

10 March 2021

British Science Week and the UK Web Archive

Add comment

It is British Science Week!

Britain has a fantastic record of pioneering science that has continued into the digital age. The UK Web Archive aims to archive as much of this online scientific output as we can. Here are just some of the many science websites in the archive. Don't forget that anyone can suggest UK science websites for the archive here: www.webarchive.org.uk/nominate

Science Sparks website
https://www.webarchive.org.uk/wayback/en/archive/20200408080530/https://www.science-sparks.com/

Science collection

The Science collection was started in 2020 and now contains over 200 different websites. The collection is wonderfully diverse - from the the British Bryological Society (Mosses and Liverworts) to the online identification site British Bugs. The collection aims to cover all areas of British science, including science communication.

Cambridge Science and Stephen Hawking

We have worked closely with our partner Cambridge University Libraries to capture the amazing science undertaken at Cambridge University. This, of course, includes the late Stephen Hawking a Professor at the Centre for Theoretical Cosmology,  famous for groundbreaking research into Black Holes amongst other things.

Centre theoretical cosmology website

Darwin 200

To celebrate the 200th anniversary of the birth of Charles Darwin in 2009, staff at the British Library put together the Darwin 200 collection. Would you like to read the complete works of Darwin, try Darwin Online?

Darwin online website

Other collections that include science elements are:

Do get in touch with suggestions for inclusion of more UK science websites: www.webarchive.org.uk/nominate

#BritishScienceWeek

 

 

 

 

 

25 January 2021

Rabbie Burns and the UK Web Archive

Add comment

By Jason Webber, Web Archive Engagement Manager, British Library

Born on 25 January 1759, Robert ‘Rabbie’ Burns, sometimes known as the ‘National Bard’, the ‘Bard of Ayrshire’ and the ‘Ploughman Poet’, is rightly famous for his poetry in the Scots dialect. Burns’ legacy remains strong into the digital age and his work has been widely collected and can be seen in the UK Web Archive.

'Editing robert burns' website

This fantastic AHRC funded project ‘Editing Robert Burns’ aims to produce a multi-edition volume of his work. If you like a pun you can’t help but smile at ‘Daylight Rabbery: The Story of ‘Antique Smith’s’ Robert Burns Forgeries’.

Cutty Sark website

Did you know that famous Greenwich landmark, and former tea clipper, ‘Cutty Sark’ gets its name from the Robert Burns poem ‘Tam o’shanter’?

But Tam kend what was what fu' brawlie:
There was ae winsome wench and waulie,
That night enlisted in the core,
Lang after ken'd on Carrick shore;
(For mony a beast to dead she shot,
And perish'd mony a bonie boat,
And shook baith meikle corn and bear,
And kept the country-side in fear.)
Her cutty-sark, o' Paisley harn
That while a lassie she had worn,

Burns makes a more direct influence into the 21st century with ‘Rabbie Burns Saves the World: an 8 Bit Game’. Play the game here.

8 bit burns website game

Do you know any online Robert Burns resources? We would love to include any in the UK Web Archive. Nominate any UK website here: www.webarchive.org.uk/nominate

Do also check out our ‘Poetry Zines and Journals’ collection.

Happy Burns Night!

25 November 2020

LGBTQ+ Lives Online Web Archive Collection

Add comment

By Steven Dryden, British Library LGBTQ+ Staff Network & Ash Green CILIP LGBTQ+ Network

As you’ll have read on this blog, the collaboration with UK Web Archive (UKWA), British Library and CILIP LGBTQ+ Network to develop LGBTQ+ content within the UK Web Archive was launched during summer 2020.

Rainbow tapestry

LGBTQ+ content was already part of the UK Web Archive before the collaboration began, with many sites in other collections overlapping LGBTQ+ themes. For example, Black and Asian Britain (blackgayblog.com), Gender Equality (Beyond the Binary), Sport (Graces Cricket Club). And some sites cut across many collections, highlighting the intersectional nature of the UK Web Archive. For example, Gal-Dem features in the News Sites; Zines and Fanzines; Black and Asian Britain; Gender Equality; Women's Issues; Unfinished Business: The Fight for Women’s Rights collections, as well as LGBTQ+ Lives Online. LGBTQ+ Lives Online, much like the lived experience of the LGBTQ+ does not sit in isolation, disconnected from other aspects of UK offline and online life. LGBTQ+ people play a part in all aspects of the UK community, and are not solely defined by their gender or sexual orientation.

This UK Web Archive collection doesn’t stand in isolation either, it enriches the scope of work already begun at The British Library.LGBTQ Histories aims to explore the experiences and stories encountered in the collections, posing questions about the lived experience of LGBTQ+ people throughout history.The LGBTQ+ Lives Online collection of the UK Web Archive plays a part in CILIP LGBTQ+ Network’s ambition to raise the profile of LGBTQ+ people, support the development of LGBTQ+ information resources and the work of LGBTQ+ Library, information and knowledge workers.

LGBTQ+ Lives Online Collection

UKWA 'ACT' tool

The collection currently contains over 400 sites and web pages in the main collection, with more of these being added to sub-collections every week. Many of the sites were already in the UKWA before the collaboration began, but were not linked to sub-collections. We are still at the stage where we are developing the structure of sub-collections but our initial indexes cover:

Since the launch of this collaborative project, we have been focused on a number of areas to both develop the project and to preserve sites within the collection. This includes:

  • Identifying sites already in the UK Web Archive to be added to the LGBTQ+ Lives Online sub-collections.
  • Identifying new sites not already in the UKWA to be included in the collection.
  • Spreading the word about the project as widely as possible via blog posts and articles such as this; social media; emails targeting specific LGBTQ+, library, and broader diversity organisations and networks.

You can browse through the collection here, and nominate a UK published site or webpage with a focus on LGBTQ+ lives to be included in the collection via: https://www.webarchive.org.uk/en/ukwa/info/nominate. We would especially like to see more nominations that reflect the multicultural nature of UK LGBTQ+ communities and the many diaspora communities based here, including UK sites written in languages other than English.

Though it can often be challenging for us to archive social media accounts, we are able to collect LGBTQ+ Twitter accounts. We have experimented with other methods of archiving social media but this is on a selective basis, but we would welcome nominations and projects that might address these challenges and how they might impact on archiving LGBTQ+ experience in the UK,

How can you access these archived websites?

UKWA search results page

Under the Non-Print Legal Deposit Regulations 2013, the UKWA  can archive UK published websites, but are only able to make the archived version available to people outside the Legal Deposit Libraries Reading Rooms, if the website owner has given permission. The UK Legal Deposit Libraries are the British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries, Cambridge University Library and Trinity College Dublin Library.  

Some of the websites in UKWA have already had permission granted, these include Out Stories Bristol, Trans Ageing and Care, Bi Cymru/Wales and Queer Zine Library. As the content of UKWA has mixed access, the message ‘Viewable only on Library premises’ will appear under the title of the website if you need to visit a Legal Deposit Library to view content. If there is no message underneath then the archived version of the website should be available on your personal device.

Due to the coronavirus pandemic, the reading rooms were closed for a number of weeks but are starting to reopen. This blog post gives an overview of opening hours and how to book a visit at the six UK Legal Deposit Libraries:

https://blogs.bl.uk/webarchive/2020/09/ukwa-available-in-reading-rooms-again.html 

Previous blog posts about the project can be viewed via the following links.

LGBTQ+ Lives Online project introduction

LGBTQ+ Lives Online: Introducing the Lead Curators

 

24 November 2020

Web Archive Team wins 2020 Digital Preservation Award

Add comment

By Sophia Chrisafis, Internal Communications Officer, The British Library

On Thursday 5 November the UK Web Archive Team won The National Archives (UK) Award for Safeguarding the Digital Legacy at the Digital Preservation Awards 2020.

The Award was made to the UK Web Archive for ‘15 years of the UK Web Archive’, marking the anniversary of the launch of a public UK Web Archive service.

In all, there are six awards, which are presented every two years. 

Digital Preservation Award 2020

2020 Awards
This year the awards took place online, via Zoom. John Sheridan, Digital Director at The National Archives, introduced the award: 

The National Archives (UK) Award for Safeguarding the Digital Legacy celebrates the practical application of preservation skills to protect at-risk digital objects, drawing attention to the concrete efforts to ensure important elements of our generation’s digital memory can remain available for future generations. It is also for demonstrating a deep understanding of the risks that digital objects face and (the winner) should be an exemplar of digital preservation best practice and why preservation matters.

The winners were announced by judge April Miller, from the World Bank Group, who invited Ian Cooke, the British Library’s Head of Contemporary British Published Collections, to give an acceptance speech.

On behalf of Web Archiving Team Ian said:

‘We’re really amazingly pleased to have won.

‘It’s a huge honour for us to be recognised in this way, and to have been among such excellent finalists, such amazing projects, really inspiring ones.

‘We always say it’s not possible to understand the 21st century without the archived web, and we’ve been posting to our blog all week about the diversity and variety of our collections.

‘I’m personally always amazed and incredibly proud of the work Andy Jackson and Nicola Bingham lead for the Web Archive, and also for our whole team, both at the British Library and across the UK legal deposit libraries, and the friends we work with – the International Internet Preservation Consortium, an incredible community – and everyone we’ve worked with around the world for the past 15 years for digital preservation access and development.

Thank you so much.’

The UK Web Archive

The UK Web Archive (UKWA) was formed in 2003 as a response to growing awareness of an urgent digital preservation need, to collect and preserve communication using the web.

UKWA is a partnership of the six UK Legal Deposit Libraries: National Library of Scotland, National Library of Wales, Bodleian Libraries, University of Oxford, Cambridge University Libraries, Trinity College, Dublin and the British Library. In 2020, UKWA celebrated 15 years since making its first collections available publicly online.

Read more about the last 15 years of UKWA in these blog posts:

 

18 November 2020

2020 Domain Crawl Update

Add comment

By Andy Jackson, Web Archiving Technical Lead at the British Library

 

On the 10th of September the 2020 Domain Crawl got underway. The annual Domain Crawl usually takes about three months to complete, it visits UK published websites on a UK Top Level Domain (TLD) like .uk, .cymru, .scot, .london etc., any web content hosted on a server registered in the UK as well as all the records manually created by the UK Web Archive teams across the UK Legal Deposit Libraries

 

Update on crawl management

Due to the billions of URLs involved, the Domain Crawl is the most technically difficult crawl we run. As the crawl frontier grows and grows, the strain starts to show, particularly on the disk space required to store all of the status information about the URLs that have been crawled or are awaiting crawling. Worst of all, we found some mysterious problems with how Heritrix3 manages this information, meant that we could not safely stop and restart long crawls. We could usually restart once, but if we restarted again strange errors would appear, and sometimes these would be serious enough to cause the whole crawl to fail. Fortunately, in the last year, we finally tracked this down and updated the Heritrix3 crawler so that it can be safely stopped and restarted multiple times. 

This has made managing the crawler much easier, as we can stop and restart the crawl with confidence if we need to change the software or hardware setup. This makes managing things like disk space much less stressful.

 

Update on the crawl performance 

In the initial phase of the crawl, we threw in the roughly 11 million web hostnames that we have seen in past crawls, which then got whittled down to about 7 million active hosts. After this bumpy start and some system tuning, the crawl settled down and has been pretty consistently processing 250-300 URLs per second.  This is acceptable, but isn’t quite as fast as we would like, so we are analysing the crawl while it runs to try and work out where the bottlenecks are.

 

What we have collected so far

The figure below shows the URLs collected over time.

 

Graph illustrating the number of URLs downloaded in the 2020 Domain Crawl
Graph illustrating the number of URLs downloaded in the 2020 Domain Crawl

 

The rather jagged start shows where we were able to stop and start the crawl in order to tune the initial hardware setup, and the flatter ‘pauses’ later on are from other maintenance activities like growing the available disk space. The advantage of being able to re-tune the crawler as we go is shown by the way the line gets steeper over time, corresponding to the increased crawl rate.

 

In terms of bytes downloaded, we see a similar result:

Graph illustrating the number of TBs downloaded in the 2020 Domain Crawl
Graph illustrating the number of TBs downloaded in the 2020 Domain Crawl

 

As you can see, we are rapidly approaching 90TB of downloaded data, which corresponds to roughly 50TB of compressed WARC.gz data.

Despite starting the crawl relatively late in the year (due to issues around the COVID-19 outbreak), we are making good and stable progress and are on track to download over two billion URLs by the end of the year.

 

Follow the UK Web Archive on Twitter for the latest updates on the Domain Crawl and other web archiving activities! 

 

09 November 2020

A tale of two web archives: challenges of engaging web archival infrastructures for research

Add comment

By Jessica Ogden, University of Bristol and Emily Maemura, University of Toronto

Web archives are quickly becoming a key source for studying the historical Web, with many recent projects and publications demonstrating the scholarly opportunities presented by national web archives, in particular. At the same time, research in and on national web archives presents a number of challenges for scholars - where a ‘sociotechnical gap’ (Ackerman 2000) can be observed between the needs of researchers and the affordances of web archives themselves.

Diagram illustrating a web archive conceptual framework

In an effort to better understand the barriers to web archival use in research, our recent paper at the Engaging with Web Archives conference shares the results of a collaborative project which compares and contrasts our experiences of using two national web archives: the UK Web Archive and the Netarchive in Denmark. In 2018, Jessica undertook a three-month research placement with the British Library looking at the challenges and opportunities of using the UKWA for social science research. Around the same time, Emily also spent three months at the Danish National Web Archive, Netarchive, in collaboration with the Royal Library and the University of Aarhus in Denmark. 

Based on our own interactions with these web archives, and interviews with staff and curators, alongside observations of web archiving activities, this paper proposes a conceptual framework that outlines the earliest stages of research alignment and engagement with national web archives. The concepts developed in the paper (orientating, auditing and constructing) provide an avenue for discussing the entanglement of researchers, curators and collections in the research process. In discussion, we make a number of observations regarding the challenges of this form of digital research - including how researchers must unpick the complex constraints of different web archives - and suggest possible ways that existing curatorial infrastructure (tools, people and curatorial knowledge and expertise) could be leveraged to better facilitate researcher engagement in future.  

To learn more about our findings, check out the recording of our EWA 2020 presentation.

Acknowledgments

This work was supported by the Social Sciences and Humanities Research Council (SSHRC) Canada Graduate Scholarship 767-2015-2217 and Michael Smith Foreign Study Supplement. Additional funding was provided by a UKRI/Economic and Social Research Council, National Centre for Research Methods placement fellowship and research funds by the University of Southampton. The authors also gratefully acknowledge the generosity and support provided by staff and researchers at the UKWA, the British Library, the Royal Library and the NetLab at Aarhus University.