UK Web Archive blog

162 posts categorized "Web/Tech"

30 June 2025

UK Web Archive Report on Digital Methodologies for the Study of Religion Symposium

By Helena Byrne, Curator of Web Archives

Digital Methodologies for the Study of Religion event details
Digital Methodologies for the Study of Religion event details

The UK Web Archive participated in the one day symposium Digital Methodologies for the Study of Religion on 25th June 2025. This knowledge exchange symposium was organised as part of the ESRC-funded Digital British Islam Project. It was a hybrid event with a mix of online presentations and in person presentations at Coventry University. 

The fourteen presentations were divided into four thematic panels: Panel 1 – Innovative Methods and Platforms, Panel 2 – Digital Archives and Cataloguing, Panel 3 – Mixed Methods and Online-Offline Dynamics and Panel 4 – Emerging Ethical Challenges.

The UK Web Archive participated in Panel 2 – Digital Archives and Cataloguing. The first speaker, Emily Cottrell from Université de Strasbourg, outlined a project that produced an online database to study digitised religious texts. The final two presentations in the panel were  from Gary R Bunt from Digital British Islam at University of Wales Trinity Saint David and Anna Grasso from Digital Islam Across Europe at University of Edinburgh. Professor Bunt outlined the scope of the Digital British Islam web archive collection as well as the lessons learnt from developing the curation skills needed to develop a web archive collection. Dr Grasso then gave an overview of the Digital Islam Across Europe web archive collection and how they were able to use the ARCH platform through their Archive-It subscription. It was really interesting to hear curatorial insights from these web archive collections and how the data collected can be used to further understand the lived experience of Islamic communities in Britain and across Europe. 

The British Library presentation was Using the UK Web Archive to understand religion on the web. This presentation gave a general introduction to the UK Web Archive explaining who is involved in curating the UK Web Archive collections, an overview of Non-Print Legal Deposit and how this shapes curation practices. It gave an overview of how religion is represented within the UK Web Archive. Religions are broadly represented across many of the over one hundred curated collections and there are currently nine individual collections that focus on a topic related to religion. The presentation gave an overview of the recent work we did to publish metadata from the UK Web Archive as data by co-developing the Datasheets for Web Archives Toolkit. So far, the Scottish Churches - Collection Seed List is the only data set related to religion that has been published but keep an eye on the UK Web Archive for updates on when the next phase of data sets will be published.  

Potential research example with the Scottish Churches - Collection Seed List data set
Potential research example with the Scottish Churches - Collection Seed List data set

All the presentations gave methodological insights that could be reused by researchers studying a different subject and I would highly recommend checking out the recordings when they are made available through the project website: https://digitalbritishislam.com/

One highlight for anyone who manages a GLAM sector catalogue was the presentation by Dr. Nur Efeoglu who presented Curating Islam Online: Religious Heritage in UK Museum Digital Catalogues. This presentation focused on reviewing three UK museum catalogues for content related to the Selçuk and Ottoman period. The lessons learnt from this report are valuable for running any effective catalogue. My favourite quote from this presentation was "curation should be a collaboration not a monologue". This is something we try to encourage in the UK Web Archive by collaborating with subject experts to curate collections on various topics and from gathering nominations for the archive from the public. 

20 June 2025

RESAW 2025: Report from UK Web Archive Colleagues

RESAW 2025 Conference Banner
RESAW 2025 Conference Banner

Introduction

The RESAW (Research Infrastructure for the Study of Archived Web) 2025 conference took place at the University of Siegen in Germany. It was organized by the Collaborative Research Centre 1187 “Media of Cooperation” at the University of Siegen in cooperation with the Centre for Contemporary and Digital History (C²DH) at the University of Luxembourg.

This was a special conference as the organisers of this, past and and future conferences had a special presentation (it included cake and balloons) to mark ten years since the first RESAW conference was held in Aarhus, Denmark. They all paid tribute to Niels Brügger from Aarhus University who founded RESAW and helped develop the RESAW community.

The conference theme, “The Datafied Web” explored this theme from a historical perspective. The call for papers stated that “we would like to explore the historical roots, trends, and trajectories that shaped the data-driven paradigm in web development and to examine the genealogies of the datafied and metrified web”. The opening panel discussion aimed to define what is meant by “the datified web”.

UK Web Archive colleagues from Bodleian Libraries, the British Library and National Library of Scotland attended the Web Archiving Conference. There was a packed programme with a variety of presentation forms and workshops that shared best practices and innovative projects in the world of web archiving. In this blog post they report highlights of their conference experience.

Reflections

Helena Byrne - Curator of Web Archives - British Library 

I was part of the panel called Web archives practices along with colleagues from the Portuguese and Belgian web archive. My presentation, Lessons learnt from preparing collections as data: the UK Web Archive experience, gave an overview of the project that spanned from October 2022 to November 2024 to develop a framework for publishing UK Web Archive curated collections as data

There were so many great presentations and panels at this conference that it is hard to just pick one highlight. The opening panel discussion defining “the datified web” raised lots of interesting points. In this panel Anne Helmond made the important point that “while the front-end of the web has changed dramatically, the back-end has undergone a deeper transformation” and the study of the web requires a mix of methodologies and resources. Another session that stood out was the panel on Past Metrics. We were reminded in this session about the visitor counters that used to be popular on early versions of websites. This was especially poignant as just a few days before this presentation I received an enquiry about a website and when I used the Memento Time Travel search function to view if any other web archive’s held a copy of it. I found one copy from its earlier years. This version had a prominent visitor counter and evoked a nostalgic response as I’d realised I hadn’t seen one for many years and had forgotten about this feature.

Beatrice Cannelli - Curatorial and Policy Research Officer (Algorithmic Archive Project) - Bodleian Libraries

At this year’s RESAW conference, my colleague Pierre Marshall and I organised a workshop titled “Towards an ‘Algorithmic Archive’: Developing Collaborative Approaches to Persistent Social and Algorithmic Data Services for Researchers”. The workshop brought together diverse perspectives from practitioners and researchers working with social media data, fostering discussions regarding the development of sustainable strategies to collect social media platforms. The workshop was a valuable opportunity to gather insights for the Algorithmic Archive project, particularly regarding issues and expectations related to short- and long-term access to social media data. 

Among the many engaging sessions, I found the one on “the challenges of archival practices” particularly interesting. Using the case of the web archive at the Aix-Marseille University, the panellists underscored the importance of encouraging critical engagement with issues researchers face, such as data ethics, data surveillance and archival responsibility, especially when dealing with potentially sensitive web archived data. Similarly, the panel of “Data Regimes” reflected on the complexity of data stewardship, where open data policies often clash with ethical concerns, especially when dealing with sensitive content like social media data. This often leaves researchers and librarians to navigate these grey areas without clear guidance, raising questions about reuse and long-term preservation.

Pierre Marshall - Technical Research Officer (Algorithmic Archive Project) - Bodleian Libraries

Vasco Rato gave an overview of arquivo.pt’s API. Arquivo.pt runs a CDX(J) server, and about half of the traffic to the archive comes from the API. Rato mentioned that sometimes people _ask_ for WARCs, but what they really want is just the text or media content of a page. It would be a better user experience to provide text or image search directly through the API. The CDX(J) server also helps anyone wanting to page through the archive without downloading the whole thing. Most researchers don't have the capacity to store and process 1.5PB of WARC files.

Helge Holzmann of the Internet Archive ran a workshop on the Archives Research Compute Hub (ARCH) service. Holzmann talked us through a series of recipes for the ArchiveSpark library, intended to make it easier for researchers to run data-centric queries against items in the Internet Archive. Besides the content of the workshop, I appreciated Holzmann's use of 2000s-era retro web graphics to illustrate his presentation. We are all here for the datafied web, but beyond the data I'm happy to celebrate the art of the early web.

The BnF also presented their Skyblogs collection, including work on parsing the page markup (back) into a data model for analysis across the corpus.

The common theme I took from these sessions is that there's a lot to learn from making large web datasets usefully available to academics. Hopefully next year Beatrice and I will be back with some examples of what internet researchers could do with our planned social media archive.

Andrea Kocsis - Chancellor’s Fellow in Humanities Informatics, University of Edinburgh/ The National Librarian’s Fellow in Digital Scholarship 2024-45, The National Library of Scotland

I was glad to present our work on web archive engagement with Leontien Talboom, where we discussed how to support not only traditional readers and computational users, but also the digitally curious who often fall between categories. I also shared a glimpse into the creative process behind Digital Ghosts, the web archive exhibition I’m currently developing with artist Dorsey Kaufmann and the National Library of Scotland, which will take place in November at Inspace in Edinburgh.

One of the talks that stayed with me was Ian Milligan’s reflection on the ethical challenges of crowdsourced digital archives in the context of 9/11. I plan to bring this ethical dilemma of accessibility, metadata, and data protection into my teaching next year in Future Libraries and Archives at the Edinburgh Futures Institute. The most inspiring talk for me, though, was Nanna Bonde Thylstrup’s keynote on data loss. Her interdisciplinary framing - drawing equally from humanities, sociology, and STEM - challenged the usual discourse of data loss as an evolutionary narrative and instead reframed it as a question of digital politics and infrastructure. Overall, RESAW was inspiring both intellectually and as a generous, thoughtful community of dedicated netpreservers.

Conclusion

Attending the RESAW conference is a great opportunity to exchange ideas, learn about innovative research projects, and foster collaborations in the field of web archive studies. The UK Web Archive colleagues contributed significantly through presentations and active participation in other sessions. Participation at conferences in this manner supports the recognition and reuse of the UK Web Archive collections as a significant resource in the wider academic discourse on web archiving. We look forward to participating in the next edition of the conference which will take place in June 2027 at the University of Groningen, the Centre for Media and Journalism Studies & Centre for Digital Humanities. The theme for 2027 is “Engaging Public Internet Histories: New Ways of Telling the Story of & with the Web”. So keep an eye out for the call for papers for the seventh RESAW conference in 2026.

08 May 2025

Marking 80 Years: Documenting VE and VJ Day Commemoration in the UK Web Archive

By Nicola Bingham, Lead Curator of Web Archives, British Library

Home page of the ve-vjday80.gov.uk website
Home page of the ve-vjday80.gov.uk website

This year marks a significant national milestone: the 80th anniversary of the end of the Second World War. With Victory in Europe (VE) Day falling on 8th May and Victory over Japan (VJ) Day on 15th August, commemorations are planned across the UK to honour the conclusion of a conflict that reshaped the world.

To document this anniversary, the UK Web Archive is curating a special collection titled "VE / VJ Day 80", which will record how people and communities across the UK are commemorating the end of WWII, from national ceremonies to local grassroots events.

Collection Scope

This curated collection focuses on UK-based websites documenting commemorative events, public activities, and community involvement related to VE/VJ Day 80. Rather than a detailed historical retrospective, the collection aims to reflect contemporary responses and engagement with this anniversary.

Key Aspects of UK Commemorations

The collection includes a wide variety of commemorative themes and activities such as:

· National Events: Organised by groups like the Royal British Legion, including parades and memorials.

· Local Celebrations: Street parties, community gatherings, and regional events.

· Church Services: Remembrance services held nationwide.

· Beacon Lighting: Symbolic ceremonies at dusk.

· Remembrance Readings: Recitals of "The Tribute" and similar dedications.

· Veteran Involvement: Honouring the voices and presence of those who served.

· Contrasting voices or critical perspectives of the commemorations.

Why We Are Archiving This

By collecting these websites now, we’re creating a rich and enduring resource for future researchers, historians, educators, and the general public. This collection will preserve not only official narratives but also grassroots and personal perspectives, reflecting the diversity of the UK’s commemorative landscape.

One recent example of how the UK Web supports research is the work of Dr Liam Markey, whose blog post, published earlier this week, describes how he has used archived web content.

Between 2018 and 2023, Liam completed a PhD at the University of Liverpool in collaboration with the British Library, examining how remembrance practices in Britain, particularly the concept of military victimhood, shape national identity and reflect militaristic thinking. His work highlights the value of digital resources like the UK Web Archive in documenting contemporary remembrance culture.

How You Can Contribute

We welcome nominations of websites, blogs, and social media accounts that reflect VE/VJ Day 80 commemorations and perspectives.

Are you organising a public or community event?

Are you sharing your thoughts or experiences online?

If so, we’d love to hear from you.

Please email your suggestions to: [email protected] 

Although the UK Web Archive website is currently offline, our team is actively capturing web content using remotely hosted systems, ensuring this material is preserved for the future.

Here are a few examples of sites already being archived:

Royal British Legion – Remembering the End of WWII (https://www.britishlegion.org.uk/getinvolved/events/remembranceevents/rememberingtheendofthesecondworldwar)

VE Day 80 Community Events (https://www.veday80.org.uk/)

VE/VJ Day 80 (https://ve-vjday80.gov.uk/)

English Cathedrals – VE Day Services (https://www.englishcathedrals.co.uk/latestnews/veday808thmay2025asharedmomentofcelebration/)

Breckland Council – Remembrance Grants & Readings (https://www.breckland.gov.uk/article/24080/VEVJDay80AnniversaryGrants)

Royal Navy – WWII Veterans’ Stories (https://www.royalnavy.mod.uk/news/2025/january/06/20250106ww2veteransurgedtocomeforwardtomark80thanniversary)

Beacon Lighting Guide (Glinton Parish Council) (https://glintonpc.gov.uk/wpcontent/uploads/2024/07/VEDay80AnniversaryGuidev19.pdf)

VE Day Blog Posts from the British Library

This is one of multiple blog posts being published across the British Library blogs this week:

UK Web Archive: https://blogs.bl.uk/webarchive/2025/05/digital-memory-and-the-militarised-past.html 

European Studies: https://blogs.bl.uk/european/2025/05/remembering-sacrifice-celebrating-freedom.html

Newsroom: https://blogs.bl.uk/thenewsroom/2025/04/ve-day-in-the-news.html 

Social Science: https://blogs.bl.uk/socialscience/2025/05/ve-day-voices-from-history-.html 

Untold Lives: https://blogs.bl.uk/untoldlives/2025/04/children-in-war-time.html 

06 May 2025

Digital Memory and the Militarised Past: Commemorating Britain’s World Wars in the 21st Century

By Dr Liam Markey, University of Liverpool

This blog post will explore the immediate legacies of the First World War centenary in Britain, looking towards the culmination of the ongoing commemoration of the Second World War’s 80th anniversary, with VE and VJ day being commemorated in May and August of this year respectively. It describes how discourse surrounding both world wars has shaped British attitudes and behaviours concerning conflict and military service over the last century, and how changing demographics may serve to consolidate these beliefs in the coming century. Special attention is paid to mixed media texts collected and held by the British Library, demonstrating the significance of the UK Web Archive (UKWA) in particular as a repository of counter-culture discourse in the context of British militarism.

A Second Century of Remembrance
As a response to the centenary of the First World War, between 2018 and 2023 I undertook a PhD at the University of Liverpool in collaboration with the British Library. My research cast a critical gaze upon the act of remembrance in Britain since the end of the First World War, with special attention paid to the concept of ‘military victimhood’ and its potential to mediate militaristic modes of thinking. The project was embarked upon in the wake of the national commemoration of the centenary of the First World War, a watershed moment in Britain which prompted the production of a myriad of state-funded cultural and educational events.

As David Cameron announced in 2012, British commemoration of the centenary would serve to,

“provide the foundations upon which to build an enduring cultural and educational legacy, to put young people front and centre in our commemoration and to ensure that the sacrifice and service of a hundred years ago is still remembered in a hundred years’ time.”


While such an important historical moment arguably provides an invaluable opportunity for critical reflection, my research ascertained that, largely, the centenary instead engendered a consolidation of, and recommitment to, traditional forms of remembrance. Ultimately, the foundations that the centenary provided were not ground-breaking, rather they had already been established during, and enacted since, the end of the First World War itself. This next century of commemoration, as envisioned by Cameron, would be cast in the image of the last, anchored upon the rituals and practices of what is referred to as the ‘1919 model’.

This in and of itself can be regarded as potentially problematic, as my research, and the work of many other scholars, demonstrates the proclivity of such forms of commemoration to perpetuate the core tenets of a militaristic ideology; seeing war glorified, justified, and normalised. This is largely achieved through the sanitisation of war’s power to victimise, with emphasis placed on an idealised vision of warfare and military service. ‘Official’ or ‘dominant’ narratives of commemoration also emphasise the unifying power the rituals of the 1919 model, such as the two-minute silence or the wearing of a poppy, have among the British population and the positive effect enactment has in relation to British victims of war. Commemorative discourse overwhelmingly emphasises notions of debt that the public are duty-bound to fulfil, while avoiding direct reference to war’s inherent violence and propensity to produce victims.

This depiction of warfare present in commemorative practices was chosen to serve a very specific purpose, as a way of alleviating the suffering of the bereaved by acknowledging that their loved ones died in service of a noble ideal. However, with much, if not all, of those for whom the 1919 model was created having now passed from the British population, sentiments of military service as being inherently glorious, core to dominant commemorative narratives, serve to sanitise war for generations of individuals with no personal experience of war’s traumatic reality.

Alongside overt references to war as glorious and necessary within dominant commemorative narratives, my research also uncovered the role of the ‘commemorative deviant’. These are individuals who choose to commemorate war in a manner outside of the official purview, and as such are vilified in the national mainstream media, encouraging others to condemn rather than replicate such behaviours. The majority of such depictions come from print texts taken from three mainstream British newspapers: The Daily Mail, The Daily Mirror, and The Times, collected for analysis from the British Library’s Newsroom. These newspaper texts serve to reinforce specific beliefs and behaviours concerning remembrance over the last century that ultimately perpetuate, rather than challenge, militaristic notions.

Mainstream narratives purport that since the end of the First World War commemoration has been static, with its enactment based on a general consensus, and those rare deviant individuals represent an anomaly rather than a pattern of behaviour visible throughout the last century. However, through access to the UKWA, and close collaboration with the UKWA team, I was able to create a unique digital dataset which challenged such notions and provided a far more expansive view of commemoration as enacted in Britain since 1918. Beyond official black and white narratives of morally righteous consensus and villainous deviance, digital texts demonstrated the complexity of British remembrance. They provided an insight into ‘ground-up’ commemorative initiatives, uncovering attitudes more often than not absent from the mainstream media due to their potential to undermine notions key to the proliferation of dominant commemorative narratives.

Websites collected by the UKWA demonstrated the rich variety of methods by which war has been commemorated in Britain since the First World War, with many serving to challenge and deprivilege assumptions inherent within dominant narratives. These ‘counter’-narratives illustrated the vastness of the category of military victims, many of whom, such as civilians or enemy soldiers, are absent from mainstream commemorative discourse, and whose existence serve to undermine notions of militarism. Many instances of ‘deviancy’ in mainstream thought became in this context simply an alternative perspective, which ultimately facilitated the broadening of knowledge concerning the enactment of British remembrance over the last century.

Take for instance the existence of the white poppy, a symbol denigrated by newspaper texts in the sample as disrespectful and a direct contributor to the suffering of military victims, such as disabled ex-servicemen or the bereaved. Digital texts provide expansive contextual information, highlighting that the white poppy was itself incepted by ex-servicemen and relatives of the war dead as a commitment to peace, remembrance of all victims of war, and as a direct challenge to a militaristic ideology. Digital texts also highlight the existence of otherwise marginalised individuals, such as dissenting ex-servicemen, conscientious objectors, soldiers ‘shot at dawn’, or soldiers severely disfigured as a result of their military service.

Alongside an expanded purview regarding representation of military victims, the digital texts collected from the UKWA also provided access to the thoughts and feelings of the average British citizen, many of which clash with mainstream declarations of consensus and unity. Message boards and amateur websites serve as a medium for dissenting viewpoints, exhibiting the democratising power of the internet. Ultimately, the UKWA provided a much fuller picture of remembrance than the one evident in mainstream media, providing a platform for individuals who have not featured at the forefront of commemoration over the last century, but are nevertheless integral components in wider British narratives of war.

Second World War 80th Anniversary

Seven years on from the centenary of the First World War, we now find ourselves approaching the culmination of the first decade of the second century of British remembrance, and at the apex of the Second World War’s 80th anniversary, concluding in the commemoration of the victories in Europe and Japan; VE Day on 8th May, and VJ Day on 15th August respectively. Preceding the commemoration of victory in 1945, we have also seen other major historical moments of the Second World War commemorated since 2019, such as the Battle of Britain and the D-Day landings.

Thus far, these tentpole national commemorative events have largely been celebrations of victory rather than meditations on the destructive nature of war. Unlike the First, which has largely been portrayed in popular culture as a futile endeavour, the Second World War stands apart as a just war, a struggle between good and evil. In recent years, it is through the lens of the Second World War that official narratives of war in Britain have been constructed, providing a useful template with which previous and later conflicts can be created in the guise of. Such a foregrounding of a single war in mainstream narratives can result in the depoliticising and decontextualising of conflict, providing an ahistorical view of war as a natural and inevitable continuum. While responses to the First World War during its centenary did in part deal with the ambiguous nature of its necessity in being fought, the Second World War is far more widely accepted as entirely justified, as a national struggle for survival. While there is no doubt that defeat of fascism is a cause worthy of celebration, it must not serve to enable a sanitisation of war’s reality by colouring our perception of conflict overall.

The 80th anniversary of the Second World War may well enable dominant narratives of war to become further entrenched in the national psyche, particularly as more and more individuals with first-hand experience of total war pass from the population. For a new generation of Britons, whose primary connection to war is through the mass media, and indeed commemorative events, there is a real danger that a sanitised and depoliticised view of warfare will become the norm, especially through an ever more celebratory depiction adopted by mainstream commemorative initiatives.

As with the First World War centenary, this is where the vital role of repositories such as the UKWA can come into play, providing alternative viewpoints upon the topic of war and ensuring that a wide variety of voices are heard, rather than obscured by the fanfare of national enterprises. In light of the 80th anniversaries of VE and VJ day, the UKWA will curate a special collection documenting events and activities relating to the end of the Second World War, and invite the public to directly submit relevant websites by emailing [email protected].

Through the creation of such a collection, the UKWA will secure an invaluable repository of digital texts, which will not only serve as a preservation of an important historical event, but also as a vital resource for future scholars. Provided will be a unique insight into national forms of commemoration alongside those enacted by individuals and local communities. Digital texts held in the UKWA collections were central to my own research, offering a window into otherwise marginalised and unseen discourses, demonstrating the vast breadth of public responses to and enactments of remembrance in Britain since the end of the First World War. I hope that, moving forwards into this second century of commemoration, the UKWA’s important work will continue, facilitating significant reflection on remembrance for future generations.

Dr Liam Markey is a Research Associate at the University of Liverpool’s Department of Sociology, Social Policy and Criminology. He completed his PhD in collaboration with the British Library in 2023 and is currently working on a British Academy funded project exploring ethical digital public histories of prisoners and the legacy of enslavement in Georgia, USA.

[email protected]

LastPosts.blog

23 April 2025

Web Archives Collections as Data at the Digital Humanities in the Nordic and Baltic Countries (DHNB) Workshop Report

By Helena Byrne, Curator of Web Archives

DHNB 2025 Conference Banner
DHNB 2025 Conference Banner

The UK Web Archive was one of five web archive organisations represented in the Web Archive Collections as Data workshop held at the Digital Humanities in the Nordic and Baltic Countries (DHNB) 2025 conference held at the National Museum of Estonia in Tartu. The UK Web Archive has participated in the 2025, 2024 and 2023 DHNB conference. The workshop was organised by Olga Holownia, Senior Programme Officer at the International Internet Preservation Consortium (IIPC). It served as an introduction to web archives and web archives collections as data with a focus on use cases but also the challenges related to producing, sharing and publishing, collections as data.

The first stage of the workshop gave a brief overview of the collections as data movement within the GLAM sector, and introduced the Collections as Data Checklist developed by members of the GLAM Labs community. It also introduced what web archives are and where you can access them, how a selection of web archives are making their collections available as data as well as what are the potential research opportunities for these collections. The panel included Olga Holownia (IIPC), Gustavo Candela (University of Alicante), Helena Byrne (British Library), Jon Carlstedt Tønnessen (National Library of Norway), Anders Klindt Myrvoll (Royal Danish Library), Sophie Ham and Steven Claeyssens (KB, National Library of the Netherlands). 

The UK Web Archive presentation promoted the recently published Datasheets for Web Archives Toolkit and the new metadata data sets that are available through the British Library Research Repository. The presentation gave an overview of how the project started, the background to how the Toolkit was prepared and how it was implemented.

Web Archives Collections as Data Workshot at DHNB 2025
Web Archives Collections as Data Workshop at DHNB 2025. Photographer: Helena Byrne & Carmen Kurg.

 

The activity stage of the workshop focused on how we could adapt the Collections as Data Checklist for web archives. The participants were split into three groups. They reviewed the checklist through the lens of if it is applicable to web archives, how it could be adapted if it does not fit, what solutions can be developed to overcome some of the challenging sections of the checklist. There was a rich discussion amongst the groups which also benefited from having both researchers and library professionals involved in reviewing the checklist.

Web Archives Collections as Data Workshop at DHNB 2025
Web Archives Collections as Data Workshop at DHNB 2025. Photographer: Carmen Kurg.

The general consensus from the groups was that maybe more detail is needed to accompany the Checklist so that it could be applied to web archive collections. Some of the points on the Checklist are particularly difficult to apply to web archive collections. There was a lot of discussion on the first two points as they cover licensing and citation. These are particularly difficult for web archives due to national legislation; most web archives operate on a dark or grey access model and most onsite terminals used to access web archives have copy and paste functions disabled so citation can become problematic. However, the participants were positive about the potential to apply an annotated or adapted Collections as Data Checklist specifically for web archives. The brainstorming session at this workshop was the first step of starting a discussion about what resources are needed to improve the process of publishing web archive collections as data. The second of these discussions was picked up at the IIPC Web Archiving Conference in April 2025. 

For a more general report from the DHNB conference click the link to the Digital Scholarship blog to read the report: https://blogs.bl.uk/digital-scholarship/2025/04/dhnb-2025-digital-humanities-in-the-nordic-and-baltic-countries-conference-report.html 

25 November 2024

Datasheets for Web Archives Toolkit is now live

By Helena Byrne, Curator of Web Archives

Datasheets for Web Archives Toolkit Banner with authour names and logos
Datasheets for Web Archives Toolkit

Since autumn 2022, Emily Maemura from the University of Illinois and Helena Byrne from the UK Web Archive team at the British Library have been exploring how the Datasheets for Datasets framework, devised for machine learning by Gebru et. al, could be applied to web archives. In order to explore the research question “can we use datasheets to describe the provenance of web archives, supporting research uses?” a series of workshops were organised in 2023. 

These workshops included a card sorting exercise with expertise in web archives as well as general information management. After the card sorting exercise there was a general discussion about using this framework to describe web archive collections.

These workshops formed the core of the guidance documentation published in the Datasheets for Web Archives Toolkit published in the British Library Research Repository.

The Toolkit

This Toolkit provides information on the creation of datasheets for web archives datasets. The datasheet concept is based on past work from Gebru et al. at Microsoft Research. The datasheet template and samples here were developed through a series of workshops with web archives curators, information professionals, and researchers during Spring and Summer 2023. The toolkit is composed of several parts including templates, examples, and guidance documents. Documents in the toolkit are available at a single DOI (https://doi.org/10.22020/rq8z-r112) and include:

  1. Toolkit Overview 
  2. Datasheets Question Guide
  3. Datasheet Blank Template

Implementation 

The UK Web Archive has implemented this framework to publish data sets from its curation software the W3 Annotation Curation Tool (ACT). These data sets are available to view in the UK Web Archive: Data folder in the British Library Research Repository. So far there are just a few collections published but this will grow over the coming months.

18 September 2024

Creating and Sharing Collection Datasets from the UK Web Archive

By Carlos Lelkes-Rarugal, Assistant Web Archivist

We have data, lots and lots of data, which is of unique importance to researchers, but presents significant challenges for those wanting to interact with it. As our holdings grow by terabytes each month, this creates significant hurdles for the UK Web Archive team who are tasked with organising the data and for researchers who wish to access it. With the scale and complexity of the data, how can one first begin to comprehend what it is that they are dealing with and understand how the collection came into being? 

This challenge is not unique to digital humanities. It is a common issue in any field dealing with vast amounts of data. A recent special report on the skills required by researchers working with web archives was produced by the Web ARChive studies network (WARCnet). This report, based on the Web Archive Research Skills and Tools Survey (WARST), provides valuable insights and can be accessed here: WARCnet Special Report - An overview of Skills, Tools & Knowledge Ecologies in Web Archive Research.

At the UK Web Archive, legal and technical restrictions dictate how we can collect, store and provide access to the data. To enhance researcher engagement, Helena Byrne, Curator of Web Archives at the British Library, and Emily Maemura, Assistant Professor at the School of Information Sciences at the University of Illinois Urbana-Champaign, have been collaborating to explore how and which types of datasets can be published. Their efforts include developing options that would enable users to programmatically examine the metadata of the UK Web Archive collections.

Thematic collections and our metadata

To understand this rich metadata, we first have to examine how it is created and where it is held..

Since 2005 we have used a number of applications, systems, and tools to enable us to curate websites. The most recent being the Annotation and Curation Tool (ACT), which enables authenticated users, mainly curators and archivists, to create metadata that define and describe targeted websites. The ACT tool also serves  to help users build collections around topics and themes, such as the UEFA Women's Euro England 2022. To build collections, ACT users first input basic metadata to build a record around a website, including information such as website URLs, descriptions, titles, and crawl frequency. With this basic ACT record describing a website, additional metadata can be added, for example metadata that is used to assign a website record to a collection. One of the great features of ACT is its extensibility, allowing us, for instance, to create new collections.

These collections, which are based around a theme or an event, give us the ability to highlight archived content. The UK Web Archive holds millions of archived websites, many of which may be unknown or rarely viewed, and so to help showcase a fraction of our holdings, we build these collections which draw on the expertise of both internal and external partners.

Exporting metadata as CSV and JSON files

That’s how we create the metadata, but how is it stored? ACT  is a web application and the metadata created through it is stored in a Postgres relational database, allowing authenticated users to input metadata in accordance to the fields within ACT. As the Assistant Web Archivist, I was given the task to extract the metadata from the database, exporting each selected collection as a CSV and JSON file. To get to that stage, the Curatorial team first had to decide which fields were to be exported. 

The ACT database is quite complex, in that there are 50+ tables which need to be considered. To enable local analysis of the database, a static copy is loaded into a database administration application, in this case, DBeaver. Using the free-to-use tool, I was able to create entity relationship diagrams of the tables and provide an extensive list of fields to the curators so that they could determine which fields are the most appropriate to export.

I then worked on a refined version of the list of fields, running a script for the designated Collection and pulling out specific metadata to be exported. To extract the fields and the metadata into an exportable format, I created an SQL (Structured Query Language) script which can be used to export results in both JSON and/or CSV: 

Select

taxonomy.parent_id as "Higher Level Collection",

collection_target.collection_id as "Collection ID",

taxonomy.name as "Collection or Subsection Name",

CASE

     WHEN collection_target.collection_id = 4278 THEN 'Main Collection'

     ELSE 'Subsection'

END AS "Main Collection or Subsection",

target.created_at as "Date Created",

target.id as"Record ID",

field_url.url as "Primary Seed",

target.title as "Title of Target",

target.description as "Description",

target.language as "Language",

target.license_status as "Licence Status",

target.no_ld_criteria_met as "LD Criteria",

target.organisation_id as "Institution ID",

target.updated_at as "Updated",

target.depth as "Depth",

target.scope as "Scope",

target.ignore_robots_txt as "Robots.txt",

target.crawl_frequency as "Crawl Frequency",

target.crawl_start_date as "Crawl Start Date",

target.crawl_end_date as "Crawl End Date"

From

collection_target

Inner Join target On collection_target.target_id = target.id

Left Join taxonomy On collection_target.collection_id = taxonomy.id

Left Join organisation On target.organisation_id = organisation.id

Inner Join field_url On field_url.target_id = target.id

Where

collection_target.collection_id in (4278, 4279, 4280, 4281, 4282, 4283, 4284) And

(field_url.position Is Null Or field_url.position In (0))

JSON Example
JSON output example for the Women’s Euro Collection

Accessing and using the data

The published metadata is available from the BL Research Repository within the UK Web Archive section, in the folder “UK Web Archive: Data”. Each dataset includes the metadata seed list in both CSV and JSON formats, a data dictionary and a datasheet which gives provenance information about how the dataset was created as well as a data dictionary that defines each of the data fields. The first collections selected for publication were:

  1. Indian Ocean Tsunami December 2004 (January-March 2005) [https://doi.org/10.23636/sgkz-g054]
  2. Blogs (2005 onwards) [https://doi.org/10.23636/ec9m-nj89] 
  3. UEFA Women's Euro England 2022 (June-October 2022) [https://doi.org/10.23636/amm7-4y46] 

31 July 2024

If websites could talk (part 6)

By Ely Nott, Library, Information and Archives Services Apprentice

After another extended break, we return to a conversation between UK domain websites as they try to parse out who among them should be crowned the most extraordinary…

“Where should we start this time?” asked Following the Lights. “Any suggestions?”

“If we’re talking weird and wonderful, clearly we should be considered first.” urged Temporary Temples, cutting off Concorde Memorabilia before they could make a sound.

“We should choose a website with a real grounding in reality.” countered the UK Association of Fossil Hunters.

“So, us, then.” shrugged the Grampian Speleological Group. “Or if not, perhaps the Geocaching Association of Great Britain?”

“We’ve got a bright idea!” said Lightbulb Languages, “Why not pick us?”

“There is no hurry.” soothed the World Poohsticks Champsionships, “We have plenty of time to think, think, think it over.”

“This is all a bit too exciting for us.” sighed the Dull Men’s Club, who was drowned out by the others.

“The title would be right at gnome with us.” said The Home of Gnome, with a little wink and a nudge to the Clown Egg Gallery, who cracked a smile.

“Don’t be so corny.” chided the Corn Exchange Benevolent Society. “Surely the title should go to the website that does the most social good?”

“Then what about Froglife?” piped up the Society of Recorder Players.

“If we’re talking ecology, we’d like to be considered!” the Mushroom enthused, egged on by Moth Dissection UK. “We have both aesthetic and environmental value.”

“Surely, any discussion of aesthetics should prioritise us.” preened Visit Stained Glass, as Old so Kool rolled their eyes.

The back and forth continued, with time ticking on until they eventually concluded that the most extraordinary site of all had to be… Saving Old Seagulls.

Check out previous episodes in this series by Hedley Sutton - Part 1Part 2, Part 3 Part 4 and Part 5

 

UK Web Archive blog recent posts

Archives

Tags

Other British Library blogs