UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

17 posts categorized "Research collaboration"

25 November 2024

Datasheets for Web Archives Toolkit is now live

By Helena Byrne, Curator of Web Archives

Datasheets for Web Archives Toolkit Banner with authour names and logos
Datasheets for Web Archives Toolkit

Since autumn 2022, Emily Maemura from the University of Illinois and Helena Byrne from the UK Web Archive team at the British Library have been exploring how the Datasheets for Datasets framework, devised for machine learning by Gebru et. al, could be applied to web archives. In order to explore the research question “can we use datasheets to describe the provenance of web archives, supporting research uses?” a series of workshops were organised in 2023. 

These workshops included a card sorting exercise with expertise in web archives as well as general information management. After the card sorting exercise there was a general discussion about using this framework to describe web archive collections.

These workshops formed the core of the guidance documentation published in the Datasheets for Web Archives Toolkit published in the British Library Research Repository.

The Toolkit

This Toolkit provides information on the creation of datasheets for web archives datasets. The datasheet concept is based on past work from Gebru et al. at Microsoft Research. The datasheet template and samples here were developed through a series of workshops with web archives curators, information professionals, and researchers during Spring and Summer 2023. The toolkit is composed of several parts including templates, examples, and guidance documents. Documents in the toolkit are available at a single DOI (https://doi.org/10.22020/rq8z-r112) and include:

  1. Toolkit Overview 
  2. Datasheets Question Guide
  3. Datasheet Blank Template

Implementation 

The UK Web Archive has implemented this framework to publish data sets from its curation software the W3 Annotation Curation Tool (ACT). These data sets are available to view in the UK Web Archive: Data folder in the British Library Research Repository. So far there are just a few collections published but this will grow over the coming months.

18 September 2024

Creating and Sharing Collection Datasets from the UK Web Archive

By Carlos Lelkes-Rarugal, Assistant Web Archivist

We have data, lots and lots of data, which is of unique importance to researchers, but presents significant challenges for those wanting to interact with it. As our holdings grow by terabytes each month, this creates significant hurdles for the UK Web Archive team who are tasked with organising the data and for researchers who wish to access it. With the scale and complexity of the data, how can one first begin to comprehend what it is that they are dealing with and understand how the collection came into being? 

This challenge is not unique to digital humanities. It is a common issue in any field dealing with vast amounts of data. A recent special report on the skills required by researchers working with web archives was produced by the Web ARChive studies network (WARCnet). This report, based on the Web Archive Research Skills and Tools Survey (WARST), provides valuable insights and can be accessed here: WARCnet Special Report - An overview of Skills, Tools & Knowledge Ecologies in Web Archive Research.

At the UK Web Archive, legal and technical restrictions dictate how we can collect, store and provide access to the data. To enhance researcher engagement, Helena Byrne, Curator of Web Archives at the British Library, and Emily Maemura, Assistant Professor at the School of Information Sciences at the University of Illinois Urbana-Champaign, have been collaborating to explore how and which types of datasets can be published. Their efforts include developing options that would enable users to programmatically examine the metadata of the UK Web Archive collections.

Thematic collections and our metadata

To understand this rich metadata, we first have to examine how it is created and where it is held..

Since 2005 we have used a number of applications, systems, and tools to enable us to curate websites. The most recent being the Annotation and Curation Tool (ACT), which enables authenticated users, mainly curators and archivists, to create metadata that define and describe targeted websites. The ACT tool also serves  to help users build collections around topics and themes, such as the UEFA Women's Euro England 2022. To build collections, ACT users first input basic metadata to build a record around a website, including information such as website URLs, descriptions, titles, and crawl frequency. With this basic ACT record describing a website, additional metadata can be added, for example metadata that is used to assign a website record to a collection. One of the great features of ACT is its extensibility, allowing us, for instance, to create new collections.

These collections, which are based around a theme or an event, give us the ability to highlight archived content. The UK Web Archive holds millions of archived websites, many of which may be unknown or rarely viewed, and so to help showcase a fraction of our holdings, we build these collections which draw on the expertise of both internal and external partners.

Exporting metadata as CSV and JSON files

That’s how we create the metadata, but how is it stored? ACT  is a web application and the metadata created through it is stored in a Postgres relational database, allowing authenticated users to input metadata in accordance to the fields within ACT. As the Assistant Web Archivist, I was given the task to extract the metadata from the database, exporting each selected collection as a CSV and JSON file. To get to that stage, the Curatorial team first had to decide which fields were to be exported. 

The ACT database is quite complex, in that there are 50+ tables which need to be considered. To enable local analysis of the database, a static copy is loaded into a database administration application, in this case, DBeaver. Using the free-to-use tool, I was able to create entity relationship diagrams of the tables and provide an extensive list of fields to the curators so that they could determine which fields are the most appropriate to export.

I then worked on a refined version of the list of fields, running a script for the designated Collection and pulling out specific metadata to be exported. To extract the fields and the metadata into an exportable format, I created an SQL (Structured Query Language) script which can be used to export results in both JSON and/or CSV: 

Select

taxonomy.parent_id as "Higher Level Collection",

collection_target.collection_id as "Collection ID",

taxonomy.name as "Collection or Subsection Name",

CASE

     WHEN collection_target.collection_id = 4278 THEN 'Main Collection'

     ELSE 'Subsection'

END AS "Main Collection or Subsection",

target.created_at as "Date Created",

target.id as"Record ID",

field_url.url as "Primary Seed",

target.title as "Title of Target",

target.description as "Description",

target.language as "Language",

target.license_status as "Licence Status",

target.no_ld_criteria_met as "LD Criteria",

target.organisation_id as "Institution ID",

target.updated_at as "Updated",

target.depth as "Depth",

target.scope as "Scope",

target.ignore_robots_txt as "Robots.txt",

target.crawl_frequency as "Crawl Frequency",

target.crawl_start_date as "Crawl Start Date",

target.crawl_end_date as "Crawl End Date"

From

collection_target

Inner Join target On collection_target.target_id = target.id

Left Join taxonomy On collection_target.collection_id = taxonomy.id

Left Join organisation On target.organisation_id = organisation.id

Inner Join field_url On field_url.target_id = target.id

Where

collection_target.collection_id in (4278, 4279, 4280, 4281, 4282, 4283, 4284) And

(field_url.position Is Null Or field_url.position In (0))

JSON Example
JSON output example for the Women’s Euro Collection

Accessing and using the data

The published metadata is available from the BL Research Repository within the UK Web Archive section, in the folder “UK Web Archive: Data”. Each dataset includes the metadata seed list in both CSV and JSON formats, a data dictionary and a datasheet which gives provenance information about how the dataset was created as well as a data dictionary that defines each of the data fields. The first collections selected for publication were:

  1. Indian Ocean Tsunami December 2004 (January-March 2005) [https://doi.org/10.23636/sgkz-g054]
  2. Blogs (2005 onwards) [https://doi.org/10.23636/ec9m-nj89] 
  3. UEFA Women's Euro England 2022 (June-October 2022) [https://doi.org/10.23636/amm7-4y46] 

22 May 2024

Reflections on the IIPC Early Scholars Spring School on Web Archives 2024

By Cameron Huggett, PhD Student (CDP), British Library/Teesside University

IIPC-2024-Paris-Early-Scholars-Summer-School-banner
IIPC Early Scholars Spring School on Web Archives banner

My name is Cameron, and I am currently undertaking an AHRC funded Collaborative Doctoral Partnership (CDP) project, between the British Library and Teesside University. My research centres on racial discourses within association football fanzines and e-zines from c.1975 to the present, and aims to examine the broader connections between football fandom, race and identity. 

I attended the Early Scholars Spring School on Web Archives, prior to commencement of the conference, which allowed me to knowledge share with colleagues from a number of different countries, institutions and disciplines, offering new perspectives on my own research. Within this school, I was fortunate enough to be able to deliver a short lighting talk, outlining my own use of web archiving within my research into the history of racial discourses within football fanzines. This generated an engaging discussion around my methodologies and led me to reflect upon how quantitative techniques can be better adopted within historical research practices.

I also particularly enjoyed discovering more about the collections of the Bibliothèque Nationale de France (BNF) and Institut National de L'audiovisuel (INA). The scope of the collections and innovative user interfaces were particularly impressive. For example, INA had created a programme that allowed the user to view a collection item, such as an election debate broadcast, alongside archived tweets relating to event in real time.

 My primary takeaway was how web archives can be innovatively employed to record the breadth and depth of online communities and discourses, as well as supplement more traditional sources within a historian’s research framework.  

28 July 2023

UK-Ireland Digital Humanities Association Launch Event Report from the British Library

By Helena Byrne, Curator of Web Archives, Frankie Perry, Music Manuscripts and Archives Cataloguer and Stella Wisdom, Digital Curator for Contemporary British Collections

UK-Ireland Digital Humanities Association Launch Event Banner with event details
UK-Ireland Digital Humanities Association Launch Event Banner

The First Annual Event for the UK-Ireland Digital Humanities Association took place  on 29th and 30th June 2023 at Senate House, University of London as well as online. The Association “aims to build a collaborative vision for the field, and create new and sustainable long-term partnerships in alignment with the international community”. The programme set across one and half days covered a wide variety of topics and included an opportunity for the Community Interest Groups to meet up. 

The British Library was involved in four presentations either as an individual presentation or as part of a collaborative project. In this blog post we hear back from the British Library colleagues who attended.

Helena Byrne, Curator of Web Archives

I was involved in two collaborative presentations with Sharon Healy (Maynooth University) and Juan-José Boté-Vericad (Universitat de Barcelona). Our first presentation was a lightning talk on day one called 'Finding Web Archives under the ‘Big Tent’ of DH: A Case Study of Ireland and the UK'. This presented one element of a forthcoming chapter in a WARCnet edited collection on web archiving. This presentation reviewed postgraduate courses for the provision of web archiving in information management and digital humanities courses in Britain and Ireland. Our second presentation was part of Panel #2 on day two called 'The Potential of a Reborn Digital Archival Edition for Collating a Corpus of Archived Web Materials'. This presentation outlined a methodology for researchers without coding skills to select, collate and analyse a corpus of archived websites. 

The highlight for me was Panel #3, especially the presentation 'Towards a Critical Black Digital Humanities: A Critical Librarian’s Response' by Naomi L.A Smith (University of West London). This presentation and the discussion that followed highlighted some of the challenges as well as some of the positive action steps that can be taken to ensure digital humanities research is more inclusive. 

Frankie Perry, Postdoctoral Research Assistant, InterMusE project, University of York / Music Manuscripts and Archives Cataloguer, British Library

I gave a paper with Prof. Rachel Cowgill (University of York) who is Principal Investigator on the InterMusE project – a collaborative venture between musicologists, computer scientists, and archive and library specialists funded by the AHRC’s UK-US New Directions for Digital Scholarship in Cultural Institutions programme. The British Library is an institutional partner, with Dr Rupert Ridgewell (Lead Curator, Printed Music) as Co-Investigator; the universities of Swansea and Illinois at Urbana-Champagne are further partners, and we’re also working with the University of Waikato. In our paper, we introduced the complexities of sourcing, digitising, and piecing together ephemera relating to historical musical events (eg. concert programmes, flyers, newspaper reviews), using as our case study materials relating to the British Music Society (1918-1933) and its regional centres and branches. We showed the interface of the digital archive built for the project, which uses a combination of the Greenstone Digital Library system, the Mirador Annotation Viewer, and the SimpleAnnotationServer to make materials browsable, searchable, and interactive for musicologists and community users alike.

I really enjoyed the event and the snapshot it provided into current digital humanities research and techniques. I especially enjoyed a paper by Orla Delaney (Cambridge) on 'Database ethnography and the museum object record', and one by Lisa Griffith (Digital Repository of Ireland) and Laura Molloy (CODATA) titled 'Pathways to collaboration – creating and sharing GLAM image collections as data'.

Stella Wisdom, Digital Curator for Contemporary British Collections

My lightning talk 'Collaborating to Curate and Exhibit Complex Digital Literature' reflected on the cooperation between curators, researchers, experimental writers and creative practitioners to plan and produce the British Library’s Digital Storytelling exhibition (2 June 2023 - 15 October 2023). A hands-on display, which explores the ways that digital innovations have transformed and enhanced our narrative experiences. Showcasing eleven examples of electronic literature that invite readers to become a part of the story themselves, through interactive narratives that respond to user input, reading experiences influenced and personalised by data feeds, and works that draw from multiple platforms and audience participation to create immersive story worlds. Preparing and in some cases modifying these interactive works to display them in a public gallery has only been possible through practical collaborations between Library staff with the writers and games studios who created these digital stories. I shared some insights from my experience of this co-curation work and encouraged attendees to visit the exhibition.

It was a pleasure to meet a number of people in real life who I had only previously spoken with online. A personal highlight was hearing Reham Hosny from the University of Cambridge and Minia University speak about 'DH and E-Lit Communities: Intersectional Perspectives'. In the refreshment breaks at this event I chatted with Reham about her novel, Al-Barrah (The Announcer) and she demonstrated to me how both augmented reality and hologram technologies work with the printed book to immerse readers in this thought provoking narrative.

03 July 2023

RESAW 2023 Conference Report from the UK Web Archive

By Cui Cui Bodleian Libraries/University of Sheffield Information School, Nicola Bingham, Helena Byrne, British Library, Alice Austin Edinburgh University.

RESAW 2023 Exploring the archived web during a highly transformative age - Sciencesconf.org
RESAW 2023 Exploring the Archived Web During a Highly Transformative Age

2023 was the fifth RESAW conference. RESAW stands for Research Infrastructure for the Study of Archived Web Materials. It was established in 2012, aims to promote a collaborative European research infrastructure for the study of archived web materials and holds a conference every two years. The 2023 conference was held in Marseille from June 5-6 under the theme ‘Exploring the Archived Web During a Highly Transformative Age’. There was a packed programme with a number of UK based presentations especially from the UK Web Archive teams based at the Bodleian Libraries, British Library and Archive of Tomorrow project partner, University of Edinburgh.

The keynote presentations from the conference were streamed live and the recording of the day two keynote ‘Saving Ukrainian Cultural Heritage Online' by Sebastian Majstorovic (European University Institute) is available on the Inspé Aix-Marseille YouTube channel.

In this blog post participants from the UK Web Archive teams have reported back on their conference experience.

Bodleian Libraries/University of Sheffield Information School 

Cui Cui, Web Archivist / PhD researcher

The experience of presenting two papers in the fifth RESAW conference turned out to be a highly emotional one for me. The first presentation alongside my fellow web archivist, Alice Austin from University of Edinburgh, marked the end of the Archive of Tomorrow project. The opportunity provided me with a chance to reflect on the work we carried out for the project. The second presentation concluded the initial phase of my PhD research project on participatory web archiving. Presenting at the conference compelled me to summarise the findings from a survey I delivered last year, aiming to gain insights into the current practices of participatory web archiving. This experience not only marked a significant milestone, but also served as a starting point to bring theories and practices together to develop better web archives. 

During a panel discussion titled “Interrogating the logics of web archiving in the era of platformization”, Jessica Ogden, Katie Mackinnon, Emily Maemura posed some critical questions about web archiving practices. Who are we collecting for, what shall we collect and how can we approach this process ethically? They particularly put content creators at the centre of considerations and challenged web archivists to critically reflect our practices and ethical considerations. It is assuring that we are not alone in grappling with these complex issues as web archivists. These questions echo with the constant dilemmas we face as web archivists. In particular, the Archive of Tomorrow project highlighted the double-bind situations we encountered when dealing with ethical considerations and piloted engagement work with content creators. From both researchers’ and archivists’ perspectives, it is evidenced that these concerns call for more evidence-based studies and a deeper understanding of the views held by content creators and other wide range of stakeholders. 

Overall, the RESAW conference provided a thought-provoking experience. It allowed me to reflect on our work, consolidate my understanding, and recognise the need for continued efforts to address these complex issues.

British Library

Nicola Bingham, Lead Curator of Web Archives

I felt very privileged to attend this conference at the Mucemlab in Marseille, set in the courtyard of Fort Saint-Jean, with a stunning mix of old and new architecture and amazing sea views. During the conference, I found numerous presentations informative, engaging, thought-provoking and humorous, however, among them, two in particular, sparked profound reflections on curatorial praxis within the context of my own work.

Henrik Smith-Sivertsen took the audience on a captivating journey into the world of digital music archiving. With a focus on three distinct songs, he illustrated how the mediascapes in which they were published have a significant impact on the archiving process. Through his exploration, he highlighted the challenges of capturing and preserving complex digital objects from social media platforms and streaming services. The question of which version(s) to capture became a pivotal point of discussion, raising awareness of the dynamic nature of digital music and the evolving digital landscape it resides in. A thought-provoking video presentation showcased the different online iterations of Lukas Graham's "7 Years" from 2015. The variations in platforms, remixes, and user-generated content surrounding this song demonstrated the diverse ways in which music proliferates and evolves online. The presentation served as a powerful reminder of the challenges faced by archivists when attempting to capture and preserve such dynamic and multi-faceted digital musical artefacts.

Tiancheng Leo Cao from the University of Texas at Austin's intriguing paper focused on the changing meanings of openness within the museum context. He shed light on the gradual shift from an institution-oriented understanding to an access-oriented interpretation, prioritising the needs and participation of the public. I was struck by how this ideology parallels our thinking in the UK Web Archive where efforts are being made to embed more participation in the curatorial process. By involving communities, ensuring diverse perspectives, and including multiple voices, heritage organisations can create a more inclusive and representative platform for preserving our digital heritage.

Helena Byrne, Curator of Web Archives 

This was my second time attending a RESAW conference. The first I attended was 2017 as part of the Web Archiving Week event held in London when the IIPC Web Archiving Conference and RESAW collaborated on organising a full week of web archiving activities. At RESAW 2023 I co-presented two presentations both on day two of the conference. These were both collaborations that came out of the WARCnet network. The first was a joint presentation with Emily Maemura from (University of Illinois) where we fed back some initial findings from the series of workshops we facilitated on ‘Describing Collections with Datasheets for Datasets’. The second presentation was a joint presentation with Sharon Healy (Maynooth University) on ‘Assessing the Scholarly Use of Web Archives in Ireland’. In this presentation we highlighted a section from a much larger report that will be published as part of the WARCnet Papers and Special Reports

A key highlight for me in the programme was the session 'Building the Next Generation of Web Archive Analysis Service'. This panel gave an overview of the development of the Archives Unleashed project from 2017. The project is now winding up and will be supported by the Internet Archive who will be releasing a subscription service to Archives Research Compute Hub (ARCH) this summer. I've been lucky enough to attend Archives Unleashed events in 2017 and 2019 so it was really great to see how the project has changed over time. I wish the Archives Unleashed team all the best.

University of Edinburgh

Alice Austin, Web Archivist

The Archive of Tomorrow project team took two papers to RESAW this year. The first was a deep-dive into the Trans Health sub-category within the Talking About Health collection. The second, presented jointly with my fellow web archivist Cui Cui of the Bodleian Libraries, delivered a condensed version of the project’s Final Report, and reflected on the challenges, wins and losses of the project as a whole.

A few related themes emerged from this year’s papers. A number of speakers reflected on the value of the archived web as a source for ‘bottom-up’ perspectives on the impact of online spaces in the development of narratives at a personal and social level. Arguing that the events of 9/11 galvanised emerging web archiving efforts, Ian Milligan’s paper explored how the resultant archived pages provide a rich source for future historians wanting to understand how that day evolved; Dana Diminescu’s paper on the archive of the ‘Comme a la maison’ platform examined how changes in the language of hospitality used online can reflect changes in societal understanding of the migrant experience; and Anya Shchetvina’s paper discussed how web-based communication objects can become recontextualised as memory objects.

Another theme concerned how to do web archiving in an age of ‘platformisation’. A trio of papers by Emily Maemura, Jess Ogden, and Kate MacKinnon explored this in detail, raising important questions about how web archiving practices might better serve the communities that they draw from. Camille Riou considered the vulnerability of data in a capitalist world in the context of the withdrawal of Twitter’s API for academic research, and Cade Diehm and Benjamin Royer of the New Design Congress presented an excellent overview of the sector’s readiness to grapple with issues of the polycrisis such as colonialism, privatisation and datafication. 

The sixth RESAW Conference will be held in 2025 at University of Siegen in Germany. The theme for the conference is ‘Histories of the Datafied Web: Infrastructures, metrics, aesthetics’. More details about the conference and the call for papers will be announced in due course. 

28 June 2023

IIPC Web Archiving Conference 2023 Report from the UK Web Archive

By Nicola Bingham, Helena Byrne, Ian Cooke, Carlos Lelkes-Rarugal, Andrew Jackson, Richard Price British Library, Leontien Talboom Cambridge University Library, Mark Simon Haydn National Library of Scotland.

IIPC WAC2023 Conference Banner with details of the online and in person conference details.
IIPC WAC2023 Conference Banner

The IIPC 2023 Web Archiving Conference was hosted by the Netherlands Institute of Sound and Vision in Hilversum and co-organised by KB, National Library of the Netherlands. There was an online session held on May 3rd and the main in-person event took place on May 11th and 12th. There was a packed programme that included Q&A sessions for pre-recorded presentations for the online day and  presentations, workshops, lighting talks as well as posters for the in-person event. This was the first in-person IIPC conference since 2019 when the event was hosted  by the National and University Library in Zagreb (NSK), Croatia. 

Many UK Web Archive colleagues from Bodleian Libraries, the British Library, Cambridge University Library and National Library of Scotland attended the conference both as delegates and presenters. In this blog post they have reported back on their conference experience.

British Library

Nicola Bingham, Lead Curator of Web Archiving

Attending the IIPC conference in person for the first time since 2019 was a great experience. The combination of reconnecting with colleagues after four long years and the (literally) colourful ambience of the Beeld & Geluid (Institute for Sound & Vision), created an atmosphere brimming with renewed energy and optimism. I will highlight just a few of the presentations and conversations that were interesting from my point of view.

I enjoyed hearing about the De Digitale Stad Herleeft (the Digital City Revived) from Marleen Stikker, founder and ‘mayor’ of DDS, Marieke Brugman of UNESCO and Tjarda de Haan, Bits and Bytes United. Presentations focused on the "webarchaeological excavations” which took place to reconstruct, preserve, store and make accessible this unique digital heritage based on KB’s XS4ALL web collection - which was listed as UNESCO Memory of the World Heritage for the Dutch list and is now under review for the worldwide list.

I enjoyed insights into diversity and co-curation from Jesper Verheof, a Researcher-in-Residence at KB working on "Mapping the Dutch LGBT+ Web Archive". Jesper's work utilises KB's collections to explore the unique web sphere formed by LGBTQ+ - or queer people - and how this evolved over time. It sparked intriguing insights and perspectives which could be applied to our own LGBTQ+ collection.

Collaboration and innovation in web archiving were recurring themes at the conference. Valuable insights were shared by the team from the Library of Congress, emphasising their investment in and education of curators to effectively participate in the web archiving process. 

Finally, I had the privilege of presenting the research by WG2 of the WARCnet project, ‘Surveying the Landscape of COVID-19 Web Collections in European GLAM Institutions’ in a session dedicated to Covid-19 collections. Our findings shed light on the scope of these collections, how they were defined, and the common challenges institutions face in making them accessible for research purposes. 

Helena Byrne, Curator of Web Archives 

I participated in both the online and in-person event as a collaborator in a presentation in the online day and co-facilitating a workshop at the in-person event. I was involved in the ‘Developing a Reborn Digital Archival Edition as an Approach for the Collection, Organisation, and Analysis of Web Archive Sources’ project with Sharon Healy (Maynooth University) and Juan-José Boté-Vericad (Universitat de Barcelona). Along with Emily Maemura (University of Illinois) we facilitated Workshop-01 ‘Describing Collections with Datasheets for Datasets’. This was part of a series of workshops we hosted to see if the Datasheets for Datasets framework could be applied to UK Web Archive collections published as data. 

As a participant there were so many great takeaways from this conference. One of the sessions that stands out most for me is the ‘Renewal in Web Archiving: Towards More Inclusive Representation and Practices’. This was on day two of the conference. The conversations in this session were really useful for me to try and ensure that we continue to try and develop more inclusive collections and opportunities to engage in the curation process. In this session we heard about the next steps for the Archiving the Black Web (ATBW) project. Although this is a USA based project, its impact will be global as they are now currently developing a training programme to improve the curation and research use of the archived black web. 

Andrew Jackson, Web Archive Technical Lead

I was involved in a couple of tool workshops during the conference, where it was great to see the interest in shared tooling, and the collaborative commitments this implies. I was also interested in how many of the presentations related to issues around information literacy. For more, see my blog post Reflections on the IIPC Web Archiving Conference 2023.

Ian Cooke, Head of Contemporary British & Irish Publications

This year’s conference was a strong reminder that web archiving is about people - the people whose lives and experiences are expressed in the collections we build; the people whose imaginations shaped the way we use, and have used, the web over time; and the people who are working across collecting, preserving and researching the archived web.

There was a great mix of presentations, blending new developments in technologies, evolving research methods, and approaches to creating and understanding collections, in ways that were accessible to all attendees. Giulia Carla Rossi and I were both pleased to talk about the development of our practice at the British Library, and legal deposit libraries, in collecting ‘emerging formats’.  

The IIPC itself is celebrating its 20th year, and the conference reflected that sense of celebration. It also demonstrated the maturing of practice, and reflection on web archiving methods and goals, at many of the organisations represented. A highlight of the conference was the presentations by Makiba Foster and Zakiya Collier on the Archiving the Black Web project, and the potential of web archiving to contribute to ‘black self-education practices, collective study and librarianship’. Foster and Collier argued for well-resourced institutions to take responsibility for providing support to community heritage organisations in building inclusive collections, and also stressed the need for ethical considerations, in particular regarding the rights of people represented within collections, when building collections.        

Overall, it was a privilege to take part in the conference and to have the time to connect in person with a community of web archive practitioners and researchers, being able to share knowledge and experience and reminding ourselves of what we have in common.

Carlos Lelkes-Rarugal, Assistant Web Archivist

I very much enjoyed my second attendance of an IIPC annual web archiving conference, 2019 was my first one, so I didn’t quite know what to expect. Sufficed to say, the 2023 WAC was just as successful and another enjoyable, unique experience.

There’s such a diverse background of people, I think this is because web archiving is approached very differently as each organisation have their particular way of going about it, which is why there is such an emphasis on sharing knowledge and information. I attended many talks and learnt about new methods of quality assurance, the infrastructure set up of institutions, policies on collecting; whichever presentation it is, you can be sure there’s something innovative going on that could be applied to your domain.

The UK Web Archive itself represents the six UK Legal Deposit Libraries, and as such, we’re inherently maintaining relationships but more importantly trying to build new relationships for new opportunities, collaborations, and potential partnerships. We’re a small team (larger than others) but still relatively small when considering the scope of our work, and I think this is exactly what the IIPC can help with. Like many organisations, the UK Web Archive does at times find web archiving to be a challenge, and as such, the IIPC helps foster a network of people who are willing to share their knowledge and expertise so that we can connect with them to tackle these emerging and ever-evolving challenges. There’s a collective effort to further web archiving, we’re trying to advance a field that has a lot of potential, so if you’re interested, please join this invaluable community.

Richard Price, Head of Contemporary British Collections

I attended this conference to reacquaint myself with web archiving in a little more detail than I have for some years. It was a privilege to attend, seeing so many different kinds of response from the international community and, if I may so, I felt especially proud of my colleagues at the British Library for their presentations and workshops. If there was a common thread through the papers it was that the problem-solving and information-sharing intrinsic to the web archiving community are values translated from the early days of the web itself – that substantial part of the early Internet that was altruistic and public-minded – and, in today’s archiving world, underpinned by layers of technical, social, and curatorial expertise. Thank you to IIPC and to Sound and Vision at Hilversum, and to all those presenting and attending!

Cambridge University Library

Leontien Talboom, Technical Analyst

This was my first time attending IIPC apart from a very brief appearance on a panel in 2022. I was fortunate enough to be a co-presenter on two talks during the conference. One was with my colleague Mark Haydn where we presented on the datasets that we were able to create during the Archive of Tomorrow project and the other was with my colleague Caylin Smith where we explored the difficulties and opportunities of capturing the University of Cambridge domain. 

Both presentations were really enjoyable and it was great to get feedback and questions from colleagues across our field. As this was my first time attending IIPC I wasn’t sure what to expect. However, I was pleasantly surprised by the wide range of topics and formats discussed. One that really stood out to me was the work of Emily Escamilla who talked about reference rot and what would happen if GitHub was to disappear. This really showcased how much as an academic sector we rely on these types of sources to be around when referencing them, but this is not necessarily a given. 

National Library of Scotland

Mark Haydn, Metadata Analyst

It has been a few years since I've been at an in-person conference, & I had forgotten how nice it can be to visit another city and spend a few days immersed in presentations and conversations with people working in the same area. Sometimes this meant hearing about something immediately relevant to my own metadata work at the National Library of Scotland, like hearing Tom Storrar of the UK Government Web Archive assess how effective their work ramping up collecting early in the pandemic to capture frequent website updates had been, or listening to members of the ResPaDon Project detail their experiences extending regional access to web archives collections across France. Other presentations served as an opportunity to better understand topics being explored further afield: there were many demonstrations of potential uses of AI, not all of them ominous, ranging from automatically producing descriptive summaries of technical metadata, for use in Library of Congress catalogue records, to generating a generic Stirring Plenary Speech at short notice.

As well as listening in, my colleague Leontien Talboom and I presented some of our work on the Archive of Tomorrow project, summarising the progress that's been possible since the development of the British Library's web archive metadata export. We heard about other institutional and international approaches and platforms for looking at web archives at scale, like Archive-It's ARCH tools, and caught fellow Archive of Tomorrow web archivist Cui Cui's discussion of knowledge sharing before heading back to the UK.

The 2024 IIPC Web Archive Conference will be hosted by the Bibliothèque nationale de France (BnF) 24-26 April. Follow the IIPC Twitter account for updates and the call for papers due out in early autumn.

19 October 2021

Clouds and blackberries: how web archives can help us to track the changing meaning of words

By Dr Barbara McGillivray (Turing Fellow), Pierpaolo Basile (Assistant Professor in Computer Science, University of Bari), Dr Marya Bazzi (Turing Fellow) and  Dr Jenny Basford, Jason Webber (British Library)

NOTE: This a re-blog from the Alan Turing Institute, with permission.

The meaning of words changes all the time. Think of the word ‘blackberry’, for example, which has been used for centuries to refer to a fruit. In 1999, a new brand of mobile devices was launched with the name BlackBerry. Suddenly, there was a new way of using this old word. ‘Cloud’ is another example of a well-established word whose association with ‘cloud computing’ only emerged in the past couple of decades. Linguists call this phenomenon ‘semantic change’ and have studied its complex mechanisms for a long time. What has changed in recent years is that we now have access to huge collections of data which can be mined to find these changes automatically. Web archives are a great example of such collections, because they contain a record of the changing content of web pages.

But how can we automatically detect in a huge web archive when a word has changed its meaning? A common strategy is to build geometric representations of words called word embeddings. Word embeddings use lots of data about the context in which words are used so that similar words can be clustered together. We can then do operations on these embeddings, for example to find the words that are closest (and most similar in meaning) to a given word. It’s a useful technique, but building embeddings takes a lot of computing power. Having access to pre-trained embeddings can therefore make a big difference, enabling those in the scientific community without sufficient computational resources to participate in this research.

A team of researchers from The Alan Turing Institute and the Universities of Bari, Oxford and Warwick, in collaboration with the UK Web Archive team based at the British Library, has now released DUKweb, a set of large-scale resources that make pre-trained word embeddings freely available. Described in this article, DUKweb was created from the JISC UK Web Domain Dataset (1996-2013), a collection of all .uk websites archived by the Internet Archive between 1996 and 2013. (This dataset is held and maintained by the UK Web Archive, which has been collecting websites since 2005, initially on a selective basis and since 2013 at a whole domain level.) DUKweb contains 1.3 billion word occurrences and two types of word embeddings for each year of the JISC UK Web Domain Dataset. The size of DUKweb is 330GB.

Researchers can use DUKweb to study semantic change in English between 1996 and 2013, looking at, for instance, the effects of the growth of the internet and social media on word meanings. For example, if the word ‘blackberry’ is used mostly to refer to fruits in 1996 and to mobile phones in 2000, the 1996 embedding for this word will be quite different from its 2000 embedding. In this way, we can find words that may have changed meaning in this time period. The figure below (from Tsakalidis et al., 2019) shows four words whose contexts of use have changed in the last couple of decades: ‘blackberry’, ‘cloud’, ‘eta’ and ‘follow’. The bars indicate words most similar to these four words in 2000 (red bars) and in 2013 (blue bars). The scale along the bottom gives a measure of the change.

figure 02 - analysis - clouds, blackberries

The resources that underpin DUKweb are hosted on the British Library’s research repository, and are available for anyone in the world to download, reuse and repurpose for their own projects. This repository is part of the BL’s Shared Research Repository for cultural heritage organisations, which brings together the research outputs produced by participating institutions, and makes them discoverable to anybody with an internet connection. Providing a stable, dedicated location to hold heritage datasets in order to share them with a wider research community has been one of the key drivers in the implementation and development of this repository service. We are grateful to the British Library’s Repository Services team for supporting this collaboration between the UK Web Archive team and the Turing by making the content for DUKweb available.

Read the paper: DUKweb: diachronic word representations from the UK Web Archive corpus

 

11 November 2020

How Remembrance Day has Changed

By Liam Markey, PhD Student, University of Liverpool and the British Library

This blog examines how attitudes to Remembrance (or Armistice) day have changed and evolved over the course of the 20th century and beyond. Read the previous blog on 'Militarism and its role in the commemoration of British war dead' for background on the wider research project.

100 Years
2020 marks 100 years since the erection of a permanent Cenotaph at Whitehall and the interment of the Unknown Warrior in his tomb at Westminster. Along with the 2-minute silence, which was first observed in 1919, and the adoption of the poppy as the symbol of British commemoration in 1921, these practices have been ever present over the past century; they have become intrinsic components of the British collective identity in what is, arguably, a relatively short period of time.

Alleviating suffering and grief
Initially, this commemoration of the dead of the First World War performed two distinct purposes: firstly, practices served to alleviate the suffering of those who had lost loved ones. The bodies of the fallen were not repatriated, so the erection of monuments extolling the sacrifices of the war dead served as focal points of grief and mourning in local communities. Secondly, Remembrancetide (the time of year in which British rituals of commemoration are enacted) was initially a period in which support for disabled ex-servicemen, and those left widowed or orphaned by the First World War, was to be generated. Through the sale of poppies or direct donations, the British public was able to provide financial support for those in need. Collective mourning, such as at the Cenotaph where the monarchy and politicians gathered, was a demonstration of unity and a national thanksgiving to the war dead.

Attitudes to commemoration are not static
Whilst commemorative practices have remained practically unchanged over the past 100 years (only the day on which they are observed has been altered, and for the duration of the Second World War national services were suspended), the same cannot be said for the historical context in which they have been enacted, nor for the thoughts and ideals of those who enact them.

Newspaper Analysis
Analysing the Daily Mail and Daily Mirror newspapers, I have been able to create a small “pseudo” historiography of British attitudes towards commemoration throughout the 20th Century. The text samples from the two newspapers that I have examined range from the 7th -14th November at ten-year intervals starting in 1928 and contain at least one mention of the terms “Armistice” or “Remembrance.” The choice to search within this temporal parameter and for these specific terms was a conscious decision made so as to ensure that texts relating to both Armistice Day and Remembrance Sunday were collected and available for analysis. The intervals between samples was a deliberate choice so that each text is taken from a year in which a tenth anniversary of the First World War took place and, in theory, when coverage of the war in the media would be at a heightened state.

1928
The first text sample is taken from 1928, the ten-year anniversary of the signing of the Armistice in 1918 and provides the largest number of texts from any year. This is in most part due to the fact that the First World War was a relatively recent event at this point in time. The main emphasis of these texts is on how the British public can aid those left disabled by their experience of the First World War, either through donations to the British Legion’s poppy appeal or by direct purchasing goods made by ex-servicemen. The issue of ‘lasting peace’ is also brought up several times, with many believing that ten years having passed without another World War proves that the cause so many British soldiers died fighting for was not in vain. At this point in time, when commemoration was in many ways an expression of a commitment to peace, the majority of the British public seemed convinced that it was fulfilling its purpose.

1938
However, by 1938 the mood had shifted considerably. With another conflict looming there is less conviction in proclamations of the First World War having achieved this lasting peace. There is an increase in articles discussing the possibility of another war in the near future and the failings of the last 20 years in maintaining peace. There is a palpable anxiety present in the coverage of both the Mail and Mirror as British society faces the stark realisation that the lasting peace so many died for between 1914-18 is on the verge of dissolution.

1948
By 1948 this anxiety had yet to subside, and despite another recent victory over Germany and her allies there is little celebration or indication that the Second World War had done a better job in achieving peace than the First had done as too little time had yet passed. This sample provides a much shorter number of texts concerned with commemoration, and I am drawn to Jay Winter’s assertion that societies following the Second World War struggled to make sense of the carnage they had experienced as an explanation as to why this was the case:

The limits of language had been reached; perhaps there was no way adequately to express the hideousness and scale of the cruelties of the 1939-1945 war. (Winter, 1995, p.9)

In the wake of the First World War, commemorative practices were conceived so as to soothe the suffering of the bereaved and to attach value and meaning to the sacrifice of the war dead. The aftermath of the Second World War resulted in a disillusionment with this previous tradition as commemoration hinged on the maintenance of peace. Now it was clear that the ‘peace’ so many died to attain was a fiction, and perhaps the lack of coverage in this text sample is demonstrative of a contextual detachment felt in British society towards the commemoration of war. The overarching theme displayed by this text sample is that of a society disillusioned with the concept of war commemoration, yet perceived slights to tradition, such as “gigglers” at Whitehall, are still harshly condemned. Despite there being no overt celebration of the war dead, or victory in the two World Wars present in either paper, it is clear that the bare minimum of traditional commemorative practices were to still be respected and observed.

1958
The texts from 1958 greatly resemble those of 1928, where it was believed that a sufficient period of time had passed since the ending of the First World War and thus it was acceptable to again assert that lasting peace had been achieved. There are a few texts that discuss this idea of lasting peace, specifically one in the Daily Mail titled What a Difference 27 Years Make, which argues that the contrast between the present and 1931, both being 13 years removed from a World War, proves that society is on the right track to avoiding another global conflict.

Another important focus of texts from this period is the issue of the “200,000,” the last remaining veterans of the First World War, and what is perceived to be a lack of financial support from the government as they enter the later stages of their lives. After 1948, where overt reference to ex-servicemen in the texts was absent, this year’s sample brings them back to the fore, reminiscent once more of 1928’s sample. The difference here, however, is that the ex-servicemen mentioned in the texts collected prior to the Second World War focused on those who had been left disabled by their experiences of the First World War. In 1958, media coverage encompasses all ex-servicemen from the First World War due to their age – now that 40 years have passed since the Armistice, the advanced age of veterans now means they are all regarded as vulnerable and in need of assistance from the public, be they disabled as a result of the war or not.

1968 and 1978
Both 1968 and 1978 samples offer an insight to changing attitudes to the First World War in British society. The British mythology of the conflict that is firmly planted in modern popular imagination has its roots in the 1960s and 70s where a number of influential pieces of media were produced that transformed attitudes to the First World War.

Evident in both text samples is the widening divide between older and younger generations and their attitudes towards the commemoration of war, and wider ideas regarding the relevance of traditional commemorative rituals considering how much time had passed since the Armistice. Both newspapers wrestle with the idea that commemorative practices have become outdated and appeal only to a small minority of the population with personal connections to the First World War, with it being described as “too sentimental” to some. Despite these growing objections, large crowds are still in attendance at remembrance services, many of whom, as the Daily Mirror points out, are young people. These decades depict the future of commemorative tradition as being somewhat in doubt; with the Second World War receding into history, and the First even more so, there is a real feeling in the texts that the commemorative traditions conceived in the wake of the Armistice had started to become outdated.

1988 & 1998
By the late 1980s British interest in commemoration seems to have been reinvigorated, perhaps in no small part due to the Falklands Conflict of 1982, with both the 1988 and 1998 texts bearing a more nationalistic tone than previous samples. With memory of the First World War having all but passed from living memory, emphasis in the texts shifts from the personal stories of those who were directly affected by the conflict towards a more abstract concept of commemoration as an almost celebration of Britishness. Both newspapers in 1988 contain adverts from the British Legion that describe the observance of traditional commemorative practices as a “National Debt,” and especially in the Daily Mail there is a vast increase in articles containing inflammatory and accusatory language directed at those who are not 100% committed to participation. Whilst in 1998, the question of whether today’s youth are willing to die for their nation is repeated numerous times throughout Remembrancetide in the Daily Mail. 

21st Century
Leading into the 21st Century there is a sense that the initial meaning behind commemoration, which sought to provide support for those mourning the deaths of loved ones, has become outdated now that lived experience of the First World War has passed from the British population. There is a real danger that the language and symbols that vindicated the sacrifice of the war-dead in the wake of the conflict are more likely to inspire militaristic notions in the present day.

Poppies in a field

Summary
While brief, I hope this piece has demonstrated to some degree the fluid nature of British attitudes to commemoration in the 20th Century, and how these attitudes are somewhat representative of wider historical and social change. As my research moves forward it will be most interesting to see the relationship between ‘micro’ discourses and those disseminated by the British media.

Resources such as the UK Web Archive will prove invaluable in exploring these ‘bottom up’ approaches to commemoration, asking how language and symbols popularised in the wake of the First World War, such as the Remembrance Poppy, are reproduced within amateur online remembrance projects and how this usage potentially relates to issues such as nationalism and militarism. Often, mainstream representations of Remembrance focus on the unifying nature of commemoration, and it will be interesting to see whether analysis of materials produced by the average British citizen challenges or confirms this narrative.

UKWA First World War centenary collection - 900+ archived websites (or pages).

UK Web Archive blog recent posts

Archives

Tags

Other British Library blogs