Digital scholarship blog

Enabling innovative research with British Library digital collections

220 posts categorized "Projects"

23 December 2024

AI (and machine learning, etc) with British Library collections

Machine learning (ML) is a hot topic, especially when it’s hyped as ‘AI’. How might libraries use machine learning / AI to enrich collections, making them more findable and usable in computational research? Digital Curator Mia Ridge lists some examples of external collaborations, internal experiments and staff training with AI / ML and digitised and born-digital collections.

Background

The trust that the public places in libraries is hugely important to us - all our 'AI' should be 'responsible' and ethical AI. The British Library was a partner in Sheffield University's FRAIM: Framing Responsible AI Implementation & Management project (2024). We've also used lessons from the projects described here to draft our AI Strategy and Ethical Guide.

Many of the projects below have contributed to our Digital Scholarship Training Programme and our Reading Group has been discussing deep learning, big data and AI for many years. It's important that libraries are part of conversations about AI, supporting AI and data literacy and helping users understand how ML models and datasets were created.

If you're interested in AI and machine learning in libraries, museums and archives, keep an eye out for news about the AI4LAM community's Fantastic Futures 2025 conference at the British Library, 3-5 December 2025. If you can't wait that long, join us for the 'AI Debates' at the British Library.

Using ML / AI tools to enrich collections

Generative AI tends to get the headlines, but at the time of writing, tools that use non-generative machine learning to automate specific parts of a workflow have more practical applications for cultural heritage collections. That is, 'AI' is currently more process than product.

Text transcription is a foundational task that makes digitised books and manuscripts more accessible to search, analysis and other computational methods. For example, oral history staff have experimented with speech transcription tools, raising important questions, and theoretical and practical issues for automatic speech recognition (ASR) tools and chatbots.

We've used Transkribus and eScriptorium to transcribe handwritten and printed text in a range of scripts and alphabets. For example:

Creating tools and demonstrators through external collaborations

Mining the UK Web Archive for Semantic Change Detection (2021)

This project used word vectors with web archives to track words whose meanings changed over time. Resources: DUKweb (Diachronic UK web) and blog post ‘Clouds and blackberries: how web archives can help us to track the changing meaning of words’.

Graphs showing how words associated with the words blackberry, cloud, eta and follow changed over time.
From blackberries to clouds... word associations change over time

Living with Machines (2018-2023)

Our Living With Machines project with The Alan Turing Institute pioneered new AI, data science and ML methods to analyse masses of newspapers, books and maps to understand the impact of the industrial revolution on ordinary people. Resources: short video case studies, our project website, final report and over 100 outputs in the British Library's Research Repository.

Outputs that used AI / machine learning / data science methods such as lexicon expansion, computer vision, classification and word embeddings included:

Tools and demonstrators created via internal pilots and experiments

Many of these examples were enabled by on-staff Research Software Engineers and the Living with Machines (LwM) team at the British Library's skills and enthusiasm for ML experiments in combination with long-term Library’s staff knowledge of collections records and processes:

British Library resources for re-use in ML / AI

Our Research Repository includes datasets suitable for ground truth training, including 'Ground truth transcriptions of 18th &19th century English language documents relating to botany from the India Office Records'. 

Our ‘1 million images’ on Flickr Commons have inspired many ML experiments, including:

The Library has also shared models and datasets for re-use on the machine learning platform Hugging Face.

13 December 2024

Looking back on the Data Science Accelerator

From April to July this year an Assistant Statistician at the Cabinet Office and a Research Software Engineer at the British Library teamed up as mentee (Catherine Macfarlane, CO) and mentor (Harry Lloyd, BL) for the Data Science Accelerator. In this blog post we reflect on the experience and what it meant for us and our work.

Introduction to the Accelerator

Harry: The Accelerator has been around since 2015, set up as a platform to ‘accelerate’ civil servants at the start of their data science journey who have a business need project and a real willingness to learn. Successful applicants are paired with mentors from across the Civil Service who have experience in techniques applicable to the problem, working together one protected day a week for 12 weeks. I was lucky enough to be a mentee in 2020, working on statistical methods to combine different types of water quality data, and my mentor Charlie taught me a lot of what I know. The programme played a huge role in the development of my career, so it was a rewarding moment to come back as a mentor for the April cohort. 

Catherine: On joining the Civil Service in 2023, I had the pleasure of becoming part of a talented data team that has motivated me to continually develop my skills. My academic background in Mathematics with Finance provides me with a strong theoretical foundation, but I am striving to improve my practical abilities. I am particularly interested in Artificial Intelligence, which is gaining increasing recognition across government, sparking discussions on its potential to improve efficiency.

I saw the Data Science Accelerator as an opportunity to deepen my knowledge, address a specific business need, and share insights with my team. The prospect of working with a mentor and immersing myself in an environment where diverse projects are undertaken was particularly appealing. A significant advantage was the protected time this project offered - a rare benefit! I was grateful to be accepted and paired with Harry, an experienced mentor who had already completed the programme. Following our first meeting, I felt ready to tackle the upcoming 12 weeks to see what we could achieve!

Photo of the mentee and mentor on a video call
With one of us based in England and the other in Scotland virtual meetings were the norm. Collaborative tools like screen sharing and Github allowed us to work effectively together.

The Project

Catherine: Our team is interested in the annual reports and accounts of Arm’s Length Bodies (ALBs), a category of public bodies funded to deliver a public or government service.  The project addressed the challenge my team faces in extracting the highly unstructured information stored in annual reports and accounts. With this information we would be able to enhance the data validation process and reduce the burden of commissioning data from ALBs on other teams. We proposed using Natural Language Processing to retrieve this information, analysing and querying it using a Large Language Model (LLM).

Initially, I concentrated on extracting five features, such as full-time equivalent staff in the organisation, from a sample of ALBs across 13 departments for the financial year 22/23. After discussions with Harry, we decided to use Retrieval-Augmented Generation (RAG), to develop a question-answering system. RAG is a technique that combines LLMs with relevant external documents to improve the accuracy and reliability of the output. This is done by retrieving documents that are relevant to the questions asked and then asking the LLM to generate an answer based on the retrieved material. We carefully selected a pre-trained LLM while considering ethical factors like model openness.

RAG
How a retrieval augmented generation system works. A document in this context is a segmented chunk of a larger text that can be parsed by an LLM.

The first four weeks focused on exploratory analysis, data processing, and labelling, all completed in R, which was essential for preparing the data for input into the language model. The subsequent stages involved model building and evaluation in Python, which required the most time and focus. This was my first time using Python, and Harry’s guidance was extremely beneficial during our pair coding sessions. A definite highlight for me was seeing the pipeline start to generate answers!

To bring all our results together, I created a dashboard in Shiny, ensuring it was accessible to both technical and non-technical audiences. The final stage involved summarising all our hard work from the past 12 weeks in a 10 minute presentation and delivering it to the Data Science Accelerator cohort.

Harry: Catherine’s was the best planned project of the ones I reviewed, and I suspected she’d be well placed to make best use of the 12 weeks. I wasn’t wrong! We covered a lot of the steps involved in good reproducible analysis. The exploratory work gave us a great sense of the variance in the data, setting up quantitative benchmarks for the language model results drove our development of the RAG system, and I was so impressed that Catherine managed to fit in building a dashboard on top of all of that.

Our Reflections

Catherine: Overall this experience was fantastic. In a short amount of time, we managed to achieve a considerable amount. It was amazing to develop my skills and grow in confidence. Harry was an excellent mentor; he encouraged discussion and asked insightful questions, which made our sessions both productive and enjoyable. A notable highlight was visiting the British Library! It was brilliant to have an in-person session with Harry and meet the Digital Research team.

A key success of the project was meeting the objectives we set out to achieve. Patience was crucial, especially when investigating errors and identifying the root problem. The main challenge was managing such a large project that could be taken in multiple directions. It can be natural to spend a long time on one area, such as exploratory analysis, but we ensured that we completed the key elements that allowed us to move on to the next stage. This balance was essential for the project's overall success.

Harry: We divided our days between time for Catherine to work solo and pair programming. Catherine is a really keen learner, and I think this approach helped her drive the project forward while giving us space to cover foundational programming topics and a new programming language. My other role was keeping an eye on the project timeline. Giving the occasional steer on when to stick with something and when to move on helped (I hope!) Catherine to achieve a huge amount in three months. 

Dashboard
A page from the dashboard Catherine created in the last third of the project.

Ongoing Work

Catherine: Our team recognises the importance of continuing this work. I have developed an updated project roadmap, which includes utilising Amazon Web Services to enhance the speed and memory capacity of our pipeline. Additionally, I have planned to compare various large language models, considering ethical factors, and I will collaborate with other government analysts involved in similar projects. I am committed to advancing this project, further upskilling the team, and keeping Harry updated on our progress.

Harry: RAG, and the semantic rather than key word search that underlies it, represents a maturation of LLM technology that has the potential to change the way users search our collections. Anticipating that this will be a feature of future library services platforms, we have a responsibility to understand more about how these technologies will work with our collections at scale. We’re currently carrying out experiments with RAG and the linked data of the British National Bibliography to understand how searching like this will change the way users interact with our data.

Conclusions

Disappointingly the Data Science Accelerator was wound down by the Office for National Statistics at the end of the latest cohort, citing budget pressures. That has made us one of the last mentor/mentee pairings to benefit from the scheme, which we’re both incredibly grateful for and deeply saddened by. The experience has been a great one, and we’ve each learned a lot from it. We’ll continue to develop RAG at the Cabinet Office and the British Library, and hope to advocate for and support schemes like the Accelerator in the future!

26 November 2024

Working Together: The UV Community Sprint Experience

How do you collaborate on a piece of software with a community of users and developers distributed around the world? Lanie and Saira from the British Library’s Universal Viewer team share their recent experience with a ‘community sprint’... 

Back in July, digital agency Cogapp tested the current version of the Universal Viewer (UV) against Web Content Accessibility Guidelines (WCAG) 2.2 and came up with a list of suggestions to enhance compliance.  

As accessibility is a top priority, the UV Steering Group decided to host a community sprint - an event focused on tackling these suggestions while boosting engagement and fostering collaboration. Sprints are typically internal, but the community sprint was open to anyone from the broader open-source community.

Zoom call showing participants
18 participants from 6 organisations teamed up to make the Universal Viewer more accessible - true collaboration in action!

The sprint took place for two weeks in October. Everyone brought unique skills and perspectives, making it a true community effort.

Software engineers worked on development tasks, such as improving screen reader compatibility, fixing keyboard navigation problems, and enhancing element visibility. Testing engineers ensured functionality, and non-technical participants assisted with planning, translations and management.

The group had different levels of experience, which made it important to provide a supportive environment for learning and collaboration.  

The project board at the end of the Sprint - not every issue was finished, but the sprint was still a success with over 30 issues completed in two weeks.
The project board at the end of the Sprint - not every issue was finished, but the sprint was still a success with over 30 issues completed in two weeks.

Some of those involved shared their thoughts on the sprint: 

Bruce Herman - Development Team Lead, British Library: 'It was a great opportunity to collaborate with other development teams in the BL and the UV Community.'

Demian Katz - Director of Library Technology, Villanova University: 'As a long-time member of the Universal Viewer community, it was really exciting to see so many new people working together effectively to improve the project.'

Sara Weale - Head of Web Design & Development, Llyfrgell Genedlaethol Cymru - National Library of Wales: 'Taking part in this accessibility sprint was an exciting and rewarding experience. As Scrum Master, I had the privilege of facilitating the inception, daily stand-ups, and retrospective sessions, helping to keep the team focused and collaborative throughout. It was fantastic to see web developers from the National Library of Wales working alongside the British Library, Falvey Library (Villanova University), and other members of the Universal Viewer Steering Group.

This sprint marked the first time an international, cross-community team came together in this way, and the sense of shared purpose and camaraderie was truly inspiring. Some of the key lessons I took away from the sprint was the need for more precise task estimation, as well as the value of longer sprints to allow for deeper problem-solving. Despite these challenges, the fortnight was defined by excellent communication and a strong collective commitment to addressing accessibility issues.

Seeing the team come together so quickly and effectively highlighted the power of collaboration to drive meaningful progress, ultimately enhancing the Universal Viewer for a more inclusive future.'

BL Test Engineers: 

Damian Burke: 'Having worked on UV for a number of years, this was my first community sprint. What stood out for me was the level of collaboration and goodwill from everyone on the team. How quickly we formed into a working agile team was impressive. From a UV tester's perspective, I learned a lot from using new tools like Vercel and exploring GitHub's advanced functionality.'

Alex Rostron: 'It was nice to collaborate and work with skilled people from all around the world to get a good number of tickets over the line.'

Danny Taylor: 'I think what I liked most was how organised the sprints were. It was great to be involved in my first BL retrospective.'

Miro board with answers to the question 'what went well during this sprint?'

 

Positive reactions to 'how I feel after the sprint'
A Miro board was used for Sprint planning and the retrospective – a review meeting after the Sprint where we determined what went well and what we would improve for next time.

Experience from the sprint helped us to organise a further sprint within the UV Steering Group for admin-related work, aimed at improving documentation to ensure clearer processes and better support for contributors. Looking ahead, we're planning to release UV 4.1.0 in the new year, incorporating the enhancements we've made - we’ll share another update when the release candidate is ready for review.

Building on the success of the community sprint, we're excited to make these collaborative efforts a key part of our strategic roadmap. Join us and help shape the future of UV!

06 November 2024

Recovered Pages: Crowdsourcing at the British Library

Digital Curator Mia Ridge writes...

While the British Library works to recover from the October 2023 cyber-attack, we're putting some information from our currently inaccessible website into an easily readable and shareable format. This blog post is based on a page captured by the Wayback Machine in September 2023.

Crowdsourcing at the British Library

Screenshot of the Zooniverse interface for annotating a historical newspaper article
Example of a crowdsourcing task

For the British Library, crowdsourcing is an engaging form of online volunteering supported by digital tools that manage tasks such as transcription, classification and geolocation that make our collections more discoverable.

The British Library has run several popular crowdsourcing projects in the past, including the Georeferencer, for geolocating historical maps, and In the Spotlight, for transcribing important information about historical playbills. We also integrated crowdsourcing activities into our flagship AI / data science project, Living with Machines.

  • Agents of Enslavement uses 18th/19th century newspapers to research slavery in Barbados and create a database of enslaved people.
  • Living with Machines, which is mostly based on research questions around nineteenth century newspapers

Crowdsourcing Projects at the British Library

  • Living with Machines (2019-2023) created innovative crowdsourced tasks, including tasks that asked the public to closely read historical newspaper articles to determine how specific words were used.
  • Agents of Enslavement (2021-2022) used 18th/19th century newspapers to research slavery in Barbados and create a database of enslaved people.
  • In the Spotlight (2017-2021) was a crowdsourcing project from the British Library that aimed to make digitised historical playbills more discoverable, while also encouraging people to closely engage with this otherwise less accessible collection of ephemera.
  • Canadian wildlife: notes from the field (2021), a project where volunteers transcribed handwritten field notes that accompany recordings of a wildlife collection within the sound archive.
  • Convert a Card (2015) was a series of crowdsourcing projects aimed to convert scanned catalogue cards in Asian and African languages into electronic records. The project template can be found and used on GitHub.
  • Georeferencer (2012 - present) enabled volunteers to create geospatial data from digitised versions of print maps by adding control points to the old and modern maps.
  • Pin-a-Tale (2012) asked people to map literary texts to British places.

 

Research Projects

The Living with Machines project included a large component of crowdsourcing research through practice, led by Digital Curator Mia Ridge.

Mia was also the Principle Investigator on the AHRC-funded Collective Wisdom project, which worked with a large group of co-authors to produce a book, The Collective Wisdom Handbook: perspectives on crowdsourcing in cultural heritage, through two 'book sprints' in 2021:

This book is written for crowdsourcing practitioners who work in cultural institutions, as well as those who wish to gain experience with crowdsourcing. It provides both practical tips, grounded in lessons often learned the hard way, and inspiration from research across a range of disciplines. Case studies and perspectives based on our experience are woven throughout the book, complemented by information drawn from research literature and practice within the field.

More Information

Our crowdsourcing projects were designed to produce data that can be used in discovery systems (such as online catalogues and our item viewer) through enjoyable tasks that give volunteers an opportunity to explore digitised collections.

Each project involves teams across the Library to supply digitised images for crowdsourcing and ensure that the results are processed and ingested into various systems. Enhancing metadata through crowdsourcing is considered in the British Library's Collection Metadata Strategy.

We previously posted on twitter @LibCrowds and currently post occasionally on Mastodon https://glammr.us/@libcrowds and via our newsletter.

Past editions of our newsletter are available online.

14 October 2024

Research and Development activities in the Qatar Programme Imaging Team

This blog post is by members of the Imaging Team at British Library/Qatar Foundation Partnership (BLQFP) Programme: Eugenio Falcioni (Imaging and Digital Product Manager), Dominique Russell, Armando Ribeiro and Alexander Nguyen (Senior Imaging Technicians), Selene Marotta (Quality Management Officer), Matthew Lee and Virginia Mazzocato (Senior Imaging Support Technicians).

The Imaging Team has played a pivotal role in the British Library/Qatar Foundation Partnership (BLQFP) Programme since its launch in 2012. However, the journey has not been without hurdles. In October 2023, the infamous cyber-attack on the British Library severely disrupted operations across the organisation, impacting the Imaging Team profoundly. Inspired by the Library's Rebuild & Renew Programme, we used this challenging period to focus on research and development, refining our processes and deepening our understanding of the studio’s work practices. 

At the time of the attack, we were in the process of recruiting new members of the team who brought fresh energy, expertise, and enthusiasm. This also coincided with the appointment of a new Studio Manager. The formation of this almost entirely new team presented challenges as we adapted to the Library's disrupted environment. Yet, our synergy and commitment led us to find innovative ways of working.  Although the absence of an IT infrastructure, and therefore imaging hardware and software, posed significant difficulties for day-to-day activities in photography and digitisation, we had the time to focus on continuous improvement, without the usual pressures of deadlines. We enhanced our digitisation processes and expertise through a combination of quality improvements, strategic collaborations, and the development of innovative tools. Through teamwork and perseverance, we transformed adversity into an opportunity for growth. 

As an Imaging Team, we aim to create the optimal  digital surrogate of the items we capture. The BLQFP defined parameters for imaging which specify criteria such as colour and resolution accuracy, ensuring compliance with International Imaging Standards (such as FADGI or ISO 19264). 

During this unusual time, we focused on research and development into imaging standards, and updated our guidelines, resulting in a 150-page document detailing our workflow. This has improved consistency between setups and photographers, and has been fundamental in training new staff. We engaged in skills sharing workshops with Imaging Services, the Library’s core imaging department, and Heritage Made Digital (HMD), the Library’s department that manages digitisation workflows. 

Over the months, we have tested our images and setup, cameras, lighting, and colour targets, all while shooting directly to camera cards and using a laser measure device to check resolution (PPI). As a result of this work, we feel more confident in producing images that conform to International Imaging Standards; capturing images that truly represent the collection items. 

A camera stand with a bound volume with a colour target ruler on top and a laser device next to it.
Colour target on a bound volume

Alongside our testing, we arranged visits to imaging studios at other institutions where we shared our knowledge and learnt from the working processes of those who are digitising comparable collection material. During these visits, we gained a better understanding of the different imaging set-ups, the various international quality standards followed, and of how images produced are analysed. We also shared our approaches to capturing and stitching oversized items such as maps and foldouts. Lastly, we discussed quality assurance and workflow management tools. Overall, these visits across the sector have been a valuable exercise in making new connections, sharing ideas, and understanding that other institutions face similar problems when digitisation collection items. 

Without the use of dedicated digitisation software, the capture of items such as manuscripts and large bound volumes has been challenging as we have been unable to check the images we were producing. For this reason, we prioritised items of the collection which were less demanding and postponed the quality assurance checks to a later date. We chose to capture 78 rpm records as they required only two shots (front and back), minimising any possible mistakes. The imaging of audio collection items was our first achievement as a team since the cyber-attack: we digitised over 1100 shellac discs, in collaboration with the BLQFP Audio Team, who had previously catalogued and digitised the sound recording. 

A record with a green label reading Columbia
Image of a shellac disc (9CS0024993_ColumbiaGA3) digitised by the BLQFP

 Through this capture task we gained the optimism and confidence to start capturing more material, starting with the bindings of all the available bound collection items. The binding capture process is time-consuming and requires a specific setup and position of the item to photograph the front, back, spine, edge, head, and tail of each volume. By capturing bindings now, we will be able to streamline the process when we resume the digitisation of entire volumes.

A camera stand with a red-bound volume supported by a frame over cardboard
Capturing the spine of a bound volume, using l-shaped card on support frame

During this time, we were also involved in scoping work to locate and assess the most challenging items and plan a digitisation strategy accordingly. We focused particularly on identifying oversized maps and foldouts, which will be captured in sections and subsequently digitally stitched. This task required frequent visits to the Library’s basement storage areas and collaboration with the BLQFP Workflow Team to optimise and migrate data from the scoping process into existing workflow management systems. By gathering this data, we could determine the physical characteristics of each collection series and select the most suitable capture device. It was also crucial to collaborate with the BLQFP Conservation Team to develop new digitisation tools for capturing oversized foldouts more quickly and securely.

A volume with an insert, folded and unfolded, over two black foam supports

A volume with an insert, folded and unfolded, over two black foam supports
Using c-shaped Plastazote created by the BLQFP Conservation Team to support an oversized fold-out

The past nine months have presented many challenges for our Team. Nevertheless, in the spirit of Rebuild & Renew, we have been able to solve problems and develop creative ways of working, pulling together all our individual skills and experiences. As we expand, we have used this time productively to understand the intricacies of digitising fragile, complex, and oversized material while working to rigorous colour and quality standards. With the imminent return of imaging software, the next step for the BLQFP Imaging Team will be to apply our knowledge and understanding to a mass digitisation environment with the expectations of targets and monthly deliverables.

Team members standing around a stand on which a volume with a large foldout is prepared for photography, with lighting on both sides of the stand
Capturing a large foldout

 

16 July 2024

'AI and the Digital Humanities' session at CILIP's 2024 conference

Digital Curator Mia Ridge writes... I was invited to chair a session on 'AI and the digital humanities' at CILIP's 2024 conference with Ciaran Talbot (Associate Director AI & Ideas Adoption, University of Manchester Library) and Glen Robson (IIIF Technical Co-ordinator, International Image Interoperability Framework Consortium). Here's a quick post with some reflections on themes in the presentations and the audience Q&A.

A woman stands on stage in front of slides; two men sit at a panel table on the stage
CILIP's photo of our session

I presented a brief overview of some of the natural language processing (NLP) and computer vision methods in the Living with Machines project. That project and other work at the British Library showed that researchers can create innovative Digital Humanities methods and improve collections data with current AI / machine learning tools. But is there a gap between 'utilities' and 'cutting edge research' that AI can't (yet) fill for libraries?

AI (machine learning) makes library, museum and archive collections more accessible in two key ways. Firstly, more and better metadata and links across collections can make individual items more discoverable (e.g. identifying places mentioned in text; visual search to find similar images). Secondly, thinking of 'collections as data' and sharing datasets for research lets others find insights and inspiration.

Some of the value in AI might lie in the marketing power of the term - we've had the technical ability to view collections across silos for some time, but the institutional will might have lagged behind. Identifying the real gaps that AI can meet is hard, cross-institutional work - you need to understand what time-consuming work could be automated with ML/AI. Ciaran's talk gave a sense of the collaborative, co-creative effort required to understand actual processes and real problems and devise ways to optimise them. An 'anarchy' phase might be part of that process, and a roadmap can help set a shared vision as you work out where AI tools will actually save time or just create more but different work.

Glen gave some great examples of how IIIF can help organisations and researchers, and how AI tools might work with IIIF collections. He highlighted the intellectual property questions that 'open access' collections being mined for AI models raises, and pointed people to HaveIBeenTrained to see if their collections have been scraped.

I was struck by the delicate balance between maintaining trust and secure provenance while also supporting creative and playful uses of AI in collections. Labelling generative AI images and texts is vital. Detecting subtle errors and structural biases requires effort and expertise. As a sector, we need to keep learning, talking and collaborating to understand what generative AI means for users and collection holders.

The first question from the audience was about the environmental impact of AI. I was able to say that our work-in-progress principles for AI at the British Library ask people to consider the environmental impact of AI (not just its carbon footprint, but also water usage and rare minerals mining) in balance with other questions of public value for proposed experiments and projects. Ciaran said that Manchester have appointed a sustainability manager, which is probably something we'll see more of in future. There was a question about what employers are looking for in library and informatics students; about where to go for information and inspiration about AI in libraries (AI4LAM is a good start); and about how to update people's perceptions of libraries and the skills of library professionals.

Thanks to everyone at CILIP for all the work they put into the conference, and the fantastic AV team working in the keynote room at the Birmingham Hilton Metropole.

 

08 July 2024

Embracing Sustainability at the British Library: Insights from the Digital Humanities Climate Coalition Workshop

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected]. 

 

Sustainability has become a core value at the British Library, driven by our staff-led Sustainability Group and bolstered by the addition of a dedicated Sustainability Manager nearly a year ago. As part of our ongoing commitment to environmental responsibility, we have been exploring various initiatives to reduce our environmental footprint. One such initiative is our engagement with the Digital Humanities Climate Coalition (DHCC), a collaborative and cross-institutional effort focused on understanding and minimising the environmental impact of digital humanities research.

Screenshot from the Digital Humanities Climate Coalition website
Screenshot from the Digital Humanities Climate Coalition website
 

Discovering the DHCC and its toolkit

The Digital Humanities Climate Coalition (DHCC) has been on my radar for some time, primarily due to their exemplary work in promoting sustainable digital practices. The DHCC toolkit, in particular, has proven to be an invaluable resource. Designed to help individuals and organisations make more environmentally conscious digital choices, the toolkit offers practical guidance for building sustainable digital humanities projects. It encourages researchers to adopt climate-responsible practices and supports those who may lack the practical knowledge to devise greener initiatives.

The toolkit is comprehensive, providing tips on the planning and management of research infrastructure and data. It aims to empower researchers to make climate-friendly technological decisions, thereby fostering a culture of sustainability within the digital humanities community.

My primary goal in leveraging the DHCC toolkit is to raise awareness about the environmental impact of digital work and technology use. By doing so, I hope to empower Library staff to make informed decisions that contribute to our sustainability goals. The toolkit’s insights are crucial for anyone involved in digital research, offering both strategic guidance and practical tips for minimising ecological footprints.

Planning a workshop at the British Library

With the support of our Research Development team, I organised a one-day workshop at the British Library, inviting Professor James Baker, Director of Digital Humanities at the University of Southampton and a member of the DHCC, to lead the event. The workshop was designed to introduce the DHCC toolkit and provide guidance on implementing best practices in research projects. The in-person, full-day workshop was held on 5 February 2024.

Workshop highlights

The workshop featured four key sessions:

Session 1: Introductions and Framing: We began with an overview of the DHCC and its work within the GLAM sector, followed by an introduction to sustainability at the British Library, the roles that libraries play in reducing carbon footprint and awareness raising, the Green Libraries Campaign (of which the British Library was a founding partner), and perspectives on digital humanities and the use of computational methods.

CILIP’s Green Libraries Campaign banner
CILIP’s Green Libraries Campaign banner

Session 2: Toolkit Overview: Prof Baker introduced the DHCC toolkit, highlighting its main components and practical applications, focusing on grant writing (e.g. recommendations on designing research projects, including Data Management Plans), and working practices (guidance on reducing energy consumption in day-to-day working life, e.g. communication and shared working, travel, and publishing and preserving data). The session included responses from relevant Library teams, on topics such as research project design, data management and our shared research repository.

DHCC publication cover: A Reseacher Guide to Writing a Climate Justice Oriented Data Management Plan
DHCC Information, Measurement and Practice Action Group. (2022). A Researcher Guide to Writing a Climate Justice Oriented Data Management Plan (v0.6). Zenodo. https://doi.org/10.5281/zenodo.6451499

Session 3: Advocacy and Influencing: This session focused on strategies for advocating for sustainable practices within one's organisation and influencing others to adopt these practices. We covered the Library’s staff-led Sustainability Group and its activities, after which participants were then asked to consider the actions that could be taken at the Library and beyond, taking into account the types of people that might be influenced (senior leaders, colleagues, peers in wider networks/community).

Session 4: Feedback and Next Steps: Participants discussed their takeaways from the workshop and identified actionable steps they could implement in their work. This session included conversations on ways to translate workshop learnings into concrete next steps, and generated light ‘commitments’ for the next week, month and year. One fun way to set oneself a yearly reminder is to schedule an eco-friendly e-card to send to yourself in a year!

Post-workshop follow-up

Three months after the workshop had taken place, we conducted a follow-up survey to gauge its impact. The survey included a mix of agree/disagree statements (see chart below) and optional long-form questions to capture more detailed feedback. While we had only a few responses, survey results were constructive and positive. Participants appreciated the practical insights and reported better awareness of sustainable practices in their digital work.

Participants’ agree/disagree ratings for a series of statements about the DHCC workshop’s impact
Participants’ agree/disagree ratings for a series of statements about the DHCC workshop’s impact

Judging from responses to the set of statements above, at least several participants have embedded toolkit recommendations, made specific changes in their work, shared knowledge and influenced their wider networks. We got additional details on these actions in responses to the open-ended questions that followed.

What did staff members say?

Here are some comments made in relation to making changes and embedding the DHCC toolkit’s recommendation:

“Changes made to working policy and practice to order vegetarian options as standard for events.”

“I have referenced the toolkit in a chapter submitted for a monograph, in relation to my BL/university research.”

“I have discussed the toolkit's recommendations with colleagues re the projects I am currently working on. We agreed which parts of the projects were most carbon intensive and discussed ways to mitigate that.”

“I recommended a workshop on the toolkit to my [research] funding body.”

“Have engaged more with small impacts - less email traffic, fewer attachments, fewer images.”

A couple of comments were made with regard to challenges or barriers to change making. One was about colleagues being reluctant to decrease flying, or travel in general, as a way to reduce one’s carbon footprint. The second point referred to an uncertainty on how to influence internal discussions on software development infrastructure – highlighting the challenge of finding the right path to the right people.

An interesting comment was made in relation to raising environmental concerns and advocating the Toolkit:

“Shared the toolkit with wider professional network at an event at which environmentally conscious and sustainable practices were raised without prompting. Toolkit was well received with expressions of relief that others are thinking along these lines and taking practical steps to help progress the agenda.”

And finally, an excellent point about the energy-intensive use of ChatGPT (or other LLMs), which was covered at the workshop:

“The thing that has stayed with me is what was said about water consumption needed to cool the supercomputers - how every time you run one of those Chat GPT (or equivalent) queries it is the equivalent of throwing a litre of water out the window, and that Microsoft's water use has gone up 30%. I've now been saying this every time someone tells me to use one of these GPT searches. To be honest it has put me off using them completely.”

In summary

The DHCC workshop at the British Library was a great success, underscoring the importance of sustainability in digital humanities, digital projects and digital working. By leveraging the DHCC toolkit, we have taken important steps toward making our digital practices more environmentally responsible, and spreading the word across internal and external networks. Moving forward, we will continue to build on this momentum, fostering a culture of sustainability and empowering our staff to make informed, climate-friendly decisions.

Thank you to workshop contributors, organisers and helpers:

James Baker, Joely Fake, Maja Maricevic, Catherine Ross, Andy Rackley, Jez Cope, Jenny Basford, Graeme Bentley, Stephen White, Bianca Miranda Cardoso, Sarah Kirk-Browne, Andrea Deri, and Deirdre Sullivan.

 

04 July 2024

DHBN 2024 - Digital Humanities in the Nordic and Baltic Countries Conference Report

This is a joint blog post by Helena Byrne, Curator of Web Archives, Harry Lloyd, Research Software Engineer, and Rossitza Atanassova, Digital Curator.

Conference banner showing Icelandic landscape with mountains
This year’s Digital Humanities in the Nordic and Baltic countries conference took place at the University of Iceland School of Education in Reykjavik. It was the eight conference which was established in 2016, but the first time it was held in Iceland. The theme for the conference was “From Experimentation to Experience: Lessons Learned from the Intersections between Digital Humanities and Cultural Heritage”. There were pre-conference workshops from May 27-29 with the main conference starting on the afternoon of May 29 and finishing on May 31. In her excellent opening keynote Sally Chambers, Head of Research Infrastructure Services at the British Library, discussed the complex research and innovation data space for cultural heritage. Three British Library colleagues report highlights of their conference experience in this blog post.

Helena Byrne, Curator of Web Archives, Contemporary British & Irish Publications.

I presented in the Born Digital session held on May 28. There were four presentations in this session and three were related to web archiving and one related to Twitter (X) data. I co-presented ‘Understanding the Challenges for the Use of Web Archives in Academic Research’. This presentation examined the challenges for the use of web archives in academic research through a synthesis of the findings from two research studies that were published through the WARCnet research network. There was lots of discussion after the presentation on how web archives could be used as a research data management tool to help manage online citations in academic publications.

Helena presenting to an audience during the conference session on born-digital archives
Helena presenting in the born-digital archives session

The conference programme was very strong and there were many takeaways that relate to my role. One strong theme was ‘collections as data’. At the UK Web Archive we have just started to publish some of our inactive curated collections as data. So these discussions were very useful. One highlight was thePanel: Publication and reuse of digital collections: A GLAM Labs approach’. What stood out for me in this session was the checklist for publishing collections as data. It was very reassuring to see that we had pretty much everything covered for the release of the UK Web Archive datasets.

Rossitza and I were kindly offered a tour of the National and University Library of Iceland by Kristinn Sigurðsson, Head of Digital Projects and Development. We enjoyed meeting curatorial staff from the Special Collections who showed us some of the historical maps of Iceland that have been digitised. We also visited the digitisation studio to see how they process periodicals, and spoke to staff involved with web archiving. Thank you to Kristinn and his colleagues for this opportunity to learn about the library’s collections and digital services.

Rossitza and Helena standing by the moat outside the National Library of Iceland building
Rossitza and Helena outside the National and University Library of Iceland

 

Inscription in Icelandic reading National and University Library of Iceland outside the Library building
The National and University Library of Iceland

Harry Lloyd, Research Software Engineer, Digital Research.

DHNB2024 was a rich conference from my perspective as a research software engineer. Sally Chambers’ opening keynote on Wednesday afternoon demonstrated an extraordinary grasp of the landscape of digital cultural heritage across the EU. By this point there had already been a day and a half of workshops, including a session Rossitza and I presented on Catalogues as Data

I spent the first half using a Jupyter notebook to explain how we extracted entries from an OCR’d version of the catalogue of the British Library’s collection of 15th century books. We used an explainable algorithm rather than a ‘black-box’ machine learning one, so we walked through the steps involved and discussed where it worked well and where it could be improved. You can follow along by clicking the ‘launch notebook’ button in the ReadMe here

Harry pointing to an image from the catalogue of printed books on a screen for the workshop audience
Harry explaining text recognition results during the workshop

Handing over to Rossitza in the second half to discuss her corpus linguistic analysis worked really well by giving attendees a feel for the complete workflow. This really showed in some great conversations we had with attendees over the following days about tricky problems like where to store the ‘true’ results of OCR. 

A few highlights from the rest of the conference were Clelia LaMonica’s work using Latin large language model to analyse kinship in texts from Medieval Burgundy. Large language models trained on historic texts are important as the majority are trained on modern material and struggle with historical language. Jørgen Burchardt presented some refreshingly quantitative work on bias across a digitised newspaper collection, very reminiscent of work by Kaspar Beelen. Overall it was a productive few days, and I very much enjoyed my time in Reykjavik.

Rossitza Atanassova, Digital Curator, Digital Research.

This was my second DHNB conference and I was looking forward to reconnecting with the community of researchers and cultural heritage practitioners, some of whom I had met at DHNB2019 in Copenhagen. Apart from the informal discussions with attendees, I contributed to DHNB2024 in two main ways.

As already mentioned, Harry and I delivered a pre-conference workshop showcasing some processes and methodology we use for working with printed catalogues as data. In the session we used the corpus tool AntConc to perform computational analysis of the descriptions for the British Library’s collection of books published in the 15th century. You can find out more about the project here and reuse the workshop materials published on Zenodo here.

I also joined the pre-conference meeting of the international GLAM Labs Community held at the National and University Library of Iceland. This was the first in-person meeting of the community in five years and was a productive session during which we brainstormed ‘100 ideas for the GLAM Labs Community’. Afterwards we had a sneak peak of the archive of the National Theatre of Iceland which is being catalogued and digitised.

The main hall of the Library with a chessboard on a table with two chairs, a statue of a man, holding spectacles and a stained glass screen.
The main hall of the Library.

The DHNB community is so welcoming and supportive, and attracts many early career digital humanists. I was particularly interested to hear from doctoral students researching the use of AI with digitised archives, and using NLP methods with historical collections. One of the projects that stood out for me was Johannes Widegren’s PhD research into the ethical use of AI to enable access and discovery of Sami cultural heritage, and to develop library and archival practice. 

I was also interested in presentations that discussed workflows for creating Named Entity Recognition resources for historical archives and I plan to try out the open-source Label Studio tool that I learned about. And of course, the poster session is always a highlight and I enjoyed finding out about a range of projects, including computational analysis of Scandinavian runic-texts, digital reconstruction of Gothenburg’s 1923 Jubilee exhibition, and training large language models to track semantic variation in climate change vocabulary in Danish news articles.

A line up of people standing in front of a screen advertising the venue for DHNB25 in Estonia
The poster presentations session chaired by Olga Holownia

We are grateful to all DHNB24 organisers for the warm welcome and a great conference experience, with special thanks to the inspirational and indefatigable Olga Holownia

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs