Digital scholarship blog

Enabling innovative research with British Library digital collections

150 posts categorized "Research collaboration"

23 December 2024

AI (and machine learning, etc) with British Library collections

Machine learning (ML) is a hot topic, especially when it’s hyped as ‘AI’. How might libraries use machine learning / AI to enrich collections, making them more findable and usable in computational research? Digital Curator Mia Ridge lists some examples of external collaborations, internal experiments and staff training with AI / ML and digitised and born-digital collections.

Background

The trust that the public places in libraries is hugely important to us - all our 'AI' should be 'responsible' and ethical AI. The British Library was a partner in Sheffield University's FRAIM: Framing Responsible AI Implementation & Management project (2024). We've also used lessons from the projects described here to draft our AI Strategy and Ethical Guide.

Many of the projects below have contributed to our Digital Scholarship Training Programme and our Reading Group has been discussing deep learning, big data and AI for many years. It's important that libraries are part of conversations about AI, supporting AI and data literacy and helping users understand how ML models and datasets were created.

If you're interested in AI and machine learning in libraries, museums and archives, keep an eye out for news about the AI4LAM community's Fantastic Futures 2025 conference at the British Library, 3-5 December 2025. If you can't wait that long, join us for the 'AI Debates' at the British Library.

Using ML / AI tools to enrich collections

Generative AI tends to get the headlines, but at the time of writing, tools that use non-generative machine learning to automate specific parts of a workflow have more practical applications for cultural heritage collections. That is, 'AI' is currently more process than product.

Text transcription is a foundational task that makes digitised books and manuscripts more accessible to search, analysis and other computational methods. For example, oral history staff have experimented with speech transcription tools, raising important questions, and theoretical and practical issues for automatic speech recognition (ASR) tools and chatbots.

We've used Transkribus and eScriptorium to transcribe handwritten and printed text in a range of scripts and alphabets. For example:

Creating tools and demonstrators through external collaborations

Mining the UK Web Archive for Semantic Change Detection (2021)

This project used word vectors with web archives to track words whose meanings changed over time. Resources: DUKweb (Diachronic UK web) and blog post ‘Clouds and blackberries: how web archives can help us to track the changing meaning of words’.

Graphs showing how words associated with the words blackberry, cloud, eta and follow changed over time.
From blackberries to clouds... word associations change over time

Living with Machines (2018-2023)

Our Living With Machines project with The Alan Turing Institute pioneered new AI, data science and ML methods to analyse masses of newspapers, books and maps to understand the impact of the industrial revolution on ordinary people. Resources: short video case studies, our project website, final report and over 100 outputs in the British Library's Research Repository.

Outputs that used AI / machine learning / data science methods such as lexicon expansion, computer vision, classification and word embeddings included:

Tools and demonstrators created via internal pilots and experiments

Many of these examples were enabled by on-staff Research Software Engineers and the Living with Machines (LwM) team at the British Library's skills and enthusiasm for ML experiments in combination with long-term Library’s staff knowledge of collections records and processes:

British Library resources for re-use in ML / AI

Our Research Repository includes datasets suitable for ground truth training, including 'Ground truth transcriptions of 18th &19th century English language documents relating to botany from the India Office Records'. 

Our ‘1 million images’ on Flickr Commons have inspired many ML experiments, including:

The Library has also shared models and datasets for re-use on the machine learning platform Hugging Face.

13 December 2024

Looking back on the Data Science Accelerator

From April to July this year an Assistant Statistician at the Cabinet Office and a Research Software Engineer at the British Library teamed up as mentee (Catherine Macfarlane, CO) and mentor (Harry Lloyd, BL) for the Data Science Accelerator. In this blog post we reflect on the experience and what it meant for us and our work.

Introduction to the Accelerator

Harry: The Accelerator has been around since 2015, set up as a platform to ‘accelerate’ civil servants at the start of their data science journey who have a business need project and a real willingness to learn. Successful applicants are paired with mentors from across the Civil Service who have experience in techniques applicable to the problem, working together one protected day a week for 12 weeks. I was lucky enough to be a mentee in 2020, working on statistical methods to combine different types of water quality data, and my mentor Charlie taught me a lot of what I know. The programme played a huge role in the development of my career, so it was a rewarding moment to come back as a mentor for the April cohort. 

Catherine: On joining the Civil Service in 2023, I had the pleasure of becoming part of a talented data team that has motivated me to continually develop my skills. My academic background in Mathematics with Finance provides me with a strong theoretical foundation, but I am striving to improve my practical abilities. I am particularly interested in Artificial Intelligence, which is gaining increasing recognition across government, sparking discussions on its potential to improve efficiency.

I saw the Data Science Accelerator as an opportunity to deepen my knowledge, address a specific business need, and share insights with my team. The prospect of working with a mentor and immersing myself in an environment where diverse projects are undertaken was particularly appealing. A significant advantage was the protected time this project offered - a rare benefit! I was grateful to be accepted and paired with Harry, an experienced mentor who had already completed the programme. Following our first meeting, I felt ready to tackle the upcoming 12 weeks to see what we could achieve!

Photo of the mentee and mentor on a video call
With one of us based in England and the other in Scotland virtual meetings were the norm. Collaborative tools like screen sharing and Github allowed us to work effectively together.

The Project

Catherine: Our team is interested in the annual reports and accounts of Arm’s Length Bodies (ALBs), a category of public bodies funded to deliver a public or government service.  The project addressed the challenge my team faces in extracting the highly unstructured information stored in annual reports and accounts. With this information we would be able to enhance the data validation process and reduce the burden of commissioning data from ALBs on other teams. We proposed using Natural Language Processing to retrieve this information, analysing and querying it using a Large Language Model (LLM).

Initially, I concentrated on extracting five features, such as full-time equivalent staff in the organisation, from a sample of ALBs across 13 departments for the financial year 22/23. After discussions with Harry, we decided to use Retrieval-Augmented Generation (RAG), to develop a question-answering system. RAG is a technique that combines LLMs with relevant external documents to improve the accuracy and reliability of the output. This is done by retrieving documents that are relevant to the questions asked and then asking the LLM to generate an answer based on the retrieved material. We carefully selected a pre-trained LLM while considering ethical factors like model openness.

RAG
How a retrieval augmented generation system works. A document in this context is a segmented chunk of a larger text that can be parsed by an LLM.

The first four weeks focused on exploratory analysis, data processing, and labelling, all completed in R, which was essential for preparing the data for input into the language model. The subsequent stages involved model building and evaluation in Python, which required the most time and focus. This was my first time using Python, and Harry’s guidance was extremely beneficial during our pair coding sessions. A definite highlight for me was seeing the pipeline start to generate answers!

To bring all our results together, I created a dashboard in Shiny, ensuring it was accessible to both technical and non-technical audiences. The final stage involved summarising all our hard work from the past 12 weeks in a 10 minute presentation and delivering it to the Data Science Accelerator cohort.

Harry: Catherine’s was the best planned project of the ones I reviewed, and I suspected she’d be well placed to make best use of the 12 weeks. I wasn’t wrong! We covered a lot of the steps involved in good reproducible analysis. The exploratory work gave us a great sense of the variance in the data, setting up quantitative benchmarks for the language model results drove our development of the RAG system, and I was so impressed that Catherine managed to fit in building a dashboard on top of all of that.

Our Reflections

Catherine: Overall this experience was fantastic. In a short amount of time, we managed to achieve a considerable amount. It was amazing to develop my skills and grow in confidence. Harry was an excellent mentor; he encouraged discussion and asked insightful questions, which made our sessions both productive and enjoyable. A notable highlight was visiting the British Library! It was brilliant to have an in-person session with Harry and meet the Digital Research team.

A key success of the project was meeting the objectives we set out to achieve. Patience was crucial, especially when investigating errors and identifying the root problem. The main challenge was managing such a large project that could be taken in multiple directions. It can be natural to spend a long time on one area, such as exploratory analysis, but we ensured that we completed the key elements that allowed us to move on to the next stage. This balance was essential for the project's overall success.

Harry: We divided our days between time for Catherine to work solo and pair programming. Catherine is a really keen learner, and I think this approach helped her drive the project forward while giving us space to cover foundational programming topics and a new programming language. My other role was keeping an eye on the project timeline. Giving the occasional steer on when to stick with something and when to move on helped (I hope!) Catherine to achieve a huge amount in three months. 

Dashboard
A page from the dashboard Catherine created in the last third of the project.

Ongoing Work

Catherine: Our team recognises the importance of continuing this work. I have developed an updated project roadmap, which includes utilising Amazon Web Services to enhance the speed and memory capacity of our pipeline. Additionally, I have planned to compare various large language models, considering ethical factors, and I will collaborate with other government analysts involved in similar projects. I am committed to advancing this project, further upskilling the team, and keeping Harry updated on our progress.

Harry: RAG, and the semantic rather than key word search that underlies it, represents a maturation of LLM technology that has the potential to change the way users search our collections. Anticipating that this will be a feature of future library services platforms, we have a responsibility to understand more about how these technologies will work with our collections at scale. We’re currently carrying out experiments with RAG and the linked data of the British National Bibliography to understand how searching like this will change the way users interact with our data.

Conclusions

Disappointingly the Data Science Accelerator was wound down by the Office for National Statistics at the end of the latest cohort, citing budget pressures. That has made us one of the last mentor/mentee pairings to benefit from the scheme, which we’re both incredibly grateful for and deeply saddened by. The experience has been a great one, and we’ve each learned a lot from it. We’ll continue to develop RAG at the Cabinet Office and the British Library, and hope to advocate for and support schemes like the Accelerator in the future!

12 December 2024

Automating metadata creation: an experiment with Parliamentary 'Road Acts'

This post was originally written by Giorgia Tolfo in early 2023 then lightly edited and posted by Mia Ridge in late 2024. It describes work undertaken in 2019, and provides context for resources we hope to share on the British Library's Research Repository in future.

The Living with Machines project used a range of diverse sources, including newspapers to maps and census data.  This post discusses the Road Acts, 18th century Acts of Parliament stored at the British Library, as an example of some of the challenges in digitising historical records, and suggests computational methods for reducing some of the overhead for cataloging Library records during digitisation.

What did we want to do?

Before collection items can be digitised, they need a preliminary catalogue record - there's no point digitising records without metadata for provenance and discoverability. Like many extensive collections, the Road Acts weren't already catalogued. Creating the necessary catalogue records manually wasn't a viable option for the timeframe and budget of the project, so with the support of British Library experts Jennie Grimshaw and Iris O’Brien, we decided to explore automated methods for extracting metadata from digitised images of the documents themselves. The metadata created could then be mapped to a catalogue schema provided by Jennie and Iris. 

Due to the complexity, the timeframe of the project, the infrastructure and the resources needed, the agency Cogapp was commissioned to do the following:

  • Export metadata for 31 scanned microfilms in a format that matched the required field in a metadata schema provided by the British Library curators
  • OCR (including normalising the 'long S') to a standard agreed with the Living with Machines project
  • Create a package of files for each Act including: OCR (METS + ALTO) + images (scanned by British Library)

To this end, we provided Cogapp with:

  • Scanned images of the 31 microfilm reels, named using the microfilm ID and the numerical sequential order of the frame
  • The Library's metadata requirements
  • Curators' support to explain and guide them through the metadata extraction and record creation process 

Once all of this was put in place, the process started, however this is where we encountered the main problem. 

First issue: the typeface

After some research and tests we came to the conclusion that the typeface (or font, shown in Figure 1) is probably English Blackletter. However, at the time, OCR software - software that uses 'optical character recognition' to transcribe text from digitised images, like Abbyy, Tesseract or Transkribus - couldn't accurately read this font. Running OCR using a generic tool would inevitably lead to poor, if not unusable, OCR. You can create 'models' for unrecognised fonts by manually transcribing a set of documents, but this can be time-consuming. 

Image of a historical document
Figure 1: Page showing typefaces and layout. SPRMicP14_12_016

Second issue: the marginalia

As you can see in Figure 2, each Act has marginalia - additional text in the margins of the page. 

This makes the task of recognising the layout of information on the page more difficult. At the time, most OCR software wasn't able to detect marginalia as separate blocks of text. As a consequence these portions of text are often rendered inline, merged with the main text. Some examples showing how OCR software using standard settings interpret the page in Figure 2 are below.

Black and white image of printed page with comments in the margins
Figure 2 Printed page with marginalia. SPRMicP14_12_324

 

OCR generated by ABBYY FineReader:

Qualisicatiori 6s Truitees;

Penalty on acting if not quaiified.

Anno Regni septimo Georgii III. Regis.

9nS be it further enaften, Chat no person ihali he tapable of aftingt ao Crustee in the Crecution of this 9ft, unless be ftall he, in his oton Eight, oj in the Eight of his ©Btfe, in the aftual PofTefli'on anb jogment oj Eeceipt of the Eents ana profits of tanas, Cenements, anb 5)erebitaments, of the clear pearlg Oalue of J?iffp Pounbs} o? (hall be ©eit apparent of some person hatiing such estate of the clear gcatlg 5ia= lue of ©ne hunb?eb Pounbs; o? poffcsseb of, o? intitieb unto, a personal estate to the amount o? Oalue of ©ne thoufanb Pounbs: 9nb if ang Person hcrebg beemeo incapable to aft, ihali presume to aft, etierg such Per* son (hall, so? etierg such ©ffcnce, fojfcit anb pag the @um of jTiftg pounbs to ang person o? 

 

OCR generated by the open source tool Tesseract:

586 Anno Regni ?eptimo Georgi III. Regis.

Qualification

of Truttees;

Penalty on

Gnd be it further enated, That no. Per?on ?hall bÈ

capable of ating as Tru?tËe in the Crecution of thig

A, unle?s he ?hall be, in his own Right, 02 in the

Right of his Wife, in the a‰ual Pofe??ion and En. |

joyment 02 Receipt of the Rents and P2zofits of Lands,

Tenements, and hereditaments, of the clear pearly

Ualue of Fifty Pounds z o? hall be Deir Apparent of

?ome Per?on having ?uch Cfitate of the clear yearly Uga-

lue of Dne Hundred Pounds ; 02 po??e?leD of, 02 intitled

unto, a Per?onal E?tate to the Amount 02 Ualue of One

thou?and Pounds : And if any Per?on hereby deemed

acting if not incapable to ai, ?hall p2e?ume to ait, every ?uch Perz

qualified.

 

OCR generated by Cogapp (without any enhancement)

of Trusteesi

586

Anno Regni ſeptimo Georgii III. Regis.

Qualihcation and be it further enałted, That no perſon thall be

capable of aging as Trulltee in the Erecution of this

ad, unlefs he thall be, in his own Right, of in the

Right of his Wife, in the ađual Polellion and En:

joyment or Receipt of the Rents and Profits of Lands,

Tenements, and hereditaments, of the clear pearly

Ualue of ffifty pounds : oi thall be peir apparent of

ſome Perſon having ſuch Etate of the clear yearly Ua:

lue of Dne hundred Pounds; ou podeled of, od intitled

unto, a Perſonal Elate to the amount ou Ualue of Dne

Penalty on thouſand Pounds : and if any perſon hereby deemed

acting if not incapable to ad, thall preſume to ađ, every ſuch Per-

Qualified.

 

As you can see, the OCR transcription results were too poor to use in our research.

Changing our focus: experimenting with metadata creation

Time was running out fast, so we decided to adjust our expectations about text transcription, and asked Cogapp to focus on generating metadata for the digitised Acts. They have reported on their process in a post called 'When AI is not enough' (which might give you a sense of the challenges!).

Since the title page of each Act has a relatively standard layout it was possible to train a machine learning model to recognise the title, year and place of publication, imprint etc. and produce metadata that could be converted into catalogue records. These were sent on to British Library experts for evaluation and quality control, and potential future ingest into our catalogues.

Conclusion

This experience, although only partly successful in creating fully transcribed pages, explored the potential of producing the basis of catalogue records computationally, and was also an opportunity to test workflows for automated metadata extraction from historical sources. 

Since this work was put on hold in 2019, advances in OCR features built into generative AI chatbots offered by major companies mean that a future project could probably produce good quality transcriptions and better structured data from our digitised images.

If you have suggestions or want to get in touch about the dataset, please email [email protected]

28 August 2024

Open and Engaged 2024: Empowering Communities to Thrive in Open Scholarship

 British Library is delighted to host its annual Open and Engaged Conference on Monday 21 October, in-person and online, as part of the International Open Access Week. The Conference is supported by the Arts and Humanities Research Council (AHRC) and Research Libraries UK (RLUK).  

Save the Date flyer for Open & Engaged 2024 on 21 October, in person and online, and with logos for sponsors UKRI, Ars and Humanities Research Council and RLUK

Open and Engaged 2024: Empowering Communities to Thrive in Open Scholarship will centre leveraging the power of communities in the axis of open scholarship, open infrastructure, emerging technologies, collections as data, equity and integrity, skills development and sustainable models to elevate research of all kinds for the public good. We take a cross sectoral approach to the conference programme – unifying around shared-values for openness – by reflecting on practices within research libraries both in higher education and GLAM (Galleries, Libraries, Archives, Museums) sectors as well as the national and public libraries.  

Open and Engaged 2024 is supported by the Arts and Humanities Research Council (AHRC) and Research Libraries UK (RLUK). Everyone interested in the conference topics is welcome to join us on Monday, 21 October! 

This will be a hybrid event taking place at the British Library’s Knowledge Centre in St. Pancras, London, and streamed online for those unable to attend in-person. 

The event will be recorded and recordings made available in the British Library’s Research Repository.

Registration

Registration is closed for in-person and online attendance. Registrants have been contacted with details. Any questions, please contact [email protected].  

Programme 

Slides and recordings of the talks are available as a collection in the British Library’s Research Repository.

09:30  Registration

10:00  Welcome remarks

10:10  Opening keynote panel: Cross disciplinary approach to open scholarship

Chaired by Sally Chambers, Head of Research Infrastructure Services at the British Library.

10:50    Empowering communities through equity, inclusivity, and ethics

Chaired by Beth Montague-Hellen, Head of Library and Information Services at the Francis Crick Institute.

This session addresses the role of the communities in governance, explores the ethical implications of AI for citizens and highlights the value of public engagement, and discusses the central importance of equity, inclusivity, and integrity in scholarly communications.

11:40  Break

12:10    Deepening partnership in skills development through shared values

Chaired by Kirsty Wallis, Head of Research Liaison at UCL.

This session explores initiatives that foster skills development in libraries with a cross sectoral approach and dives into the role of libraries to support communities in building resilience.

13:00  Lunch

13:45   Open repositories for research of all kinds

This session addresses the role of infrastructure to carry out open scholarship practices, explores the practice as research in the axis of diverse outputs and infrastructure, discusses institutional resilience in digital strategies. 

Chaired by William J Nixon, Deputy Executive Director at Research Libraries UK (RLUK).

14:45  Break

15:15   Enabling collections as data: from policy to practice  

Chaired by Jez Cope, Data Services Lead at the British Library.

This session dives into the digital collections as data by exploring policies and practices across different sectors, public-private partnerships in making collections publicly available, dynamics in preservation versus access approach in national libraries whilst underlining the public good. 

16:15   Closing keynote: Stories Change Lives

Chaired by Liz White, Director of Library Partnerships at the British Library

16:45 Closing remarks

17:00 Networking session

19:00  End

The hashtag for the event is #OpenEngaged on social media platform of your choice. If you have any questions, please contactus at [email protected].  

16 July 2024

'AI and the Digital Humanities' session at CILIP's 2024 conference

Digital Curator Mia Ridge writes... I was invited to chair a session on 'AI and the digital humanities' at CILIP's 2024 conference with Ciaran Talbot (Associate Director AI & Ideas Adoption, University of Manchester Library) and Glen Robson (IIIF Technical Co-ordinator, International Image Interoperability Framework Consortium). Here's a quick post with some reflections on themes in the presentations and the audience Q&A.

A woman stands on stage in front of slides; two men sit at a panel table on the stage
CILIP's photo of our session

I presented a brief overview of some of the natural language processing (NLP) and computer vision methods in the Living with Machines project. That project and other work at the British Library showed that researchers can create innovative Digital Humanities methods and improve collections data with current AI / machine learning tools. But is there a gap between 'utilities' and 'cutting edge research' that AI can't (yet) fill for libraries?

AI (machine learning) makes library, museum and archive collections more accessible in two key ways. Firstly, more and better metadata and links across collections can make individual items more discoverable (e.g. identifying places mentioned in text; visual search to find similar images). Secondly, thinking of 'collections as data' and sharing datasets for research lets others find insights and inspiration.

Some of the value in AI might lie in the marketing power of the term - we've had the technical ability to view collections across silos for some time, but the institutional will might have lagged behind. Identifying the real gaps that AI can meet is hard, cross-institutional work - you need to understand what time-consuming work could be automated with ML/AI. Ciaran's talk gave a sense of the collaborative, co-creative effort required to understand actual processes and real problems and devise ways to optimise them. An 'anarchy' phase might be part of that process, and a roadmap can help set a shared vision as you work out where AI tools will actually save time or just create more but different work.

Glen gave some great examples of how IIIF can help organisations and researchers, and how AI tools might work with IIIF collections. He highlighted the intellectual property questions that 'open access' collections being mined for AI models raises, and pointed people to HaveIBeenTrained to see if their collections have been scraped.

I was struck by the delicate balance between maintaining trust and secure provenance while also supporting creative and playful uses of AI in collections. Labelling generative AI images and texts is vital. Detecting subtle errors and structural biases requires effort and expertise. As a sector, we need to keep learning, talking and collaborating to understand what generative AI means for users and collection holders.

The first question from the audience was about the environmental impact of AI. I was able to say that our work-in-progress principles for AI at the British Library ask people to consider the environmental impact of AI (not just its carbon footprint, but also water usage and rare minerals mining) in balance with other questions of public value for proposed experiments and projects. Ciaran said that Manchester have appointed a sustainability manager, which is probably something we'll see more of in future. There was a question about what employers are looking for in library and informatics students; about where to go for information and inspiration about AI in libraries (AI4LAM is a good start); and about how to update people's perceptions of libraries and the skills of library professionals.

Thanks to everyone at CILIP for all the work they put into the conference, and the fantastic AV team working in the keynote room at the Birmingham Hilton Metropole.

 

08 July 2024

Embracing Sustainability at the British Library: Insights from the Digital Humanities Climate Coalition Workshop

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected]. 

 

Sustainability has become a core value at the British Library, driven by our staff-led Sustainability Group and bolstered by the addition of a dedicated Sustainability Manager nearly a year ago. As part of our ongoing commitment to environmental responsibility, we have been exploring various initiatives to reduce our environmental footprint. One such initiative is our engagement with the Digital Humanities Climate Coalition (DHCC), a collaborative and cross-institutional effort focused on understanding and minimising the environmental impact of digital humanities research.

Screenshot from the Digital Humanities Climate Coalition website
Screenshot from the Digital Humanities Climate Coalition website
 

Discovering the DHCC and its toolkit

The Digital Humanities Climate Coalition (DHCC) has been on my radar for some time, primarily due to their exemplary work in promoting sustainable digital practices. The DHCC toolkit, in particular, has proven to be an invaluable resource. Designed to help individuals and organisations make more environmentally conscious digital choices, the toolkit offers practical guidance for building sustainable digital humanities projects. It encourages researchers to adopt climate-responsible practices and supports those who may lack the practical knowledge to devise greener initiatives.

The toolkit is comprehensive, providing tips on the planning and management of research infrastructure and data. It aims to empower researchers to make climate-friendly technological decisions, thereby fostering a culture of sustainability within the digital humanities community.

My primary goal in leveraging the DHCC toolkit is to raise awareness about the environmental impact of digital work and technology use. By doing so, I hope to empower Library staff to make informed decisions that contribute to our sustainability goals. The toolkit’s insights are crucial for anyone involved in digital research, offering both strategic guidance and practical tips for minimising ecological footprints.

Planning a workshop at the British Library

With the support of our Research Development team, I organised a one-day workshop at the British Library, inviting Professor James Baker, Director of Digital Humanities at the University of Southampton and a member of the DHCC, to lead the event. The workshop was designed to introduce the DHCC toolkit and provide guidance on implementing best practices in research projects. The in-person, full-day workshop was held on 5 February 2024.

Workshop highlights

The workshop featured four key sessions:

Session 1: Introductions and Framing: We began with an overview of the DHCC and its work within the GLAM sector, followed by an introduction to sustainability at the British Library, the roles that libraries play in reducing carbon footprint and awareness raising, the Green Libraries Campaign (of which the British Library was a founding partner), and perspectives on digital humanities and the use of computational methods.

CILIP’s Green Libraries Campaign banner
CILIP’s Green Libraries Campaign banner

Session 2: Toolkit Overview: Prof Baker introduced the DHCC toolkit, highlighting its main components and practical applications, focusing on grant writing (e.g. recommendations on designing research projects, including Data Management Plans), and working practices (guidance on reducing energy consumption in day-to-day working life, e.g. communication and shared working, travel, and publishing and preserving data). The session included responses from relevant Library teams, on topics such as research project design, data management and our shared research repository.

DHCC publication cover: A Reseacher Guide to Writing a Climate Justice Oriented Data Management Plan
DHCC Information, Measurement and Practice Action Group. (2022). A Researcher Guide to Writing a Climate Justice Oriented Data Management Plan (v0.6). Zenodo. https://doi.org/10.5281/zenodo.6451499

Session 3: Advocacy and Influencing: This session focused on strategies for advocating for sustainable practices within one's organisation and influencing others to adopt these practices. We covered the Library’s staff-led Sustainability Group and its activities, after which participants were then asked to consider the actions that could be taken at the Library and beyond, taking into account the types of people that might be influenced (senior leaders, colleagues, peers in wider networks/community).

Session 4: Feedback and Next Steps: Participants discussed their takeaways from the workshop and identified actionable steps they could implement in their work. This session included conversations on ways to translate workshop learnings into concrete next steps, and generated light ‘commitments’ for the next week, month and year. One fun way to set oneself a yearly reminder is to schedule an eco-friendly e-card to send to yourself in a year!

Post-workshop follow-up

Three months after the workshop had taken place, we conducted a follow-up survey to gauge its impact. The survey included a mix of agree/disagree statements (see chart below) and optional long-form questions to capture more detailed feedback. While we had only a few responses, survey results were constructive and positive. Participants appreciated the practical insights and reported better awareness of sustainable practices in their digital work.

Participants’ agree/disagree ratings for a series of statements about the DHCC workshop’s impact
Participants’ agree/disagree ratings for a series of statements about the DHCC workshop’s impact

Judging from responses to the set of statements above, at least several participants have embedded toolkit recommendations, made specific changes in their work, shared knowledge and influenced their wider networks. We got additional details on these actions in responses to the open-ended questions that followed.

What did staff members say?

Here are some comments made in relation to making changes and embedding the DHCC toolkit’s recommendation:

“Changes made to working policy and practice to order vegetarian options as standard for events.”

“I have referenced the toolkit in a chapter submitted for a monograph, in relation to my BL/university research.”

“I have discussed the toolkit's recommendations with colleagues re the projects I am currently working on. We agreed which parts of the projects were most carbon intensive and discussed ways to mitigate that.”

“I recommended a workshop on the toolkit to my [research] funding body.”

“Have engaged more with small impacts - less email traffic, fewer attachments, fewer images.”

A couple of comments were made with regard to challenges or barriers to change making. One was about colleagues being reluctant to decrease flying, or travel in general, as a way to reduce one’s carbon footprint. The second point referred to an uncertainty on how to influence internal discussions on software development infrastructure – highlighting the challenge of finding the right path to the right people.

An interesting comment was made in relation to raising environmental concerns and advocating the Toolkit:

“Shared the toolkit with wider professional network at an event at which environmentally conscious and sustainable practices were raised without prompting. Toolkit was well received with expressions of relief that others are thinking along these lines and taking practical steps to help progress the agenda.”

And finally, an excellent point about the energy-intensive use of ChatGPT (or other LLMs), which was covered at the workshop:

“The thing that has stayed with me is what was said about water consumption needed to cool the supercomputers - how every time you run one of those Chat GPT (or equivalent) queries it is the equivalent of throwing a litre of water out the window, and that Microsoft's water use has gone up 30%. I've now been saying this every time someone tells me to use one of these GPT searches. To be honest it has put me off using them completely.”

In summary

The DHCC workshop at the British Library was a great success, underscoring the importance of sustainability in digital humanities, digital projects and digital working. By leveraging the DHCC toolkit, we have taken important steps toward making our digital practices more environmentally responsible, and spreading the word across internal and external networks. Moving forward, we will continue to build on this momentum, fostering a culture of sustainability and empowering our staff to make informed, climate-friendly decisions.

Thank you to workshop contributors, organisers and helpers:

James Baker, Joely Fake, Maja Maricevic, Catherine Ross, Andy Rackley, Jez Cope, Jenny Basford, Graeme Bentley, Stephen White, Bianca Miranda Cardoso, Sarah Kirk-Browne, Andrea Deri, and Deirdre Sullivan.

 

04 July 2024

DHBN 2024 - Digital Humanities in the Nordic and Baltic Countries Conference Report

This is a joint blog post by Helena Byrne, Curator of Web Archives, Harry Lloyd, Research Software Engineer, and Rossitza Atanassova, Digital Curator.

Conference banner showing Icelandic landscape with mountains
This year’s Digital Humanities in the Nordic and Baltic countries conference took place at the University of Iceland School of Education in Reykjavik. It was the eight conference which was established in 2016, but the first time it was held in Iceland. The theme for the conference was “From Experimentation to Experience: Lessons Learned from the Intersections between Digital Humanities and Cultural Heritage”. There were pre-conference workshops from May 27-29 with the main conference starting on the afternoon of May 29 and finishing on May 31. In her excellent opening keynote Sally Chambers, Head of Research Infrastructure Services at the British Library, discussed the complex research and innovation data space for cultural heritage. Three British Library colleagues report highlights of their conference experience in this blog post.

Helena Byrne, Curator of Web Archives, Contemporary British & Irish Publications.

I presented in the Born Digital session held on May 28. There were four presentations in this session and three were related to web archiving and one related to Twitter (X) data. I co-presented ‘Understanding the Challenges for the Use of Web Archives in Academic Research’. This presentation examined the challenges for the use of web archives in academic research through a synthesis of the findings from two research studies that were published through the WARCnet research network. There was lots of discussion after the presentation on how web archives could be used as a research data management tool to help manage online citations in academic publications.

Helena presenting to an audience during the conference session on born-digital archives
Helena presenting in the born-digital archives session

The conference programme was very strong and there were many takeaways that relate to my role. One strong theme was ‘collections as data’. At the UK Web Archive we have just started to publish some of our inactive curated collections as data. So these discussions were very useful. One highlight was thePanel: Publication and reuse of digital collections: A GLAM Labs approach’. What stood out for me in this session was the checklist for publishing collections as data. It was very reassuring to see that we had pretty much everything covered for the release of the UK Web Archive datasets.

Rossitza and I were kindly offered a tour of the National and University Library of Iceland by Kristinn Sigurðsson, Head of Digital Projects and Development. We enjoyed meeting curatorial staff from the Special Collections who showed us some of the historical maps of Iceland that have been digitised. We also visited the digitisation studio to see how they process periodicals, and spoke to staff involved with web archiving. Thank you to Kristinn and his colleagues for this opportunity to learn about the library’s collections and digital services.

Rossitza and Helena standing by the moat outside the National Library of Iceland building
Rossitza and Helena outside the National and University Library of Iceland

 

Inscription in Icelandic reading National and University Library of Iceland outside the Library building
The National and University Library of Iceland

Harry Lloyd, Research Software Engineer, Digital Research.

DHNB2024 was a rich conference from my perspective as a research software engineer. Sally Chambers’ opening keynote on Wednesday afternoon demonstrated an extraordinary grasp of the landscape of digital cultural heritage across the EU. By this point there had already been a day and a half of workshops, including a session Rossitza and I presented on Catalogues as Data

I spent the first half using a Jupyter notebook to explain how we extracted entries from an OCR’d version of the catalogue of the British Library’s collection of 15th century books. We used an explainable algorithm rather than a ‘black-box’ machine learning one, so we walked through the steps involved and discussed where it worked well and where it could be improved. You can follow along by clicking the ‘launch notebook’ button in the ReadMe here

Harry pointing to an image from the catalogue of printed books on a screen for the workshop audience
Harry explaining text recognition results during the workshop

Handing over to Rossitza in the second half to discuss her corpus linguistic analysis worked really well by giving attendees a feel for the complete workflow. This really showed in some great conversations we had with attendees over the following days about tricky problems like where to store the ‘true’ results of OCR. 

A few highlights from the rest of the conference were Clelia LaMonica’s work using Latin large language model to analyse kinship in texts from Medieval Burgundy. Large language models trained on historic texts are important as the majority are trained on modern material and struggle with historical language. Jørgen Burchardt presented some refreshingly quantitative work on bias across a digitised newspaper collection, very reminiscent of work by Kaspar Beelen. Overall it was a productive few days, and I very much enjoyed my time in Reykjavik.

Rossitza Atanassova, Digital Curator, Digital Research.

This was my second DHNB conference and I was looking forward to reconnecting with the community of researchers and cultural heritage practitioners, some of whom I had met at DHNB2019 in Copenhagen. Apart from the informal discussions with attendees, I contributed to DHNB2024 in two main ways.

As already mentioned, Harry and I delivered a pre-conference workshop showcasing some processes and methodology we use for working with printed catalogues as data. In the session we used the corpus tool AntConc to perform computational analysis of the descriptions for the British Library’s collection of books published in the 15th century. You can find out more about the project here and reuse the workshop materials published on Zenodo here.

I also joined the pre-conference meeting of the international GLAM Labs Community held at the National and University Library of Iceland. This was the first in-person meeting of the community in five years and was a productive session during which we brainstormed ‘100 ideas for the GLAM Labs Community’. Afterwards we had a sneak peak of the archive of the National Theatre of Iceland which is being catalogued and digitised.

The main hall of the Library with a chessboard on a table with two chairs, a statue of a man, holding spectacles and a stained glass screen.
The main hall of the Library.

The DHNB community is so welcoming and supportive, and attracts many early career digital humanists. I was particularly interested to hear from doctoral students researching the use of AI with digitised archives, and using NLP methods with historical collections. One of the projects that stood out for me was Johannes Widegren’s PhD research into the ethical use of AI to enable access and discovery of Sami cultural heritage, and to develop library and archival practice. 

I was also interested in presentations that discussed workflows for creating Named Entity Recognition resources for historical archives and I plan to try out the open-source Label Studio tool that I learned about. And of course, the poster session is always a highlight and I enjoyed finding out about a range of projects, including computational analysis of Scandinavian runic-texts, digital reconstruction of Gothenburg’s 1923 Jubilee exhibition, and training large language models to track semantic variation in climate change vocabulary in Danish news articles.

A line up of people standing in front of a screen advertising the venue for DHNB25 in Estonia
The poster presentations session chaired by Olga Holownia

We are grateful to all DHNB24 organisers for the warm welcome and a great conference experience, with special thanks to the inspirational and indefatigable Olga Holownia

07 May 2024

Recovered Pages: Computing for Cultural Heritage Student Projects

The British Library is continuing to recover from last year’s cyber-attack. While our teams work to restore our services safely and securely, one of our goals in the Digital Research Team is to get some of the information from our currently inaccessible web pages into an easily readable and shareable format. We’ll be sharing these pages via blog posts here, with information recovered from the Wayback Machine, a fantastic initiative of the Internet Archive.  

The next page in this series is all about the student projects that came out of our Computing for Cultural Heritage project with the National Archives and Birkbeck University. This student project page was captured by the Wayback Machine on 7 June 2023.  

 

Computing for Cultural Heritage Student Projects

computing for cultural heritage logo - an image of a laptop with bookshelves as the screen saver

This page provides abstracts for a selection of student projects undertaken as part of a one-year part-time Postgraduate Certificate (PGCert), Computing for Cultural Heritage, co-developed by British Library, National Archives and Birkbeck University and funded by the Institute of Coding as part of a £4.8 million University skills drive.

“I have gone from not being able to print 'hello' in Python to writing some relatively complex programs and having a much greater understanding of data science and how it is applicable to my work."

- Jessica Green  

Key points

  • Aim of the trial was to provide professionals working in the cultural heritage sector with an understanding of basic programming and computational analytic tools to support them in their daily work 
  • During the Autumn & Spring terms (October 2019-April 2020), 12 staff members from British Library and 8 staff staff members from The National Archives completed two new trial modules at Birkbeck University: Demystifying computing for heritage professionals and Work-based Project 
  • Birkbeck University have now launched the Applied Data Science (Postgraduate Certificate) based on the outcomes of the trial

Student Projects

 

Transforming Physical Labels into Digital References 

Sotirios Alpanis, British Library
This project aims to use computing to convert data collected during the preparation of archive material for digitisation into a tool that can verify and validate image captures, and subsequently label them. This will take as its input physical information about each document being digitised, perform and facilitate a series of validations throughout image capture and quality assurance and result in an xml file containing a map of physical labels to digital files. The project will take place within the British Library/Qatar Foundation Partnership (BL/QFP), which is digitising archive material for display on the QDL.qa.  

Enhancing national thesis metadata with persistent identifiers

Jenny Basford, British Library 
Working with data from ISNI (International Standard Name Identifier) Agency and EThOS (Electronic Theses Online Service), both based at the British Library, I intend to enhance the metadata of both databases by identifying doctoral supervisors in thesis metadata and matching these data with ISNI holdings. This work will also feed into the European-funded FREYA project, which is concerned with the use of a wide variety of persistent identifiers across the research landscape to improve openness in research culture and infrastructure through Linked Data applications.

A software tool to support the social media activities of the Unlocking Our Sound Heritage Project

Lucia Cavorsi, British Library
Video
I would like to design a software tool able to flag forthcoming anniversaries, by comparing all the dates present in SAMI (sound and moving image catalogue – Sound Archive) with the current date. The aim of this tool is to suggest potential content for the Sound Archive’s social media posts. Useful dates in SAMI which could be matched with the current date and provide material for tweets are: birth and death dates of performers or authors, radio programme broadcast dates, recording dates).  I would like this tool to also match the subjects currently present in SAMI with the subjects featured in the list of anniversaries 2020 which the social media team uses. For example anniversaries like ‘International HIV day’, ‘International day of Lesbian visibility’ etc.  A windows pop up message will be designed for anniversaries notifications on the day.  If time permits, it would be convenient to also analyse what hashtags have been used over last year by the people who are followed by or follow the Sound Archive Twitter account. By extracting a list of these hashtags further, and more sound related, anniversaries could be added to the list of anniversaries currently used by the UOSH’s social media team.

Computing Cholera: Topic modelling the catalogue entries of the General Board of Health

Christopher Day, The National Archives
BlogOther
The correspondence of the General Board of Health (1848–1871) documents the work of a body set up to deal with cholera epidemics in a period where some English homes were so filthy as to be described as “mere pigholes not fit for human beings”. Individual descriptions for each of these over 89,000 letters are available on Discovery, The National Archives (UK)’s catalogue. Now, some 170 years later, access to the letters themselves has been disrupted by another epidemic, COVID-19. This paper examines how data science can be used to repurpose archival catalogue descriptions, initially created to enhance the ‘human findability’ of records (and favoured by many UK archives due to high digitisation costs), for large-scale computational analysis. The records of the General Board will be used as a case study: their catalogue descriptions topic modelled using a latent Dirichlet allocation model, visualised, and analysed – giving an insight into how new sanitary regulations were negotiated with a divided public during an epidemic. The paper then explores the validity of using the descriptions of historical sources as a source in their own right; and asks how, during a time of restricted archival access, metadata can be used to continue research.

An Automated Text Extraction Tool for Use on Digitised Maps

Nicholas Dykes, British Library
Blog / Video
Researchers of history often have difficulty geo-locating historical place names in Africa. I would like to apply automated transcription techniques to a digitised archive of historical maps of Africa to create a resource that will allow users to search for text, and discover where, and on which maps that text can be found. This will enable identification and analysis both of historical place names and of other text, such as topographical descriptions. I propose to develop a software tool in Python that will send images stored locally to the Google Vision API, and retrieve and process a response for each image, consisting of a JSON file containing the text found, pixel coordinate bounding boxes for each instance of text, and a confidence score. The tool will also create a copy of each image with the text instances highlighted. I will experiment with the parameters of the API in order to achieve the most accurate results.  I will incorporate a routine that will store each related JSON file and highlighted image together in a separate folder for each map image, and create an Excel spreadsheet containing text results, confidence scores, links to relevant image folders, and hyperlinks to high-res images hosted on the BL website. The spreadsheet and subfolders will then be packaged together into a single downloadable resource.  The finished software tool will have the capability to create a similar resource of interlinked spreadsheet and subfolders from any batch of images.

Reconstituting a Deconstructed Dataset using Python and SQLite

Alex Green, The National Archives
Video
For this project I will rebuild a database and establish the referential integrity of the data from CSV files using Python and SQLite. To do this I will need to study the data, read the documentation, draw an entity relationship diagram and learn more about relational databases. I want to enable users to query the data as they would have been able to in the past. I will then make the code reusable so it can be used to rebuild other databases, testing it with a further two datasets in CSV form. As an additional challenge, I plan to rearrange the data to meet the principles of ‘tidy data’ to aid data analysis.

PIMMS: Developing a Model Pre-Ingest Metadata Management System at the British Library

Jessica Green, British Library
GitHub / Video
I am proposing a solution to analysing and preparing for ingest a vast amount of ‘legacy’ BL digitised content into the future Digital Asset Management System (DAMPS). This involves building a prototype for a SQL database to aggregate metadata about digitised content and preparing for SIP creation. In addition, I will write basic queries to aid in our ongoing analysis about these TIFF files, including planning for storage, copyright, digital preservation and duplicate analysis. I will use Python to import sample metadata from BL sources like SharePoint, Excel and BL catalogues – currently used for analysis of ‘live’ and ‘legacy’ digitised BL collections. There is at least 1 PB of digitised content on the BL networks alone, as well as on external media such as hard-drives and CDs. We plan to only ingest one copy of each digitised TIFF file set and need to ensure that the metadata is accurate and up-to-date at the point of ingest. This database, the Pre-Ingest Metadata Management System (PIMMS), could serve as a central metadata repository for legacy digitised BL collections until then. I look forward to using Python and SQL, as well as drawing on the coding skills from others, to make these processes more efficient and effective going forward.

Exploring, cleaning and visualising catalogue metadata

Alex Hailey, British Library
Blog / Video
Working with catalogue metadata for the India Office Records (IOR) I will undertake three tasks: 1) converting c430,000 IOR/E index entries to descriptions within the relevant volume entries; 2) producing an SQL database for 46,500 IOR/P descriptions, allowing enhanced search when compared with the BL catalogue; and 3) creating Python scripts for searching, analysis and visualisation, to be demonstrated on dataset(s) and delivered through Jupyter Notebooks.

Automatic generation of unique reference numbers for structured archival data.

Graham Jevon, British Library
Blog / Video / GitHub
The British Library’s Endangered Archives Programme (EAP) funds the digital preservation of endangered archival material around the world. Third party researchers digitise material and send the content to the British Library. This is accompanied by an Excel spreadsheet containing metadata that describes the digitised content. EAP’s main task is to clean, validate, and enhance the metadata prior to ingesting it into the Library’s cataloguing system (IAMS). One of these tasks is the creation of unique catalogue reference numbers for each record (each row of data on the spreadsheet). This is a predominantly manual process that is potentially time consuming and subject to human inputting errors. This project seeks to solve this problem. The intention is to create a Windows executable program that will enable users to upload a csv file, enter a prefix, and then click generate. The instant result will be an export of a new csv file, which contains the data from the original csv file plus automatically generated catalogue reference numbers. These reference numbers are not random. They are structured in accordance with an ordered archival hierarchy. The program will include additional flexibility to account for several variables, including language encoding, computational efficiency, data validation, and wider re-use beyond EAP and the British Library.

Automating Metadata Extraction in Born Digital Processing

Callum McKean, British Library
Video
To automate the metadata extraction section of the Library’s current work-flow for born-digital processing using Python, then interrogate and collate information in new ways using the SQLite module.

Analysis of peak customer interactions with Reference staff at the British Library: a software solution

Jaimee McRoberts, British Library
Video
The British Library, facing on-going budget constraints, has a need to efficiently deploy Reference Services staff during peak periods of demand. The service would benefit from analysis of existing statistical data recording the timestamp of each customer interaction at a Reference Desk. In order to do this, a software solution is required to extract, analyse, and output the necessary data. This project report demonstrates a solution utilising Python alongside the pandas library which has successfully achieved the required data analysis.

Enhancing the data in the Manorial Documents Register (MDR) and making it more accessible

Elisabeth Novitski, The National Archives
Video
To develop computer scripts that will take the data from the existing separate and inconsistently formatted files and merge them into a consistent and organised dataset. This data will be loaded into the Manorial Documents Register (MDR) and National Register of Archives (NRA) to provide the user with improved search ability and access to the manorial document information.

Automating data analysis for collection care research at The National Archives: spectral and textual data

Lucia Pereira Pardo, The National Archives
The day-to-day work of a conservation scientist working for the care of an archival collection involves acquiring experimental data from the varied range of materials present in the physical records (inks, pigments, dyes, binding media, paper, parchment, photographs, textiles, degradation and restoration products, among others). To this end, we use multiple and complementary analytical and testing techniques, such as X-ray fluorescence (XRF), Fourier Transform Infrared (FTIR) and Fibre Optic Reflectance spectroscopies (FORS), multispectral imaging (MSI), colour and gloss measurements, microfading (MFT) and other accelerated ageing tests.  The outcome of these analyses is a heterogeneous and often large dataset, which can be challenging and time-consuming to process and analyse. Therefore, the objective of this project is to automate these tasks when possible, or at least to apply computing techniques to optimise the time and efforts invested in routine operations, so that resources are freed for actual research and more specialised and creative tasks dealing with the interpretation of the results.

Improving efficiencies in content development through batch processing and the automation of workloads

Harriet Roden, British Library
Video
With the purpose to support and enrich the curriculum, the British Library’s Digital Learning team produces large-scale content packages for online learners through individual projects. Due to their reliance on other internal teams within the workflow for content delivery, a substantial amount of resource is spent on routine tasks to duplicate collection metadata across various databases. In order to reduce inefficiencies, increase productivity and improve reliability, my project aimed to alleviate pressures across the workflow through workload automation, through four separate phases.

The Botish Library: building a poetry printing machine with Python

Giulia Carla Rossi, British Library
Blog / Video
This project aims to build a poetry printing machine, as a creative output that unites traditional content, new media and Python. The poems will be sourced from the British Library Digitised Books dataset collection, available under Public Domain Mark; I will sort through the datasets and identify which titles can be categorised as poetry using Python. I will then create a new dataset comprising these poetry books and relative metadata, which will then be connected to the printer with a Python script. The poetry printing machine will print randomized poems from this new dataset, together with some metadata (e.g. poem title, book title, author and shelfmark ID) that will allow users to easily identify the book.

Automating data entry in the UOSH Tracking Database

Chris Weaver, British Library
The proposed software solution is the creation of a Python script (to feature as a module in a larger script) to extract data from a web-based tool (either via obtaining data in JSON format via the sites' API or accessing the database powering the site directly). The data obtained is then formatted and inserted into corresponding fields in a Microsoft SQL Server database.

Final Module

Following the completion of the trial, participants had the opportunity to complete their PGCert in Applied Data Science by attending the final module, Analytic Tools for Information Professionals, which was part of the official course launched last autumn. We followed up with some of the participants to hear more about their experience of the full course:

“The third and final module of the computing for cultural heritage course was not only fascinating and enjoyable, it was also really pertinent to my job and I was immediately able to put the skills I learned into practice.  

The majority of the third module focussed on machine learning. We studied a number of different methods and one of these proved invaluable to the Agents of Enslavement research project I am currently leading. This project included a crowdsourcing task which asked the public to draw rectangles around four different types of newspaper advertisement. The purpose of the task was to use the coordinates of these rectangles to crop the images and create a dataset of adverts that can then be analysed for research purposes. To help ensure that no adverts were missed and to account for individual errors, each image was classified by five different people.  

One of my biggest technical challenges was to find a way of aggregating the rectangles drawn by five different people on a single page in order to calculate the rectangles of best fit. If each person only drew one rectangle, it was relatively easy for me to aggregate the results using the coding skills I had developed in the first two modules. I could simply find the average (or mean) of the five different classification attempts. But what if people identified several adverts and therefore drew multiple rectangles on a single page? For example, what if person one drew a rectangle around only one advert in the top left corner of the page; people two and three drew two rectangles on the same page, one in the top left and one in the top right; and people four and five drew rectangles around four adverts on the same page (one in each corner). How would I be able to create a piece of code that knew how to aggregate the coordinates of all the rectangles drawn in the top left and to separately aggregate the coordinates of all the rectangles drawn in the bottom right, and so on?  

One solution to this problem was to use an unsupervised machine learning method to cluster the coordinates before running the aggregation method. Much to my amazement, this worked perfectly and enabled me to successfully process the total of 92,218 rectangles that were drawn and create an aggregated dataset of more than 25,000 unique newspaper adverts.” 

-Graham Jevon, EAP Cataloguer; BL Endangered Archives Programme 

“The final module of the course was in some ways the most challenging — requiring a lot of us to dust off the statistics and algebra parts of our brain. However, I think, it was also the most powerful; revealing how machine learning approaches can help us to uncover hidden knowledge and patterns in a huge variety of different areas.  

Completing the course during COVID meant that collection access was limited, so I ended up completing a case study examining how generic tropes have evolved in science fiction across time using a dataset extracted from GoodReads. This work proved to be exceptionally useful in helping me to think about how computers understand language differently; and how we can leverage their ability to make statistical inferences in order to support our own, qualitative analyses. 

In my own collection area, working with born digital archives in Contemporary Archives and Manuscripts, we treat draft material — of novels, poems or anything else — as very important to understanding the creative process. I am excited to apply some of these techniques — particularly Unsupervised Machine Learning — to examine the hidden relationships between draft material in some of our creative archives. 

The course has provided many, many avenues of potential enquiry like this and I’m excited to see the projects that its graduates undertake across the Library.” 

- Callum McKean, Lead Curator, Digital; Contemporary British Collection

“I really enjoyed the Analytics Tools for Data Science module. As a data science novice, I came to the course with limited theoretical knowledge of how data science tools could be applied to answer research questions. The choice of using real-life data to solve queries specific to professionals in the cultural heritage sector was really appreciated as it made everyday applications of the tools and code more tangible. I can see now how curators’ expertise and specialised knowledge could be combined with tools for data analysis to further understanding of and meaningful research in their own collection area."

- Giulia Carla Rossi, Curator, Digital Publications; Contemporary British Collection

Please note this page was originally published in Feb 2021 and some of the resources, job titles and locations may now be out of date.

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs