Open and Engaged 2024: Empowering Communities to Thrive in Open Scholarship will centre leveraging the power of communities in the axis of open scholarship, open infrastructure, emerging technologies, collections as data, equity and integrity, skills development and sustainable models to elevate research of all kinds for the public good. We take a cross sectoral approach to the conference programme – unifying around shared-values for openness – by reflecting on practices within research libraries both in higher education and GLAM (Galleries, Libraries, Archives, Museums) sectors as well as the national and public libraries.
This will be a hybrid event taking place at the British Library’s Knowledge Centre in St. Pancras, London, and streamed online for those unable to attend in-person.
The event will be recorded and recordings made available in the British Library’s Research Repository.
Registration
Please register for Open and Engaged 2024 by filling out this form. Registration will close on Friday 4 October for in-person attendance and Thursday 17 October for online attendance at 18:00 BST.
Registrants will be contacted with details for either in-person attendance or a link to access the online stream closer to the event.
Provisional Programme
Please note that the conference program is subject to updates as we finalize the lineup of speakers.
09:30 Registration
10:00 Welcome remarks
10:10 Opening keynote panel: Cross disciplinary approach to open scholarship
Earlier this month, I had the pleasure of attending the “Charting the European D-SEA: Digital Scholarship in East Asian Studies” conference held at the Berlin State Library (Staatsbibliothek zu Berlin), also known as the Stabi. The conference, held on 11-12 July 2024, aimed to fill a gap in the European digital scholarship landscape by creating a research community and a space for knowledge exchange on digital scholarship issues across humanities disciplines concerned with East Asian regions and languages.
The event was a dynamic fusion of workshops, presentations and panel discussions. Over three days of workshops (8-10 July), participants were introduced to key digital methods, resources, and databases. These sessions aimed to transmit practical knowledge in digital scholarship, focusing on East Asian collections and data. The subsequent two days were dedicated to the conference proper, featuring a broad range of presentations on various themes.
The reading room in the Berlin State Library, Haus Potsdamer Straße
DH and East Asian Studies in Europe and Beyond
Conference organisers Jing Hu and Brent Ho from the Stabi, and Shih-Pei Chen and Dagmar Schäfer from the Max Planck Institute for the History of Science (MPIWG), set the stage for an enriching exchange of ideas and knowledge. The diversity of topics covered was impressive – from the more established digital resources and research tools to AI applications in historical research – the sessions provided a comprehensive overview of the current state and future directions of the field.
There were so many excellent presentations – and I often wished I could clone myself to attend parallel sessions! As expected, there was much focus on working with AI – machine learning and generative AI – and their potential in historical and humanities research. AI technologies offer powerful tools for data analysis and pattern recognition, and can significantly enhance research capabilities.
Damian Mandzunowski (Heidelberg University) talked about using AI to extract and analyse information from Chinese Comics
Shaojian Li (Renmin University of China) looked into automating the classification of pattern images using deep learning
One notable session was "Reflections on Deep Learning & Generative AI," chaired by Brent Ho and discussed by Clemens Neudecker. The roundtable highlighted the evolving role of AI in humanities research. Calvin Yeh from MPIWG discussed AI's potential to augment, rather than just automate, research processes. He shared intriguing examples of using AI tools like ChatGPT to simulate group discussions and suggest research actions. Hongsu Wang from Harvard University presented on the use of Large Language Models and traditional Transformers in the China Biographical Database (CBDB) project, demonstrating the effectiveness of these models in data extraction and standardisation.
Calvin Yeh (MPIWG) discussed AI for “Augmentation, not only Automation” and experimented with ChatGPT discussing a research approach, designing a research process and simulating a group discussion
Hongsu Wang (Harvard University) talked about extracting and standardising data using LLMs and traditional Transformers in the CBDB project – here showcasing Jeffrey Tharsen’s research to create a network graph using a prompt in ChatGPT
Exploring the Stabi
Our group tour in the Stabi was a personal highlight for me. This historic library, part of the Prussian Cultural Heritage Foundation, is renowned for its extensive collections and commitment to making digitised materials publicly accessible. The library operates from two major public sites – Haus Unter Den Linden and Haus Potsdamer Straße. Tours of both locations were available, but I chose to explore the more recent building, designed by Hans Scharoun and located in the Kulturforum on Potsdamer Straße in West Berlin – the history and architecture of which is fascinating.
A group of the conference delegates enjoying the tour of SBB’s Haus Potsdamer Straße
I really enjoyed catching up with old colleagues and making new connections with fellow scholars passionate about East Asian digital humanities!
To conclude
In conclusion, the Charting European D-SEA Conference at the Stabi was an enriching experience, providing deep insights into the integration of digital methods in East Asian studies. It provided valuable insights into the advancements in digital scholarship and allowed me to connect with a global community of scholars. The combination of traditional and more recent digital practices, coupled with the forward-looking discussions on AI and deep learning, made this conference a significant milestone in the field. I look forward to seeing how these conversations evolve and contribute to the broader landscape of digital humanities.
Sustainability has become a core value at the British Library, driven by our staff-led Sustainability Group and bolstered by the addition of a dedicated Sustainability Manager nearly a year ago. As part of our ongoing commitment to environmental responsibility, we have been exploring various initiatives to reduce our environmental footprint. One such initiative is our engagement with the Digital Humanities Climate Coalition (DHCC), a collaborative and cross-institutional effort focused on understanding and minimising the environmental impact of digital humanities research.
Screenshot from the Digital Humanities Climate Coalition website
Discovering the DHCC and its toolkit
The Digital Humanities Climate Coalition (DHCC) has been on my radar for some time, primarily due to their exemplary work in promoting sustainable digital practices. The DHCC toolkit, in particular, has proven to be an invaluable resource. Designed to help individuals and organisations make more environmentally conscious digital choices, the toolkit offers practical guidance for building sustainable digital humanities projects. It encourages researchers to adopt climate-responsible practices and supports those who may lack the practical knowledge to devise greener initiatives.
The toolkit is comprehensive, providing tips on the planning and management of research infrastructure and data. It aims to empower researchers to make climate-friendly technological decisions, thereby fostering a culture of sustainability within the digital humanities community.
My primary goal in leveraging the DHCC toolkit is to raise awareness about the environmental impact of digital work and technology use. By doing so, I hope to empower Library staff to make informed decisions that contribute to our sustainability goals. The toolkit’s insights are crucial for anyone involved in digital research, offering both strategic guidance and practical tips for minimising ecological footprints.
Planning a workshop at the British Library
With the support of our Research Development team, I organised a one-day workshop at the British Library, inviting Professor James Baker, Director of Digital Humanities at the University of Southampton and a member of the DHCC, to lead the event. The workshop was designed to introduce the DHCC toolkit and provide guidance on implementing best practices in research projects. The in-person, full-day workshop was held on 5 February 2024.
Workshop highlights
The workshop featured four key sessions:
Session 1: Introductions and Framing: We began with an overview of the DHCC and its work within the GLAM sector, followed by an introduction to sustainability at the British Library, the roles that libraries play in reducing carbon footprint and awareness raising, the Green Libraries Campaign (of which the British Library was a founding partner), and perspectives on digital humanities and the use of computational methods.
CILIP’s Green Libraries Campaign banner
Session 2: Toolkit Overview: Prof Baker introduced the DHCC toolkit, highlighting its main components and practical applications, focusing on grant writing (e.g. recommendations on designing research projects, including Data Management Plans), and working practices (guidance on reducing energy consumption in day-to-day working life, e.g. communication and shared working, travel, and publishing and preserving data). The session included responses from relevant Library teams, on topics such as research project design, data management and our shared research repository.
DHCC Information, Measurement and Practice Action Group. (2022). A Researcher Guide to Writing a Climate Justice Oriented Data Management Plan (v0.6). Zenodo. https://doi.org/10.5281/zenodo.6451499
Session 3: Advocacy and Influencing: This session focused on strategies for advocating for sustainable practices within one's organisation and influencing others to adopt these practices. We covered the Library’s staff-led Sustainability Group and its activities, after which participants were then asked to consider the actions that could be taken at the Library and beyond, taking into account the types of people that might be influenced (senior leaders, colleagues, peers in wider networks/community).
Session 4: Feedback and Next Steps: Participants discussed their takeaways from the workshop and identified actionable steps they could implement in their work. This session included conversations on ways to translate workshop learnings into concrete next steps, and generated light ‘commitments’ for the next week, month and year. One fun way to set oneself a yearly reminder is to schedule an eco-friendly e-card to send to yourself in a year!
Post-workshop follow-up
Three months after the workshop had taken place, we conducted a follow-up survey to gauge its impact. The survey included a mix of agree/disagree statements (see chart below) and optional long-form questions to capture more detailed feedback. While we had only a few responses, survey results were constructive and positive. Participants appreciated the practical insights and reported better awareness of sustainable practices in their digital work.
Participants’ agree/disagree ratings for a series of statements about the DHCC workshop’s impact
Judging from responses to the set of statements above, at least several participants have embedded toolkit recommendations, made specific changes in their work, shared knowledge and influenced their wider networks. We got additional details on these actions in responses to the open-ended questions that followed.
What did staff members say?
Here are some comments made in relation to making changes and embedding the DHCC toolkit’s recommendation:
“Changes made to working policy and practice to order vegetarian options as standard for events.”
“I have referenced the toolkit in a chapter submitted for a monograph, in relation to my BL/university research.”
“I have discussed the toolkit's recommendations with colleagues re the projects I am currently working on. We agreed which parts of the projects were most carbon intensive and discussed ways to mitigate that.”
“I recommended a workshop on the toolkit to my [research] funding body.”
“Have engaged more with small impacts - less email traffic, fewer attachments, fewer images.”
A couple of comments were made with regard to challenges or barriers to change making. One was about colleagues being reluctant to decrease flying, or travel in general, as a way to reduce one’s carbon footprint. The second point referred to an uncertainty on how to influence internal discussions on software development infrastructure – highlighting the challenge of finding the right path to the right people.
An interesting comment was made in relation to raising environmental concerns and advocating the Toolkit:
“Shared the toolkit with wider professional network at an event at which environmentally conscious and sustainable practices were raised without prompting. Toolkit was well received with expressions of relief that others are thinking along these lines and taking practical steps to help progress the agenda.”
And finally, an excellent point about the energy-intensive use of ChatGPT (or other LLMs), which was covered at the workshop:
“The thing that has stayed with me is what was said about water consumption needed to cool the supercomputers - how every time you run one of those Chat GPT (or equivalent) queries it is the equivalent of throwing a litre of water out the window, and that Microsoft's water use has gone up 30%. I've now been saying this every time someone tells me to use one of these GPT searches. To be honest it has put me off using them completely.”
In summary
The DHCC workshop at the British Library was a great success, underscoring the importance of sustainability in digital humanities, digital projects and digital working. By leveraging the DHCC toolkit, we have taken important steps toward making our digital practices more environmentally responsible, and spreading the word across internal and external networks. Moving forward, we will continue to build on this momentum, fostering a culture of sustainability and empowering our staff to make informed, climate-friendly decisions.
Thank you to workshop contributors, organisers and helpers:
James Baker, Joely Fake, Maja Maricevic, Catherine Ross, Andy Rackley, Jez Cope, Jenny Basford, Graeme Bentley, Stephen White, Bianca Miranda Cardoso, Sarah Kirk-Browne, Andrea Deri, and Deirdre Sullivan.
This is a repeated and updated blog post byDr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections. She shares some background information on how a new post advertised for a Digital Curator for OCR/HTR will help the Library streamline post-digitisation work to make its collections even more accessible to users. Our previous run of this recruitment was curtailed due to the cyber-attack on the Library - but we are now ready to restart the process!
We’ve been digitising our collections for about three decades, opening up access to incredibly diverse and rich collections, for our users to study and enjoy. However, it is important that we further support discovery and digital research by unlocking the huge potential in automatically transcribing our collections.
We’ve done some work over the years towards making our collection items available in machine-readable format, in order to enable full-text search and analysis. Optical Character Recognition (OCR) technology has been around for a while, and there are several large-scale projects that produced OCRed text alongside digitised images – such as the Microsoft Books project. Until recently, Western languages print collections have been the main focus, especially newspaper collections. A flagship collaboration with the Alan Turing Institute, the Living with Machines project, applied OCR technology to UK newspapers, designing and implementing new methods in data science and artificial intelligence, and analysing these materials at scale.
OCR of Bengali books using Transkribus, Two Centuries of Indian Print Project
Machine Learning technologies have been dealing increasingly well with both modern and historical collections, whether printed, typewritten or handwritten. Taking a broader perspective on Library collections, we have been exploring opportunities with non-Western collections too. Library staff have been engaging closely with the exploration of OCR and Handwritten Text Recognition (HTR) systems for English, Bangla, Arabic, Urdu and Chinese. Digital Curators Tom Derrick, Nora McGregor and Adi Keinan-Schoonbaert have teamed up with PRImA Research Lab and the Alan Turing Institute to run four competitions in 2017-2019, inviting providers of text recognition methods to try them out on our historical material.
We have been working with Transkribus as well – for example, Alex Hailey, Curator for Modern Archives and Manuscripts, used the software to automatically transcribe 19th century botanical records from the India Office Records. A digital humanities work strand led by former colleague Tom Derrick saw the OCR of most of our digitised collection of Bengali printed texts, digitised as part of the Two Centuries of Indian Print project. More recently Transkribus has been used to extract text from catalogue cards in a project called Convert-a-Card, as well as from Incunabula print catalogues.
An example of a catalogue card in Transkribus, showing segmentation and transcription
We've also collaborated with Colin Brisson from the READ_Chinese project on Chinese HTR, working with eScriptorium to enhance binarisation, segmentation and transcription models using manuscripts that were digitised as part of the International Dunhuang Programme. You can read more about this work in this brilliant blog post by Peter Smith, who's done a PhD placement with us last year.
The British Library is now looking for someone to join us to further improve the access and usability of our digital collections, by integrating a standardised OCR and HTR production process into our existing workflows, in line with industry best practice.
In this guest post, developer Sak Supple describes his work turning digitised images of playbills into fully searchable documents... Digital Curator Mia Ridge says, 'we're absolutely delighted by Sak's work, and hope that his post helps others working with digitised collections'.
Sample playbills from the British Library's collection
This blog post explores the creation of blplaybills.org, a website that showcases data made publicly available by the British Library.
The blplaybills.org website provides a way to search for, view and download archival playbills from Great Britain and Ireland, 1600-1902, as curated by the British Library (BL).
The website is independently produced using assets made available by the British Library under a Creative Commons licence as part of an open data initiative.
The playbill data
Playbills were promotional flyers advertising entertainment events at theatres, fairs and pleasure gardens.
The BL playbills data originated as document scans (digitised from microfilm, the most viable approach for fragile artefacts) in PDF format, each file containing hundreds of individual playbills, grouped by volume (usually organised by theatre, region and/or period of history).
In total there are more than 80,000 scanned playbills available.
Beside the PDFs, there is also metadata describing where in the Library these playbills could be found (volumes, shelfmarks etc). Including this information meant researchers could search for information online, and also have the volume reference at hand when visiting the Library.
This data is useful to anyone researching theatre, music, history and literature. Making it easy to find, view and download playbills using simple text searches over the internet is a good way to bring the playbills to a wider audience.
This is how blplaybills.org came into existence: the goal was to turn playbill data from the British Library into a searchable online database and image store.
The workflows
It is notoriously difficult to search PDF documents containing scans.
The text in these playbills is embedded in an image. This makes it especially difficult for computers to search the content of a scan, since a computer will interpret the text as a number of lines and curves within the image, without recognizing it as text.
Because internet technologies are well suited to searching for text, the first challenge is to turn the scanned playbill text into searchable text that a computer can more easily understand.
The chosen approach was to use Optical Character Recognition (OCR) software to capture text contained in the playbills.
OCR is a pattern matching technique, enhanced with machine learning, that finds text in an image by first using text detection algorithms to isolate character images, called glyphs, and comparing these with similarly stored glyphs. These glyphs are then further broken down into features (lines, loops etc), which are then used to find the best match amongst pre-trained glyphs.
The recognised text can then be processed using techniques like contextual analysis and grammar checking to improve accuracy.
The result can then be stored in a computer file to form text that a computer can recognise in the form of characters, words, phrases and sentences.
The resulting text is associated with individual playbills and related metadata, and the text and metadata stored in an online database to make it searchable.
In parallel to the above processes, high and low resolution JPEG versions of individual playbills were generated and uploaded to cloud storage for online access.
The general flow is shown below.
Figure 1: Flow of data from original data to structured online resources
The details of each of these workflows is discussed in more detail below.
Text generation workflow
Since the goal is to make it possible to search for individual playbills, the first step was to break up PDFs containing multiple playbills into individual documents containing one playbill each.
This was done using open source software called poppler-utils that provides command line utilities for manipulating PDF documents, including generating single page documents from one multipage document.
The next step is to extract text using OCR. In 2018 my research showed that an effective open source solution for this was Tesseract.
Experiments showed that Tesseract produced best results by converting the PDF document to a lossless raster format like TIFF (Tag Image File Format) before running the OCR program. In fact, it was found that changing the size of the document, increasing the resolution and contrast and then converting to TIFF produced good output from Tesseract OCR.
The conversion from PDF to TIFF for each playbill was achieved using open source software called ImageMagick.
This workflow is shown below.
Figure 2: Workflow to produce OCR text for each individual playbill
Doing this for 80,000+ individual playbills was achieved by automating the above workflow and processing multiple playbills in parallel. The individual playbills could be uniquely identified by the name of the original multipage PDF, together with the page number of the playbill.
Two other workflows were set up to work in parallel with this:
Convert individual PDF playbills into high and low resolution JPEGs for online viewing
Add metadata to the OCR text (volume, shelfmark, date, theatre etc) to produce a JSON file, and upload and index this information in a searchable online database
JPEG generation
As individual PDF playbills were generated from multipage PDFs, a copy of each single page PDF was sent to the JPEG generation workflow where its arrival triggered the workflow.
ImageMagick was used to create thumbnail and high resolution JPEG versions of the playbill suitable for online viewing.
The resulting JPEG files, identified by the original PDF filename and page number of the playbill, were then uploaded to cloud storage.
JSON generation
A popular choice to store searchable text in JSON format is a database called Elasticsearch. This provides fast indexing and search capabilities, and is available for non-commercial use.
This JSON should include the searchable playbill text and relevant metadata.
Each output from the text generation workflow triggered the JSON generation, allowing metadata for the individual playbill to be merged with OCR text into a single JSON file.
The resulting JSON was uploaded and indexed in an online Elasticsearch database. This became the searchable datastore for the web application that researchers use when visiting blplaybills.org.
The search interface
At this point the data is stored in a searchable online database, and images of individual playbills have been made available in online cloud storage.
The next step is to allow researchers to search for, view and download playbills.
The main requirements of the interface are:
Simple text search to return playbills containing matching text
These results to be quickly filtered using faceted search based on date, theatre, location, organisation and volume
Quick copy of playbill text
View and download a high resolution version of the playbill
Responsive design
The interface is shown in Figure 3 below.
Figure 3: Online search interface
The web interface is hosted in AWS/EC2 (Amazon Web Services cloud compute service) and uses standard web frameworks used for the creation of single page applications.
Some software development was necessary to create backend workflows, and to automate and integrate them with each other.
This was achieved using a combination of scripting (NodeJS, Bourne shell and Python) and C programs.
The front-end was developed with Javascript, NodeJS, Angular and HTML5/CSS3.
Recent work and next steps
I recently made some modifications to the above approach to improve the quality of OCR generated text for each playbill.
Specifically, Tesseract has been replaced by a utility called textra (Swift/MacOS) that uses the Apple Vision framework for character recognition. This significantly improved the quality of the text generated by the OCR process, resulting in improved search accuracy. This technology was not available in 2018 when blplaybills.org was first created.
Another method to improve the accuracy of search might be to enhance OCR text with text transcribed as part of a crowdsourcing initiative from the British Library: In the Spotlight. This involved members of the public transcribing titles, names and locations in playbills. By adding this information to the indexed data already generated, search accuracy could be further improved.
An interesting piece of research would be to consider if LLMs (Large Language Models) could be fine tuned to enhance the results of traditional OCR techniques.
The goal would be to find a generalised approach that uses modern natural language processing techniques to improve the automatic transcription of less machine-readable archival material such as, but not limited to, these playbills. Ideally these techniques could also be applied to multi-lingual material.
This will be the focus of future work to improve the data behind blplaybills.org.
The British Library is continuing to recover from last year’s cyber-attack. While our teams work to restore our services safely and securely, one of our goals in the Digital Research Team is to get some of the information from our currently inaccessible web pages into an easily readable and shareable format. We’ll be sharing these pages via blog posts here, with information recovered from the Wayback Machine, a fantastic initiative of the Internet Archive.
The next page in this series is all about the student projects that came out of our Computing for Cultural Heritage project with the National Archives and Birkbeck University. This student project page was captured by the Wayback Machine on 7 June 2023.
Computing for Cultural Heritage Student Projects
This page provides abstracts for a selection of student projects undertaken as part of a one-year part-time Postgraduate Certificate (PGCert), Computing for Cultural Heritage, co-developed by British Library, National Archives and Birkbeck University and funded by the Institute of Coding as part of a £4.8 million University skills drive.
“I have gone from not being able to print 'hello' in Python to writing some relatively complex programs and having a much greater understanding of data science and how it is applicable to my work."
- Jessica Green
Key points
Aim of the trial was to provide professionals working in the cultural heritage sector with an understanding of basic programming and computational analytic tools to support them in their daily work
During the Autumn & Spring terms (October 2019-April 2020), 12 staff members from British Library and 8 staff staff members from The National Archives completed two new trial modules at Birkbeck University: Demystifying computing for heritage professionals and Work-based Project
Transforming Physical Labels into Digital References
Sotirios Alpanis, British Library This project aims to use computing to convert data collected during the preparation of archive material for digitisation into a tool that can verify and validate image captures, and subsequently label them. This will take as its input physical information about each document being digitised, perform and facilitate a series of validations throughout image capture and quality assurance and result in an xml file containing a map of physical labels to digital files. The project will take place within the British Library/Qatar Foundation Partnership (BL/QFP), which is digitising archive material for display on the QDL.qa.
Enhancing national thesis metadata with persistent identifiers
Jenny Basford, British Library Working with data from ISNI (International Standard Name Identifier) Agency and EThOS (Electronic Theses Online Service), both based at the British Library, I intend to enhance the metadata of both databases by identifying doctoral supervisors in thesis metadata and matching these data with ISNI holdings. This work will also feed into the European-funded FREYA project, which is concerned with the use of a wide variety of persistent identifiers across the research landscape to improve openness in research culture and infrastructure through Linked Data applications.
A software tool to support the social media activities of the Unlocking Our Sound Heritage Project
Lucia Cavorsi, British Library Video I would like to design a software tool able to flag forthcoming anniversaries, by comparing all the dates present in SAMI (sound and moving image catalogue – Sound Archive) with the current date. The aim of this tool is to suggest potential content for the Sound Archive’s social media posts. Useful dates in SAMI which could be matched with the current date and provide material for tweets are: birth and death dates of performers or authors, radio programme broadcast dates, recording dates). I would like this tool to also match the subjects currently present in SAMI with the subjects featured in the list of anniversaries 2020 which the social media team uses. For example anniversaries like ‘International HIV day’, ‘International day of Lesbian visibility’ etc. A windows pop up message will be designed for anniversaries notifications on the day. If time permits, it would be convenient to also analyse what hashtags have been used over last year by the people who are followed by or follow the Sound Archive Twitter account. By extracting a list of these hashtags further, and more sound related, anniversaries could be added to the list of anniversaries currently used by the UOSH’s social media team.
Computing Cholera: Topic modelling the catalogue entries of the General Board of Health
Christopher Day, The National Archives Blog / Other The correspondence of the General Board of Health (1848–1871) documents the work of a body set up to deal with cholera epidemics in a period where some English homes were so filthy as to be described as “mere pigholes not fit for human beings”. Individual descriptions for each of these over 89,000 letters are available on Discovery, The National Archives (UK)’s catalogue. Now, some 170 years later, access to the letters themselves has been disrupted by another epidemic, COVID-19. This paper examines how data science can be used to repurpose archival catalogue descriptions, initially created to enhance the ‘human findability’ of records (and favoured by many UK archives due to high digitisation costs), for large-scale computational analysis. The records of the General Board will be used as a case study: their catalogue descriptions topic modelled using a latent Dirichlet allocation model, visualised, and analysed – giving an insight into how new sanitary regulations were negotiated with a divided public during an epidemic. The paper then explores the validity of using the descriptions of historical sources as a source in their own right; and asks how, during a time of restricted archival access, metadata can be used to continue research.
An Automated Text Extraction Tool for Use on Digitised Maps
Nicholas Dykes, British Library Blog / Video Researchers of history often have difficulty geo-locating historical place names in Africa. I would like to apply automated transcription techniques to a digitised archive of historical maps of Africa to create a resource that will allow users to search for text, and discover where, and on which maps that text can be found. This will enable identification and analysis both of historical place names and of other text, such as topographical descriptions. I propose to develop a software tool in Python that will send images stored locally to the Google Vision API, and retrieve and process a response for each image, consisting of a JSON file containing the text found, pixel coordinate bounding boxes for each instance of text, and a confidence score. The tool will also create a copy of each image with the text instances highlighted. I will experiment with the parameters of the API in order to achieve the most accurate results. I will incorporate a routine that will store each related JSON file and highlighted image together in a separate folder for each map image, and create an Excel spreadsheet containing text results, confidence scores, links to relevant image folders, and hyperlinks to high-res images hosted on the BL website. The spreadsheet and subfolders will then be packaged together into a single downloadable resource. The finished software tool will have the capability to create a similar resource of interlinked spreadsheet and subfolders from any batch of images.
Reconstituting a Deconstructed Dataset using Python and SQLite
Alex Green, The National Archives Video For this project I will rebuild a database and establish the referential integrity of the data from CSV files using Python and SQLite. To do this I will need to study the data, read the documentation, draw an entity relationship diagram and learn more about relational databases. I want to enable users to query the data as they would have been able to in the past. I will then make the code reusable so it can be used to rebuild other databases, testing it with a further two datasets in CSV form. As an additional challenge, I plan to rearrange the data to meet the principles of ‘tidy data’ to aid data analysis.
PIMMS: Developing a Model Pre-Ingest Metadata Management System at the British Library
Jessica Green, British Library GitHub / Video I am proposing a solution to analysing and preparing for ingest a vast amount of ‘legacy’ BL digitised content into the future Digital Asset Management System (DAMPS). This involves building a prototype for a SQL database to aggregate metadata about digitised content and preparing for SIP creation. In addition, I will write basic queries to aid in our ongoing analysis about these TIFF files, including planning for storage, copyright, digital preservation and duplicate analysis. I will use Python to import sample metadata from BL sources like SharePoint, Excel and BL catalogues – currently used for analysis of ‘live’ and ‘legacy’ digitised BL collections. There is at least 1 PB of digitised content on the BL networks alone, as well as on external media such as hard-drives and CDs. We plan to only ingest one copy of each digitised TIFF file set and need to ensure that the metadata is accurate and up-to-date at the point of ingest. This database, the Pre-Ingest Metadata Management System (PIMMS), could serve as a central metadata repository for legacy digitised BL collections until then. I look forward to using Python and SQL, as well as drawing on the coding skills from others, to make these processes more efficient and effective going forward.
Exploring, cleaning and visualising catalogue metadata
Alex Hailey, British Library Blog / Video Working with catalogue metadata for the India Office Records (IOR) I will undertake three tasks: 1) converting c430,000 IOR/E index entries to descriptions within the relevant volume entries; 2) producing an SQL database for 46,500 IOR/P descriptions, allowing enhanced search when compared with the BL catalogue; and 3) creating Python scripts for searching, analysis and visualisation, to be demonstrated on dataset(s) and delivered through Jupyter Notebooks.
Automatic generation of unique reference numbers for structured archival data.
Graham Jevon, British Library Blog / Video / GitHub The British Library’s Endangered Archives Programme (EAP) funds the digital preservation of endangered archival material around the world. Third party researchers digitise material and send the content to the British Library. This is accompanied by an Excel spreadsheet containing metadata that describes the digitised content. EAP’s main task is to clean, validate, and enhance the metadata prior to ingesting it into the Library’s cataloguing system (IAMS). One of these tasks is the creation of unique catalogue reference numbers for each record (each row of data on the spreadsheet). This is a predominantly manual process that is potentially time consuming and subject to human inputting errors. This project seeks to solve this problem. The intention is to create a Windows executable program that will enable users to upload a csv file, enter a prefix, and then click generate. The instant result will be an export of a new csv file, which contains the data from the original csv file plus automatically generated catalogue reference numbers. These reference numbers are not random. They are structured in accordance with an ordered archival hierarchy. The program will include additional flexibility to account for several variables, including language encoding, computational efficiency, data validation, and wider re-use beyond EAP and the British Library.
Automating Metadata Extraction in Born Digital Processing
Callum McKean, British Library Video To automate the metadata extraction section of the Library’s current work-flow for born-digital processing using Python, then interrogate and collate information in new ways using the SQLite module.
Analysis of peak customer interactions with Reference staff at the British Library: a software solution
Jaimee McRoberts, British Library Video The British Library, facing on-going budget constraints, has a need to efficiently deploy Reference Services staff during peak periods of demand. The service would benefit from analysis of existing statistical data recording the timestamp of each customer interaction at a Reference Desk. In order to do this, a software solution is required to extract, analyse, and output the necessary data. This project report demonstrates a solution utilising Python alongside the pandas library which has successfully achieved the required data analysis.
Enhancing the data in the Manorial Documents Register (MDR) and making it more accessible
Elisabeth Novitski, The National Archives Video To develop computer scripts that will take the data from the existing separate and inconsistently formatted files and merge them into a consistent and organised dataset. This data will be loaded into the Manorial Documents Register (MDR) and National Register of Archives (NRA) to provide the user with improved search ability and access to the manorial document information.
Automating data analysis for collection care research at The National Archives: spectral and textual data
Lucia Pereira Pardo, The National Archives The day-to-day work of a conservation scientist working for the care of an archival collection involves acquiring experimental data from the varied range of materials present in the physical records (inks, pigments, dyes, binding media, paper, parchment, photographs, textiles, degradation and restoration products, among others). To this end, we use multiple and complementary analytical and testing techniques, such as X-ray fluorescence (XRF), Fourier Transform Infrared (FTIR) and Fibre Optic Reflectance spectroscopies (FORS), multispectral imaging (MSI), colour and gloss measurements, microfading (MFT) and other accelerated ageing tests. The outcome of these analyses is a heterogeneous and often large dataset, which can be challenging and time-consuming to process and analyse. Therefore, the objective of this project is to automate these tasks when possible, or at least to apply computing techniques to optimise the time and efforts invested in routine operations, so that resources are freed for actual research and more specialised and creative tasks dealing with the interpretation of the results.
Improving efficiencies in content development through batch processing and the automation of workloads
Harriet Roden, British Library Video With the purpose to support and enrich the curriculum, the British Library’s Digital Learning team produces large-scale content packages for online learners through individual projects. Due to their reliance on other internal teams within the workflow for content delivery, a substantial amount of resource is spent on routine tasks to duplicate collection metadata across various databases. In order to reduce inefficiencies, increase productivity and improve reliability, my project aimed to alleviate pressures across the workflow through workload automation, through four separate phases.
The Botish Library: building a poetry printing machine with Python
Giulia Carla Rossi, British Library Blog / Video This project aims to build a poetry printing machine, as a creative output that unites traditional content, new media and Python. The poems will be sourced from the British Library Digitised Books dataset collection, available under Public Domain Mark; I will sort through the datasets and identify which titles can be categorised as poetry using Python. I will then create a new dataset comprising these poetry books and relative metadata, which will then be connected to the printer with a Python script. The poetry printing machine will print randomized poems from this new dataset, together with some metadata (e.g. poem title, book title, author and shelfmark ID) that will allow users to easily identify the book.
Automating data entry in the UOSH Tracking Database
Chris Weaver, British Library The proposed software solution is the creation of a Python script (to feature as a module in a larger script) to extract data from a web-based tool (either via obtaining data in JSON format via the sites' API or accessing the database powering the site directly). The data obtained is then formatted and inserted into corresponding fields in a Microsoft SQL Server database.
Final Module
Following the completion of the trial, participants had the opportunity to complete their PGCert in Applied Data Science by attending the final module, Analytic Tools for Information Professionals, which was part of the official course launched last autumn. We followed up with some of the participants to hear more about their experience of the full course:
“The third and final module of the computing for cultural heritage course was not only fascinating and enjoyable, it was also really pertinent to my job and I was immediately able to put the skills I learned into practice.
The majority of the third module focussed on machine learning. We studied a number of different methods and one of these proved invaluable to the Agents of Enslavement research project I am currently leading. This project included a crowdsourcing task which asked the public to draw rectangles around four different types of newspaper advertisement. The purpose of the task was to use the coordinates of these rectangles to crop the images and create a dataset of adverts that can then be analysed for research purposes. To help ensure that no adverts were missed and to account for individual errors, each image was classified by five different people.
One of my biggest technical challenges was to find a way of aggregating the rectangles drawn by five different people on a single page in order to calculate the rectangles of best fit. If each person only drew one rectangle, it was relatively easy for me to aggregate the results using the coding skills I had developed in the first two modules. I could simply find the average (or mean) of the five different classification attempts. But what if people identified several adverts and therefore drew multiple rectangles on a single page? For example, what if person one drew a rectangle around only one advert in the top left corner of the page; people two and three drew two rectangles on the same page, one in the top left and one in the top right; and people four and five drew rectangles around four adverts on the same page (one in each corner). How would I be able to create a piece of code that knew how to aggregate the coordinates of all the rectangles drawn in the top left and to separately aggregate the coordinates of all the rectangles drawn in the bottom right, and so on?
One solution to this problem was to use an unsupervised machine learning method to cluster the coordinates before running the aggregation method. Much to my amazement, this worked perfectly and enabled me to successfully process the total of 92,218 rectangles that were drawn and create an aggregated dataset of more than 25,000 unique newspaper adverts.”
“The final module of the course was in some ways the most challenging — requiring a lot of us to dust off the statistics and algebra parts of our brain. However, I think, it was also the most powerful; revealing how machine learning approaches can help us to uncover hidden knowledge and patterns in a huge variety of different areas.
Completing the course during COVID meant that collection access was limited, so I ended up completing a case study examining how generic tropes have evolved in science fiction across time using a dataset extracted from GoodReads. This work proved to be exceptionally useful in helping me to think about how computers understand language differently; and how we can leverage their ability to make statistical inferences in order to support our own, qualitative analyses.
In my own collection area, working with born digital archives in Contemporary Archives and Manuscripts, we treat draft material — of novels, poems or anything else — as very important to understanding the creative process. I am excited to apply some of these techniques — particularly Unsupervised Machine Learning — to examine the hidden relationships between draft material in some of our creative archives.
The course has provided many, many avenues of potential enquiry like this and I’m excited to see the projects that its graduates undertake across the Library.”
- Callum McKean, Lead Curator, Digital; Contemporary British Collection
“I really enjoyed the Analytics Tools for Data Science module. As a data science novice, I came to the course with limited theoretical knowledge of how data science tools could be applied to answer research questions. The choice of using real-life data to solve queries specific to professionals in the cultural heritage sector was really appreciated as it made everyday applications of the tools and code more tangible. I can see now how curators’ expertise and specialised knowledge could be combined with tools for data analysis to further understanding of and meaningful research in their own collection area."
-Giulia Carla Rossi, Curator, Digital Publications; Contemporary British Collection
Please note this page was originally published in Feb 2021 and some of the resources, job titles and locations may now be out of date.
The British Library is continuing to recover from last year’s cyber-attack. While our teams work to restore our services safely and securely, one of our goals in the Digital Research Team is to get some of the information from our currently inaccessible web pages into an easily readable and shareable format. We’ll be sharing these pages via blog posts here, with information recovered from the Wayback Machine, a fantastic initiative of the Internet Archive.
The second page in this series is a case study on the impact of our Digital Scholarship Training Programme, captured by the Wayback Machine on 3 October 2023.
Graham Jevon: A Digital Transformation Story
'The Digital Scholarship Training Programme has introduced me to new software, opened my eyes to digital opportunities, provided inspiration for me to improve, and helped me attain new skills'
Key points
Graham Jevon has been an active participant in the Digital Scholarship Training Programme
Through gaining digital skills he has been able to build software to automate tricky processes
Graham went on to become a Coleridge Fellowship scholar, putting these digital skills to good use!
Find out more on what Graham has been up to on his Staff Profile
Did you know? The Digital Scholarship Training Programme has been running since 2012, and creates opportunities for staff to develop necessary skills and knowledge to support emerging areas of modern scholarship.
The Digital Scholarship Training Programme
Since joining the library in 2018, the Digital Scholarship Training Programme has been integral to the trajectory of both my personal development and the working practices within my team.
The very first training course I attended at the library was the introduction to OpenRefine. The key thing that I took away from this course was not necessarily the skills to use the software, but simply understanding OpenRefine’s functionality and the possibilities the software offered for my team. This inspired me to spend time after the session devising a workflow that enhanced our cataloguing efficiency and accuracy, enabling me to create more detailed and accurate metadata in less time. With OpenRefine I created a semi-automated workflow that required the kind of logical thinking associated with computer programming, but without the need to understand a computer programming language.
Computing for Cultural Heritage
The use of this kind of logical thinking and the introduction to writing computational expressions within OpenRefine sparked an interest in me to learn a computing language such as Python. I started a free online Python introduction, but without much context to the course my attention quickly waned. When the Digital Scholarship Computing for Cultural Heritage course was announced I therefore jumped at the chance to apply.
I went into the Computing for Cultural Heritage course hoping to learn skills that would enable me to solve cataloguing and administrative problems, skills that would help me process data in spreadsheets more efficiently and accurately. I had one particular problem in mind and I was able to address this problem in the project module of the course. For the project we had to design a software program. I created a program (known as ReG), which automatically generates structured catalogue references for archival collections. I was extremely pleased with the outcome of this project and this piece of software is something that my team now use in our day-to-day activities. An error-prone task that could take hours or days to complete manually in Excel now takes just a few seconds and is always 100% accurate.
This in itself was a great outcome of the course that met my hopes at the outset. But this course did so much more. I came away from the course with a completely new set of data science skills that I could build on and apply in other areas. For example, I recently created another piece of software that helps my team survey any digitisation data that we receive, to help us spot any errors or problems that need fixing.
The British Library Coleridge Research Fellowship
The data science skills were particularly instrumental in enabling me to apply successfully for the British Library’s Coleridge research fellowship. This research fellowship is partly a personal development scheme and it enabled me the opportunity to put my new data science skills into practice in a research environment (rather than simply using them in a cataloguing context). My previous academic research experience was based on traditional analogue methods. But for the Coleridge project I used crowdsourcing to extract data for analysis from two collections of newspapers.
The third and final Computing for Cultural Heritage module focussed on machine learning and I was able to apply these skills directly to the crowdsourcing project Agents of Enslavement. The first crowdsourcing task, for example, asked the public to draw rectangles around four specific types of newspaper advertisement. To help ensure that no adverts were missed and to account for individual errors, each image was classified by five different people. I therefore had to aggregate the results. Thanks to the new data science skills I had learned, I was able to write a Python script that used machine learning algorithms to aggregate 92,000 total rectangles drawn by the public into an aggregated dataset of 25,000 unique newspaper advertisements.
The OpenRefine and Computing for Cultural Heritage course are just two of the many digital scholarship training sessions that I have attended. But they perfectly illustrate the value of the Digital Scholarship Training Programme, which has introduced me to new software, opened my eyes to digital opportunities, provided inspiration for me to improve, and helped me attain new skills that I have been able to put into practice both for the benefit of myself and my team.
The British Library is continuing to recover from last year’s cyber-attack. While our teams work to restore our services safely and securely, one of our goals in the Digital Research Team is to get some of the information from our currently inaccessible web pages into an easily readable and shareable format. We’ll be sharing these pages via blog posts here, with information recovered from the Wayback Machine, a fantastic initiative of the Internet Archive.
The Digital Scholarship Training Programme has been running since 2012, and creates opportunities for staff to develop necessary skills and knowledge to support emerging areas of modern scholarship.
About
This internal and bespoke staff training programme is one of the cornerstones of the Digital Curator Team’s work at the British Library. Running since 2012, it provides colleagues with the space and opportunity to delve into and explore all that digital content and new technologies have to offer in the research domain today. The Digital Curator team oversees the design and delivery of roughly 50-60 training events a year. Since its inception, well over a thousand individual staff members have come through the programme, on average attending three or more courses each and the Library has seen a steep change in its capacity to support innovative digital research.
Objectives
Staff are familiar and conversant with the foundational concepts, methods and tools of digital scholarship.
Staff are empowered to innovate.
Collaborative digital initiatives flourish across subject areas within the Library as well as externally.
Our internal capacity for training and skill-sharing in digital scholarship are a shared responsibility across the Library.
The Programme
What's it all about?
To celebrate our ten year anniversary, we created a series of video testimonials from the people behind the Training Programme - coordinators, instructors, and attendees. Click 'Watch on YouTube' to view the whole series of videos.
Nora McGregor, Digital Curator, gives a presentation all about the Digital Scholarship Training Programme - where it started, where it's going and what it hopes to accomplish.
Courses
As digital research methods have changed overtime, so too have course topics and content. Today's full course catalogue reflects this through a diversity of topics from cleaning up data, digital storytelling, to command line programming and geo-referencing.
Courses range from half-days to full-day workshops for no more than 15 attendees at a time and are taught mainly by staff members but also external trainers where necessary. Example courses include:
We host a monthly “Hack & Yack” to run alongside the more formal training programme. During these two-hour self-paced casual meet-ups, open to all staff, the group works through a variety of online tutorials on a particular digital topic. Example sessions include:
The Digital Scholarship Reading Group holds informal discussions on the first Tuesday of each month. Each month we discuss an article, conference, podcast or video related to digital scholarship. It's a great way to keep up with new ideas or reality check trends in digital scholarship (including the digital humanities). We welcome people from any department in the Library, and take suggestions for topics that are particularly relevant to diverse teams or disciplines.
Curious about what we cover? Check out this previous blog post that cover the last five years of our Reading Group.
21st Century Curatorship Talk Series
The Digital Scholarship team hosts the 21st Century Curatorship Programme (C21st), a series of professional development talks and seminars, open to all staff, providing a forum for keeping up with new developments and emerging technologies in scholarship, libraries and cultural heritage.
What’s new?
In 2019, the British Library and partners Birkbeck University and The National Archives were awarded £222,420 in funding by the Institute of Coding (IoC) to co-develop a one-year part-time postgraduate Certificate (PGCert), Computing for Cultural Heritage, as part of a £4.8 million University skills drive. The new course aims to provide working professionals, particularly across the GLAM sector (Galleries, Libraries, Archives and Museums), with an understanding of basic programming, analytic tools and computing environments to support them in their daily work.
Further information
For more information on the Training Programme's most recent year, including our performance numbers and topics covered by the training, please see our full screen, interactive inforgraphic.