Digital scholarship blog

Enabling innovative research with British Library digital collections

278 posts categorized "Collaborations"

23 December 2024

AI (and machine learning, etc) with British Library collections

Machine learning (ML) is a hot topic, especially when it’s hyped as ‘AI’. How might libraries use machine learning / AI to enrich collections, making them more findable and usable in computational research? Digital Curator Mia Ridge lists some examples of external collaborations, internal experiments and staff training with AI / ML and digitised and born-digital collections.

Background

The trust that the public places in libraries is hugely important to us - all our 'AI' should be 'responsible' and ethical AI. The British Library was a partner in Sheffield University's FRAIM: Framing Responsible AI Implementation & Management project (2024). We've also used lessons from the projects described here to draft our AI Strategy and Ethical Guide.

Many of the projects below have contributed to our Digital Scholarship Training Programme and our Reading Group has been discussing deep learning, big data and AI for many years. It's important that libraries are part of conversations about AI, supporting AI and data literacy and helping users understand how ML models and datasets were created.

If you're interested in AI and machine learning in libraries, museums and archives, keep an eye out for news about the AI4LAM community's Fantastic Futures 2025 conference at the British Library, 3-5 December 2025. If you can't wait that long, join us for the 'AI Debates' at the British Library.

Using ML / AI tools to enrich collections

Generative AI tends to get the headlines, but at the time of writing, tools that use non-generative machine learning to automate specific parts of a workflow have more practical applications for cultural heritage collections. That is, 'AI' is currently more process than product.

Text transcription is a foundational task that makes digitised books and manuscripts more accessible to search, analysis and other computational methods. For example, oral history staff have experimented with speech transcription tools, raising important questions, and theoretical and practical issues for automatic speech recognition (ASR) tools and chatbots.

We've used Transkribus and eScriptorium to transcribe handwritten and printed text in a range of scripts and alphabets. For example:

Creating tools and demonstrators through external collaborations

Mining the UK Web Archive for Semantic Change Detection (2021)

This project used word vectors with web archives to track words whose meanings changed over time. Resources: DUKweb (Diachronic UK web) and blog post ‘Clouds and blackberries: how web archives can help us to track the changing meaning of words’.

Graphs showing how words associated with the words blackberry, cloud, eta and follow changed over time.
From blackberries to clouds... word associations change over time

Living with Machines (2018-2023)

Our Living With Machines project with The Alan Turing Institute pioneered new AI, data science and ML methods to analyse masses of newspapers, books and maps to understand the impact of the industrial revolution on ordinary people. Resources: short video case studies, our project website, final report and over 100 outputs in the British Library's Research Repository.

Outputs that used AI / machine learning / data science methods such as lexicon expansion, computer vision, classification and word embeddings included:

Tools and demonstrators created via internal pilots and experiments

Many of these examples were enabled by on-staff Research Software Engineers and the Living with Machines (LwM) team at the British Library's skills and enthusiasm for ML experiments in combination with long-term Library’s staff knowledge of collections records and processes:

British Library resources for re-use in ML / AI

Our Research Repository includes datasets suitable for ground truth training, including 'Ground truth transcriptions of 18th &19th century English language documents relating to botany from the India Office Records'. 

Our ‘1 million images’ on Flickr Commons have inspired many ML experiments, including:

The Library has also shared models and datasets for re-use on the machine learning platform Hugging Face.

13 December 2024

Looking back on the Data Science Accelerator

From April to July this year an Assistant Statistician at the Cabinet Office and a Research Software Engineer at the British Library teamed up as mentee (Catherine Macfarlane, CO) and mentor (Harry Lloyd, BL) for the Data Science Accelerator. In this blog post we reflect on the experience and what it meant for us and our work.

Introduction to the Accelerator

Harry: The Accelerator has been around since 2015, set up as a platform to ‘accelerate’ civil servants at the start of their data science journey who have a business need project and a real willingness to learn. Successful applicants are paired with mentors from across the Civil Service who have experience in techniques applicable to the problem, working together one protected day a week for 12 weeks. I was lucky enough to be a mentee in 2020, working on statistical methods to combine different types of water quality data, and my mentor Charlie taught me a lot of what I know. The programme played a huge role in the development of my career, so it was a rewarding moment to come back as a mentor for the April cohort. 

Catherine: On joining the Civil Service in 2023, I had the pleasure of becoming part of a talented data team that has motivated me to continually develop my skills. My academic background in Mathematics with Finance provides me with a strong theoretical foundation, but I am striving to improve my practical abilities. I am particularly interested in Artificial Intelligence, which is gaining increasing recognition across government, sparking discussions on its potential to improve efficiency.

I saw the Data Science Accelerator as an opportunity to deepen my knowledge, address a specific business need, and share insights with my team. The prospect of working with a mentor and immersing myself in an environment where diverse projects are undertaken was particularly appealing. A significant advantage was the protected time this project offered - a rare benefit! I was grateful to be accepted and paired with Harry, an experienced mentor who had already completed the programme. Following our first meeting, I felt ready to tackle the upcoming 12 weeks to see what we could achieve!

Photo of the mentee and mentor on a video call
With one of us based in England and the other in Scotland virtual meetings were the norm. Collaborative tools like screen sharing and Github allowed us to work effectively together.

The Project

Catherine: Our team is interested in the annual reports and accounts of Arm’s Length Bodies (ALBs), a category of public bodies funded to deliver a public or government service.  The project addressed the challenge my team faces in extracting the highly unstructured information stored in annual reports and accounts. With this information we would be able to enhance the data validation process and reduce the burden of commissioning data from ALBs on other teams. We proposed using Natural Language Processing to retrieve this information, analysing and querying it using a Large Language Model (LLM).

Initially, I concentrated on extracting five features, such as full-time equivalent staff in the organisation, from a sample of ALBs across 13 departments for the financial year 22/23. After discussions with Harry, we decided to use Retrieval-Augmented Generation (RAG), to develop a question-answering system. RAG is a technique that combines LLMs with relevant external documents to improve the accuracy and reliability of the output. This is done by retrieving documents that are relevant to the questions asked and then asking the LLM to generate an answer based on the retrieved material. We carefully selected a pre-trained LLM while considering ethical factors like model openness.

RAG
How a retrieval augmented generation system works. A document in this context is a segmented chunk of a larger text that can be parsed by an LLM.

The first four weeks focused on exploratory analysis, data processing, and labelling, all completed in R, which was essential for preparing the data for input into the language model. The subsequent stages involved model building and evaluation in Python, which required the most time and focus. This was my first time using Python, and Harry’s guidance was extremely beneficial during our pair coding sessions. A definite highlight for me was seeing the pipeline start to generate answers!

To bring all our results together, I created a dashboard in Shiny, ensuring it was accessible to both technical and non-technical audiences. The final stage involved summarising all our hard work from the past 12 weeks in a 10 minute presentation and delivering it to the Data Science Accelerator cohort.

Harry: Catherine’s was the best planned project of the ones I reviewed, and I suspected she’d be well placed to make best use of the 12 weeks. I wasn’t wrong! We covered a lot of the steps involved in good reproducible analysis. The exploratory work gave us a great sense of the variance in the data, setting up quantitative benchmarks for the language model results drove our development of the RAG system, and I was so impressed that Catherine managed to fit in building a dashboard on top of all of that.

Our Reflections

Catherine: Overall this experience was fantastic. In a short amount of time, we managed to achieve a considerable amount. It was amazing to develop my skills and grow in confidence. Harry was an excellent mentor; he encouraged discussion and asked insightful questions, which made our sessions both productive and enjoyable. A notable highlight was visiting the British Library! It was brilliant to have an in-person session with Harry and meet the Digital Research team.

A key success of the project was meeting the objectives we set out to achieve. Patience was crucial, especially when investigating errors and identifying the root problem. The main challenge was managing such a large project that could be taken in multiple directions. It can be natural to spend a long time on one area, such as exploratory analysis, but we ensured that we completed the key elements that allowed us to move on to the next stage. This balance was essential for the project's overall success.

Harry: We divided our days between time for Catherine to work solo and pair programming. Catherine is a really keen learner, and I think this approach helped her drive the project forward while giving us space to cover foundational programming topics and a new programming language. My other role was keeping an eye on the project timeline. Giving the occasional steer on when to stick with something and when to move on helped (I hope!) Catherine to achieve a huge amount in three months. 

Dashboard
A page from the dashboard Catherine created in the last third of the project.

Ongoing Work

Catherine: Our team recognises the importance of continuing this work. I have developed an updated project roadmap, which includes utilising Amazon Web Services to enhance the speed and memory capacity of our pipeline. Additionally, I have planned to compare various large language models, considering ethical factors, and I will collaborate with other government analysts involved in similar projects. I am committed to advancing this project, further upskilling the team, and keeping Harry updated on our progress.

Harry: RAG, and the semantic rather than key word search that underlies it, represents a maturation of LLM technology that has the potential to change the way users search our collections. Anticipating that this will be a feature of future library services platforms, we have a responsibility to understand more about how these technologies will work with our collections at scale. We’re currently carrying out experiments with RAG and the linked data of the British National Bibliography to understand how searching like this will change the way users interact with our data.

Conclusions

Disappointingly the Data Science Accelerator was wound down by the Office for National Statistics at the end of the latest cohort, citing budget pressures. That has made us one of the last mentor/mentee pairings to benefit from the scheme, which we’re both incredibly grateful for and deeply saddened by. The experience has been a great one, and we’ve each learned a lot from it. We’ll continue to develop RAG at the Cabinet Office and the British Library, and hope to advocate for and support schemes like the Accelerator in the future!

12 December 2024

Automating metadata creation: an experiment with Parliamentary 'Road Acts'

This post was originally written by Giorgia Tolfo in early 2023 then lightly edited and posted by Mia Ridge in late 2024. It describes work undertaken in 2019, and provides context for resources we hope to share on the British Library's Research Repository in future.

The Living with Machines project used a range of diverse sources, including newspapers to maps and census data.  This post discusses the Road Acts, 18th century Acts of Parliament stored at the British Library, as an example of some of the challenges in digitising historical records, and suggests computational methods for reducing some of the overhead for cataloging Library records during digitisation.

What did we want to do?

Before collection items can be digitised, they need a preliminary catalogue record - there's no point digitising records without metadata for provenance and discoverability. Like many extensive collections, the Road Acts weren't already catalogued. Creating the necessary catalogue records manually wasn't a viable option for the timeframe and budget of the project, so with the support of British Library experts Jennie Grimshaw and Iris O’Brien, we decided to explore automated methods for extracting metadata from digitised images of the documents themselves. The metadata created could then be mapped to a catalogue schema provided by Jennie and Iris. 

Due to the complexity, the timeframe of the project, the infrastructure and the resources needed, the agency Cogapp was commissioned to do the following:

  • Export metadata for 31 scanned microfilms in a format that matched the required field in a metadata schema provided by the British Library curators
  • OCR (including normalising the 'long S') to a standard agreed with the Living with Machines project
  • Create a package of files for each Act including: OCR (METS + ALTO) + images (scanned by British Library)

To this end, we provided Cogapp with:

  • Scanned images of the 31 microfilm reels, named using the microfilm ID and the numerical sequential order of the frame
  • The Library's metadata requirements
  • Curators' support to explain and guide them through the metadata extraction and record creation process 

Once all of this was put in place, the process started, however this is where we encountered the main problem. 

First issue: the typeface

After some research and tests we came to the conclusion that the typeface (or font, shown in Figure 1) is probably English Blackletter. However, at the time, OCR software - software that uses 'optical character recognition' to transcribe text from digitised images, like Abbyy, Tesseract or Transkribus - couldn't accurately read this font. Running OCR using a generic tool would inevitably lead to poor, if not unusable, OCR. You can create 'models' for unrecognised fonts by manually transcribing a set of documents, but this can be time-consuming. 

Image of a historical document
Figure 1: Page showing typefaces and layout. SPRMicP14_12_016

Second issue: the marginalia

As you can see in Figure 2, each Act has marginalia - additional text in the margins of the page. 

This makes the task of recognising the layout of information on the page more difficult. At the time, most OCR software wasn't able to detect marginalia as separate blocks of text. As a consequence these portions of text are often rendered inline, merged with the main text. Some examples showing how OCR software using standard settings interpret the page in Figure 2 are below.

Black and white image of printed page with comments in the margins
Figure 2 Printed page with marginalia. SPRMicP14_12_324

 

OCR generated by ABBYY FineReader:

Qualisicatiori 6s Truitees;

Penalty on acting if not quaiified.

Anno Regni septimo Georgii III. Regis.

9nS be it further enaften, Chat no person ihali he tapable of aftingt ao Crustee in the Crecution of this 9ft, unless be ftall he, in his oton Eight, oj in the Eight of his ©Btfe, in the aftual PofTefli'on anb jogment oj Eeceipt of the Eents ana profits of tanas, Cenements, anb 5)erebitaments, of the clear pearlg Oalue of J?iffp Pounbs} o? (hall be ©eit apparent of some person hatiing such estate of the clear gcatlg 5ia= lue of ©ne hunb?eb Pounbs; o? poffcsseb of, o? intitieb unto, a personal estate to the amount o? Oalue of ©ne thoufanb Pounbs: 9nb if ang Person hcrebg beemeo incapable to aft, ihali presume to aft, etierg such Per* son (hall, so? etierg such ©ffcnce, fojfcit anb pag the @um of jTiftg pounbs to ang person o? 

 

OCR generated by the open source tool Tesseract:

586 Anno Regni ?eptimo Georgi III. Regis.

Qualification

of Truttees;

Penalty on

Gnd be it further enated, That no. Per?on ?hall bÈ

capable of ating as Tru?tËe in the Crecution of thig

A, unle?s he ?hall be, in his own Right, 02 in the

Right of his Wife, in the a‰ual Pofe??ion and En. |

joyment 02 Receipt of the Rents and P2zofits of Lands,

Tenements, and hereditaments, of the clear pearly

Ualue of Fifty Pounds z o? hall be Deir Apparent of

?ome Per?on having ?uch Cfitate of the clear yearly Uga-

lue of Dne Hundred Pounds ; 02 po??e?leD of, 02 intitled

unto, a Per?onal E?tate to the Amount 02 Ualue of One

thou?and Pounds : And if any Per?on hereby deemed

acting if not incapable to ai, ?hall p2e?ume to ait, every ?uch Perz

qualified.

 

OCR generated by Cogapp (without any enhancement)

of Trusteesi

586

Anno Regni ſeptimo Georgii III. Regis.

Qualihcation and be it further enałted, That no perſon thall be

capable of aging as Trulltee in the Erecution of this

ad, unlefs he thall be, in his own Right, of in the

Right of his Wife, in the ađual Polellion and En:

joyment or Receipt of the Rents and Profits of Lands,

Tenements, and hereditaments, of the clear pearly

Ualue of ffifty pounds : oi thall be peir apparent of

ſome Perſon having ſuch Etate of the clear yearly Ua:

lue of Dne hundred Pounds; ou podeled of, od intitled

unto, a Perſonal Elate to the amount ou Ualue of Dne

Penalty on thouſand Pounds : and if any perſon hereby deemed

acting if not incapable to ad, thall preſume to ađ, every ſuch Per-

Qualified.

 

As you can see, the OCR transcription results were too poor to use in our research.

Changing our focus: experimenting with metadata creation

Time was running out fast, so we decided to adjust our expectations about text transcription, and asked Cogapp to focus on generating metadata for the digitised Acts. They have reported on their process in a post called 'When AI is not enough' (which might give you a sense of the challenges!).

Since the title page of each Act has a relatively standard layout it was possible to train a machine learning model to recognise the title, year and place of publication, imprint etc. and produce metadata that could be converted into catalogue records. These were sent on to British Library experts for evaluation and quality control, and potential future ingest into our catalogues.

Conclusion

This experience, although only partly successful in creating fully transcribed pages, explored the potential of producing the basis of catalogue records computationally, and was also an opportunity to test workflows for automated metadata extraction from historical sources. 

Since this work was put on hold in 2019, advances in OCR features built into generative AI chatbots offered by major companies mean that a future project could probably produce good quality transcriptions and better structured data from our digitised images.

If you have suggestions or want to get in touch about the dataset, please email [email protected]

11 December 2024

MIX 2025: Writing With Technologies Call for Submissions

One of the highlights of our Digital Storytelling exhibition last year was hosting the 2023 MIX conference at the British Library in collaboration with Bath Spa University and the MyWorld programme, which explores the future of creative technology innovation.

MIX is an established forum for the discussion and celebration of writing and technology, bringing together researchers, writers, technologists and practitioners from around the world.  Many of the topics covered are relevant to work in the British Library as part of our research into collecting, curating and preserving interactive digital works and emerging formats.

Image text says MIX 2025 Writing With Technologies 2nd July 2025, with organisation logos underneath the text

As a new year draws near, we are looking forward to upcoming events. MIX will be back in Bath at the Locksbrook Campus on Wednesday 2 July 2025 and their call for submissions  is currently open until early February. Organisers are looking for proposals for 15 minute papers/presentations or 5 minute lightening talks from technologists, artists, writers and poets, academic researchers and independent scholars, on the following themes:

  • Issues of trust and truth in digital writing
  • The use of generative AI tools by authors, poets and screenwriters
  • Debates around AI and ethics for creative practitioners
  • Emerging immersive storytelling practices

MIX 2025 will investigate the intersection between these themes, including the challenges and opportunities for interactive and locative works, poetry film, screenwriting and writing for games, as well as digital preservation, archiving, enhanced curation and storytelling with AI. Conference organisers are also welcoming papers and presentations on the innovative use of AI tools in creative writing pedagogy. Deadline for submissions is 5pm GMT on Monday 10 February 2025, if you have any enquiries email [email protected].

As part of the programme, New York Times bestselling writer and publisher Michael Bhaskar, currently working for Microsoft AI and co-author of the book The Coming Wave: AI, Power and the 21st Century’s Greatest Dilemma, will appear in conversation.

To whet your appetite ahead of the next MIX you may want to check out the Writing with Technologies webinar series presented by My World with Bath Spa University’s Centre for Cultural and Creative Industries and the Narrative and Emerging Technologies Lab. This series examines AI’s emerging influence across writing and publishing in various fields through talks from writers, creators, academics, publishing professionals and AI experts. The next webinar will be on Wednesday 22nd January 2025, 2-3pm GMT discussing AI And Creative Expression, book your free place here.

11 November 2024

British National Bibliography resumes publication

The British National Bibliography (BNB) has resumed publication, following a period of unavailability due to a cyber-attack in 2023.

Having started in 1950, the BNB predates the founding of the British Library, but despite many changes over the years its purpose remains the same: to record the publishing output of the United Kingdom and the Republic of Ireland. The BNB includes books and periodicals, covering both physical and electronic material. It describes forthcoming items up to sixteen weeks ahead of their publication, so it is essential as a current awareness tool. To date, the BNB contains almost 5.5 million records.

As our ongoing recovery from the cyber-attack continues, our Collection Metadata department have developed a process by which the BNB can be published in formats familiar to its many users. Bibliographic records and summaries will be shared in several ways:

  • The database is searchable on the Share Family initiative's BNB Beta platform at https://bl.natbib-lod.org/ (see example record in the image below)
  • Regular updates in PDF format will be made freely available to all users. Initially this will be on request
  • MARC21 bibliographic records will be supplied directly to commercial customers across the world on a weekly basis
Image comprised of five photographs: a shelf of British National Bibliography volumes, the cover of a printed copy of BNB and examples of BNB records
This image includes photographs of the very first BNB entry from 1950 (“Male and female”) and the first one we produced in this new process (“Song of the mysteries”)

Other services, such as Z39.50 access and outputs in other formats, are currently unavailable. We are working towards restoring these, and will provide further information in due course.

The BNB is the first national bibliography to be made available on the Share Family initiative's platform. It is published as linked data, and forms part of an international collaboration of libraries to link and enhance discovery across multiple catalogues and bibliographies.

The resumption of the BNB is the result of adaptations built around long-established collaborative working partnerships, with Bibliographic Data Services (who provide our CIP records) and UK Legal Deposit libraries, who contribute to the Shared Cataloguing Programme.

The International Federation of Library Associations describes bibliographies like the BNB as "a permanent record of the cultural and intellectual output of a nation or country, which is witnessed by its publishing output". We are delighted to be able to resume publication of the BNB, especially as we prepare to celebrate its 75th anniversary in 2025.

For further information about the BNB, please contact [email protected].

Mark Ellison, Collection Metadata Services Manager

29 October 2024

Happy Twelfth Birthday Wikidata!

Today the global Wikidata community is celebrating its 12th birthday! Wikidata originally went live on the 29th October 2012, back when Andrew Gray was the British Library’s first Wikipedian in Residence and since then it has massively expanded.  

Wikidata is a free and open knowledge base that can be read and edited by both humans and machines, which acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia and Wikisource. Wikidata content is available under a free license (CC0), exported using standard formats (JSON & RDF), and can be interlinked to other open data sets on the linked data web.

Drawing of four people around a birthday cake

Over the past year Wikidata passed the incredible milestone of 2 Billion edits, making it the most edited Wikimedia project of all time. However, this growth has created Wikidata Query Service stability challenges and scaling issues. To address these, the development team have been working on several projects including splitting the data in the Query Service and releasing the multiple languages code to be able to handle the current size of Wikidata better.

Heat Map of Wikidata’s geographic coverage as of October 2024
Map of Wikidata’s geographic coverage as of October 2024

Another major focus during the past year has been promoting Wikidata reuse. To make it easier to access Wikidata’s data there is a new REST API. Plus developers who build with Wikidata’s data now have access to a Wikidata developer portal, which holds important information and provides inspiration about what is possible with Wikidata’s data.

The international library community actively engages with Wikidata. In 2019 the IFLA Wikidata Working Group was formed to explore the integration of Wikidata and Wikibase with library systems, and alignment of the Wikidata ontology with library metadata formats such as BIBFRAME, RDA, and MARC. There is also the LD4 Wikidata Affinity Group, who hold Affinity Group Calls and Wikidata Working Hours throughout the year.

If you are new to Wikidata and want to learn more, there are many resources available, including this Zine about Wikidata, created by our recent Wikimedian in Residence Dr Lucy Hinnie, and these videos:

You may also want to check out the online Bibliography of Wikidata, which lists books, academic conference presentations and peer-reviewed papers, which focus on Wikidata as their subject.

This post is by Digital Curator Stella Wisdom.

24 October 2024

Southeast Asian Language and Script Conversion Using Aksharamukha

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected]. 

 

The British Library’s vast Southeast Asian collection includes manuscripts, periodicals and printed books in the languages of the countries of maritime Southeast Asia, including Indonesia, Malaysia, Singapore, Brunei, the Philippines and East Timor, as well as on the mainland, from Thailand, Laos, Cambodia, Myanmar (Burma) and Vietnam.

The display of literary manuscripts from Southeast Asia outside of the Asian and African Studies Reading Room in St Pancras (photo by Adi Keinan-Schoonbaert)
The display of literary manuscripts from Southeast Asia outside of the Asian and African Studies Reading Room in St Pancras (photo by Adi Keinan-Schoonbaert)

 

Several languages and scripts from the mainland were the focus of recent development work commissioned by the Library and done on the script conversion platform Aksharamukha. These include Shan, Khmer, Khuen, and northern Thai and Lao Dhamma (Dhamma, or Tham, meaning ‘scripture’, is the script that several languages are written in).

These and other Southeast Asian languages and scripts pose multiple challenges to us and our users. Collection items in languages using non-romanised scripts are mainly catalogued (and therefore searched by users) using romanised text. For some language groups, users need to search the catalogue by typing in the transliteration of title and/or author using the Library of Congress (LoC) romanisation rules.

Items’ metadata text converted using the LoC romanisation scheme is often unintuitive, and therefore poses a barrier for users, hindering discovery and access to our collections via the online catalogues. In addition, curatorial and acquisition staff spend a significant amount of time manually converting scripts, a slow process which is prone to errors. Other libraries worldwide holding Southeast Asian collections and using the LoC romanisation scheme face the same issues.

Excerpt from the Library of Congress romanisation scheme for Khmer
Excerpt from the Library of Congress romanisation scheme for Khmer

 

Having faced these issues with Burmese language, last year we commissioned development work to the open-access platform Aksharamukha, which enables the conversion between various scripts, supporting 121 scripts and 21 romanisation methods. Vinodh Rajan, Aksharamukha’s developer, perfectly combines language and writing systems knowledge with computer science and coding skills. He added the LoC romanisation system to the platform’s Burmese script transliteration functionality (read about this in my previous post).

The results were outstanding – readers could copy and paste transliterated text into the Library's catalogue search box to check if we have items of interest. This has also greatly enhanced cataloguing and acquisition processes by enabling the creation of acquisition records and minimal records. In addition, our Metadata team updated all of our Burmese catalogue records (ca. 20,000) to include Burmese script, alongside transliteration (side note: these updated records are still unavailable to our readers due to the cyber-attack on the Library last year, but they will become accessible in the future).

The time was ripe to expand our collaboration with Vinodh and Aksharamukha. Maria Kekki, Curator for Burmese Collections, has been hosting this past year a Chevening Fellow from Myanmar, Myo Thant Linn. Myo was tasked with cataloguing manuscripts and printed books in Shan and Khuen – but found the romanisation aspect of this work to be very challenging to do manually. In order to facilitate Myo’s work and maximise the benefit from his fellowship, we needed to have a LoC romanisation functionality available. Aksharamukha was the right place for this – this free, open source, online tool is available to our curators, cataloguers, acquisition staff, and metadata team to use.

Former Chevening Fellow Myo Thant Linn reciting from a Shan manuscript in the Asian and African Studies Reading Room, September 2024 (photo by Jana Igunma)
Former Chevening Fellow Myo Thant Linn reciting from a Shan manuscript in the Asian and African Studies Reading Room, September 2024 (photo by Jana Igunma)

 

In addition to Maria and Myo’s requirements, Jana Igunma, Ginsburg Curator for Thai, Lao and Cambodian Collections, noted that adding Khmer to Aksharamukha would be immensely helpful for cataloguing our Khmer backlog and assist with new acquisitions. Northern Thai and Lao Dhamma scripts would be mostly useful to catalogue new acquisitions for print material, and add original scripts to manuscript records. The automation of LoC transliteration could be very cost-effective, by saving many cataloguing, acquisitions and metadata team’s hours. Khmer is a great example – it has the most extensive alphabet in the world (74 letters), and its romanisation is extremely complicated and time consuming!

First three leaves with text in a long format palm leaf bundle (សាស្ត្រាស្លឹករឹត/sāstrā slẏk rẏt) containing part of the Buddhist cosmology (សាស្ត្រាត្រៃភូមិ/Sāstrā Traibhūmi) in Khmer script, 18th or 19th century. Acquired by the British Museum from Edwards Goutier, Paris, on 6 December 1895. British Library, Or 5003, ff. 9-11
First three leaves with text in a long format palm leaf bundle (សាស្ត្រាស្លឹករឹត/sāstrā slẏk rẏt) containing part of the Buddhist cosmology (សាស្ត្រាត្រៃភូមិ/Sāstrā Traibhūmi) in Khmer script, 18th or 19th century. Acquired by the British Museum from Edwards Goutier, Paris, on 6 December 1895. British Library, Or 5003, ff. 9-11

 

It was required, therefore, to enhance Aksharamukha’s script conversion functionality with these additional scripts. This could generally be done by referring to existing LoC conversion tables, while taking into account any permutations of diacritics or character variations. However, it definitely has not been as simple as that!

For example, the presence of diacritics instigated a discussion between internal and external colleagues on the use of precomposed vs. decomposed formats in Unicode, when romanising original script. LoC systems use two types of coding schemata, MARC 21 and MARC 8. The former allows for precomposed diacritic characters, and the latter does not – it allows for decomposed format. In order to enable both these schemata, Vinodh included both MARC 8 and MARC 21 as input and output formats in the conversion functionality.

Another component, implemented for Burmese in the previous development round, but also needed for Khmer and Shan transliterations, is word spacing. Vinodh implemented word separation in this round as well – although this would always remain something that the cataloguer would need to check and adjust. Note that this is not enabled by default – you would have to select it (under ‘input’ – see image below).

Screenshot from Aksharamukha, showcasing Khmer word segmentation option
Screenshot from Aksharamukha, showcasing Khmer word segmentation option

 

It is heartening to know that enhancing Aksharamukha has been making a difference. Internally, Myo had been a keen user of the Shan romanisation functionality (though Khuen romanisation is still work-in-progress); and Jana has been using the Khmer transliteration too. Jana found it particularly useful to use Aksharamukha’s option to upload a photo of the title page, which is then automatically OCRed and romanised. This saved precious time otherwise spent on typing Khmer!

It should be mentioned that, when it comes to cataloguing Khmer language books at the British Library, both original Khmer script and romanised metadata are being included in catalogue records. Aksharamukha helps to speed up the process of cataloguing and eliminates typing errors. However, capitalisation and in some instances word separation and final consonants need to be adjusted manually by the cataloguer. Therefore, it is necessary that the cataloguer has a good knowledge of the language.

On the left: photo of a title page of a Khmer language textbook for Grade 9, recently acquired by the British Library; on the right: conversion of original Khmer text from the title page into LoC romanisation standard using Aksharamukha
On the left: photo of a title page of a Khmer language textbook for Grade 9, recently acquired by the British Library; on the right: conversion of original Khmer text from the title page into LoC romanisation standard using Aksharamukha

 

The conversion tool for Tham (Lanna) and Tham (Lao) works best for texts in Pali language, according to its LoC romanisation table. If Aksharamukha is used for works in northern Thai language in Tham (Lanna) script, or Lao language in Tham (Lao) script, cataloguer intervention is always required as there is no LoC romanisation standard for northern Thai and Lao languages in Tham scripts. Such publications are rare, and an interim solution that has been adopted by various libraries is to convert Tham scripts to modern Thai or Lao scripts, and then to romanise them according to the LoC romanisation standards for these languages.

Other libraries have been enjoying the benefits of the new developments to Aksharamukha. Conversations with colleagues from the Library of Congress revealed that present and past commissioned developments on Aksharamukha had a positive impact on their operations. LoC has been developing a transliteration tool called ScriptShifter. Aksharamukha’s Burmese and Khmer functionalities are already integrated into this tool, which can convert over ninety non-Latin scripts into Latin script following the LoC/ALA guidelines. The British Library funding Aksharamukha to make several Southeast Asian languages and scripts available in LoC romanisation has already been useful!

If you have feedback or encounter any bugs, please feel free to raise an issue on GitHub. And, if you’re interested in other scripts romanised using LoC schemas, Aksharamukha has a complete list of the ones that it supports. Happy conversions!

 

08 July 2024

Embracing Sustainability at the British Library: Insights from the Digital Humanities Climate Coalition Workshop

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected]. 

 

Sustainability has become a core value at the British Library, driven by our staff-led Sustainability Group and bolstered by the addition of a dedicated Sustainability Manager nearly a year ago. As part of our ongoing commitment to environmental responsibility, we have been exploring various initiatives to reduce our environmental footprint. One such initiative is our engagement with the Digital Humanities Climate Coalition (DHCC), a collaborative and cross-institutional effort focused on understanding and minimising the environmental impact of digital humanities research.

Screenshot from the Digital Humanities Climate Coalition website
Screenshot from the Digital Humanities Climate Coalition website
 

Discovering the DHCC and its toolkit

The Digital Humanities Climate Coalition (DHCC) has been on my radar for some time, primarily due to their exemplary work in promoting sustainable digital practices. The DHCC toolkit, in particular, has proven to be an invaluable resource. Designed to help individuals and organisations make more environmentally conscious digital choices, the toolkit offers practical guidance for building sustainable digital humanities projects. It encourages researchers to adopt climate-responsible practices and supports those who may lack the practical knowledge to devise greener initiatives.

The toolkit is comprehensive, providing tips on the planning and management of research infrastructure and data. It aims to empower researchers to make climate-friendly technological decisions, thereby fostering a culture of sustainability within the digital humanities community.

My primary goal in leveraging the DHCC toolkit is to raise awareness about the environmental impact of digital work and technology use. By doing so, I hope to empower Library staff to make informed decisions that contribute to our sustainability goals. The toolkit’s insights are crucial for anyone involved in digital research, offering both strategic guidance and practical tips for minimising ecological footprints.

Planning a workshop at the British Library

With the support of our Research Development team, I organised a one-day workshop at the British Library, inviting Professor James Baker, Director of Digital Humanities at the University of Southampton and a member of the DHCC, to lead the event. The workshop was designed to introduce the DHCC toolkit and provide guidance on implementing best practices in research projects. The in-person, full-day workshop was held on 5 February 2024.

Workshop highlights

The workshop featured four key sessions:

Session 1: Introductions and Framing: We began with an overview of the DHCC and its work within the GLAM sector, followed by an introduction to sustainability at the British Library, the roles that libraries play in reducing carbon footprint and awareness raising, the Green Libraries Campaign (of which the British Library was a founding partner), and perspectives on digital humanities and the use of computational methods.

CILIP’s Green Libraries Campaign banner
CILIP’s Green Libraries Campaign banner

Session 2: Toolkit Overview: Prof Baker introduced the DHCC toolkit, highlighting its main components and practical applications, focusing on grant writing (e.g. recommendations on designing research projects, including Data Management Plans), and working practices (guidance on reducing energy consumption in day-to-day working life, e.g. communication and shared working, travel, and publishing and preserving data). The session included responses from relevant Library teams, on topics such as research project design, data management and our shared research repository.

DHCC publication cover: A Reseacher Guide to Writing a Climate Justice Oriented Data Management Plan
DHCC Information, Measurement and Practice Action Group. (2022). A Researcher Guide to Writing a Climate Justice Oriented Data Management Plan (v0.6). Zenodo. https://doi.org/10.5281/zenodo.6451499

Session 3: Advocacy and Influencing: This session focused on strategies for advocating for sustainable practices within one's organisation and influencing others to adopt these practices. We covered the Library’s staff-led Sustainability Group and its activities, after which participants were then asked to consider the actions that could be taken at the Library and beyond, taking into account the types of people that might be influenced (senior leaders, colleagues, peers in wider networks/community).

Session 4: Feedback and Next Steps: Participants discussed their takeaways from the workshop and identified actionable steps they could implement in their work. This session included conversations on ways to translate workshop learnings into concrete next steps, and generated light ‘commitments’ for the next week, month and year. One fun way to set oneself a yearly reminder is to schedule an eco-friendly e-card to send to yourself in a year!

Post-workshop follow-up

Three months after the workshop had taken place, we conducted a follow-up survey to gauge its impact. The survey included a mix of agree/disagree statements (see chart below) and optional long-form questions to capture more detailed feedback. While we had only a few responses, survey results were constructive and positive. Participants appreciated the practical insights and reported better awareness of sustainable practices in their digital work.

Participants’ agree/disagree ratings for a series of statements about the DHCC workshop’s impact
Participants’ agree/disagree ratings for a series of statements about the DHCC workshop’s impact

Judging from responses to the set of statements above, at least several participants have embedded toolkit recommendations, made specific changes in their work, shared knowledge and influenced their wider networks. We got additional details on these actions in responses to the open-ended questions that followed.

What did staff members say?

Here are some comments made in relation to making changes and embedding the DHCC toolkit’s recommendation:

“Changes made to working policy and practice to order vegetarian options as standard for events.”

“I have referenced the toolkit in a chapter submitted for a monograph, in relation to my BL/university research.”

“I have discussed the toolkit's recommendations with colleagues re the projects I am currently working on. We agreed which parts of the projects were most carbon intensive and discussed ways to mitigate that.”

“I recommended a workshop on the toolkit to my [research] funding body.”

“Have engaged more with small impacts - less email traffic, fewer attachments, fewer images.”

A couple of comments were made with regard to challenges or barriers to change making. One was about colleagues being reluctant to decrease flying, or travel in general, as a way to reduce one’s carbon footprint. The second point referred to an uncertainty on how to influence internal discussions on software development infrastructure – highlighting the challenge of finding the right path to the right people.

An interesting comment was made in relation to raising environmental concerns and advocating the Toolkit:

“Shared the toolkit with wider professional network at an event at which environmentally conscious and sustainable practices were raised without prompting. Toolkit was well received with expressions of relief that others are thinking along these lines and taking practical steps to help progress the agenda.”

And finally, an excellent point about the energy-intensive use of ChatGPT (or other LLMs), which was covered at the workshop:

“The thing that has stayed with me is what was said about water consumption needed to cool the supercomputers - how every time you run one of those Chat GPT (or equivalent) queries it is the equivalent of throwing a litre of water out the window, and that Microsoft's water use has gone up 30%. I've now been saying this every time someone tells me to use one of these GPT searches. To be honest it has put me off using them completely.”

In summary

The DHCC workshop at the British Library was a great success, underscoring the importance of sustainability in digital humanities, digital projects and digital working. By leveraging the DHCC toolkit, we have taken important steps toward making our digital practices more environmentally responsible, and spreading the word across internal and external networks. Moving forward, we will continue to build on this momentum, fostering a culture of sustainability and empowering our staff to make informed, climate-friendly decisions.

Thank you to workshop contributors, organisers and helpers:

James Baker, Joely Fake, Maja Maricevic, Catherine Ross, Andy Rackley, Jez Cope, Jenny Basford, Graeme Bentley, Stephen White, Bianca Miranda Cardoso, Sarah Kirk-Browne, Andrea Deri, and Deirdre Sullivan.

 

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs