Digital scholarship blog

Enabling innovative research with British Library digital collections

143 posts categorized "Tools"

23 December 2024

AI (and machine learning, etc) with British Library collections

Machine learning (ML) is a hot topic, especially when it’s hyped as ‘AI’. How might libraries use machine learning / AI to enrich collections, making them more findable and usable in computational research? Digital Curator Mia Ridge lists some examples of external collaborations, internal experiments and staff training with AI / ML and digitised and born-digital collections.

Background

The trust that the public places in libraries is hugely important to us - all our 'AI' should be 'responsible' and ethical AI. The British Library was a partner in Sheffield University's FRAIM: Framing Responsible AI Implementation & Management project (2024). We've also used lessons from the projects described here to draft our AI Strategy and Ethical Guide.

Many of the projects below have contributed to our Digital Scholarship Training Programme and our Reading Group has been discussing deep learning, big data and AI for many years. It's important that libraries are part of conversations about AI, supporting AI and data literacy and helping users understand how ML models and datasets were created.

If you're interested in AI and machine learning in libraries, museums and archives, keep an eye out for news about the AI4LAM community's Fantastic Futures 2025 conference at the British Library, 3-5 December 2025. If you can't wait that long, join us for the 'AI Debates' at the British Library.

Using ML / AI tools to enrich collections

Generative AI tends to get the headlines, but at the time of writing, tools that use non-generative machine learning to automate specific parts of a workflow have more practical applications for cultural heritage collections. That is, 'AI' is currently more process than product.

Text transcription is a foundational task that makes digitised books and manuscripts more accessible to search, analysis and other computational methods. For example, oral history staff have experimented with speech transcription tools, raising important questions, and theoretical and practical issues for automatic speech recognition (ASR) tools and chatbots.

We've used Transkribus and eScriptorium to transcribe handwritten and printed text in a range of scripts and alphabets. For example:

Creating tools and demonstrators through external collaborations

Mining the UK Web Archive for Semantic Change Detection (2021)

This project used word vectors with web archives to track words whose meanings changed over time. Resources: DUKweb (Diachronic UK web) and blog post ‘Clouds and blackberries: how web archives can help us to track the changing meaning of words’.

Graphs showing how words associated with the words blackberry, cloud, eta and follow changed over time.
From blackberries to clouds... word associations change over time

Living with Machines (2018-2023)

Our Living With Machines project with The Alan Turing Institute pioneered new AI, data science and ML methods to analyse masses of newspapers, books and maps to understand the impact of the industrial revolution on ordinary people. Resources: short video case studies, our project website, final report and over 100 outputs in the British Library's Research Repository.

Outputs that used AI / machine learning / data science methods such as lexicon expansion, computer vision, classification and word embeddings included:

Tools and demonstrators created via internal pilots and experiments

Many of these examples were enabled by on-staff Research Software Engineers and the Living with Machines (LwM) team at the British Library's skills and enthusiasm for ML experiments in combination with long-term Library’s staff knowledge of collections records and processes:

British Library resources for re-use in ML / AI

Our Research Repository includes datasets suitable for ground truth training, including 'Ground truth transcriptions of 18th &19th century English language documents relating to botany from the India Office Records'. 

Our ‘1 million images’ on Flickr Commons have inspired many ML experiments, including:

The Library has also shared models and datasets for re-use on the machine learning platform Hugging Face.

12 December 2024

Automating metadata creation: an experiment with Parliamentary 'Road Acts'

This post was originally written by Giorgia Tolfo in early 2023 then lightly edited and posted by Mia Ridge in late 2024. It describes work undertaken in 2019, and provides context for resources we hope to share on the British Library's Research Repository in future.

The Living with Machines project used a range of diverse sources, including newspapers to maps and census data.  This post discusses the Road Acts, 18th century Acts of Parliament stored at the British Library, as an example of some of the challenges in digitising historical records, and suggests computational methods for reducing some of the overhead for cataloging Library records during digitisation.

What did we want to do?

Before collection items can be digitised, they need a preliminary catalogue record - there's no point digitising records without metadata for provenance and discoverability. Like many extensive collections, the Road Acts weren't already catalogued. Creating the necessary catalogue records manually wasn't a viable option for the timeframe and budget of the project, so with the support of British Library experts Jennie Grimshaw and Iris O’Brien, we decided to explore automated methods for extracting metadata from digitised images of the documents themselves. The metadata created could then be mapped to a catalogue schema provided by Jennie and Iris. 

Due to the complexity, the timeframe of the project, the infrastructure and the resources needed, the agency Cogapp was commissioned to do the following:

  • Export metadata for 31 scanned microfilms in a format that matched the required field in a metadata schema provided by the British Library curators
  • OCR (including normalising the 'long S') to a standard agreed with the Living with Machines project
  • Create a package of files for each Act including: OCR (METS + ALTO) + images (scanned by British Library)

To this end, we provided Cogapp with:

  • Scanned images of the 31 microfilm reels, named using the microfilm ID and the numerical sequential order of the frame
  • The Library's metadata requirements
  • Curators' support to explain and guide them through the metadata extraction and record creation process 

Once all of this was put in place, the process started, however this is where we encountered the main problem. 

First issue: the typeface

After some research and tests we came to the conclusion that the typeface (or font, shown in Figure 1) is probably English Blackletter. However, at the time, OCR software - software that uses 'optical character recognition' to transcribe text from digitised images, like Abbyy, Tesseract or Transkribus - couldn't accurately read this font. Running OCR using a generic tool would inevitably lead to poor, if not unusable, OCR. You can create 'models' for unrecognised fonts by manually transcribing a set of documents, but this can be time-consuming. 

Image of a historical document
Figure 1: Page showing typefaces and layout. SPRMicP14_12_016

Second issue: the marginalia

As you can see in Figure 2, each Act has marginalia - additional text in the margins of the page. 

This makes the task of recognising the layout of information on the page more difficult. At the time, most OCR software wasn't able to detect marginalia as separate blocks of text. As a consequence these portions of text are often rendered inline, merged with the main text. Some examples showing how OCR software using standard settings interpret the page in Figure 2 are below.

Black and white image of printed page with comments in the margins
Figure 2 Printed page with marginalia. SPRMicP14_12_324

 

OCR generated by ABBYY FineReader:

Qualisicatiori 6s Truitees;

Penalty on acting if not quaiified.

Anno Regni septimo Georgii III. Regis.

9nS be it further enaften, Chat no person ihali he tapable of aftingt ao Crustee in the Crecution of this 9ft, unless be ftall he, in his oton Eight, oj in the Eight of his ©Btfe, in the aftual PofTefli'on anb jogment oj Eeceipt of the Eents ana profits of tanas, Cenements, anb 5)erebitaments, of the clear pearlg Oalue of J?iffp Pounbs} o? (hall be ©eit apparent of some person hatiing such estate of the clear gcatlg 5ia= lue of ©ne hunb?eb Pounbs; o? poffcsseb of, o? intitieb unto, a personal estate to the amount o? Oalue of ©ne thoufanb Pounbs: 9nb if ang Person hcrebg beemeo incapable to aft, ihali presume to aft, etierg such Per* son (hall, so? etierg such ©ffcnce, fojfcit anb pag the @um of jTiftg pounbs to ang person o? 

 

OCR generated by the open source tool Tesseract:

586 Anno Regni ?eptimo Georgi III. Regis.

Qualification

of Truttees;

Penalty on

Gnd be it further enated, That no. Per?on ?hall bÈ

capable of ating as Tru?tËe in the Crecution of thig

A, unle?s he ?hall be, in his own Right, 02 in the

Right of his Wife, in the a‰ual Pofe??ion and En. |

joyment 02 Receipt of the Rents and P2zofits of Lands,

Tenements, and hereditaments, of the clear pearly

Ualue of Fifty Pounds z o? hall be Deir Apparent of

?ome Per?on having ?uch Cfitate of the clear yearly Uga-

lue of Dne Hundred Pounds ; 02 po??e?leD of, 02 intitled

unto, a Per?onal E?tate to the Amount 02 Ualue of One

thou?and Pounds : And if any Per?on hereby deemed

acting if not incapable to ai, ?hall p2e?ume to ait, every ?uch Perz

qualified.

 

OCR generated by Cogapp (without any enhancement)

of Trusteesi

586

Anno Regni ſeptimo Georgii III. Regis.

Qualihcation and be it further enałted, That no perſon thall be

capable of aging as Trulltee in the Erecution of this

ad, unlefs he thall be, in his own Right, of in the

Right of his Wife, in the ađual Polellion and En:

joyment or Receipt of the Rents and Profits of Lands,

Tenements, and hereditaments, of the clear pearly

Ualue of ffifty pounds : oi thall be peir apparent of

ſome Perſon having ſuch Etate of the clear yearly Ua:

lue of Dne hundred Pounds; ou podeled of, od intitled

unto, a Perſonal Elate to the amount ou Ualue of Dne

Penalty on thouſand Pounds : and if any perſon hereby deemed

acting if not incapable to ad, thall preſume to ađ, every ſuch Per-

Qualified.

 

As you can see, the OCR transcription results were too poor to use in our research.

Changing our focus: experimenting with metadata creation

Time was running out fast, so we decided to adjust our expectations about text transcription, and asked Cogapp to focus on generating metadata for the digitised Acts. They have reported on their process in a post called 'When AI is not enough' (which might give you a sense of the challenges!).

Since the title page of each Act has a relatively standard layout it was possible to train a machine learning model to recognise the title, year and place of publication, imprint etc. and produce metadata that could be converted into catalogue records. These were sent on to British Library experts for evaluation and quality control, and potential future ingest into our catalogues.

Conclusion

This experience, although only partly successful in creating fully transcribed pages, explored the potential of producing the basis of catalogue records computationally, and was also an opportunity to test workflows for automated metadata extraction from historical sources. 

Since this work was put on hold in 2019, advances in OCR features built into generative AI chatbots offered by major companies mean that a future project could probably produce good quality transcriptions and better structured data from our digitised images.

If you have suggestions or want to get in touch about the dataset, please email [email protected]

26 November 2024

Working Together: The UV Community Sprint Experience

How do you collaborate on a piece of software with a community of users and developers distributed around the world? Lanie and Saira from the British Library’s Universal Viewer team share their recent experience with a ‘community sprint’... 

Back in July, digital agency Cogapp tested the current version of the Universal Viewer (UV) against Web Content Accessibility Guidelines (WCAG) 2.2 and came up with a list of suggestions to enhance compliance.  

As accessibility is a top priority, the UV Steering Group decided to host a community sprint - an event focused on tackling these suggestions while boosting engagement and fostering collaboration. Sprints are typically internal, but the community sprint was open to anyone from the broader open-source community.

Zoom call showing participants
18 participants from 6 organisations teamed up to make the Universal Viewer more accessible - true collaboration in action!

The sprint took place for two weeks in October. Everyone brought unique skills and perspectives, making it a true community effort.

Software engineers worked on development tasks, such as improving screen reader compatibility, fixing keyboard navigation problems, and enhancing element visibility. Testing engineers ensured functionality, and non-technical participants assisted with planning, translations and management.

The group had different levels of experience, which made it important to provide a supportive environment for learning and collaboration.  

The project board at the end of the Sprint - not every issue was finished, but the sprint was still a success with over 30 issues completed in two weeks.
The project board at the end of the Sprint - not every issue was finished, but the sprint was still a success with over 30 issues completed in two weeks.

Some of those involved shared their thoughts on the sprint: 

Bruce Herman - Development Team Lead, British Library: 'It was a great opportunity to collaborate with other development teams in the BL and the UV Community.'

Demian Katz - Director of Library Technology, Villanova University: 'As a long-time member of the Universal Viewer community, it was really exciting to see so many new people working together effectively to improve the project.'

Sara Weale - Head of Web Design & Development, Llyfrgell Genedlaethol Cymru - National Library of Wales: 'Taking part in this accessibility sprint was an exciting and rewarding experience. As Scrum Master, I had the privilege of facilitating the inception, daily stand-ups, and retrospective sessions, helping to keep the team focused and collaborative throughout. It was fantastic to see web developers from the National Library of Wales working alongside the British Library, Falvey Library (Villanova University), and other members of the Universal Viewer Steering Group.

This sprint marked the first time an international, cross-community team came together in this way, and the sense of shared purpose and camaraderie was truly inspiring. Some of the key lessons I took away from the sprint was the need for more precise task estimation, as well as the value of longer sprints to allow for deeper problem-solving. Despite these challenges, the fortnight was defined by excellent communication and a strong collective commitment to addressing accessibility issues.

Seeing the team come together so quickly and effectively highlighted the power of collaboration to drive meaningful progress, ultimately enhancing the Universal Viewer for a more inclusive future.'

BL Test Engineers: 

Damian Burke: 'Having worked on UV for a number of years, this was my first community sprint. What stood out for me was the level of collaboration and goodwill from everyone on the team. How quickly we formed into a working agile team was impressive. From a UV tester's perspective, I learned a lot from using new tools like Vercel and exploring GitHub's advanced functionality.'

Alex Rostron: 'It was nice to collaborate and work with skilled people from all around the world to get a good number of tickets over the line.'

Danny Taylor: 'I think what I liked most was how organised the sprints were. It was great to be involved in my first BL retrospective.'

Miro board with answers to the question 'what went well during this sprint?'

 

Positive reactions to 'how I feel after the sprint'
A Miro board was used for Sprint planning and the retrospective – a review meeting after the Sprint where we determined what went well and what we would improve for next time.

Experience from the sprint helped us to organise a further sprint within the UV Steering Group for admin-related work, aimed at improving documentation to ensure clearer processes and better support for contributors. Looking ahead, we're planning to release UV 4.1.0 in the new year, incorporating the enhancements we've made - we’ll share another update when the release candidate is ready for review.

Building on the success of the community sprint, we're excited to make these collaborative efforts a key part of our strategic roadmap. Join us and help shape the future of UV!

22 November 2024

Collaborating to improve usability on the Universal Viewer project

Open source software is a valuable alternative to commercial software, but its decentralised nature often leads to less than polished user interfaces. This has also been the case for the Universal Viewer (UV), despite attempts over the years to improve the user experience (UX) for viewing digital collections. Improving the usability of the UV is just one of the challenges that the British Library's UV team have taken on. We've even recruited an expert volunteer to help!

Digital Curator Mia Ridge talks to UX expert Scott Jenson about his background in user experience design, his interest in working with open source software, and what he's noticed so far about the user experience of the Universal Viewer.

Mia: Hi Scott! Could you tell our readers a little about your background, and how you came to be interested in the UX of open source software?

Scott: I’ve been working in commercial software my entire life (Apple, Google and a few startups) and it became clear over time that the profit motive is often at odds with users’ needs. I’ve been exploring open source as an alternative.

Mia: I noticed your posts on Mastodon about looking for volunteer opportunities as you retired from professional work at just about the time that Erin (Product Owner for the Universal Viewer at the British Library) and I were wondering how we could integrate UX and usability work into the Library's plans for the UV. Have you volunteered before, and do you think it'll become a trend for others wondering how to use their skills after retirement?

Scott: Google has a program where you can leave your position for 3 months and volunteer on a project within Google.org. I worked on a project to help California Forestry analyse and map out the most critical areas in need of treatment. It was a lovely project and felt quite impactful. It was partly due to that project that put me on this path.

Mia: Why did you say 'yes' when I approached you about volunteering some time with us for the UV?

Scott: I lived in London for 4 years working for a mobile OS company called Symbian so I’ve spent a lot of time in London. While living in London, I even wrote my book in the British Library! So we have a lot in common. It was an intersection of opportunity and history I just couldn’t pass up.

Mia: And what were your first impressions of the project? 

Scott: It was an impactful project with a great vision of where it needed to go. I really wanted to get stuck in and help if I could.

Mia: we loved the short videos you made that crystallised the issues that users encounter with the UV but find hard to describe. Could you share one?

Scott: The most important one is something that happens to many projects that evolve over time: a patchwork of metaphors that accrue. In this case the current UV has at least 4 different ways to page through a document, 3 of which are horizontal and 1 vertical. This just creates a mishmash of conflicting visual prompts for users and simplifying that will go a long way to improve usability.

Screenshot of the Viewer with target areas marked up
A screenshot from Scott's video showing multiple navigation areas on the UV

How can you help improve the usability of the Universal Viewer?

We shared Scott's first impressions with the UV Steering Group in September, when he noted that the UV screen had 32 'targets' and 8 areas where functionality had been sprinkled over time, making it hard for users to know where to focus. We'd now to like get wider feedback on future directions.

Scott's made a short video that sets out some of the usability issues in the current layout of the Universal Viewer, and some possible solutions. We think it's a great provocation for discussion by the community! To join in and help with our next steps, you can post on the Universal Viewer Slack (request to join here) or GitHub.

06 November 2024

Recovered Pages: Crowdsourcing at the British Library

Digital Curator Mia Ridge writes...

While the British Library works to recover from the October 2023 cyber-attack, we're putting some information from our currently inaccessible website into an easily readable and shareable format. This blog post is based on a page captured by the Wayback Machine in September 2023.

Crowdsourcing at the British Library

Screenshot of the Zooniverse interface for annotating a historical newspaper article
Example of a crowdsourcing task

For the British Library, crowdsourcing is an engaging form of online volunteering supported by digital tools that manage tasks such as transcription, classification and geolocation that make our collections more discoverable.

The British Library has run several popular crowdsourcing projects in the past, including the Georeferencer, for geolocating historical maps, and In the Spotlight, for transcribing important information about historical playbills. We also integrated crowdsourcing activities into our flagship AI / data science project, Living with Machines.

  • Agents of Enslavement uses 18th/19th century newspapers to research slavery in Barbados and create a database of enslaved people.
  • Living with Machines, which is mostly based on research questions around nineteenth century newspapers

Crowdsourcing Projects at the British Library

  • Living with Machines (2019-2023) created innovative crowdsourced tasks, including tasks that asked the public to closely read historical newspaper articles to determine how specific words were used.
  • Agents of Enslavement (2021-2022) used 18th/19th century newspapers to research slavery in Barbados and create a database of enslaved people.
  • In the Spotlight (2017-2021) was a crowdsourcing project from the British Library that aimed to make digitised historical playbills more discoverable, while also encouraging people to closely engage with this otherwise less accessible collection of ephemera.
  • Canadian wildlife: notes from the field (2021), a project where volunteers transcribed handwritten field notes that accompany recordings of a wildlife collection within the sound archive.
  • Convert a Card (2015) was a series of crowdsourcing projects aimed to convert scanned catalogue cards in Asian and African languages into electronic records. The project template can be found and used on GitHub.
  • Georeferencer (2012 - present) enabled volunteers to create geospatial data from digitised versions of print maps by adding control points to the old and modern maps.
  • Pin-a-Tale (2012) asked people to map literary texts to British places.

 

Research Projects

The Living with Machines project included a large component of crowdsourcing research through practice, led by Digital Curator Mia Ridge.

Mia was also the Principle Investigator on the AHRC-funded Collective Wisdom project, which worked with a large group of co-authors to produce a book, The Collective Wisdom Handbook: perspectives on crowdsourcing in cultural heritage, through two 'book sprints' in 2021:

This book is written for crowdsourcing practitioners who work in cultural institutions, as well as those who wish to gain experience with crowdsourcing. It provides both practical tips, grounded in lessons often learned the hard way, and inspiration from research across a range of disciplines. Case studies and perspectives based on our experience are woven throughout the book, complemented by information drawn from research literature and practice within the field.

More Information

Our crowdsourcing projects were designed to produce data that can be used in discovery systems (such as online catalogues and our item viewer) through enjoyable tasks that give volunteers an opportunity to explore digitised collections.

Each project involves teams across the Library to supply digitised images for crowdsourcing and ensure that the results are processed and ingested into various systems. Enhancing metadata through crowdsourcing is considered in the British Library's Collection Metadata Strategy.

We previously posted on twitter @LibCrowds and currently post occasionally on Mastodon https://glammr.us/@libcrowds and via our newsletter.

Past editions of our newsletter are available online.

31 October 2024

Welcome to the British Library’s new Digital Curator OCR/HTR!

Blog pictureHello everyone! I am Dr Valentina Vavassori, the new Digital Curator for Optical Character Recognition/Handwritten Text Recognition at the British Library.

I am part of the Heritage Made Digital Team, which is responsible for developing and overseeing the digitisation workflow at the Library. I am also an unofficial member of the Digital Research Team, where I promote the reuse and access to the Library’s collections.

My role has both an operational component (integrating and developing OCR and HTR in the digitisation workflow) and a research and engagement component (supporting OCR/HTR projects in the Library). I really enjoy these two sides of my role, as I have a background as a researcher and as a cultural heritage professional.

I joined the British Library from The National Archives, London, where I worked as a Digital Scholarship Researcher in the Digital Research Team. I worked on projects involving data visualisation, OCR/HTR, data modelling, and user experience.

Before that, I completed a PhD in Digital Humanities at King’s College London, focusing on chatbots and augmented reality in museums and their impact on users and museum narratives. Part of my thesis explored how to use these narratives using spatial humanities methods such as GIS. During my PhD, I also collaborated on various digital research projects with institutions like The National Gallery, London, and the Museum of London.

However, I originally trained as an art historian. I studied art history in Italy and worked for a few years in museums. During my job, I realised the potential of developing digital experiences for visitors and the significant impact digitisation can have on research and enjoyment in cultural heritage. I was so interested in the opportunities, that I co-founded a start-up which developed a heritage geolocation app for tourists.

Joining the Library has been an amazing opportunity. I am really looking forward to learning from my colleagues and exploring all the potential collaborations within and outside the Library.

24 October 2024

Southeast Asian Language and Script Conversion Using Aksharamukha

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected]. 

 

The British Library’s vast Southeast Asian collection includes manuscripts, periodicals and printed books in the languages of the countries of maritime Southeast Asia, including Indonesia, Malaysia, Singapore, Brunei, the Philippines and East Timor, as well as on the mainland, from Thailand, Laos, Cambodia, Myanmar (Burma) and Vietnam.

The display of literary manuscripts from Southeast Asia outside of the Asian and African Studies Reading Room in St Pancras (photo by Adi Keinan-Schoonbaert)
The display of literary manuscripts from Southeast Asia outside of the Asian and African Studies Reading Room in St Pancras (photo by Adi Keinan-Schoonbaert)

 

Several languages and scripts from the mainland were the focus of recent development work commissioned by the Library and done on the script conversion platform Aksharamukha. These include Shan, Khmer, Khuen, and northern Thai and Lao Dhamma (Dhamma, or Tham, meaning ‘scripture’, is the script that several languages are written in).

These and other Southeast Asian languages and scripts pose multiple challenges to us and our users. Collection items in languages using non-romanised scripts are mainly catalogued (and therefore searched by users) using romanised text. For some language groups, users need to search the catalogue by typing in the transliteration of title and/or author using the Library of Congress (LoC) romanisation rules.

Items’ metadata text converted using the LoC romanisation scheme is often unintuitive, and therefore poses a barrier for users, hindering discovery and access to our collections via the online catalogues. In addition, curatorial and acquisition staff spend a significant amount of time manually converting scripts, a slow process which is prone to errors. Other libraries worldwide holding Southeast Asian collections and using the LoC romanisation scheme face the same issues.

Excerpt from the Library of Congress romanisation scheme for Khmer
Excerpt from the Library of Congress romanisation scheme for Khmer

 

Having faced these issues with Burmese language, last year we commissioned development work to the open-access platform Aksharamukha, which enables the conversion between various scripts, supporting 121 scripts and 21 romanisation methods. Vinodh Rajan, Aksharamukha’s developer, perfectly combines language and writing systems knowledge with computer science and coding skills. He added the LoC romanisation system to the platform’s Burmese script transliteration functionality (read about this in my previous post).

The results were outstanding – readers could copy and paste transliterated text into the Library's catalogue search box to check if we have items of interest. This has also greatly enhanced cataloguing and acquisition processes by enabling the creation of acquisition records and minimal records. In addition, our Metadata team updated all of our Burmese catalogue records (ca. 20,000) to include Burmese script, alongside transliteration (side note: these updated records are still unavailable to our readers due to the cyber-attack on the Library last year, but they will become accessible in the future).

The time was ripe to expand our collaboration with Vinodh and Aksharamukha. Maria Kekki, Curator for Burmese Collections, has been hosting this past year a Chevening Fellow from Myanmar, Myo Thant Linn. Myo was tasked with cataloguing manuscripts and printed books in Shan and Khuen – but found the romanisation aspect of this work to be very challenging to do manually. In order to facilitate Myo’s work and maximise the benefit from his fellowship, we needed to have a LoC romanisation functionality available. Aksharamukha was the right place for this – this free, open source, online tool is available to our curators, cataloguers, acquisition staff, and metadata team to use.

Former Chevening Fellow Myo Thant Linn reciting from a Shan manuscript in the Asian and African Studies Reading Room, September 2024 (photo by Jana Igunma)
Former Chevening Fellow Myo Thant Linn reciting from a Shan manuscript in the Asian and African Studies Reading Room, September 2024 (photo by Jana Igunma)

 

In addition to Maria and Myo’s requirements, Jana Igunma, Ginsburg Curator for Thai, Lao and Cambodian Collections, noted that adding Khmer to Aksharamukha would be immensely helpful for cataloguing our Khmer backlog and assist with new acquisitions. Northern Thai and Lao Dhamma scripts would be mostly useful to catalogue new acquisitions for print material, and add original scripts to manuscript records. The automation of LoC transliteration could be very cost-effective, by saving many cataloguing, acquisitions and metadata team’s hours. Khmer is a great example – it has the most extensive alphabet in the world (74 letters), and its romanisation is extremely complicated and time consuming!

First three leaves with text in a long format palm leaf bundle (សាស្ត្រាស្លឹករឹត/sāstrā slẏk rẏt) containing part of the Buddhist cosmology (សាស្ត្រាត្រៃភូមិ/Sāstrā Traibhūmi) in Khmer script, 18th or 19th century. Acquired by the British Museum from Edwards Goutier, Paris, on 6 December 1895. British Library, Or 5003, ff. 9-11
First three leaves with text in a long format palm leaf bundle (សាស្ត្រាស្លឹករឹត/sāstrā slẏk rẏt) containing part of the Buddhist cosmology (សាស្ត្រាត្រៃភូមិ/Sāstrā Traibhūmi) in Khmer script, 18th or 19th century. Acquired by the British Museum from Edwards Goutier, Paris, on 6 December 1895. British Library, Or 5003, ff. 9-11

 

It was required, therefore, to enhance Aksharamukha’s script conversion functionality with these additional scripts. This could generally be done by referring to existing LoC conversion tables, while taking into account any permutations of diacritics or character variations. However, it definitely has not been as simple as that!

For example, the presence of diacritics instigated a discussion between internal and external colleagues on the use of precomposed vs. decomposed formats in Unicode, when romanising original script. LoC systems use two types of coding schemata, MARC 21 and MARC 8. The former allows for precomposed diacritic characters, and the latter does not – it allows for decomposed format. In order to enable both these schemata, Vinodh included both MARC 8 and MARC 21 as input and output formats in the conversion functionality.

Another component, implemented for Burmese in the previous development round, but also needed for Khmer and Shan transliterations, is word spacing. Vinodh implemented word separation in this round as well – although this would always remain something that the cataloguer would need to check and adjust. Note that this is not enabled by default – you would have to select it (under ‘input’ – see image below).

Screenshot from Aksharamukha, showcasing Khmer word segmentation option
Screenshot from Aksharamukha, showcasing Khmer word segmentation option

 

It is heartening to know that enhancing Aksharamukha has been making a difference. Internally, Myo had been a keen user of the Shan romanisation functionality (though Khuen romanisation is still work-in-progress); and Jana has been using the Khmer transliteration too. Jana found it particularly useful to use Aksharamukha’s option to upload a photo of the title page, which is then automatically OCRed and romanised. This saved precious time otherwise spent on typing Khmer!

It should be mentioned that, when it comes to cataloguing Khmer language books at the British Library, both original Khmer script and romanised metadata are being included in catalogue records. Aksharamukha helps to speed up the process of cataloguing and eliminates typing errors. However, capitalisation and in some instances word separation and final consonants need to be adjusted manually by the cataloguer. Therefore, it is necessary that the cataloguer has a good knowledge of the language.

On the left: photo of a title page of a Khmer language textbook for Grade 9, recently acquired by the British Library; on the right: conversion of original Khmer text from the title page into LoC romanisation standard using Aksharamukha
On the left: photo of a title page of a Khmer language textbook for Grade 9, recently acquired by the British Library; on the right: conversion of original Khmer text from the title page into LoC romanisation standard using Aksharamukha

 

The conversion tool for Tham (Lanna) and Tham (Lao) works best for texts in Pali language, according to its LoC romanisation table. If Aksharamukha is used for works in northern Thai language in Tham (Lanna) script, or Lao language in Tham (Lao) script, cataloguer intervention is always required as there is no LoC romanisation standard for northern Thai and Lao languages in Tham scripts. Such publications are rare, and an interim solution that has been adopted by various libraries is to convert Tham scripts to modern Thai or Lao scripts, and then to romanise them according to the LoC romanisation standards for these languages.

Other libraries have been enjoying the benefits of the new developments to Aksharamukha. Conversations with colleagues from the Library of Congress revealed that present and past commissioned developments on Aksharamukha had a positive impact on their operations. LoC has been developing a transliteration tool called ScriptShifter. Aksharamukha’s Burmese and Khmer functionalities are already integrated into this tool, which can convert over ninety non-Latin scripts into Latin script following the LoC/ALA guidelines. The British Library funding Aksharamukha to make several Southeast Asian languages and scripts available in LoC romanisation has already been useful!

If you have feedback or encounter any bugs, please feel free to raise an issue on GitHub. And, if you’re interested in other scripts romanised using LoC schemas, Aksharamukha has a complete list of the ones that it supports. Happy conversions!

 

16 September 2024

memoQfest 2024: A Journey of Innovation and Connection

Attending memoQfest 2024 as a translator was an enriching and insightful experience. Held from 13 to 14 June in Budapest, Hungary, the event stood out as a hub for language professionals and translation technology enthusiasts. 

Streetview 1 of Budapest, near the venue for memoQfest 2024. Captured by the author

Streetview 2 of Budapest, near the venue for memoQfest 2024. Captured by the author
Streetviews of Budapest, near the venue for memoQfest 2024. Captured by the author

 

A Well-Structured Agenda 

The conference had a well-structured agenda with over 50 speakers, including two keynotes, who brought valuable insights into the world of translation.  

Jay Marciano, President of the Association for Machine Translation in the Americas (AMTA), delivered his highly anticipated presentation on understanding generative AI and large language models (LLMs). While he acknowledged their significant potential, Marciano expressed only cautious optimism on their future in the industry, stressing the need for a deeper understanding of the limitations. As he laid out, machines can translate faster but the quality of their output depends greatly on the quality of the training data, especially in certain domains or for specific clients. He believes that translators’ role will evolve so that they will become more involved with data curation, than with translation itself, to improve the quality of machine output. 

Dr Mike Dillinger, the former Technical Lead for Knowledge Graphs in the AI Division at LinkedIn, and now a technical advisor and consultant, also delved into the challenges and opportunities presented by AI-generated content in his keynote speech, The Next 'New Normal' for Language Services.  Dillinger holds a nuanced perspective on the intersection of AI, machine translation (MT), and knowledge graphs. As he explained, knowledge graphs can be designed to integrate, organize, and provide context for large volumes of data. They are particularly valuable because they go beyond simple data storage, embedding rich relationships and context. They can therefore make it easier for AI systems to process complex information, enhancing tasks like natural language processing, recommendation engines, and semantic search.  

Dillinger therefore advocated for the integration of knowledge graphs with AI, arguing that high-quality, context-rich data is crucial for improving the reliability and effectiveness of AI systems. Knowledge graphs can significantly enhance the capabilities of LLMs by grounding language in concrete concepts and real-world knowledge, thereby addressing some of the current limitations of AI and LLMs. He concluded that, while LLMs have made significant strides, they often lack true understanding of the text and context. 

 

Enhancing Translation Technology for BLQFP 

The event also offered hands-on demonstrations of memoQ's latest features and updates such as significant improvements to the In-country Review tool (ICR), a new filter for Markdown files, and enhanced spellcheck.  

Interior of the Pesti Vigado, Budapest's second largest concert hall, and venue for the memoQfest Gala dinner
Interior of the Pesti Vigado, Budapest's second largest concert hall, and venue for the memoQfest Gala dinner

 

 

As a participant, I was keen to explore how some of these features could be used to enhance translation processes at the British Library. For example, could machine translation (MT) be used to translate catalogue records? Over the last twelve years, the translation team of the British Library/Qatar Foundation Partnership project has built up a massive translation memory (TM) – a bilingual repository of all our previous translations. A machine could be trained on our terminology and style, using this TM and our other bilingual resources, such as our vast and growing term base (TB). With appropriate data curation, MT could be a cost-effective and efficient way to maximise our translation operations. 

There are challenges, however. For example, before it can be used to train a machine, our TM would need to be edited and cleaned, removing repetitive and inappropriate content. We would need to choose the most appropriate translations, while maintaining proper alignment between segments. The same applies to our TB, which would need to be curated. Some of these data curation tasks cannot be pursued at this time, as we remain without access to much of our data following the cyberattack incident. Moreover, these careful preparatory steps would not suffice, as any machine output would still need to be post-edited by skilled human translators. As both the conference’s keynote speakers agreed, it is not yet a simple matter of letting the machines do the work. 

 This blog post is by Musa Alkhalifa Alsulaiman, Arabic Translator, British Library/Qatar Foundation Partnership. 

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs