Digital scholarship blog

Enabling innovative research with British Library digital collections

64 posts categorized "Printed books"

11 November 2024

British National Bibliography resumes publication

The British National Bibliography (BNB) has resumed publication, following a period of unavailability due to a cyber-attack in 2023.

Having started in 1950, the BNB predates the founding of the British Library, but despite many changes over the years its purpose remains the same: to record the publishing output of the United Kingdom and the Republic of Ireland. The BNB includes books and periodicals, covering both physical and electronic material. It describes forthcoming items up to sixteen weeks ahead of their publication, so it is essential as a current awareness tool. To date, the BNB contains almost 5.5 million records.

As our ongoing recovery from the cyber-attack continues, our Collection Metadata department have developed a process by which the BNB can be published in formats familiar to its many users. Bibliographic records and summaries will be shared in several ways:

  • The database is searchable on the Share Family initiative's BNB Beta platform at https://bl.natbib-lod.org/ (see example record in the image below)
  • Regular updates in PDF format will be made freely available to all users. Initially this will be on request
  • MARC21 bibliographic records will be supplied directly to commercial customers across the world on a weekly basis
Image comprised of five photographs: a shelf of British National Bibliography volumes, the cover of a printed copy of BNB and examples of BNB records
This image includes photographs of the very first BNB entry from 1950 (“Male and female”) and the first one we produced in this new process (“Song of the mysteries”)

Other services, such as Z39.50 access and outputs in other formats, are currently unavailable. We are working towards restoring these, and will provide further information in due course.

The BNB is the first national bibliography to be made available on the Share Family initiative's platform. It is published as linked data, and forms part of an international collaboration of libraries to link and enhance discovery across multiple catalogues and bibliographies.

The resumption of the BNB is the result of adaptations built around long-established collaborative working partnerships, with Bibliographic Data Services (who provide our CIP records) and UK Legal Deposit libraries, who contribute to the Shared Cataloguing Programme.

The International Federation of Library Associations describes bibliographies like the BNB as "a permanent record of the cultural and intellectual output of a nation or country, which is witnessed by its publishing output". We are delighted to be able to resume publication of the BNB, especially as we prepare to celebrate its 75th anniversary in 2025.

For further information about the BNB, please contact [email protected].

Mark Ellison, Collection Metadata Services Manager

31 October 2024

Welcome to the British Library’s new Digital Curator OCR/HTR!

Blog pictureHello everyone! I am Dr Valentina Vavassori, the new Digital Curator for Optical Character Recognition/Handwritten Text Recognition at the British Library.

I am part of the Heritage Made Digital Team, which is responsible for developing and overseeing the digitisation workflow at the Library. I am also an unofficial member of the Digital Research Team, where I promote the reuse and access to the Library’s collections.

My role has both an operational component (integrating and developing OCR and HTR in the digitisation workflow) and a research and engagement component (supporting OCR/HTR projects in the Library). I really enjoy these two sides of my role, as I have a background as a researcher and as a cultural heritage professional.

I joined the British Library from The National Archives, London, where I worked as a Digital Scholarship Researcher in the Digital Research Team. I worked on projects involving data visualisation, OCR/HTR, data modelling, and user experience.

Before that, I completed a PhD in Digital Humanities at King’s College London, focusing on chatbots and augmented reality in museums and their impact on users and museum narratives. Part of my thesis explored how to use these narratives using spatial humanities methods such as GIS. During my PhD, I also collaborated on various digital research projects with institutions like The National Gallery, London, and the Museum of London.

However, I originally trained as an art historian. I studied art history in Italy and worked for a few years in museums. During my job, I realised the potential of developing digital experiences for visitors and the significant impact digitisation can have on research and enjoyment in cultural heritage. I was so interested in the opportunities, that I co-founded a start-up which developed a heritage geolocation app for tourists.

Joining the Library has been an amazing opportunity. I am really looking forward to learning from my colleagues and exploring all the potential collaborations within and outside the Library.

24 October 2024

Southeast Asian Language and Script Conversion Using Aksharamukha

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected]. 

 

The British Library’s vast Southeast Asian collection includes manuscripts, periodicals and printed books in the languages of the countries of maritime Southeast Asia, including Indonesia, Malaysia, Singapore, Brunei, the Philippines and East Timor, as well as on the mainland, from Thailand, Laos, Cambodia, Myanmar (Burma) and Vietnam.

The display of literary manuscripts from Southeast Asia outside of the Asian and African Studies Reading Room in St Pancras (photo by Adi Keinan-Schoonbaert)
The display of literary manuscripts from Southeast Asia outside of the Asian and African Studies Reading Room in St Pancras (photo by Adi Keinan-Schoonbaert)

 

Several languages and scripts from the mainland were the focus of recent development work commissioned by the Library and done on the script conversion platform Aksharamukha. These include Shan, Khmer, Khuen, and northern Thai and Lao Dhamma (Dhamma, or Tham, meaning ‘scripture’, is the script that several languages are written in).

These and other Southeast Asian languages and scripts pose multiple challenges to us and our users. Collection items in languages using non-romanised scripts are mainly catalogued (and therefore searched by users) using romanised text. For some language groups, users need to search the catalogue by typing in the transliteration of title and/or author using the Library of Congress (LoC) romanisation rules.

Items’ metadata text converted using the LoC romanisation scheme is often unintuitive, and therefore poses a barrier for users, hindering discovery and access to our collections via the online catalogues. In addition, curatorial and acquisition staff spend a significant amount of time manually converting scripts, a slow process which is prone to errors. Other libraries worldwide holding Southeast Asian collections and using the LoC romanisation scheme face the same issues.

Excerpt from the Library of Congress romanisation scheme for Khmer
Excerpt from the Library of Congress romanisation scheme for Khmer

 

Having faced these issues with Burmese language, last year we commissioned development work to the open-access platform Aksharamukha, which enables the conversion between various scripts, supporting 121 scripts and 21 romanisation methods. Vinodh Rajan, Aksharamukha’s developer, perfectly combines language and writing systems knowledge with computer science and coding skills. He added the LoC romanisation system to the platform’s Burmese script transliteration functionality (read about this in my previous post).

The results were outstanding – readers could copy and paste transliterated text into the Library's catalogue search box to check if we have items of interest. This has also greatly enhanced cataloguing and acquisition processes by enabling the creation of acquisition records and minimal records. In addition, our Metadata team updated all of our Burmese catalogue records (ca. 20,000) to include Burmese script, alongside transliteration (side note: these updated records are still unavailable to our readers due to the cyber-attack on the Library last year, but they will become accessible in the future).

The time was ripe to expand our collaboration with Vinodh and Aksharamukha. Maria Kekki, Curator for Burmese Collections, has been hosting this past year a Chevening Fellow from Myanmar, Myo Thant Linn. Myo was tasked with cataloguing manuscripts and printed books in Shan and Khuen – but found the romanisation aspect of this work to be very challenging to do manually. In order to facilitate Myo’s work and maximise the benefit from his fellowship, we needed to have a LoC romanisation functionality available. Aksharamukha was the right place for this – this free, open source, online tool is available to our curators, cataloguers, acquisition staff, and metadata team to use.

Former Chevening Fellow Myo Thant Linn reciting from a Shan manuscript in the Asian and African Studies Reading Room, September 2024 (photo by Jana Igunma)
Former Chevening Fellow Myo Thant Linn reciting from a Shan manuscript in the Asian and African Studies Reading Room, September 2024 (photo by Jana Igunma)

 

In addition to Maria and Myo’s requirements, Jana Igunma, Ginsburg Curator for Thai, Lao and Cambodian Collections, noted that adding Khmer to Aksharamukha would be immensely helpful for cataloguing our Khmer backlog and assist with new acquisitions. Northern Thai and Lao Dhamma scripts would be mostly useful to catalogue new acquisitions for print material, and add original scripts to manuscript records. The automation of LoC transliteration could be very cost-effective, by saving many cataloguing, acquisitions and metadata team’s hours. Khmer is a great example – it has the most extensive alphabet in the world (74 letters), and its romanisation is extremely complicated and time consuming!

First three leaves with text in a long format palm leaf bundle (សាស្ត្រាស្លឹករឹត/sāstrā slẏk rẏt) containing part of the Buddhist cosmology (សាស្ត្រាត្រៃភូមិ/Sāstrā Traibhūmi) in Khmer script, 18th or 19th century. Acquired by the British Museum from Edwards Goutier, Paris, on 6 December 1895. British Library, Or 5003, ff. 9-11
First three leaves with text in a long format palm leaf bundle (សាស្ត្រាស្លឹករឹត/sāstrā slẏk rẏt) containing part of the Buddhist cosmology (សាស្ត្រាត្រៃភូមិ/Sāstrā Traibhūmi) in Khmer script, 18th or 19th century. Acquired by the British Museum from Edwards Goutier, Paris, on 6 December 1895. British Library, Or 5003, ff. 9-11

 

It was required, therefore, to enhance Aksharamukha’s script conversion functionality with these additional scripts. This could generally be done by referring to existing LoC conversion tables, while taking into account any permutations of diacritics or character variations. However, it definitely has not been as simple as that!

For example, the presence of diacritics instigated a discussion between internal and external colleagues on the use of precomposed vs. decomposed formats in Unicode, when romanising original script. LoC systems use two types of coding schemata, MARC 21 and MARC 8. The former allows for precomposed diacritic characters, and the latter does not – it allows for decomposed format. In order to enable both these schemata, Vinodh included both MARC 8 and MARC 21 as input and output formats in the conversion functionality.

Another component, implemented for Burmese in the previous development round, but also needed for Khmer and Shan transliterations, is word spacing. Vinodh implemented word separation in this round as well – although this would always remain something that the cataloguer would need to check and adjust. Note that this is not enabled by default – you would have to select it (under ‘input’ – see image below).

Screenshot from Aksharamukha, showcasing Khmer word segmentation option
Screenshot from Aksharamukha, showcasing Khmer word segmentation option

 

It is heartening to know that enhancing Aksharamukha has been making a difference. Internally, Myo had been a keen user of the Shan romanisation functionality (though Khuen romanisation is still work-in-progress); and Jana has been using the Khmer transliteration too. Jana found it particularly useful to use Aksharamukha’s option to upload a photo of the title page, which is then automatically OCRed and romanised. This saved precious time otherwise spent on typing Khmer!

It should be mentioned that, when it comes to cataloguing Khmer language books at the British Library, both original Khmer script and romanised metadata are being included in catalogue records. Aksharamukha helps to speed up the process of cataloguing and eliminates typing errors. However, capitalisation and in some instances word separation and final consonants need to be adjusted manually by the cataloguer. Therefore, it is necessary that the cataloguer has a good knowledge of the language.

On the left: photo of a title page of a Khmer language textbook for Grade 9, recently acquired by the British Library; on the right: conversion of original Khmer text from the title page into LoC romanisation standard using Aksharamukha
On the left: photo of a title page of a Khmer language textbook for Grade 9, recently acquired by the British Library; on the right: conversion of original Khmer text from the title page into LoC romanisation standard using Aksharamukha

 

The conversion tool for Tham (Lanna) and Tham (Lao) works best for texts in Pali language, according to its LoC romanisation table. If Aksharamukha is used for works in northern Thai language in Tham (Lanna) script, or Lao language in Tham (Lao) script, cataloguer intervention is always required as there is no LoC romanisation standard for northern Thai and Lao languages in Tham scripts. Such publications are rare, and an interim solution that has been adopted by various libraries is to convert Tham scripts to modern Thai or Lao scripts, and then to romanise them according to the LoC romanisation standards for these languages.

Other libraries have been enjoying the benefits of the new developments to Aksharamukha. Conversations with colleagues from the Library of Congress revealed that present and past commissioned developments on Aksharamukha had a positive impact on their operations. LoC has been developing a transliteration tool called ScriptShifter. Aksharamukha’s Burmese and Khmer functionalities are already integrated into this tool, which can convert over ninety non-Latin scripts into Latin script following the LoC/ALA guidelines. The British Library funding Aksharamukha to make several Southeast Asian languages and scripts available in LoC romanisation has already been useful!

If you have feedback or encounter any bugs, please feel free to raise an issue on GitHub. And, if you’re interested in other scripts romanised using LoC schemas, Aksharamukha has a complete list of the ones that it supports. Happy conversions!

 

04 July 2024

DHBN 2024 - Digital Humanities in the Nordic and Baltic Countries Conference Report

This is a joint blog post by Helena Byrne, Curator of Web Archives, Harry Lloyd, Research Software Engineer, and Rossitza Atanassova, Digital Curator.

Conference banner showing Icelandic landscape with mountains
This year’s Digital Humanities in the Nordic and Baltic countries conference took place at the University of Iceland School of Education in Reykjavik. It was the eight conference which was established in 2016, but the first time it was held in Iceland. The theme for the conference was “From Experimentation to Experience: Lessons Learned from the Intersections between Digital Humanities and Cultural Heritage”. There were pre-conference workshops from May 27-29 with the main conference starting on the afternoon of May 29 and finishing on May 31. In her excellent opening keynote Sally Chambers, Head of Research Infrastructure Services at the British Library, discussed the complex research and innovation data space for cultural heritage. Three British Library colleagues report highlights of their conference experience in this blog post.

Helena Byrne, Curator of Web Archives, Contemporary British & Irish Publications.

I presented in the Born Digital session held on May 28. There were four presentations in this session and three were related to web archiving and one related to Twitter (X) data. I co-presented ‘Understanding the Challenges for the Use of Web Archives in Academic Research’. This presentation examined the challenges for the use of web archives in academic research through a synthesis of the findings from two research studies that were published through the WARCnet research network. There was lots of discussion after the presentation on how web archives could be used as a research data management tool to help manage online citations in academic publications.

Helena presenting to an audience during the conference session on born-digital archives
Helena presenting in the born-digital archives session

The conference programme was very strong and there were many takeaways that relate to my role. One strong theme was ‘collections as data’. At the UK Web Archive we have just started to publish some of our inactive curated collections as data. So these discussions were very useful. One highlight was thePanel: Publication and reuse of digital collections: A GLAM Labs approach’. What stood out for me in this session was the checklist for publishing collections as data. It was very reassuring to see that we had pretty much everything covered for the release of the UK Web Archive datasets.

Rossitza and I were kindly offered a tour of the National and University Library of Iceland by Kristinn Sigurðsson, Head of Digital Projects and Development. We enjoyed meeting curatorial staff from the Special Collections who showed us some of the historical maps of Iceland that have been digitised. We also visited the digitisation studio to see how they process periodicals, and spoke to staff involved with web archiving. Thank you to Kristinn and his colleagues for this opportunity to learn about the library’s collections and digital services.

Rossitza and Helena standing by the moat outside the National Library of Iceland building
Rossitza and Helena outside the National and University Library of Iceland

 

Inscription in Icelandic reading National and University Library of Iceland outside the Library building
The National and University Library of Iceland

Harry Lloyd, Research Software Engineer, Digital Research.

DHNB2024 was a rich conference from my perspective as a research software engineer. Sally Chambers’ opening keynote on Wednesday afternoon demonstrated an extraordinary grasp of the landscape of digital cultural heritage across the EU. By this point there had already been a day and a half of workshops, including a session Rossitza and I presented on Catalogues as Data

I spent the first half using a Jupyter notebook to explain how we extracted entries from an OCR’d version of the catalogue of the British Library’s collection of 15th century books. We used an explainable algorithm rather than a ‘black-box’ machine learning one, so we walked through the steps involved and discussed where it worked well and where it could be improved. You can follow along by clicking the ‘launch notebook’ button in the ReadMe here

Harry pointing to an image from the catalogue of printed books on a screen for the workshop audience
Harry explaining text recognition results during the workshop

Handing over to Rossitza in the second half to discuss her corpus linguistic analysis worked really well by giving attendees a feel for the complete workflow. This really showed in some great conversations we had with attendees over the following days about tricky problems like where to store the ‘true’ results of OCR. 

A few highlights from the rest of the conference were Clelia LaMonica’s work using Latin large language model to analyse kinship in texts from Medieval Burgundy. Large language models trained on historic texts are important as the majority are trained on modern material and struggle with historical language. Jørgen Burchardt presented some refreshingly quantitative work on bias across a digitised newspaper collection, very reminiscent of work by Kaspar Beelen. Overall it was a productive few days, and I very much enjoyed my time in Reykjavik.

Rossitza Atanassova, Digital Curator, Digital Research.

This was my second DHNB conference and I was looking forward to reconnecting with the community of researchers and cultural heritage practitioners, some of whom I had met at DHNB2019 in Copenhagen. Apart from the informal discussions with attendees, I contributed to DHNB2024 in two main ways.

As already mentioned, Harry and I delivered a pre-conference workshop showcasing some processes and methodology we use for working with printed catalogues as data. In the session we used the corpus tool AntConc to perform computational analysis of the descriptions for the British Library’s collection of books published in the 15th century. You can find out more about the project here and reuse the workshop materials published on Zenodo here.

I also joined the pre-conference meeting of the international GLAM Labs Community held at the National and University Library of Iceland. This was the first in-person meeting of the community in five years and was a productive session during which we brainstormed ‘100 ideas for the GLAM Labs Community’. Afterwards we had a sneak peak of the archive of the National Theatre of Iceland which is being catalogued and digitised.

The main hall of the Library with a chessboard on a table with two chairs, a statue of a man, holding spectacles and a stained glass screen.
The main hall of the Library.

The DHNB community is so welcoming and supportive, and attracts many early career digital humanists. I was particularly interested to hear from doctoral students researching the use of AI with digitised archives, and using NLP methods with historical collections. One of the projects that stood out for me was Johannes Widegren’s PhD research into the ethical use of AI to enable access and discovery of Sami cultural heritage, and to develop library and archival practice. 

I was also interested in presentations that discussed workflows for creating Named Entity Recognition resources for historical archives and I plan to try out the open-source Label Studio tool that I learned about. And of course, the poster session is always a highlight and I enjoyed finding out about a range of projects, including computational analysis of Scandinavian runic-texts, digital reconstruction of Gothenburg’s 1923 Jubilee exhibition, and training large language models to track semantic variation in climate change vocabulary in Danish news articles.

A line up of people standing in front of a screen advertising the venue for DHNB25 in Estonia
The poster presentations session chaired by Olga Holownia

We are grateful to all DHNB24 organisers for the warm welcome and a great conference experience, with special thanks to the inspirational and indefatigable Olga Holownia

26 June 2024

Join the British Library as a Digital Curator, OCR/HTR

This is a repeated and updated blog post by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections. She shares some background information on how a new post advertised for a Digital Curator for OCR/HTR will help the Library streamline post-digitisation work to make its collections even more accessible to users. Our previous run of this recruitment was curtailed due to the cyber-attack on the Library - but we are now ready to restart the process!

 

We’ve been digitising our collections for about three decades, opening up access to incredibly diverse and rich collections, for our users to study and enjoy. However, it is important that we further support discovery and digital research by unlocking the huge potential in automatically transcribing our collections.

We’ve done some work over the years towards making our collection items available in machine-readable format, in order to enable full-text search and analysis. Optical Character Recognition (OCR) technology has been around for a while, and there are several large-scale projects that produced OCRed text alongside digitised images – such as the Microsoft Books project. Until recently, Western languages print collections have been the main focus, especially newspaper collections. A flagship collaboration with the Alan Turing Institute, the Living with Machines project, applied OCR technology to UK newspapers, designing and implementing new methods in data science and artificial intelligence, and analysing these materials at scale.

OCR of Bengali books using Transkribus, Two Centuries of Indian Print Project
OCR of Bengali books using Transkribus, Two Centuries of Indian Print Project

Machine Learning technologies have been dealing increasingly well with both modern and historical collections, whether printed, typewritten or handwritten. Taking a broader perspective on Library collections, we have been exploring opportunities with non-Western collections too. Library staff have been engaging closely with the exploration of OCR and Handwritten Text Recognition (HTR) systems for EnglishBangla, Arabic, Urdu and Chinese. Digital Curators Tom Derrick, Nora McGregor and Adi Keinan-Schoonbaert have teamed up with PRImA Research Lab and the Alan Turing Institute to run four competitions in 2017-2019, inviting providers of text recognition methods to try them out on our historical material.

We have been working with Transkribus as well – for example, Alex Hailey, Curator for Modern Archives and Manuscripts, used the software to automatically transcribe 19th century botanical records from the India Office Records. A digital humanities work strand led by former colleague Tom Derrick saw the OCR of most of our digitised collection of Bengali printed texts, digitised as part of the Two Centuries of Indian Print project. More recently Transkribus has been used to extract text from catalogue cards in a project called Convert-a-Card, as well as from Incunabula print catalogues.

An example of a catalogue card in Transkribus, showing segmentation and transcription
An example of a catalogue card in Transkribus, showing segmentation and transcription

We've also collaborated with Colin Brisson from the READ_Chinese project on Chinese HTR, working with eScriptorium to enhance binarisation, segmentation and transcription models using manuscripts that were digitised as part of the International Dunhuang Programme. You can read more about this work in this brilliant blog post by Peter Smith, who's done a PhD placement with us last year.

The British Library is now looking for someone to join us to further improve the access and usability of our digital collections, by integrating a standardised OCR and HTR production process into our existing workflows, in line with industry best practice.

For more information and to apply please visit the ad for Digital Curator for OCR/HTR on the British Library recruitment site. Applications close on Sunday 21 July 2024. Please pay close attention to questions asked in the application process. Any questions? Drop us a line at [email protected].

Good luck!

05 April 2024

Curious about using 'public domain' British Library Flickr images?

We regularly get questions from people who want to re-use the images we've published on the British Library's Flickr account. They might be listed as 'No known copyright restrictions' or 'public domain'.

We're always pleased to hear that people want to use images from our Flickr collection, and appreciate folk who get in touch to check if their proposed use - particularly commercial use - is ok.

So, yes, our Public Domain images are out of copyright and available for re-use without restriction, including commercial re-use. You can find out more about our images at https://web.archive.org/web/20230325130540/https://www.bl.uk/about-us/terms-and-conditions/content-on-flickr-and-wikimedia-commons

You don't have to credit the Library when using our public domain images, but we always appreciate credit where possible as a way of celebrating their re-use and to help other people find the collection.

If you'd like to credit us, you can say something like 'Images courtesy of the British Library’s Flickr Collection'.

We also love hearing how people have used our images, so please do let us know ([email protected]) about the results if you do use them.

By Digital Curator Mia Ridge for the British Library's Digital Research team

12 September 2023

Convert-a-Card: Past, Present and Future of Catalogue Cards Retroconversion

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected].

 

It’s been more than eight years, in June 2015, since the British Library launched its crowdsourcing platform, LibCrowds, with the aim of enhancing access to our collections. The first project series on LibCrowds was called Convert-a-Card, followed by the ever-so-popular In the Spotlight project. The aim of Convert-a-Card was to convert print card catalogues from the Library’s Asian and African Collections into electronic records, for inclusion in our online catalogue Explore.

A significant portion of the Library's extensive historical collections was acquired well before the advent of standard computer-based cataloguing. Consequently, even though the Library's online catalogue offers public access to tens of millions of records, numerous crucial research materials remain discoverable solely through searching the traditional physical card catalogues. The physical cards provide essential information for each book, such as title, author, physical description (dimensions, number of pages, images, etc.), subject and a “shelfmark” – a reference to the item’s location. This information still constitutes the basic set of data to produce e-records in libraries and archives.

Card Catalogue Cabinets in the British Library’s Asian & African Studies Reading Room © Jon Ellis
Card Catalogue Cabinets in the British Library’s Asian & African Studies Reading Room © Jon Ellis

 

The initial focus of Convert-a-Card was the Library’s card catalogues for Chinese, Indonesian and Urdu books – you can read more about this here and here. Scanned catalogue cards were uploaded to Flickr (and later to our Research Repository), grouped by the physical drawer in which they were originally located. Several of these digitised drawers became projects on LibCrowds.

 

Crowdsourcing Retroconversion

Convert-a-Card on LibCrowds included two tasks:

  1. Task 1 – Search for a WorldCat record match: contributors were asked to look at a digitised card and search the OCLC WorldCat database based on some of the metadata elements printed on it (e.g. title, author, publication date), to see if a record for the book already exists in some form online. If found, they select the matching record.
  2. Task 2 – Transcribe the shelfmark: if a match was found, contributors then transcribed the Library's unique shelfmark as printed on the card.

Online volunteers worked on Pinyin (Chinese), Indonesian and Urdu records, mainly between 2015 and 2019. Their valuable contributions resulted in lists of new records which were then ingested into the Library's Explore catalogue – making these items so much more discoverable to our users. For cards only partially matched with online records, curators and cataloguers had a special area on the LibCrowds platform through which they could address some of the discrepancies in partial matches and resolve them.

An example of an Urdu catalogue card
An example of an Urdu catalogue card

 

After much consideration, we have decided to sunset LibCrowds. However, you can see a good snapshot of it thanks to the UK Web Archive (with thanks to Mia Ridge and Filipe Bento for archiving it), or access its GitHub pages – originally set up and maintained by LibCrowds creator Alex Mendes. We have been using mainly Zooniverse for crowdsourcing projects (see for example Living with Machines projects), and you can see here some references to these and other crowdsourcing initiatives. Sunsetting LibCrowds provided us with the opportunity to rethink Convert-a-Card and consider alternative, innovative ways to automate or semi-automate the retroconversion of these valuable catalogue cards.

 

Text Recognition

As a first step, we were looking to automate the retrieval of text from the digitised cards using OCR/Machine Learning. As mentioned, this text includes shelfmark, title, author, place and date of publication, and other information. If extracted accurately enough, this text could be used for WorldCat lookup, as well as for enhancement of existing records. In most cases, the text was typewritten in English, often with additional information, or translation, handwritten in other languages. To start with, we’ve decided to focus only on the typewritten English – with the aspiration to address other scripts and languages in the future.

Last year, we ran some comparative testing with ABBYY FineReader Server (the software generally used for in-house OCR) and Transkribus, to see how accurately they perform this task. We trialled a set of cards with two different versions of ABBYY, and three different models for typewritten Latin scripts in Transkribus (Model IDs 29418, 36202, and 25849). Assessment was done by visually comparing the original text with the OCRed text, examining mainly the key areas of text which are important for this initiative, i.e. the shelfmark, author’s name and book title. For the purpose of automatically recognising the typewritten English on the catalogue cards, Transkribus Model 29418 performed better than the others – and more accurately than ABBYY’s recognition.

An example of a Pinyin card in Transkribus, showing segmentation and transcription
An example of a Pinyin card in Transkribus, showing segmentation and transcription

 

Using that as a base model, we incrementally trained a bespoke model to recognise the text on our Pinyin cards. We’ve also normalised the resulting text, for example removing spaces in the shelfmark, or excluding unnecessary bits of data. This model currently extracts the English text only, with a Character Error Rate (CER) of 1.8%. With more training data, we plan on extending this model to other types of catalogue cards – but for now we are testing this workflow with our Chinese cards.

 

Entities Extraction

Extracting meaningful entities from the OCRed text is our next step, and there are different ways to do that. One such method – if already using Transkribus for text extraction – is training and applying a bespoke P2PaLA layout analysis model. Such model could identify text regions, improve automated segmentation of the cards, and help retrieve specific regions for further tasks. Former colleague Giorgia Tolfo tested this with our Urdu cards, with good results. Trying to replicate this for our Chinese cards was not as successful – perhaps due to the fact that they are less consistent in structure.

Another possible method is by using regular expressions in a programming language. Research Software Engineer (RSE) Harry Lloyd created a Jupyter notebook with Python code to do just that: take the PAGE XML files produced by Transkribus, parse the XML, and extract the title, author and shelfmark from the text. This works exceptionally well, and in the future we’ll expand entity recognition and extraction to other types of data appearing on the cards. But for now, this information suffices to query OCLC WorldCat and see if a matching record exists.

One of the 26 drawers of Chinese (Pinyin) card catalogues © Jon Ellis
One of the 26 drawers of Chinese (Pinyin) card catalogues © Jon Ellis

 

Matching Cards to WorldCat Records

Entities extracted from the catalogue cards can now be used to search and retrieve potentially matching records from the OCLC WorldCat database. Pulling out WorldCat records matched with our card records would help us create new records to go into our cataloguing system Aleph, as well as enrich existing Aleph records with additional information. Previously done by volunteers, we aim to automate this process as much as possible.

Querying WorldCat was initially done using the z39.50 protocol – the same one originally used in LibCrowds. This is a client-server communications protocol designed to support the search and retrieval of information in a distributed network environment. With an excellent start by Victoria Morris and Giorgia Tolfo, who developed a prototype that uses PyZ3950 and PyMARC to query WorldCat, Harry built upon this, refined the code, and tested it successfully for data search and retrieval. Moving forward, we are likely to use the OCLC API for this – which should be a lot more straightforward!

 

Curator/Cataloguer Disambiguation

Getting potential matches from WorldCat is brilliant, but we would like to have an easy way for curators and cataloguers to make the final decision on the ideal match – which WorldCat record would be the best one as a basis to create a new catalogue record on our system. For this purpose, Harry is currently working on a web application based on Streamlit – an open source Python library that enables the building and sharing of web apps. Staff members will be able to use this app by viewing suggested matches, and selecting the most suitable ones.

I’ll leave it up to Harry to tell you about this work – so stay tuned for a follow-up blog post very soon!

 

03 August 2023

My AHRC-RLUK Professional Practice Fellowship: A year on

A year ago I started work on my RLUK Professional Practice Fellowship project to analyse computationally the descriptions in the Library’s incunabula printed catalogue. As the project comes to a close this week, I would like to update on the work from the last few months leading to the publication of the incunabula printed catalogue data, a featured collection on the British Library’s Research Repository. In a separate blogpost I will discuss the findings from the text analysis and next steps, as well as share my reflections on the fellowship experience.

Since Isaac’s blogpost about the automated detection of the catalogue entries in the OCR files, a lot of effort has gone into improving the code and outputting the descriptions in the format required for the text analysis and as open datasets. With the invaluable help of Harry Lloyd who had joined the Library’s Digital Research team as Research Software Engineer, we verified the results and identified new rules for detecting sub-entries signaled by Another Copy rather than a main entry heading. We also reassembled and parsed the XML files, originally split in two sets per volume for the purpose of generating the OCR, so that the entries are listed in the order in which they appear in the printed volume. We prepared new text files containing all the entries from each volume with each entry represented as a single line of text, that I could use for the corpus linguistics analysis with AntConc. In consultation with the Curator, Karen Limper-Herz, and colleagues in Collection Metadata we agreed how best to store the data for evaluation and in preparation to update the Library’s online catalogue.

Two women looking at the poster illustrating the text analysis with the incunabula catalogue data
Poster session at Digital Humanities Conference 2023

Whilst all this work was taking place, I started the computational analysis of the English text from the descriptions. The reason for using these partial descriptions was to separate what was merely transcribed from the incunabula from the more language used by the cataloguer in their own ‘voice’. I have recorded my initial observations in the poster I presented at the Digital Humanities Conference 2023. Discussing my fellowship project with the conference attendees was extremely rewarding; there was much interest in the way I had used Transkribus to derive the OCR data, some questions about how the project methodology applies to other data and an agreement on the need to contextualise collections descriptions and reflect on any bias in the transmission of knowledge. In the poster I also highlight the importance of the cross-disciplinary collaboration required for this type of work, which resonated well with the conference theme of Collaboration as Opportunity.

I have started disseminating the knowledge gained from the project with members of the GLAM community. At the British Library Harry, Karen and I ran an informal ‘Hack & Yack’ training session showcasing the project aims and methodology through the use of Jupyter notebooks. I also enjoyed the opportunity to discuss my research at a recent Research Libraries UK Digital Scholarship Network workshop and look forward to further conversations on this topic with colleagues in the wider GLAM community. 

We intend to continue to enrich the datasets to enable better access to the collection, the development of new resources for incunabula research and digital scholarship projects. I would like to end by adding my thanks to Graham Jevon, for assisting with the timely publication of the project datasets, and above all to James, Karen and Harry for supporting me throughout this project.

This blogpost is by Dr Rossitza Atanassova, Digital Curator, British Library. She is on Twitter @RossiAtanassova  and Mastodon @[email protected]

 

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs