Digital scholarship blog

67 posts categorized "Printed books"

09 April 2025

Wikisource 2025 Conference: Collaboration, Innovation, and the Future of Digital Texts

This blog post is byDr Adi Keinan-Schoonbaert, Digital Curator forAsian and African Collections, British Library. She's on Mastodon as@[email protected] and Bluesky as @adi-keinan.bsky.social

 

The Wikisource 2025 Conference, held in the lush setting of Bali, Indonesia between 14-16 February 2025, brought together a global community of Wikimedians, heritage enthusiasts, and open knowledge advocates. Organised by a coalition of Wikisource contributors, Wikimedia Foundation and Wikimedia Indonesia, the conference served as a dynamic space to discuss the evolving role of Wikisource, explore new technologies, and strengthen collaborations with libraries, cultural institutions, and other global stakeholders.

Wikisource Conference 2025 participants. Photo by Memora Productions for Wikimedia Indonesia.
Wikisource Conference 2025 participants. Photo by Memora Productions for Wikimedia Indonesia.

The conference, themed “Wikisource: Transform & Preserve Heritage Digitally,”  featured a rich programme of keynote talks, long presentations, lightning talks, and informal meet-ups. Central themes included governance, technological advancements, community engagement, and the challenge of scaling Wikisource as a set of collaborative, multilingual platforms. We also enjoyed a couple of fantastic cultural events, celebrating the centuries-old, unique heritage of Bali!

Keynotes and Indonesian Partnerships

Following a kick-off session on the state of Wikisource community and technology, several Indonesian partners shared insights into their work on heritage, preservation, and digital accessibility. Dr Munawar Holil (Kang Mumu) highlighted the efforts of Manassa (the Indonesian Manuscript Society) to safeguard over 121,000 manuscripts, the majority of which remain undigitised, with key collections located in Bali, Jakarta, and Aceh. Challenges include limited public awareness, sacred perceptions requiring ceremonial handling, and structural gaps in institutional training.

Dr Cokorda Rai Adi Paramartha from Udayana University addressed the linguistic diversity of Indonesia – home to 780 languages and 40 scripts, only eight (!) of which are in Unicode – and stressed the importance of developing digital tools like a Balinese keyboard to engage the younger generation. Both speakers underscored the role of community collaboration and technological innovation in making manuscripts more accessible and relevant in the digital age.

Dr Munawar Holil (left), Dr Cokorda Rai Adi Paramartha (right) and session moderator Ivonne Kristiani (WMF; centre).
Dr Munawar Holil (left), Dr Cokorda Rai Adi Paramartha (right) and session moderator Ivonne Kristiani (WMF; centre).

I had the honour – and the absolute pleasure! – of being invited as one of the keynote speakers for this conference. In my talk I explored collaborations between the British Library and Wikisource, focusing on engaging local communities, raising awareness of library collections, facilitating access to digitised books and manuscripts, and enhancing them with accurate transcriptions.

We have previously collaborated with Bengali communities on two competitions to proofread 19th century Bengali books digitised as part of the Two Centuries of Indian Print project. More recently, the Library partnered with the Wikisource Loves Manuscripts (WiLMa) project, sharing Javanese manuscripts digitised through the Yogyakarta Digitisation Project. I’ve highlighted past and present work with Transkribus undertaken to develop Machine Learning training models aimed at automating transcriptions in various languages, encouraging further collaborations that could benefit communities worldwide, and highlighting the potential of such partnerships in expanding access to digitised heritage.

Dr Adi Keinan-Schoonbaert delivering a keynote address at the conference. Photo by Memora Productions for Wikimedia Indonesia.
Dr Adi Keinan-Schoonbaert delivering a keynote address at the conference. Photo by Memora Productions for Wikimedia Indonesia.

Another keynote was delivered by Andy Stauder from the READ-COOP. After introducing the cooperative and Transkribus, Andy talked about a key component of their approach – CCR – which stands for Clean, Controllable, and Reliable data coupled with information extraction (NER), powered by end-to-end ATR (automated text recognition) models. This approach is essential for both training and processing with large language models (LLMs). The future may move beyond pre-training to embrace active learning, fine-tuning, retrieval-augmented generation (RAG), dynamic prompt engineering, and reinforcement learning, with an aim to generate linked knowledge—such as integration with Wikidata IDs. Community collaboration remains central, as seen in projects like the digitisation of Indonesian palm-leaf manuscripts using Transkribus.

Andy Stauder (READ-COOP) talking about collaboration around the Indonesian palm-leaf manuscripts digitisation
Andy Stauder (READ-COOP) talking about collaboration around the Indonesian palm-leaf manuscripts digitisation

Cassie Chan (Google APAC Search Partnerships) gave a third keynote on Google's role in digitising and curating cultural and literary heritage, aligning with Wikisource’s mission of providing free access to source texts. Projects like Google Books aim to make out-of-copyright works discoverable online, while Google Arts & Culture showcases curated collections such as the Timbuktu Manuscripts, aiding preservation and accessibility. These efforts support Wikimedia goals by offering valuable, context-rich resources for contributors. Additionally, Google's use of AI for cultural exploration – through tools like Poem Postcards and Art Selfie – demonstrates innovative approaches to engaging with global heritage.

Spotlight on Key Themes and Takeaways

The conference featured so many interesting talks and discussions, providing insights into projects, sharing knowledge, and encouraging collaborations. I’ll mention here just a few themes and some key takeaways, from my perspective as someone working with heritage collections, communities, and technology.

Starting with the latter, a major focus was on Optical Character Recognition (OCR) improvements. Enhanced OCR capabilities on Wikisource platforms not only improve text accuracy but also encourage more volunteers to engage in text correction. Implementing Google OCR, Tesseract, and more recently – Transkribus – are driving increased participation, as volunteers enjoy refining text accuracy. Among other speakers, User:Darafsh, Chairman of the Iranian Wikimedians User Group, mentioned the importance of teaching how to use Wikisource and OCR, and the development of Persian OCR at the University of Hamburg. Other talks relating to technology covered the introduction of new extensions, widgets, and mobile apps, highlighting the push to make Wikisource more user-friendly and scalable.

Nicolas Vigneron showcasing the languages for which Google OCR was implemented on Wikisource
Nicolas Vigneron showcasing the languages for which Google OCR was implemented on Wikisource

Some discussions explored the potential of WiLMa (Wikisource Loves Manuscripts) as a model for coordinating across stakeholders, ensuring the consistency of tools, and fostering engagement with cultural institutions. For example, Irvin Tomas and Maffeth Opiana talked about WiLMa Philippines. This project launched in June 2024 as the first WiLMa project outside of Indonesia, focusing on transcribing and proofreading Central Bikol texts through activities like monthly proofread-a-thons, a 12-hour transcribe-a-thon, and training sessions at universities.

Another interesting topic was that of Wikidata and Metadata. The integration of structured metadata remains a key area of development, enabling better searchability and linking across digital archives. Bodhisattwa Mandal (West Bengal Wikimedians User Group) talked about Wikisource content including both descriptive metadata and unstructured text. While most data isn’t yet stored in a structured format, using Wikidata enables easier updates, avoids redundancy, and improves search, queries, and visualisation. There are tools that support metadata enrichment, annotation, and cataloguing, and a forthcoming mobile app will allow Wikidata-based book searches. Annotating text with Wikidata items enhances discoverability and link content more effectively across Wikimedia projects.

Working for the British Library, I (naturally!) picked up on a few collaborative projects between Wikisource and public or national libraries. One talk was about a digitisation project for traditional Korean texts, a three-year collaboration with Wikimedia Korea and the National Library of Korea, successfully revitalising the Korean Wikisource community by increasing participation and engaging volunteers through events and partnerships.

Another project built a Wikisource community in Uganda by training university students, particularly from library information studies, alongside existing volunteers. Through practical sessions, collaborative tasks, and support from institutions like the National Library of Uganda and Wikimedia contributors, participants developed digital literacy and archival skills.

Nanteza Divine Gabriella giving a talk on ‘Training Wikisource 101’ and building a Wikisource community in Uganda
Nanteza Divine Gabriella giving a talk on ‘Training Wikisource 101’ and building a Wikisource community in Uganda

A third Wikisource and libraries talk was about a Wikisource to public library pipeline project, which started initially in a public library in Hokitika, New Zealand. This pipeline enables scanned public domain books to be transcribed on Wikisource and then made available as lendable eBooks via the Libby app, using OverDrive's Local Content feature. With strong librarian involvement, a clear workflow, and support from a small grant, the project has successfully bridged Wikisource and library systems to increase accessibility and customise reading experiences for library users.

The final session of the conference focused on shaping a future roadmap for Wikisource through community-driven conversation, strategic planning, and partnership development. Discussions emphasised the need for clearer vision, sustainable collaborations with technology and cultural institutions, improved tools and infrastructure, and greater outreach to grow both readership and contributor communities. Key takeaways included aligning with partners’ goals, investing in editor growth, leveraging government language initiatives, and developing innovative workflows. A strong call was made to prioritise people over platforms and to ensure Wikisource remains a meaningful and inclusive space for engaging with knowledge and heritage.

Looking Ahead

The Wikisource 2025 Conference reaffirmed the platform’s importance in the digital knowledge ecosystem. However, sustaining momentum requires ongoing advocacy, technological refinement, and deeper institutional partnerships. Whether through digitising new materials or leveraging already-digitised collections, there is a clear hunger for openly accessible public domain texts.

As the community moves forward, a focus on governance, technology, and strategic partnerships will be essential in shaping the future of Wikisource. The atmosphere was so positive and there was so much enthusiasm and willingness to collaborate – see this fantastic video available via Wikimedia Commons, which successfully captures the sentiment. I’m sure we’re going to see a lot more coming from Wikisource communities in the future!

 

16 December 2024

Closing the language gap: automated language identification in British Library catalogue records

What do you do when you have millions of books and no record of the language they were written in? Collection Metadata Analyst Victoria Morris looks back to describe how she worked on this in 2020...

Context

In an age of online library catalogues, recording the language in which a book (or any other textual resource) is written is vital to library curators and users alike, as it allows them to search for resources in a particular language, and to filter search results by language.

As the graph below illustrates, although language information is routinely added to British Library catalogue records created as part of ongoing annual production, fewer than 30% of legacy records (from the British Library’s foundation catalogues) contain language information. As of October 2018, nearly 4.7 million of records were lacking any explicit language information. Of these, 78% were also lacking information about the country of publication, so it would not be possible to infer language from the place of publication.

Chart showing language of content records barely increasing over time

The question is: what can be done about this? In most cases, the language of the resource described can be immediately identified by a person viewing the book (or indeed the catalogue record for the book). With such a large number of books to deal with, though, it would be infeasible to start working through them one at a time ... an automated language identification process is required.

Language identification

Language identification (or language detection) refers to the process of determining the natural language in which a given piece of text is written. The texts analysed are commonly referred to as documents.

There are two possible avenues of approach: using either linguistic models or statistical models. Whilst linguistic models have the potential to be more realistic, they are also more complex, relying on detailed linguistic knowledge. For example, some linguistic models involve analysis of the grammatical structure of a document, and therefore require knowledge of the morphological properties of nouns, verbs, adjectives, etc. within all the languages of interest.

Statistical models are based on the analysis of certain features present within a training corpus of documents. These features might be words, character n-grams (sequences of n adjacent characters) or word n-grams (sequences of n adjacent words). These features are examined in a purely statistical, ‘linguistic-agnostic’ manner; words are understood as sequences of letter-like characters bounded by non-letter-like characters, not as words in any linguistic sense. When a document in an unknown language is encountered, its features can be compared to those of the training corpus, and a predication can thereby be made about the language of the document.

Our project was limited to an investigation of statistical models, since these could be more readily implemented using generic processing rules.

What can be analysed?

Since the vast majority of the books lacking language information have not been digitised, the language identification had to be based solely on the catalogue record. The title, edition statement and series title were extracted from catalogue records, and formed the test documents for analysis.

Although there are examples of catalogue records where these metadata elements are in a language different to that of the resource being described (as in, for example, The Four Gospels in Fanti, below), it was felt that this assumption was reasonable for the majority of catalogue records.

A screenshot of the catalogue record for a book listed as 'The Four Gospels in Fanti'

Measures of success

The effectiveness of a language identification model can be quantified by the measures precision and recall; precision measures the ability of the model not to make incorrect language predictions, whilst recall measures the ability of the model to find all instances of documents in a particular language. In this context, high precision is of greater value than high recall, since it is preferable to provide no information about the language of content of a resource than to provide incorrect information.

Various statistical models were investigated, with only a Bayesian statistical model based on analysis of words providing anything approaching satisfactory precision. This model was therefore selected for further development.

The Bayesian idea

Bayesian methods are based on a calculation of the probabilities that a book is written in each language under consideration. An assumption is made that the words present within the book title are statistically independent; this is obviously a false assumption (since, for example, adjacent words are likely to belong to the same language), but it allows application of the following proportionality:

An equation: P(D" is in language " l "given that it has features"  f_1…f_n )∝P (D" is in language " l)∏_(i=1)^n▒├ P("feature " f_i " arises in language " l)

The right-hand side of this proportionality can be calculated based on an analysis of the training corpus. The language of the test document is then predicted to be the language which maximises the above probability.

Because of the assumption of word-independence, this method is often referred to as naïve Bayesian classification.

What that means in practice is this: we notice that whenever the word ‘szerelem’ appears in a book title for which we have language information, the language is Hungarian. Therefore, if we find a book title which contains the word ‘szerelem’, but we don’t have language information for that book, we can predict that the book is probably in Hungarian.

Screenshot of catalogue entry with the word 'szerelem' in the title of a book
Szerelem: definitely a Hungarian word => probably a Hungarian title

If we repeat this for every word appearing in every title of each of the 12 million resources where we do have language information, then we can build up a model, which we can use to make predictions about the language(s) of the 4.7 million records that we’re interested in. Simple!

Training corpus

The training corpus was built from British Library catalogue records which contain language information, Records recorded as being in ‘Miscellaneous languages’, ‘Multiple languages’, ‘Sign languages’, ‘Undetermined’ and ‘No linguistic content’ were excluded. This yielded 12,254,341 records, of which 9,578,175 were for English-language resources. Words were extracted from the title, edition statement, and series title, and stored in a ‘language bucket’.

Words in English, Hungarian and Volapuk shown above the appropriate language 'bucket'

Language buckets were analysed in order to create a matrix of probabilities, whereby a number was assigned to each word-language pair (for all words encountered within the catalogue, and all languages listed in a controlled list) to represent the probability that that word belongs to that language. Selected examples are listed in the table below; the final row in the table illustrates the fact that shorter words tend to be common to many languages, and are therefore of less use than longer words in language identification.

{Telugu: 0.750, Somali: 0.250}

aaaarrgghh

{English: 1.000}

aaavfleeße

{German: 1.000}

aafjezatsd

{German: 0.333, Low German: 0.333, Limburgish: 0.333}

aanbidding

{Germanic (Other): 0.048, Afrikaans: 0.810, Low German: 0.048, Dutch: 0.095}

نبوغ

{Persian: 0.067, Arabic: 0.200, Pushto: 0.333, Iranian (Other): 0.333, Azerbaijani: 0.067}

metodicheskiĭ

{Russian: 0.981, Kazakh: 0.019}

nuannersuujuaannannginneranik

{Kalâtdlisut): 1.000}

karga

{Faroese: 0.020, Papiamento: 0.461, Guarani: 0.010, Zaza: 0.010, Esperanto: 0.010, Estonian: 0.010, Iloko: 0.176, Maltese: 0.010, Pampanga: 0.010, Tagalog: 0.078, Ladino: 0.137, Basque: 0.029, English: 0.010, Turkish: 0.029}

Results

Precision and recall varied enormously between languages. Zulu, for instance, had 100% precision but only 20% recall; this indicates that all records detected as being in Zulu had been correctly classified, but that the majority of Zulu records had either been mis-classified, or no language prediction had been made. In practical terms, this meant that a prediction “this book is in Zulu” was a prediction that we could trust, but we couldn’t assume that we had found all of the Zulu books. Looking at our results across all languages, we could generate a picture (formally termed a ‘confusion matrix’) to indicate how different languages were performing (see below). The shaded cells on the diagonal represent resources where the language has been correctly identified, whilst the other shaded cells show us where things have gone wrong.

Language confusion matrix

The best-performing languages were Hawaiian, Malay, Zulu, Icelandic, English, Samoan, Finnish, Welsh, Latin and French, whilst the worst-performing languages were Shona, Turkish, Pushto, Slovenian, Azerbaijani, Javanese, Vietnamese, Bosnian, Thai and Somali.

Where possible, predictions were checked by language experts from the British Library’s curatorial teams. Such validation facilitated the identification of off-diagonal shaded areas (i.e. languages for which predictions which should be treated with caution), and enabled acceptance thresholds to be set. For example, the model tends to over-predict English, in part due to the predominance of English-language material in the training corpus, thus the acceptance threshold for English was set at 100%: predictions of English would only be accepted if the model claimed that it was 100% certain that the language was English. For other languages, the acceptance threshold was generally between 95% and 99%.

Outcomes

Two batches of records have been completed to date. In the first batch, language codes were assigned to 1.15 million records with 99.7% confidence; in the second batch, a further 1 million language codes were assigned with 99.4% confidence. Work on a third batch is currently underway, and it is hoped to achieve at least a further million language code assignments. The graph below shows the impact that this project is having on the British Library catalogue.

Graph showing improvement in the number of 'foundation catalogue' records with languages recorded

The project has already been well-received by Library colleagues, who have been able to use the additional language coding to assist them in identifying curatorial responsibilities and better understanding the collection.

Further reading

For a more in-depth, mathematical write-up of this project, please see a paper written for Cataloging & Classification Quarterly, which is available at: https://doi.org/10.1080/01639374.2019.1700201, and is also in the BL research repository at https://bl.iro.bl.uk/work/6c99ffcb-0003-477d-8a58-64cf8c45ecf5.

12 December 2024

Automating metadata creation: an experiment with Parliamentary 'Road Acts'

This post was originally written by Giorgia Tolfo in early 2023 then lightly edited and posted by Mia Ridge in late 2024. It describes work undertaken in 2019, and provides context for resources we hope to share on the British Library's Research Repository in future.

The Living with Machines project used a range of diverse sources, including newspapers to maps and census data.  This post discusses the Road Acts, 18th century Acts of Parliament stored at the British Library, as an example of some of the challenges in digitising historical records, and suggests computational methods for reducing some of the overhead for cataloging Library records during digitisation.

What did we want to do?

Before collection items can be digitised, they need a preliminary catalogue record - there's no point digitising records without metadata for provenance and discoverability. Like many extensive collections, the Road Acts weren't already catalogued. Creating the necessary catalogue records manually wasn't a viable option for the timeframe and budget of the project, so with the support of British Library experts Jennie Grimshaw and Iris O’Brien, we decided to explore automated methods for extracting metadata from digitised images of the documents themselves. The metadata created could then be mapped to a catalogue schema provided by Jennie and Iris. 

Due to the complexity, the timeframe of the project, the infrastructure and the resources needed, the agency Cogapp was commissioned to do the following:

  • Export metadata for 31 scanned microfilms in a format that matched the required field in a metadata schema provided by the British Library curators
  • OCR (including normalising the 'long S') to a standard agreed with the Living with Machines project
  • Create a package of files for each Act including: OCR (METS + ALTO) + images (scanned by British Library)

To this end, we provided Cogapp with:

  • Scanned images of the 31 microfilm reels, named using the microfilm ID and the numerical sequential order of the frame
  • The Library's metadata requirements
  • Curators' support to explain and guide them through the metadata extraction and record creation process 

Once all of this was put in place, the process started, however this is where we encountered the main problem. 

First issue: the typeface

After some research and tests we came to the conclusion that the typeface (or font, shown in Figure 1) is probably English Blackletter. However, at the time, OCR software - software that uses 'optical character recognition' to transcribe text from digitised images, like Abbyy, Tesseract or Transkribus - couldn't accurately read this font. Running OCR using a generic tool would inevitably lead to poor, if not unusable, OCR. You can create 'models' for unrecognised fonts by manually transcribing a set of documents, but this can be time-consuming. 

Image of a historical document
Figure 1: Page showing typefaces and layout. SPRMicP14_12_016

Second issue: the marginalia

As you can see in Figure 2, each Act has marginalia - additional text in the margins of the page. 

This makes the task of recognising the layout of information on the page more difficult. At the time, most OCR software wasn't able to detect marginalia as separate blocks of text. As a consequence these portions of text are often rendered inline, merged with the main text. Some examples showing how OCR software using standard settings interpret the page in Figure 2 are below.

Black and white image of printed page with comments in the margins
Figure 2 Printed page with marginalia. SPRMicP14_12_324

 

OCR generated by ABBYY FineReader:

Qualisicatiori 6s Truitees;

Penalty on acting if not quaiified.

Anno Regni septimo Georgii III. Regis.

9nS be it further enaften, Chat no person ihali he tapable of aftingt ao Crustee in the Crecution of this 9ft, unless be ftall he, in his oton Eight, oj in the Eight of his ©Btfe, in the aftual PofTefli'on anb jogment oj Eeceipt of the Eents ana profits of tanas, Cenements, anb 5)erebitaments, of the clear pearlg Oalue of J?iffp Pounbs} o? (hall be ©eit apparent of some person hatiing such estate of the clear gcatlg 5ia= lue of ©ne hunb?eb Pounbs; o? poffcsseb of, o? intitieb unto, a personal estate to the amount o? Oalue of ©ne thoufanb Pounbs: 9nb if ang Person hcrebg beemeo incapable to aft, ihali presume to aft, etierg such Per* son (hall, so? etierg such ©ffcnce, fojfcit anb pag the @um of jTiftg pounbs to ang person o? 

 

OCR generated by the open source tool Tesseract:

586 Anno Regni ?eptimo Georgi III. Regis.

Qualification

of Truttees;

Penalty on

Gnd be it further enated, That no. Per?on ?hall bÈ

capable of ating as Tru?tËe in the Crecution of thig

A, unle?s he ?hall be, in his own Right, 02 in the

Right of his Wife, in the a‰ual Pofe??ion and En. |

joyment 02 Receipt of the Rents and P2zofits of Lands,

Tenements, and hereditaments, of the clear pearly

Ualue of Fifty Pounds z o? hall be Deir Apparent of

?ome Per?on having ?uch Cfitate of the clear yearly Uga-

lue of Dne Hundred Pounds ; 02 po??e?leD of, 02 intitled

unto, a Per?onal E?tate to the Amount 02 Ualue of One

thou?and Pounds : And if any Per?on hereby deemed

acting if not incapable to ai, ?hall p2e?ume to ait, every ?uch Perz

qualified.

 

OCR generated by Cogapp (without any enhancement)

of Trusteesi

586

Anno Regni ſeptimo Georgii III. Regis.

Qualihcation and be it further enałted, That no perſon thall be

capable of aging as Trulltee in the Erecution of this

ad, unlefs he thall be, in his own Right, of in the

Right of his Wife, in the ađual Polellion and En:

joyment or Receipt of the Rents and Profits of Lands,

Tenements, and hereditaments, of the clear pearly

Ualue of ffifty pounds : oi thall be peir apparent of

ſome Perſon having ſuch Etate of the clear yearly Ua:

lue of Dne hundred Pounds; ou podeled of, od intitled

unto, a Perſonal Elate to the amount ou Ualue of Dne

Penalty on thouſand Pounds : and if any perſon hereby deemed

acting if not incapable to ad, thall preſume to ađ, every ſuch Per-

Qualified.

 

As you can see, the OCR transcription results were too poor to use in our research.

Changing our focus: experimenting with metadata creation

Time was running out fast, so we decided to adjust our expectations about text transcription, and asked Cogapp to focus on generating metadata for the digitised Acts. They have reported on their process in a post called 'When AI is not enough' (which might give you a sense of the challenges!).

Since the title page of each Act has a relatively standard layout it was possible to train a machine learning model to recognise the title, year and place of publication, imprint etc. and produce metadata that could be converted into catalogue records. These were sent on to British Library experts for evaluation and quality control, and potential future ingest into our catalogues.

Conclusion

This experience, although only partly successful in creating fully transcribed pages, explored the potential of producing the basis of catalogue records computationally, and was also an opportunity to test workflows for automated metadata extraction from historical sources. 

Since this work was put on hold in 2019, advances in OCR features built into generative AI chatbots offered by major companies mean that a future project could probably produce good quality transcriptions and better structured data from our digitised images.

If you have suggestions or want to get in touch about the dataset, please email [email protected]

11 November 2024

British National Bibliography resumes publication

The British National Bibliography (BNB) has resumed publication, following a period of unavailability due to a cyber-attack in 2023.

Having started in 1950, the BNB predates the founding of the British Library, but despite many changes over the years its purpose remains the same: to record the publishing output of the United Kingdom and the Republic of Ireland. The BNB includes books and periodicals, covering both physical and electronic material. It describes forthcoming items up to sixteen weeks ahead of their publication, so it is essential as a current awareness tool. To date, the BNB contains almost 5.5 million records.

As our ongoing recovery from the cyber-attack continues, our Collection Metadata department have developed a process by which the BNB can be published in formats familiar to its many users. Bibliographic records and summaries will be shared in several ways:

  • The database is searchable on the Share Family initiative's BNB Beta platform at https://bl.natbib-lod.org/ (see example record in the image below)
  • Regular updates in PDF format will be made freely available to all users. Initially this will be on request
  • MARC21 bibliographic records will be supplied directly to commercial customers across the world on a weekly basis
Image comprised of five photographs: a shelf of British National Bibliography volumes, the cover of a printed copy of BNB and examples of BNB records
This image includes photographs of the very first BNB entry from 1950 (“Male and female”) and the first one we produced in this new process (“Song of the mysteries”)

Other services, such as Z39.50 access and outputs in other formats, are currently unavailable. We are working towards restoring these, and will provide further information in due course.

The BNB is the first national bibliography to be made available on the Share Family initiative's platform. It is published as linked data, and forms part of an international collaboration of libraries to link and enhance discovery across multiple catalogues and bibliographies.

The resumption of the BNB is the result of adaptations built around long-established collaborative working partnerships, with Bibliographic Data Services (who provide our CIP records) and UK Legal Deposit libraries, who contribute to the Shared Cataloguing Programme.

The International Federation of Library Associations describes bibliographies like the BNB as "a permanent record of the cultural and intellectual output of a nation or country, which is witnessed by its publishing output". We are delighted to be able to resume publication of the BNB, especially as we prepare to celebrate its 75th anniversary in 2025.

For further information about the BNB, please contact [email protected].

Mark Ellison, Collection Metadata Services Manager

31 October 2024

Welcome to the British Library’s new Digital Curator OCR/HTR!

Blog pictureHello everyone! I am Dr Valentina Vavassori, the new Digital Curator for Optical Character Recognition/Handwritten Text Recognition at the British Library.

I am part of the Heritage Made Digital Team, which is responsible for developing and overseeing the digitisation workflow at the Library. I am also an unofficial member of the Digital Research Team, where I promote the reuse and access to the Library’s collections.

My role has both an operational component (integrating and developing OCR and HTR in the digitisation workflow) and a research and engagement component (supporting OCR/HTR projects in the Library). I really enjoy these two sides of my role, as I have a background as a researcher and as a cultural heritage professional.

I joined the British Library from The National Archives, London, where I worked as a Digital Scholarship Researcher in the Digital Research Team. I worked on projects involving data visualisation, OCR/HTR, data modelling, and user experience.

Before that, I completed a PhD in Digital Humanities at King’s College London, focusing on chatbots and augmented reality in museums and their impact on users and museum narratives. Part of my thesis explored how to use these narratives using spatial humanities methods such as GIS. During my PhD, I also collaborated on various digital research projects with institutions like The National Gallery, London, and the Museum of London.

However, I originally trained as an art historian. I studied art history in Italy and worked for a few years in museums. During my job, I realised the potential of developing digital experiences for visitors and the significant impact digitisation can have on research and enjoyment in cultural heritage. I was so interested in the opportunities, that I co-founded a start-up which developed a heritage geolocation app for tourists.

Joining the Library has been an amazing opportunity. I am really looking forward to learning from my colleagues and exploring all the potential collaborations within and outside the Library.

24 October 2024

Southeast Asian Language and Script Conversion Using Aksharamukha

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected]. 

 

The British Library’s vast Southeast Asian collection includes manuscripts, periodicals and printed books in the languages of the countries of maritime Southeast Asia, including Indonesia, Malaysia, Singapore, Brunei, the Philippines and East Timor, as well as on the mainland, from Thailand, Laos, Cambodia, Myanmar (Burma) and Vietnam.

The display of literary manuscripts from Southeast Asia outside of the Asian and African Studies Reading Room in St Pancras (photo by Adi Keinan-Schoonbaert)
The display of literary manuscripts from Southeast Asia outside of the Asian and African Studies Reading Room in St Pancras (photo by Adi Keinan-Schoonbaert)

 

Several languages and scripts from the mainland were the focus of recent development work commissioned by the Library and done on the script conversion platform Aksharamukha. These include Shan, Khmer, Khuen, and northern Thai and Lao Dhamma (Dhamma, or Tham, meaning ‘scripture’, is the script that several languages are written in).

These and other Southeast Asian languages and scripts pose multiple challenges to us and our users. Collection items in languages using non-romanised scripts are mainly catalogued (and therefore searched by users) using romanised text. For some language groups, users need to search the catalogue by typing in the transliteration of title and/or author using the Library of Congress (LoC) romanisation rules.

Items’ metadata text converted using the LoC romanisation scheme is often unintuitive, and therefore poses a barrier for users, hindering discovery and access to our collections via the online catalogues. In addition, curatorial and acquisition staff spend a significant amount of time manually converting scripts, a slow process which is prone to errors. Other libraries worldwide holding Southeast Asian collections and using the LoC romanisation scheme face the same issues.

Excerpt from the Library of Congress romanisation scheme for Khmer
Excerpt from the Library of Congress romanisation scheme for Khmer

 

Having faced these issues with Burmese language, last year we commissioned development work to the open-access platform Aksharamukha, which enables the conversion between various scripts, supporting 121 scripts and 21 romanisation methods. Vinodh Rajan, Aksharamukha’s developer, perfectly combines language and writing systems knowledge with computer science and coding skills. He added the LoC romanisation system to the platform’s Burmese script transliteration functionality (read about this in my previous post).

The results were outstanding – readers could copy and paste transliterated text into the Library's catalogue search box to check if we have items of interest. This has also greatly enhanced cataloguing and acquisition processes by enabling the creation of acquisition records and minimal records. In addition, our Metadata team updated all of our Burmese catalogue records (ca. 20,000) to include Burmese script, alongside transliteration (side note: these updated records are still unavailable to our readers due to the cyber-attack on the Library last year, but they will become accessible in the future).

The time was ripe to expand our collaboration with Vinodh and Aksharamukha. Maria Kekki, Curator for Burmese Collections, has been hosting this past year a Chevening Fellow from Myanmar, Myo Thant Linn. Myo was tasked with cataloguing manuscripts and printed books in Shan and Khuen – but found the romanisation aspect of this work to be very challenging to do manually. In order to facilitate Myo’s work and maximise the benefit from his fellowship, we needed to have a LoC romanisation functionality available. Aksharamukha was the right place for this – this free, open source, online tool is available to our curators, cataloguers, acquisition staff, and metadata team to use.

Former Chevening Fellow Myo Thant Linn reciting from a Shan manuscript in the Asian and African Studies Reading Room, September 2024 (photo by Jana Igunma)
Former Chevening Fellow Myo Thant Linn reciting from a Shan manuscript in the Asian and African Studies Reading Room, September 2024 (photo by Jana Igunma)

 

In addition to Maria and Myo’s requirements, Jana Igunma, Ginsburg Curator for Thai, Lao and Cambodian Collections, noted that adding Khmer to Aksharamukha would be immensely helpful for cataloguing our Khmer backlog and assist with new acquisitions. Northern Thai and Lao Dhamma scripts would be mostly useful to catalogue new acquisitions for print material, and add original scripts to manuscript records. The automation of LoC transliteration could be very cost-effective, by saving many cataloguing, acquisitions and metadata team’s hours. Khmer is a great example – it has the most extensive alphabet in the world (74 letters), and its romanisation is extremely complicated and time consuming!

First three leaves with text in a long format palm leaf bundle (សាស្ត្រាស្លឹករឹត/sāstrā slẏk rẏt) containing part of the Buddhist cosmology (សាស្ត្រាត្រៃភូមិ/Sāstrā Traibhūmi) in Khmer script, 18th or 19th century. Acquired by the British Museum from Edwards Goutier, Paris, on 6 December 1895. British Library, Or 5003, ff. 9-11
First three leaves with text in a long format palm leaf bundle (សាស្ត្រាស្លឹករឹត/sāstrā slẏk rẏt) containing part of the Buddhist cosmology (សាស្ត្រាត្រៃភូមិ/Sāstrā Traibhūmi) in Khmer script, 18th or 19th century. Acquired by the British Museum from Edwards Goutier, Paris, on 6 December 1895. British Library, Or 5003, ff. 9-11

 

It was required, therefore, to enhance Aksharamukha’s script conversion functionality with these additional scripts. This could generally be done by referring to existing LoC conversion tables, while taking into account any permutations of diacritics or character variations. However, it definitely has not been as simple as that!

For example, the presence of diacritics instigated a discussion between internal and external colleagues on the use of precomposed vs. decomposed formats in Unicode, when romanising original script. LoC systems use two types of coding schemata, MARC 21 and MARC 8. The former allows for precomposed diacritic characters, and the latter does not – it allows for decomposed format. In order to enable both these schemata, Vinodh included both MARC 8 and MARC 21 as input and output formats in the conversion functionality.

Another component, implemented for Burmese in the previous development round, but also needed for Khmer and Shan transliterations, is word spacing. Vinodh implemented word separation in this round as well – although this would always remain something that the cataloguer would need to check and adjust. Note that this is not enabled by default – you would have to select it (under ‘input’ – see image below).

Screenshot from Aksharamukha, showcasing Khmer word segmentation option
Screenshot from Aksharamukha, showcasing Khmer word segmentation option

 

It is heartening to know that enhancing Aksharamukha has been making a difference. Internally, Myo had been a keen user of the Shan romanisation functionality (though Khuen romanisation is still work-in-progress); and Jana has been using the Khmer transliteration too. Jana found it particularly useful to use Aksharamukha’s option to upload a photo of the title page, which is then automatically OCRed and romanised. This saved precious time otherwise spent on typing Khmer!

It should be mentioned that, when it comes to cataloguing Khmer language books at the British Library, both original Khmer script and romanised metadata are being included in catalogue records. Aksharamukha helps to speed up the process of cataloguing and eliminates typing errors. However, capitalisation and in some instances word separation and final consonants need to be adjusted manually by the cataloguer. Therefore, it is necessary that the cataloguer has a good knowledge of the language.

On the left: photo of a title page of a Khmer language textbook for Grade 9, recently acquired by the British Library; on the right: conversion of original Khmer text from the title page into LoC romanisation standard using Aksharamukha
On the left: photo of a title page of a Khmer language textbook for Grade 9, recently acquired by the British Library; on the right: conversion of original Khmer text from the title page into LoC romanisation standard using Aksharamukha

 

The conversion tool for Tham (Lanna) and Tham (Lao) works best for texts in Pali language, according to its LoC romanisation table. If Aksharamukha is used for works in northern Thai language in Tham (Lanna) script, or Lao language in Tham (Lao) script, cataloguer intervention is always required as there is no LoC romanisation standard for northern Thai and Lao languages in Tham scripts. Such publications are rare, and an interim solution that has been adopted by various libraries is to convert Tham scripts to modern Thai or Lao scripts, and then to romanise them according to the LoC romanisation standards for these languages.

Other libraries have been enjoying the benefits of the new developments to Aksharamukha. Conversations with colleagues from the Library of Congress revealed that present and past commissioned developments on Aksharamukha had a positive impact on their operations. LoC has been developing a transliteration tool called ScriptShifter. Aksharamukha’s Burmese and Khmer functionalities are already integrated into this tool, which can convert over ninety non-Latin scripts into Latin script following the LoC/ALA guidelines. The British Library funding Aksharamukha to make several Southeast Asian languages and scripts available in LoC romanisation has already been useful!

If you have feedback or encounter any bugs, please feel free to raise an issue on GitHub. And, if you’re interested in other scripts romanised using LoC schemas, Aksharamukha has a complete list of the ones that it supports. Happy conversions!

 

04 July 2024

DHBN 2024 - Digital Humanities in the Nordic and Baltic Countries Conference Report

This is a joint blog post by Helena Byrne, Curator of Web Archives, Harry Lloyd, Research Software Engineer, and Rossitza Atanassova, Digital Curator.

Conference banner showing Icelandic landscape with mountains
This year’s Digital Humanities in the Nordic and Baltic countries conference took place at the University of Iceland School of Education in Reykjavik. It was the eight conference which was established in 2016, but the first time it was held in Iceland. The theme for the conference was “From Experimentation to Experience: Lessons Learned from the Intersections between Digital Humanities and Cultural Heritage”. There were pre-conference workshops from May 27-29 with the main conference starting on the afternoon of May 29 and finishing on May 31. In her excellent opening keynote Sally Chambers, Head of Research Infrastructure Services at the British Library, discussed the complex research and innovation data space for cultural heritage. Three British Library colleagues report highlights of their conference experience in this blog post.

Helena Byrne, Curator of Web Archives, Contemporary British & Irish Publications.

I presented in the Born Digital session held on May 28. There were four presentations in this session and three were related to web archiving and one related to Twitter (X) data. I co-presented ‘Understanding the Challenges for the Use of Web Archives in Academic Research’. This presentation examined the challenges for the use of web archives in academic research through a synthesis of the findings from two research studies that were published through the WARCnet research network. There was lots of discussion after the presentation on how web archives could be used as a research data management tool to help manage online citations in academic publications.

Helena presenting to an audience during the conference session on born-digital archives
Helena presenting in the born-digital archives session

The conference programme was very strong and there were many takeaways that relate to my role. One strong theme was ‘collections as data’. At the UK Web Archive we have just started to publish some of our inactive curated collections as data. So these discussions were very useful. One highlight was thePanel: Publication and reuse of digital collections: A GLAM Labs approach’. What stood out for me in this session was the checklist for publishing collections as data. It was very reassuring to see that we had pretty much everything covered for the release of the UK Web Archive datasets.

Rossitza and I were kindly offered a tour of the National and University Library of Iceland by Kristinn Sigurðsson, Head of Digital Projects and Development. We enjoyed meeting curatorial staff from the Special Collections who showed us some of the historical maps of Iceland that have been digitised. We also visited the digitisation studio to see how they process periodicals, and spoke to staff involved with web archiving. Thank you to Kristinn and his colleagues for this opportunity to learn about the library’s collections and digital services.

Rossitza and Helena standing by the moat outside the National Library of Iceland building
Rossitza and Helena outside the National and University Library of Iceland

 

Inscription in Icelandic reading National and University Library of Iceland outside the Library building
The National and University Library of Iceland

Harry Lloyd, Research Software Engineer, Digital Research.

DHNB2024 was a rich conference from my perspective as a research software engineer. Sally Chambers’ opening keynote on Wednesday afternoon demonstrated an extraordinary grasp of the landscape of digital cultural heritage across the EU. By this point there had already been a day and a half of workshops, including a session Rossitza and I presented on Catalogues as Data

I spent the first half using a Jupyter notebook to explain how we extracted entries from an OCR’d version of the catalogue of the British Library’s collection of 15th century books. We used an explainable algorithm rather than a ‘black-box’ machine learning one, so we walked through the steps involved and discussed where it worked well and where it could be improved. You can follow along by clicking the ‘launch notebook’ button in the ReadMe here

Harry pointing to an image from the catalogue of printed books on a screen for the workshop audience
Harry explaining text recognition results during the workshop

Handing over to Rossitza in the second half to discuss her corpus linguistic analysis worked really well by giving attendees a feel for the complete workflow. This really showed in some great conversations we had with attendees over the following days about tricky problems like where to store the ‘true’ results of OCR. 

A few highlights from the rest of the conference were Clelia LaMonica’s work using Latin large language model to analyse kinship in texts from Medieval Burgundy. Large language models trained on historic texts are important as the majority are trained on modern material and struggle with historical language. Jørgen Burchardt presented some refreshingly quantitative work on bias across a digitised newspaper collection, very reminiscent of work by Kaspar Beelen. Overall it was a productive few days, and I very much enjoyed my time in Reykjavik.

Rossitza Atanassova, Digital Curator, Digital Research.

This was my second DHNB conference and I was looking forward to reconnecting with the community of researchers and cultural heritage practitioners, some of whom I had met at DHNB2019 in Copenhagen. Apart from the informal discussions with attendees, I contributed to DHNB2024 in two main ways.

As already mentioned, Harry and I delivered a pre-conference workshop showcasing some processes and methodology we use for working with printed catalogues as data. In the session we used the corpus tool AntConc to perform computational analysis of the descriptions for the British Library’s collection of books published in the 15th century. You can find out more about the project here and reuse the workshop materials published on Zenodo here.

I also joined the pre-conference meeting of the international GLAM Labs Community held at the National and University Library of Iceland. This was the first in-person meeting of the community in five years and was a productive session during which we brainstormed ‘100 ideas for the GLAM Labs Community’. Afterwards we had a sneak peak of the archive of the National Theatre of Iceland which is being catalogued and digitised.

The main hall of the Library with a chessboard on a table with two chairs, a statue of a man, holding spectacles and a stained glass screen.
The main hall of the Library.

The DHNB community is so welcoming and supportive, and attracts many early career digital humanists. I was particularly interested to hear from doctoral students researching the use of AI with digitised archives, and using NLP methods with historical collections. One of the projects that stood out for me was Johannes Widegren’s PhD research into the ethical use of AI to enable access and discovery of Sami cultural heritage, and to develop library and archival practice. 

I was also interested in presentations that discussed workflows for creating Named Entity Recognition resources for historical archives and I plan to try out the open-source Label Studio tool that I learned about. And of course, the poster session is always a highlight and I enjoyed finding out about a range of projects, including computational analysis of Scandinavian runic-texts, digital reconstruction of Gothenburg’s 1923 Jubilee exhibition, and training large language models to track semantic variation in climate change vocabulary in Danish news articles.

A line up of people standing in front of a screen advertising the venue for DHNB25 in Estonia
The poster presentations session chaired by Olga Holownia

We are grateful to all DHNB24 organisers for the warm welcome and a great conference experience, with special thanks to the inspirational and indefatigable Olga Holownia

26 June 2024

Join the British Library as a Digital Curator, OCR/HTR

This is a repeated and updated blog post by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections. She shares some background information on how a new post advertised for a Digital Curator for OCR/HTR will help the Library streamline post-digitisation work to make its collections even more accessible to users. Our previous run of this recruitment was curtailed due to the cyber-attack on the Library - but we are now ready to restart the process!

 

We’ve been digitising our collections for about three decades, opening up access to incredibly diverse and rich collections, for our users to study and enjoy. However, it is important that we further support discovery and digital research by unlocking the huge potential in automatically transcribing our collections.

We’ve done some work over the years towards making our collection items available in machine-readable format, in order to enable full-text search and analysis. Optical Character Recognition (OCR) technology has been around for a while, and there are several large-scale projects that produced OCRed text alongside digitised images – such as the Microsoft Books project. Until recently, Western languages print collections have been the main focus, especially newspaper collections. A flagship collaboration with the Alan Turing Institute, the Living with Machines project, applied OCR technology to UK newspapers, designing and implementing new methods in data science and artificial intelligence, and analysing these materials at scale.

OCR of Bengali books using Transkribus, Two Centuries of Indian Print Project
OCR of Bengali books using Transkribus, Two Centuries of Indian Print Project

Machine Learning technologies have been dealing increasingly well with both modern and historical collections, whether printed, typewritten or handwritten. Taking a broader perspective on Library collections, we have been exploring opportunities with non-Western collections too. Library staff have been engaging closely with the exploration of OCR and Handwritten Text Recognition (HTR) systems for EnglishBangla, Arabic, Urdu and Chinese. Digital Curators Tom Derrick, Nora McGregor and Adi Keinan-Schoonbaert have teamed up with PRImA Research Lab and the Alan Turing Institute to run four competitions in 2017-2019, inviting providers of text recognition methods to try them out on our historical material.

We have been working with Transkribus as well – for example, Alex Hailey, Curator for Modern Archives and Manuscripts, used the software to automatically transcribe 19th century botanical records from the India Office Records. A digital humanities work strand led by former colleague Tom Derrick saw the OCR of most of our digitised collection of Bengali printed texts, digitised as part of the Two Centuries of Indian Print project. More recently Transkribus has been used to extract text from catalogue cards in a project called Convert-a-Card, as well as from Incunabula print catalogues.

An example of a catalogue card in Transkribus, showing segmentation and transcription
An example of a catalogue card in Transkribus, showing segmentation and transcription

We've also collaborated with Colin Brisson from the READ_Chinese project on Chinese HTR, working with eScriptorium to enhance binarisation, segmentation and transcription models using manuscripts that were digitised as part of the International Dunhuang Programme. You can read more about this work in this brilliant blog post by Peter Smith, who's done a PhD placement with us last year.

The British Library is now looking for someone to join us to further improve the access and usability of our digital collections, by integrating a standardised OCR and HTR production process into our existing workflows, in line with industry best practice.

For more information and to apply please visit the ad for Digital Curator for OCR/HTR on the British Library recruitment site. Applications close on Sunday 21 July 2024. Please pay close attention to questions asked in the application process. Any questions? Drop us a line at [email protected].

Good luck!

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs