Digital scholarship blog

Enabling innovative research with British Library digital collections

118 posts categorized "Humanities"

06 November 2024

Recovered Pages: Crowdsourcing at the British Library

Digital Curator Mia Ridge writes...

While the British Library works to recover from the October 2023 cyber-attack, we're putting some information from our currently inaccessible website into an easily readable and shareable format. This blog post is based on a page captured by the Wayback Machine in September 2023.

Crowdsourcing at the British Library

Screenshot of the Zooniverse interface for annotating a historical newspaper article
Example of a crowdsourcing task

For the British Library, crowdsourcing is an engaging form of online volunteering supported by digital tools that manage tasks such as transcription, classification and geolocation that make our collections more discoverable.

The British Library has run several popular crowdsourcing projects in the past, including the Georeferencer, for geolocating historical maps, and In the Spotlight, for transcribing important information about historical playbills. We also integrated crowdsourcing activities into our flagship AI / data science project, Living with Machines.

  • Agents of Enslavement uses 18th/19th century newspapers to research slavery in Barbados and create a database of enslaved people.
  • Living with Machines, which is mostly based on research questions around nineteenth century newspapers

Crowdsourcing Projects at the British Library

  • Living with Machines (2019-2023) created innovative crowdsourced tasks, including tasks that asked the public to closely read historical newspaper articles to determine how specific words were used.
  • Agents of Enslavement (2021-2022) used 18th/19th century newspapers to research slavery in Barbados and create a database of enslaved people.
  • In the Spotlight (2017-2021) was a crowdsourcing project from the British Library that aimed to make digitised historical playbills more discoverable, while also encouraging people to closely engage with this otherwise less accessible collection of ephemera.
  • Canadian wildlife: notes from the field (2021), a project where volunteers transcribed handwritten field notes that accompany recordings of a wildlife collection within the sound archive.
  • Convert a Card (2015) was a series of crowdsourcing projects aimed to convert scanned catalogue cards in Asian and African languages into electronic records. The project template can be found and used on GitHub.
  • Georeferencer (2012 - present) enabled volunteers to create geospatial data from digitised versions of print maps by adding control points to the old and modern maps.
  • Pin-a-Tale (2012) asked people to map literary texts to British places.

 

Research Projects

The Living with Machines project included a large component of crowdsourcing research through practice, led by Digital Curator Mia Ridge.

Mia was also the Principle Investigator on the AHRC-funded Collective Wisdom project, which worked with a large group of co-authors to produce a book, The Collective Wisdom Handbook: perspectives on crowdsourcing in cultural heritage, through two 'book sprints' in 2021:

This book is written for crowdsourcing practitioners who work in cultural institutions, as well as those who wish to gain experience with crowdsourcing. It provides both practical tips, grounded in lessons often learned the hard way, and inspiration from research across a range of disciplines. Case studies and perspectives based on our experience are woven throughout the book, complemented by information drawn from research literature and practice within the field.

More Information

Our crowdsourcing projects were designed to produce data that can be used in discovery systems (such as online catalogues and our item viewer) through enjoyable tasks that give volunteers an opportunity to explore digitised collections.

Each project involves teams across the Library to supply digitised images for crowdsourcing and ensure that the results are processed and ingested into various systems. Enhancing metadata through crowdsourcing is considered in the British Library's Collection Metadata Strategy.

We previously posted on twitter @LibCrowds and currently post occasionally on Mastodon https://glammr.us/@libcrowds and via our newsletter.

Past editions of our newsletter are available online.

26 July 2024

Charting the European D-SEA Conference at the Stabi

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected]. 

 

Earlier this month, I had the pleasure of attending the “Charting the European D-SEA: Digital Scholarship in East Asian Studies” conference held at the Berlin State Library (Staatsbibliothek zu Berlin), also known as the Stabi. The conference, held on 11-12 July 2024, aimed to fill a gap in the European digital scholarship landscape by creating a research community and a space for knowledge exchange on digital scholarship issues across humanities disciplines concerned with East Asian regions and languages.

The event was a dynamic fusion of workshops, presentations and panel discussions. Over three days of workshops (8-10 July), participants were introduced to key digital methods, resources, and databases. These sessions aimed to transmit practical knowledge in digital scholarship, focusing on East Asian collections and data. The subsequent two days were dedicated to the conference proper, featuring a broad range of presentations on various themes.

The reading room in the Berlin State Library, Haus Potsdamer Straße
The reading room in the Berlin State Library, Haus Potsdamer Straße

 

DH and East Asian Studies in Europe and Beyond

Conference organisers Jing Hu and Brent Ho from the Stabi, and Shih-Pei Chen and Dagmar Schäfer from the Max Planck Institute for the History of Science (MPIWG), set the stage for an enriching exchange of ideas and knowledge. The diversity of topics covered was impressive – from the more established digital resources and research tools to AI applications in historical research – the sessions provided a comprehensive overview of the current state and future directions of the field.

There were so many excellent presentations – and I often wished I could clone myself to attend parallel sessions! As expected, there was much focus on working with AI – machine learning and generative AI – and their potential in historical and humanities research. AI technologies offer powerful tools for data analysis and pattern recognition, and can significantly enhance research capabilities.

Damian Mandzunowski (Heidelberg University) talked about using AI to extract and analyse information from Chinese Comics
Damian Mandzunowski (Heidelberg University) talked about using AI to extract and analyse information from Chinese Comics
 
Shaojian Li (Renmin University of China) looked into automating the classification of pattern images using deep learning
Shaojian Li (Renmin University of China) looked into automating the classification of pattern images using deep learning

One notable session was "Reflections on Deep Learning & Generative AI," chaired by Brent Ho and discussed by Clemens Neudecker. The roundtable highlighted the evolving role of AI in humanities research. Calvin Yeh from MPIWG discussed AI's potential to augment, rather than just automate, research processes. He shared intriguing examples of using AI tools like ChatGPT to simulate group discussions and suggest research actions. Hongsu Wang from Harvard University presented on the use of Large Language Models and traditional Transformers in the China Biographical Database (CBDB) project, demonstrating the effectiveness of these models in data extraction and standardisation.

Calvin Yeh (MPIWG) discussed AI for “Augmentation, not only Automation” and experimented with ChatGPT discussing a research approach, designing a research process and simulating a group discussion
Calvin Yeh (MPIWG) discussed AI for “Augmentation, not only Automation” and experimented with ChatGPT discussing a research approach, designing a research process and simulating a group discussion
 
Hongsu Wang (Harvard University) talked about extracting and standardising data using LLMs and traditional Transformers in the CBDB project – here showcasing Jeffrey Tharsen’s research to create a network graph using a prompt in ChatGPT
Hongsu Wang (Harvard University) talked about extracting and standardising data using LLMs and traditional Transformers in the CBDB project – here showcasing Jeffrey Tharsen’s research to create a network graph using a prompt in ChatGPT

 

Exploring the Stabi

Our group tour in the Stabi was a personal highlight for me. This historic library, part of the Prussian Cultural Heritage Foundation, is renowned for its extensive collections and commitment to making digitised materials publicly accessible. The library operates from two major public sites – Haus Unter Den Linden and Haus Potsdamer Straße. Tours of both locations were available, but I chose to explore the more recent building, designed by Hans Scharoun and located in the Kulturforum on Potsdamer Straße in West Berlin – the history and architecture of which is fascinating.

A group of the conference delegates enjoying the tour of SBB’s Haus Potsdamer Straße
A group of the conference delegates enjoying the tour of SBB’s Haus Potsdamer Straße

I really enjoyed catching up with old colleagues and making new connections with fellow scholars passionate about East Asian digital humanities!

To conclude

In conclusion, the Charting European D-SEA Conference at the Stabi was an enriching experience, providing deep insights into the integration of digital methods in East Asian studies. It provided valuable insights into the advancements in digital scholarship and allowed me to connect with a global community of scholars. The combination of traditional and more recent digital practices, coupled with the forward-looking discussions on AI and deep learning, made this conference a significant milestone in the field. I look forward to seeing how these conversations evolve and contribute to the broader landscape of digital humanities.

 

08 July 2024

Embracing Sustainability at the British Library: Insights from the Digital Humanities Climate Coalition Workshop

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected]. 

 

Sustainability has become a core value at the British Library, driven by our staff-led Sustainability Group and bolstered by the addition of a dedicated Sustainability Manager nearly a year ago. As part of our ongoing commitment to environmental responsibility, we have been exploring various initiatives to reduce our environmental footprint. One such initiative is our engagement with the Digital Humanities Climate Coalition (DHCC), a collaborative and cross-institutional effort focused on understanding and minimising the environmental impact of digital humanities research.

Screenshot from the Digital Humanities Climate Coalition website
Screenshot from the Digital Humanities Climate Coalition website
 

Discovering the DHCC and its toolkit

The Digital Humanities Climate Coalition (DHCC) has been on my radar for some time, primarily due to their exemplary work in promoting sustainable digital practices. The DHCC toolkit, in particular, has proven to be an invaluable resource. Designed to help individuals and organisations make more environmentally conscious digital choices, the toolkit offers practical guidance for building sustainable digital humanities projects. It encourages researchers to adopt climate-responsible practices and supports those who may lack the practical knowledge to devise greener initiatives.

The toolkit is comprehensive, providing tips on the planning and management of research infrastructure and data. It aims to empower researchers to make climate-friendly technological decisions, thereby fostering a culture of sustainability within the digital humanities community.

My primary goal in leveraging the DHCC toolkit is to raise awareness about the environmental impact of digital work and technology use. By doing so, I hope to empower Library staff to make informed decisions that contribute to our sustainability goals. The toolkit’s insights are crucial for anyone involved in digital research, offering both strategic guidance and practical tips for minimising ecological footprints.

Planning a workshop at the British Library

With the support of our Research Development team, I organised a one-day workshop at the British Library, inviting Professor James Baker, Director of Digital Humanities at the University of Southampton and a member of the DHCC, to lead the event. The workshop was designed to introduce the DHCC toolkit and provide guidance on implementing best practices in research projects. The in-person, full-day workshop was held on 5 February 2024.

Workshop highlights

The workshop featured four key sessions:

Session 1: Introductions and Framing: We began with an overview of the DHCC and its work within the GLAM sector, followed by an introduction to sustainability at the British Library, the roles that libraries play in reducing carbon footprint and awareness raising, the Green Libraries Campaign (of which the British Library was a founding partner), and perspectives on digital humanities and the use of computational methods.

CILIP’s Green Libraries Campaign banner
CILIP’s Green Libraries Campaign banner

Session 2: Toolkit Overview: Prof Baker introduced the DHCC toolkit, highlighting its main components and practical applications, focusing on grant writing (e.g. recommendations on designing research projects, including Data Management Plans), and working practices (guidance on reducing energy consumption in day-to-day working life, e.g. communication and shared working, travel, and publishing and preserving data). The session included responses from relevant Library teams, on topics such as research project design, data management and our shared research repository.

DHCC publication cover: A Reseacher Guide to Writing a Climate Justice Oriented Data Management Plan
DHCC Information, Measurement and Practice Action Group. (2022). A Researcher Guide to Writing a Climate Justice Oriented Data Management Plan (v0.6). Zenodo. https://doi.org/10.5281/zenodo.6451499

Session 3: Advocacy and Influencing: This session focused on strategies for advocating for sustainable practices within one's organisation and influencing others to adopt these practices. We covered the Library’s staff-led Sustainability Group and its activities, after which participants were then asked to consider the actions that could be taken at the Library and beyond, taking into account the types of people that might be influenced (senior leaders, colleagues, peers in wider networks/community).

Session 4: Feedback and Next Steps: Participants discussed their takeaways from the workshop and identified actionable steps they could implement in their work. This session included conversations on ways to translate workshop learnings into concrete next steps, and generated light ‘commitments’ for the next week, month and year. One fun way to set oneself a yearly reminder is to schedule an eco-friendly e-card to send to yourself in a year!

Post-workshop follow-up

Three months after the workshop had taken place, we conducted a follow-up survey to gauge its impact. The survey included a mix of agree/disagree statements (see chart below) and optional long-form questions to capture more detailed feedback. While we had only a few responses, survey results were constructive and positive. Participants appreciated the practical insights and reported better awareness of sustainable practices in their digital work.

Participants’ agree/disagree ratings for a series of statements about the DHCC workshop’s impact
Participants’ agree/disagree ratings for a series of statements about the DHCC workshop’s impact

Judging from responses to the set of statements above, at least several participants have embedded toolkit recommendations, made specific changes in their work, shared knowledge and influenced their wider networks. We got additional details on these actions in responses to the open-ended questions that followed.

What did staff members say?

Here are some comments made in relation to making changes and embedding the DHCC toolkit’s recommendation:

“Changes made to working policy and practice to order vegetarian options as standard for events.”

“I have referenced the toolkit in a chapter submitted for a monograph, in relation to my BL/university research.”

“I have discussed the toolkit's recommendations with colleagues re the projects I am currently working on. We agreed which parts of the projects were most carbon intensive and discussed ways to mitigate that.”

“I recommended a workshop on the toolkit to my [research] funding body.”

“Have engaged more with small impacts - less email traffic, fewer attachments, fewer images.”

A couple of comments were made with regard to challenges or barriers to change making. One was about colleagues being reluctant to decrease flying, or travel in general, as a way to reduce one’s carbon footprint. The second point referred to an uncertainty on how to influence internal discussions on software development infrastructure – highlighting the challenge of finding the right path to the right people.

An interesting comment was made in relation to raising environmental concerns and advocating the Toolkit:

“Shared the toolkit with wider professional network at an event at which environmentally conscious and sustainable practices were raised without prompting. Toolkit was well received with expressions of relief that others are thinking along these lines and taking practical steps to help progress the agenda.”

And finally, an excellent point about the energy-intensive use of ChatGPT (or other LLMs), which was covered at the workshop:

“The thing that has stayed with me is what was said about water consumption needed to cool the supercomputers - how every time you run one of those Chat GPT (or equivalent) queries it is the equivalent of throwing a litre of water out the window, and that Microsoft's water use has gone up 30%. I've now been saying this every time someone tells me to use one of these GPT searches. To be honest it has put me off using them completely.”

In summary

The DHCC workshop at the British Library was a great success, underscoring the importance of sustainability in digital humanities, digital projects and digital working. By leveraging the DHCC toolkit, we have taken important steps toward making our digital practices more environmentally responsible, and spreading the word across internal and external networks. Moving forward, we will continue to build on this momentum, fostering a culture of sustainability and empowering our staff to make informed, climate-friendly decisions.

Thank you to workshop contributors, organisers and helpers:

James Baker, Joely Fake, Maja Maricevic, Catherine Ross, Andy Rackley, Jez Cope, Jenny Basford, Graeme Bentley, Stephen White, Bianca Miranda Cardoso, Sarah Kirk-Browne, Andrea Deri, and Deirdre Sullivan.

 

18 March 2024

Handwritten Text Recognition of the Dunhuang manuscripts: the challenges of machine learning on ancient Chinese texts

This blog post is by Peter Smith, DPhil Student at the Faculty of Asian and Middle Eastern Studies, University of Oxford

 

Introduction

The study of writing and literature has been transformed by the mass transcription of printed materials, aided significantly by the use of Optical Character Recognition (OCR). This has enabled textual analysis through a growing array of digital techniques, ranging from simple word searches in a text to linguistic analysis of large corpora – the possibilities are yet to be fully explored. However, printed materials are only one expression of the written word and tend to be more representative of certain types of writing. These may be shaped by efforts to standardise spelling or character variants, they may use more formal or literary styles of language, and they are often edited and polished with great care. They will never reveal the great, messy diversity of features that occur in writings produced by the human hand. What of the personal letters and documents, poems and essays scribbled on paper with no intention of distribution; the unpublished drafts of a major literary work; or manuscript editions of various classics that, before the use of print, were the sole means of preserving ancient writings and handing them onto future generations? These are also a rich resource for exploring past lives and events or expressions of literary culture.

The study of handwritten materials is not new but, until recently, the possibilities for analysing them using digital tools have been quite limited. With the advent of Handwritten Text Recognition (HTR) the picture is starting to change. HTR applications such as Transkribus and eScriptorium are capable of learning to transcribe a broad range of scripts in multiple languages. As the potential of these platforms develops, large collections of manuscripts can be automatically transcribed and consequently explored using digital tools. Institutions such as the British Library are doing much to encourage this process and improve accessibility of the transcribed works for academic research and the general interest of the public. My recent role in an HTR project at the Library represents one small step in this process and here I hope to provide a glimpse behind-the-scenes, a look at some of the challenges of developing HTR.

As a PhD student exploring classical Chinese texts, I was delighted to find a placement at the British Library working on HTR of historical Chinese manuscripts. This project proceeded under the guidance of my British Library supervisors Dr Adi Keinan-Schoonbaert and Mélodie Doumy. I was also provided with support and expertise from outside of the Library: Colin Brisson is part of a group working on Chinese Historical documents Automatic Transcription (CHAT). They have already gathered and developed preliminary models for processing handwritten Chinese with the open source HTR application eScriptorium. I worked with Colin to train the software further using materials from the British Library. These were drawn entirely from the fabulous collection of manuscripts from Dunhuang, China, which date back to the Tang dynasty (618–907 CE) and beyond. Examples of these can be seen below, along with reference numbers for each item, and the originals can be viewed on the new website of the International Dunhuang Programme. Some of these texts were written with great care in standard Chinese scripts and are very well preserved. Others are much more messy: cursive scripts, irregular layouts, character corrections, and margin notes are all common features of handwritten work. The writing materials themselves may be stained, torn, or eaten by animals, resulting in missing or illegible text. All these issues have the potential to mislead the ‘intelligence’ of a machine. To overcome such challenges the software requires data – multiple examples of the diverse elements it might encounter and instruction as to how they should be understood.

The challenges encountered in my work on HTR can be examined in three broad categories, reflecting three steps in the HTR process of eScriptorium: image binarisation, layout segmentation, and text recognition.

 

Image binarisation

The first task in processing an image is to reduce its complexity, to remove any information that is not relevant to the output required. One way of doing this is image binarisation, taking a colour image and using an algorithm to strip it of hue and brightness values so that only black and white pixels remain. This was achieved using a binarisation model developed by Colin Brisson and his partners. My role in this stage was to observe the results of the process and identify strengths and weaknesses in the current model. These break down into three different categories: capturing details, stained or discoloured paper, and colour and density of ink.

1. Capturing details

In the process of distinguishing the brushstrokes of characters from other random marks on the paper, it is perhaps inevitable that some thin or faint lines – occurring as a feature of the hand written text or through deterioration over time – might be lost during binarisation. Typically the binarisation model does very well in picking them out, as seen in figure 1:

Fig 1. Good retention of thin lines (S.3011, recto image 23)
Fig 1. Good retention of thin lines (S.3011, recto image 23)

 

While problems with faint strokes are understandable, it was surprising to find that loss of detail was also an issue in somewhat thicker lines. I wasn’t able to determine the cause of this but it occurred in more than one image. See figures 2 and 3:

Fig 2. Loss of detail in thick lines (S.3011, recto image 23)
Fig 2. Loss of detail in thick lines (S.3011, recto image 23)

 

Fig 3. Loss of detail in thick lines (S.3011, recto image 23)
Fig 3. Loss of detail in thick lines (S.3011, recto image 23)

 

2. Stained and discoloured paper

Where paper has darkened over time, the contrast between ink and background is diminished and during binarisation some writing may be entirely removed along with the dark colours of the paper. Although I encountered this occasionally, unless the background was really dark the binarisation model did well. One notable success is its ability to remove the dark colours of partially stained sections. This can be seen in figure 4, where a dark stain is removed while a good amount of detail is retained in the written characters.

Fig 4. Good retention of character detail on heavily stained paper (S.2200, recto image 6)
Fig 4. Good retention of character detail on heavily stained paper (S.2200, recto image 6)

 

3. Colour and density of ink

The majority of manuscripts are written in black ink, ideal for creating good contrast with most background colourations. In some places however, text may be written with less concentrated ink, resulting in greyer tones that are not so easy to distinguish from the paper. The binarisation model can identify these correctly but sometimes it fails to distinguish them from the other random markings and colour variations that can be found in the paper of ancient manuscripts. Of particular interest is the use of red ink, which is often indicative of later annotations in the margins or between lines, or used for the addition of punctuation. The current binarisation model will sometimes ignore red ink if it is very faint but in most cases it identifies it very well. In one impressive example, shown in figure 5, it identified the red text while removing larger red marks used to highlight other characters written in black ink, demonstrating an ability to distinguish between semantic and less significant information.

Fig 5. Effective retention of red characters and removal of large red marks (S.2200, recto image 7)
Fig 5. Effective retention of red characters and removal of large red marks (S.2200, recto image 7)

 

In summary, the examples above show that the current binarisation model is already very effective at eliminating unwanted background colours and stains while preserving most of the important character detail. Its response to red ink illustrates a capacity for nuanced analysis. It does not treat every red pixel in the same way, but determines whether to keep it or remove it according to the context. There is clearly room for further training and refinement of the model but it already produces materials that are quite suitable for the next stages of the HTR process.

 

Layout segmentation

Segmentation defines the different regions of a digitised manuscript and the type of content they contain, either text or image. Lines are drawn around blocks of text to establish a text region and for many manuscripts there is just one per image. Anything outside of the marked regions will just be ignored by the software. On occasion, additional regions might be used to distinguish writings in the margins of the manuscript. Finally, within each text region the lines of text must also be clearly marked. Having established the location of the lines, they can be assigned a particular type. In this project the options include ‘default’, ‘double line’, and ‘other’ – the purpose of these will be explored below.

All of this work can be automated in eScriptorium using a segmentation model. However, when it comes to analysing Chinese manuscripts, this model was the least developed component in the eScriptorium HTR process and much of our work focused on developing its capabilities. My task was to run binarised images through the model and then manually correct any errors. Figure 6 shows the eScriptorium interface and the initial results produced by the segmentation model. Vertical sections of text are marked with a purple line and the endings of each section are indicated with a horizontal pink line.

Fig 6. Initial results of the segmentation model section showing multiple errors. The text is the Zhuangzi Commentary by Guo Xiang (S.1603)
Fig 6. Initial results of the segmentation model section showing multiple errors. The text is the Zhuangzi Commentary by Guo Xiang (S.1603)

 

This example shows that the segmentation model is very good at positioning a line in the centre of a vertical column of text. Frequently, however, single lines of text are marked as a sequence of separate lines while other lines of text are completely ignored. The correct output, achieved through manual segmentation, is shown in figure 7. Every line is marked from beginning to end with no omissions or inappropriate breaks.

Fig 7. Results of manual segmentation showing the text region (the blue rectangle) and the single and double lines of text (S.1603)
Fig 7. Results of manual segmentation showing the text region (the blue rectangle) and the single and double lines of text (S.1603)

 

Once the lines of a text are marked, line masks can be generated automatically, defining the area of text around each line. Masks are needed to show the transcription model (discussed below) exactly where it should look when attempting to match images on the page to digital characters. The example in figure 8 shows that the results of the masking process are almost perfect, encompassing every Chinese character without overlapping other lines.

Fig 8. Line masks outline the area of text associated with each line (S.1603)
Fig 8. Line masks outline the area of text associated with each line (S.1603)

 

The main challenge with developing a good segmentation model is that manuscripts in the Dunhuang collection have so much variation in layout. Large and small characters mix together in different ways and the distribution of lines and characters can vary considerably. When selecting material for this project I picked a range of standard layouts. This provided some degree of variation but also contained enough repetition for the training to be effective. For example, the manuscript shown above in figures 6–8 combines a classical text written in large characters interspersed with double lines of commentary in smaller writing, in this case it is the Zhuangzi Commentary by Guo Xiang. The large text is assigned the ‘default’ line type while the smaller lines of commentary are marked as ‘double-line’ text. There is also an ‘other’ line type which can be applied to anything that isn’t part of the main text – margin notes are one example. Line types do not affect how characters are transcribed but they can be used to determine how different sections of text relate to each other and how they are assembled and formatted in the final output files.

Fig 9. A section from the Lotus Sūtra with a text region, lines of prose, and lines of verse clearly marked (Or8210/S.1338)
Fig 9. A section from the Lotus Sūtra with a text region, lines of prose, and lines of verse clearly marked (Or8210/S.1338)

 

Figures 8 and 9, above, represent standard layouts used in the writing of a text but manuscripts contain many elements that are more random. Of these, inter-line annotations are a good example. They are typically added by a later hand, offering comments on a particular character or line of text. Annotations might be as short as a single character (figure 10) or could be a much longer comment squeezed in between the lines of text (figure 11). In such cases these additions can be distinguished from the main text by being labelled with the ‘other’ line type.

Fig 10. Single character annotation in S.3011, recto image 14 (left) and a longer annotation in S.5556, recto image 4 (right)
Fig 10. Single character annotation in S.3011, recto image 14 (left) and a longer annotation in S.5556, recto image 4 (right)

 

Fig 11. A comment in red ink inserted between two lines of text (S.2200, recto image 5)
Fig 11. A comment in red ink inserted between two lines of text (S.2200, recto image 5)

 

Other occasional features include corrections to the text. These might be made by the original scribe or by a later hand. In such cases one character may be blotted out and a replacement added to the side, as seen in figure 12. For the reader, these should be understood as part of the text itself but for the segmentation model they appear similar or identical to annotations. For the purpose of segmentation training any irregular features like this are identified using the ‘other’ line type.

Fig 12. Character correction in S.3011, recto image 23.
Fig 12. Character correction in S.3011, recto image 23.

 

As the examples above show, segmentation presents many challenges. Even the standard features of common layouts offer a degree of variation and in some manuscripts irregularities abound. However, work done on this project has now been used for further training of the segmentation model and reports are promising. The model appears capable of learning quickly, even from relatively small data sets. As the process improves, time spent using and training the model offers increasing returns. Even if some errors remain, manual correction is always possible and segmented images can pass through to the final stage of text recognition.

 

Text recognition

Although transcription is the ultimate aim of this process it consumed less of my time on the project so I will keep this section relatively brief. Fortunately, this is another stage where the available model works very well. It had previously been trained on other print and manuscript collections so a well-established vocabulary set was in place, capable of recognising many of the characters found in historical writings. Dealing with handwritten text is inevitably a greater challenge for a transcription model but my selection of manuscripts included several carefully written texts. I felt there was a good chance of success and was very keen to give it a go, hoping I might end up with some usable transcriptions of these works. Once the transcription model had been run I inspected the first page using eScriptorium’s correction interface as illustrated in figure 13.

Fig 13. Comparison of image and transcription in eScriptorium’s correction interface.
Fig 13. Comparison of image and transcription in eScriptorium’s correction interface.

 

The interface presents a single line from the scanned image alongside the digitally transcribed text, allowing me to check each character and amend any errors. I quickly scanned the first few lines hoping I would find something other than random symbols – I was not disappointed! The results weren’t perfect of course but one or two lines actually came through with no errors at all and generally the character error rate seems very low. After careful correction of the errors that remained and some additional work on the reading order of the lines, I was able to export one complete manuscript transcription bringing the whole process to a satisfying conclusion.

 

Final thoughts

Naturally there is still some work to be done. All the models would benefit from further refinement and the segmentation model in particular will require training on a broader range of layouts before it can handle the great diversity of the Dunhuang collection. Hopefully future projects will allow more of these manuscripts to be used in the training of eScriptorium so that a robust HTR process can be established. I look forward to further developments and, for now, am very grateful for the chance I’ve had to work alongside my fabulous colleagues at the British Library and play some small role in this work.

 

14 September 2023

What's the future of crowdsourcing in cultural heritage?

The short version: crowdsourcing in cultural heritage is an exciting field, rich in opportunities for collaborative, interdisciplinary research and practice. It includes online volunteering, citizen science, citizen history, digital public participation, community co-production, and, increasingly, human computation and other systems that will change how participants relate to digital cultural heritage. New technologies like image labelling, text transcription and natural language processing, plus trends in organisations and societies at large mean constantly changing challenges (and potential). Our white paper is an attempt to make recommendations for funders, organisations and practitioners in the near and distant future. You can let us know what we got right, and what we could improve by commenting on Recommendations, Challenges and Opportunities for the Future of Crowdsourcing in Cultural Heritage: a White Paper.

The longer version: The Collective Wisdom project was funded by an AHRC networking grant to bring experts from the UK and the US together to document the state of the art in designing, managing and integrating crowdsourcing activities, and to look ahead to future challenges and unresolved issues that could be addressed by larger, longer-term collaboration on methods for digitally-enabled participation.

Our open access Collective Wisdom Handbook: perspectives on crowdsourcing in cultural heritage is the first outcome of the project, our expert workshops were a second.

Mia (me) and Sam Blickhan launched our White Paper for comment on pubpub at the Digital Humanities 2023 conference in Graz, Austria, in July this year, with Meghan Ferriter attending remotely. Our short paper abstract and DH2023 slides are online at Zenodo

So - what's the future of crowdsourcing in cultural heritage? Head on over to Recommendations, Challenges and Opportunities for the Future of Crowdsourcing in Cultural Heritage: a White Paper and let us know what you think! You've got until the end of September…

You can also read our earlier post on 'community review' for a sense of the feedback we're after - in short, what resonates, what needs tweaking, what examples could we include?

To whet your appetite, here's a preview of our five recommendations. (To find out why we make those recommendations, you'll have to read the White Paper):

  • Infrastructure: Platforms need sustainability. Funding should not always be tied to novelty, but should also support the maintenance, uptake and reuse of well-used tools.
  • Evidencing and Evaluation: Help create an evaluation toolkit for cultural heritage crowdsourcing projects; provide ‘recipes’ for measuring different kinds of success. Shift thinking about value from output/scale/product to include impact on participants' and community well-being.
  • Skills and Competencies: Help create a self-guided skills inventory assessment resource, tool, or worksheet to support skills assessment, and develop workshops to support their integrity and adoption.
  • Communities of Practice: Fund informal meetups, low-cost conferences, peer review panels, and other opportunities for creating and extending community. They should have an international reach, e.g. beyond the UK-US limitations of the initial Collective Wisdom project funding.
  • Incorporating Emergent Technologies and Methods: Fund educational resources and workshops to help the field understand opportunities, and anticipate the consequences of proposed technologies.

What have we missed? Which points do you want to boost? (For example, we discovered how many of our points apply to digital scholarship projects in general). You can '+1' on points that resonate with you, suggest changes to wording, ask questions, provide examples and references, or (constructively, please) challenge our arguments. Our funding only supported participants from the UK and US, so we're very keen to hear from folk from the rest of the world.

04 September 2023

ICDAR 2023 Conference Impressions

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected].

 

Last week I came back from my very first ICDAR conference, inspired and energised for things to come! The International Conference on Document Analysis and Recognition (ICDAR) is the main international event for scientists and practitioners involved in document analysis and recognition. Its 17th edition was held in San José, California, 21-26 August 2023.

ICDAR 2023 featured a three-day conference, including several competitions to challenge the field, as well as post-conference workshops and tutorials. All conference papers were made available as conference proceedings with Springer. 155 submissions were selected for inclusion into the scientific programme of ICDAR 2023, out of which 55 were delivered as oral presentations, and 100 as posters. The conference also teamed up with the International Journal of Document Analysis and Recognition (IJDAR) for a special journal track. 13 papers were accepted and published in a special issue entitled “Advanced Topics of Document Analysis and Recognition,” and were included as oral presentations in the conference programme. Do have a look at the programme booklet for more information!

ICDAR 2023 Logo
ICDAR 2023 Logo

Each conference day included a thought-provoking keynote talk. The first one, by Marti Hearst, Professor and Interim Dean of the UC Berkeley School of Information, was entitled “A First Look at LLMs Applied to Scientific Documents.” I learned about three platforms using Natural Language Processing (NLP) methods on PDF documents: ScholarPhi, Paper Plain, and SCIM. These projects help people read academic scientific publications, for example by enabling definitions for mathematical notations, or generating glossary for nonce words (e.g. acronyms, symbols, jargon terms); make medical research more accessible by enabling simplified summaries and Q&A; and classifying key passages in papers to enable quick and intelligent paper skimming.

The second keynote talk, “Enabling the Document Experiences of the Future,” was by Vlad Morariu, Senior Research Scientist at Adobe Research. Vlad addressed the need for human-document interaction, and took us through some future document experiences: PDF re-flows for mobile devices, documents read themselves, and conversational functionalities such as asking questions and receiving answers. Enabling this type of ultra-responsive documents is reliant on methods such as structural element detection, page layout understanding, and semantic connections.

The third and final keynote talk was by Seiichi Uchida, Distinguished Professor and Senior Vice President, Kyushu University, Japan. In his talk, “What Are Letters?,” Seiichi took us through the four main functions of letters and text: message (transmission of verbalised info), label (disambiguation of objects and environments), design (give a nonverbal info, such as impression), and code (readability under various noises and deformations). He provoked us to contemplate how our lives were affected by texts around us, and how could we analyse the correlation between our behaviour and the texts that we read.

Prof Seiichi Uchida giving his keynote talk on “What Are Letters?”
Prof Seiichi Uchida giving his keynote talk on “What Are Letters?”

When it came to papers submitted for review by the conference committee, the most prominent topic represented in those submissions was handwriting recognition, with a growing number of papers specifically tackling historical documents. Other submission topics included Graphics Recognition, Natural Language Processing for Documents (D-NLP), Applications (including for medical, legal, and business documents), and other types of Document Analysis and Recognition topics (DAR).

Screenshot of a slide showing the main submission topics for ICDAR 2023
Screenshot of a slide showing the main submission topics for ICDAR 2023

Some of the papers that I attended tackled Named Entity Recognition (NER) evaluation methods and genealogical information extraction; papers dealing with Document Understanding, e.g. identifying the internal structure of documents, and understanding the relations between different entities; papers on Text and Document Recognition, such as looking into a model for multilingual OCR; and papers looking into Graphics, especially the recognition of table structure and content, as well as extracting data from structure diagrammes, for example in financial documents, or flowchart recognition. Papers on Handwritten Text Recognition (HTR) dealt with methods for Writer Retrieval, i.e. identifying documents likely written by specific authors, the creation of generic models, text line detection, and more.

The conference included two poster sessions, featuring an incredibly rich array of poster presentations, as well as doctoral consortia. One of my favourite posters was presented by Mirjam Cuper, Data Scientist at the National Library of the Netherlands (KB), entitled “Unraveling confidence: examining confidence scores as proxy for OCR quality.” Together with colleagues Corine van Dongen and Tineke Koster, she looked into confidence scores provided by OCR engines, which indicate the level of certainty in which a word or character were accurately recognised. However, other factors are at play when measuring OCR quality – you can watch a ‘teaser’ video for this poster.

Conference participants at one of the poster sessions
Conference participants at one of the poster sessions

As mentioned, the conference was followed by three days of tutorials and workshops. I enjoyed the tutorial on Computational Analysis of Historical Documents, co-led by Dr Isabelle Marthot-Santaniello (University of Bale, Switzerland) and Dr Hussein Adnan Mohammed (University of Hamburg, Germany). Presentations focused on the unique challenges, difficulties, and opportunities inherent to working with different types of historical documents. The distinct difficulties posed by historical handwritten manuscripts and ancient artifacts necessitate an interdisciplinary strategy and the utilisation of state-of-the-art technologies – and this fusion leads to the emergence of exciting and novel advancements in this area. The presentations were interwoven with great questions and a rich discussion, indicative of the audience’s enthusiasm. This tutorial was appropriately followed by a workshop dedicated to Computational Palaeography (IWCP).

I especially looked forward to the next day’s workshop, which was the 7th edition of Historical Document Imaging and Processing (HIP’23). It was all about making documents accessible in digital libraries, looking at methods addressing OCR/HTR of historical documents, information extraction, writer identification, script transliteration, virtual reconstruction, and so much more. This day-long workshop featured papers in four sessions: HTR and Multi-Modal Methods, Classics, Segmentation & Layout Analysis, and Language Technologies & Classification. One of my favourite presentations was by Prof Apostolos Antonacopoulos, talking about his work with Christian Clausner and Stefan Pletschacher on “NAME – A Rich XML Format for Named Entity and Relation Tagging.” Their NAME XML tackles the need to represent named entities in rich and complex scenarios. Tags could be overlapping and nested, character-precise, multi-part, and possibly with non-consecutive words or tokens. This flexible and extensible format addresses the relationships between entities, makes them interoperable, usable alongside other information (images and other formats), and possible to validate.

Prof Apostolos Antonacopoulos talking about “NAME – A Rich XML Format for Named Entity and Relation Tagging”
Prof Apostolos Antonacopoulos talking about “NAME – A Rich XML Format for Named Entity and Relation Tagging”

I’ve greatly enjoyed the conference and its wonderful community, meeting old colleagues and making new friends. Until next time!

 

28 October 2022

Learn more about Living with Machines at events this winter

Digital Curator, and Living with Machines Co-Investigator Dr Mia Ridge writes…

The Living with Machines research project is a collaboration between the British Library, The Alan Turing Institute and various partner universities. Our free exhibition at Leeds City Museum, Living with Machines: Human stories from the industrial age, opened at the end of July. Read on for information about adult events around the exhibition…

Museum Late: Living with Machines, Thursday 24 November, 2022

6 - 10pm Leeds City Museum • £5, booking essential https://my.leedstickethub.co.uk/19101

The first ever Museum Late at Leeds City Museum! Come along to experience the museum after hours with music, pub quiz, weaving, informal workshops, chats with curators, and a quiz. Local food and drinks in the main hall.

Full programme: https://museumsandgalleries.leeds.gov.uk/events/leeds-city-museum/museum-late-living-with-machines/

Tickets: https://my.leedstickethub.co.uk/19101

Study Day: Living with Machines, Friday December 2, 2022

10:00 am - 4:00 pm Online • Free but booking essential: https://my.leedstickethub.co.uk/18775

A unique opportunity to hear experts in the field illuminate key themes from the exhibition and learn how exhibition co-curators found stories and objects to represent research work in AI and digital history. This study day is online via Zoom so that you can attend from anywhere.

Full programme: https://museumsandgalleries.leeds.gov.uk/events/leeds-city-museum/living-with-machines-study-day/

Tickets: https://my.leedstickethub.co.uk/18775

Living with Machines Wikithon, Saturday January 7, 2023

1 – 4:30pm Leeds City Museum • Free but booking essential: https://my.leedstickethub.co.uk/19104

Ever wanted to try editing Wikipedia, but haven't known where to start? Join us for a session with our brilliant Wikipedian-in-residence to help improve Wikipedia’s coverage of local lives and topics at an editathon themed around our exhibition. 

Everyone is welcome. You won’t require any previous Wiki experience but please bring your own laptop for this event. Find out more, including how you can prepare, in my blog post on the Living with Machines site, Help fill gaps in Wikipedia: our Leeds editathon.

The exhibition closes the next day, so it really is your last chance to see it!

Full programme: https://museumsandgalleries.leeds.gov.uk/events/leeds-city-museum/living-with-machines-wikithon-exploring-the-margins/

Tickets: https://my.leedstickethub.co.uk/19104

If you just want to try out something more hands on with textiles inspired by the exhibition, there's also a Peg Loom Weaving Workshop, and not one but two Christmas Wreath Workshops.

You can find out more about our exhibition on the Living with Machines website.

Lwm800x400

20 September 2022

Learn more about what AI means for us at Living with Machines events this autumn

Digital Curator, and Living with Machines Co-Investigator Dr Mia Ridge writes…

The Living with Machines research project is a collaboration between the British Library, The Alan Turing Institute and various partner universities. Our free exhibition at Leeds City Museum, Living with Machines: Human stories from the industrial age, opened at the end of July. Read on for information about adult events around the exhibition…

AI evening panels and workshop, September 2022

We’ve put together some great panels with expert speakers guaranteed to get you thinking about the impact of AI with their thought-provoking examples and questions. You'll have a chance to ask your own questions in the Q&A, and to mingle with other attendees over drinks.

We’ve also collaborated with AI Tech North to offer an exclusive workshop looking at the practical aspects of ethics in AI. If you’re using or considering AI-based services or tools, this might be for you. Our events are also part of the jam-packed programme of the Leeds Digital Festival #LeedsDigi22, where we’re in great company.

The role of AI in Creative and Cultural Industries

Thu, Sep 22, 17:30 – 19:45 BST

Leeds City Museum • Free but booking required

https://www.eventbrite.com/e/the-role-of-ai-in-creative-and-cultural-industries-tickets-395003043737

How will AI change what we wear, the TV and films we watch, what we read? 

Join our fabulous Chair Zillah Watson (independent consultant, ex-BBC) and panellists Rebecca O’Higgins (Founder KI-AH-NA), Laura Ellis (Head of Technology Forecasting, BBC) and Maja Maricevic, (Head of Higher Education and Science, British Library) for an evening that'll help you understand the future of these industries for audiences and professionals alike. 

Maja's written a blog post on The role of AI in creative and cultural industries with more background on this event.

 

Workshop: Developing ethical and fair AI for society and business

Thu, Sep 29, 13:30 - 17:00 BST

Leeds City Museum • Free but booking required

https://www.eventbrite.com/e/workshop-developing-ethical-and-fair-ai-for-society-and-business-tickets-400345623537

 

Panel: Developing ethical and fair AI for society and business

Thu, Sep 29, 17:30 – 19:45 BST

Leeds City Museum • Free but booking required

https://www.eventbrite.com/e/panel-developing-ethical-and-fair-ai-for-society-and-business-tickets-395020706567

AI is coming, so how do we live and work with it? What can we all do to develop ethical approaches to AI to help ensure a more equal and just society? 

Our expert Chair, Timandra Harkness, and panellists Sherin Mathew (Founder & CEO of AI Tech UK), Robbie Stamp (author and CEO at Bioss International), Keely Crockett (Professor in Computational Intelligence, Manchester Metropolitan University) and Andrew Dyson (Global Co-Chair of DLA Piper’s Data Protection, Privacy and Security Group) will present a range of perspectives on this important topic.

If you missed our autumn events, we also have a study day and Wikipedia editathon this winter. You can find out more about our exhibition on the Living with Machines website.

Lwm800x400

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs