Digital scholarship blog

123 posts categorized "Research collaboration"

29 November 2022

My AHRC-RLUK Professional Practice Fellowship: Four months on

In August 2022 I started work on a project to investigate the legacies of curatorial voice in the descriptions of incunabula collections at the British Library and their future reuse. My research is funded by the collaborative AHRC-RLUK Professional Practice Fellowship Scheme for academic and research libraries which launched in 2021. As part of the first cohort of ten Fellows I embraced this opportunity to engage in practitioner research that benefits my institution and the wider sector, and to promote the role of library professionals as important research partners.

The overall aim of my Fellowship is to demonstrate new ways of working with digitised catalogues that would also improve the discoverability and usability of the collections they describe. The focus of my research is the Catalogue of books printed in the 15th century now at the British Museum (or BMC) published between 1908 and 2007 which describes over 12,700 volumes from the British Library incunabula collection. By using computational approaches and tools with the data derived from the catalogue I will gain new insights into and interpretations of this valuable resource and enable its reuse in contemporary online resources. 

Titlepage to volume 2 of the Catalogue of books printed in the fifteenth century now in the British Museum, part 2, Germany, Eltvil-Trier
BMC volume 2 titlepage


This research idea was inspired by a recent collaboration with Dr James Baker, who is also my mentor for this Fellowship, and was further developed in conversations with Dr Karen Limper-Herz, Lead Curator for Incunabula, Adrian Edwards, Head of Printed Heritage Collections, and Alan Danskin, Collections Metadata Standards Manager, who support my research at the Library.

My Fellowship runs until July 2023 with Fridays being my main research days. I began by studying the history of the catalogue, its arrangement and the structure of the item descriptions and their relationship with different online resources. Overall, the main focus of this first phase has been on generating the text data required for the computational analysis and investigations into curatorial and cataloguing practice. This work involved new digitisation of the catalogue and a lot of experimentation using the Transkribus AI-empowered platform that proved best-suited for improving the layout and text recognition for the digitised images. During the last two months I have hugely benefited from the expertise of my colleague Tom Derrick, as we worked together on creating the training data and building structure models for the incunabula catalogue images.

An image from Transkribus Lite showing a page from the catalogue with separate regions drawn around columns 1 and 2, and the text baselines highlighted in purple
Layout recognition output for pages with only two columns, including text baselines, viewed on Transkribus Lite

 

An image from Transkribus Lite showing a page from the catalogue alongside the text lines
Text recognition output after applying the model trained with annotations for 2 columns on the page, viewed on Transkribus Lite

 

An image from Transkribus Lite showing a page from the catalogue with separate regions drawn around 4 columns of text separated by a single text block
Layout recognition output for pages with mixed layout of single text block and text in columns, viewed on Transkribus Lite

Whilst the data preparation phase has taken longer than I had planned due to the varied layout of the catalogue, this has been an important part of the process as the project outcomes are dependent on using the best quality text data for the incunabula descriptions. The next phase of the research will involve the segmentation of the records and extraction of relevant information to use with a range of computational tools. I will report on the progress with this work and the next steps early next year. Watch this space and do get in touch if you would like to learn more about my research.

This blogpost is by Dr Rossitza Atanassova, Digital Curator for Digitisation, British Library. She is on Twitter @RossiAtanassova  and Mastodon @ratanass@glammr.us

28 October 2022

Learn more about Living with Machines at events this winter

Digital Curator, and Living with Machines Co-Investigator Dr Mia Ridge writes…

The Living with Machines research project is a collaboration between the British Library, The Alan Turing Institute and various partner universities. Our free exhibition at Leeds City Museum, Living with Machines: Human stories from the industrial age, opened at the end of July. Read on for information about adult events around the exhibition…

Museum Late: Living with Machines, Thursday 24 November, 2022

6 - 10pm Leeds City Museum • £5, booking essential https://my.leedstickethub.co.uk/19101

The first ever Museum Late at Leeds City Museum! Come along to experience the museum after hours with music, pub quiz, weaving, informal workshops, chats with curators, and a quiz. Local food and drinks in the main hall.

Full programme: https://museumsandgalleries.leeds.gov.uk/events/leeds-city-museum/museum-late-living-with-machines/

Tickets: https://my.leedstickethub.co.uk/19101

Study Day: Living with Machines, Friday December 2, 2022

10:00 am - 4:00 pm Online • Free but booking essential: https://my.leedstickethub.co.uk/18775

A unique opportunity to hear experts in the field illuminate key themes from the exhibition and learn how exhibition co-curators found stories and objects to represent research work in AI and digital history. This study day is online via Zoom so that you can attend from anywhere.

Full programme: https://museumsandgalleries.leeds.gov.uk/events/leeds-city-museum/living-with-machines-study-day/

Tickets: https://my.leedstickethub.co.uk/18775

Living with Machines Wikithon, Saturday January 7, 2023

1 – 4:30pm Leeds City Museum • Free but booking essential: https://my.leedstickethub.co.uk/19104

Ever wanted to try editing Wikipedia, but haven't known where to start? Join us for a session with our brilliant Wikipedian-in-residence to help improve Wikipedia’s coverage of local lives and topics at an editathon themed around our exhibition. 

Everyone is welcome. You won’t require any previous Wiki experience but please bring your own laptop for this event. Find out more, including how you can prepare, in my blog post on the Living with Machines site, Help fill gaps in Wikipedia: our Leeds editathon.

The exhibition closes the next day, so it really is your last chance to see it!

Full programme: https://museumsandgalleries.leeds.gov.uk/events/leeds-city-museum/living-with-machines-wikithon-exploring-the-margins/

Tickets: https://my.leedstickethub.co.uk/19104

If you just want to try out something more hands on with textiles inspired by the exhibition, there's also a Peg Loom Weaving Workshop, and not one but two Christmas Wreath Workshops.

You can find out more about our exhibition on the Living with Machines website.

Lwm800x400

04 October 2022

Open and Engaged 2022: Climate research in GLAM, digital infrastructure and skills to open collections

As part of International Open Access Week, the British Library is delighted to host its annual Open and Engaged event online on 24 October, Monday from 13:00 to 16:30 BST.

Since 2018 the British Library has organised the Open and Engaged Conference to coincide with International Open Access Week.

In line with this year’s #OAWeek theme: Open for Climate Justice; Open and Engaged will address intersections between cultural heritage and climate research through use of collections, digital infrastructures and skills.

A range of speakers from cultural heritage and higher education institutions will answer these questions to shed a light on the theme:

  • What is the role of library collections, historical datasets to understand the impact of climate change?
  • How to use digital infrastructure for more equitable knowledge sharing?
  • What roles and skills are needed to make research from heritage organisations openly available?

We invite everyone interested in the topic to join us on the day by registering via this online form. Please see the programme below and note that it is subject to minor updates up until the event date.  

Programme – 24 October, Monday - British Summer Time (UTC+1)

DOIs now available for each talk

13:00 – 13:10  Opening notes

13:10 – 13:15  Welcome remarks by Rachael Kotarski, the British Library

13:15 – 14:05  Climate research in cultural heritage - moderated by Maja Maricevic, the British Library

13:15 – 13:40 Climate change approach at the British Library. Maja Maricevic, the British Library

13:40 – 14:05 Climate justice at the Royal Botanic Garden Edinburgh. Lorna Mitchel, Royal Botanic Gardens Edinburgh

14:05 – 14:30 Break

14:30 – 15:45  Opening up heritage research: Infrastructure and skills – moderated by Susan Miles, the British Library

14.30 – 14:55  “Forever or 5 years”: Sustainability planning for Digital Research Infrastructure for Arts and Humanities. Anna Maria Sichani, Digital Humanities Research Hub, School of Advanced Study, University of London

14:55 -15:20  Shared Research Repository Service and competency framework for cultural heritage professionals. Jenny Basford and Ilkay Holt, the British Library

15:20 – 15:45 Valuing the breadth and depth of skills in the research library. Claire Knowles, University of Leeds

15:45 – 16:00 Closing remarks from Rachael Kotarski, The British Library

We encourage you to participate in discussion with other attendees and speakers by using the Twitter hashtag #OpenEngaged. If you have any questions, please contact us at openaccess@bl.uk.  

20 September 2022

Learn more about what AI means for us at Living with Machines events this autumn

Digital Curator, and Living with Machines Co-Investigator Dr Mia Ridge writes…

The Living with Machines research project is a collaboration between the British Library, The Alan Turing Institute and various partner universities. Our free exhibition at Leeds City Museum, Living with Machines: Human stories from the industrial age, opened at the end of July. Read on for information about adult events around the exhibition…

AI evening panels and workshop, September 2022

We’ve put together some great panels with expert speakers guaranteed to get you thinking about the impact of AI with their thought-provoking examples and questions. You'll have a chance to ask your own questions in the Q&A, and to mingle with other attendees over drinks.

We’ve also collaborated with AI Tech North to offer an exclusive workshop looking at the practical aspects of ethics in AI. If you’re using or considering AI-based services or tools, this might be for you. Our events are also part of the jam-packed programme of the Leeds Digital Festival #LeedsDigi22, where we’re in great company.

The role of AI in Creative and Cultural Industries

Thu, Sep 22, 17:30 – 19:45 BST

Leeds City Museum • Free but booking required

https://www.eventbrite.com/e/the-role-of-ai-in-creative-and-cultural-industries-tickets-395003043737

How will AI change what we wear, the TV and films we watch, what we read? 

Join our fabulous Chair Zillah Watson (independent consultant, ex-BBC) and panellists Rebecca O’Higgins (Founder KI-AH-NA), Laura Ellis (Head of Technology Forecasting, BBC) and Maja Maricevic, (Head of Higher Education and Science, British Library) for an evening that'll help you understand the future of these industries for audiences and professionals alike. 

Maja's written a blog post on The role of AI in creative and cultural industries with more background on this event.

 

Workshop: Developing ethical and fair AI for society and business

Thu, Sep 29, 13:30 - 17:00 BST

Leeds City Museum • Free but booking required

https://www.eventbrite.com/e/workshop-developing-ethical-and-fair-ai-for-society-and-business-tickets-400345623537

 

Panel: Developing ethical and fair AI for society and business

Thu, Sep 29, 17:30 – 19:45 BST

Leeds City Museum • Free but booking required

https://www.eventbrite.com/e/panel-developing-ethical-and-fair-ai-for-society-and-business-tickets-395020706567

AI is coming, so how do we live and work with it? What can we all do to develop ethical approaches to AI to help ensure a more equal and just society? 

Our expert Chair, Timandra Harkness, and panellists Sherin Mathew (Founder & CEO of AI Tech UK), Robbie Stamp (author and CEO at Bioss International), Keely Crockett (Professor in Computational Intelligence, Manchester Metropolitan University) and Andrew Dyson (Global Co-Chair of DLA Piper’s Data Protection, Privacy and Security Group) will present a range of perspectives on this important topic.

If you missed our autumn events, we also have a study day and Wikipedia editathon this winter. You can find out more about our exhibition on the Living with Machines website.

Lwm800x400

18 July 2022

UK Digital Comics: More of the same but different? [1]

This is a guest post by Linda Berube, an AHRC Collaborative Doctoral Partnership student based at the British Library and City, University of London. If you would like to know more about Linda's research, please do email her at Linda.Berube@city.ac.uk.

When I last wrote a post for the Digital Scholarship blog in 2020 (Berube, 2020), I was a fairly new PhD student, fresh out of the starting blocks, taking on the challenge of UK digital comics research.  My research involves an analysis of the systems and processes of UK digital comics publishing as a means of understanding how digital technology has affected, maybe transformed them. For this work, I have the considerable support of supervisors Ian Cooke and Stella Wisdom (British Library) and Ernesto Priego and Stephann Makri (Human-Computer Interaction Design Centre, City, University of London).

Little did I, or the rest world for that matter, know the transformations to daily life brought on by pandemic that were to come. There was no less of an impact felt in the publishing sector, and certainly in comics publishing. Still, despite all the obstacles to meetings, people from traditional[2] large and small press publishers, media and video game companies publishing comics, as well as creators and self-publishers gave generously of their time to discuss comics with me. I am currently speaking with comics readers and observing their reading practices, again all via remote meetings. To all these people, this PhD student owes a debt of gratitude for their enthusiastic participation.

British Comics Publishing: It’s where we’re at

Digital technology has had a significant impact on British comics publishing, but not as pervasively as expected from initial prognostications by scholars and the comics press. Back in 2020, I observed:

  This particular point in time offers an excellent opportunity to consider the digital comics, and specifically UK, landscape. We seem to be past the initial enthusiasm for digital technologies when babies and bathwater were ejected with abandon (see McCloud 2000, for example), and probably still in the middle of a retrenchment, so to speak, of that enthusiasm (see Priego 2011 pp278-280, for example). (Berube, 2020).

But ‘retrenchment’ might be a strong word. According to my research findings to date, and in keeping with those of the broader publishing sector (Thompson, 2010; 2021), the comics publishing process has most definitely been ‘revolutionized’ by digital technology. All comics begin life as digital files until they are published in print. Even those creators who still draw by hand must convert their work to digital versions that can be sent to a publisher or uploaded to a website or publishing platform. And, while print comics have by no means been completely supplanted by digital comics (in fact a significant number of those interviewed voiced a preference for print), reading on digital devices-laptops, tablets, smartphones-has become popular enough for publishers to provide access through ebook and app technology. Even those publishers I interviewed who were most resistant to digital felt compelled ‘to dabble in digital comics’ (according to one small press publisher) by at least providing pdf versions on Gumroad or some other storefront. The restrictions on print distribution and sales through bookstores resulting from Covid lockdown compelled some of the publishers not only to provide more access to digital versions, but some went as far to sell digital-exclusive versions, in other words comics only offered digitally.

Everywhere you look, a comic

The visibility of digital comics across sectors including health, economics, education, literacy and even the hard sciences was immediately obvious from a mapping exercise of UK comics publishers, producers and platforms as well as through interviews. What this means is that comics-the creation and reading of them-are used to teach and to learn about multiple topics, including archiving (specifically UK Legal Deposit) (Figure 1) and Anthropology (specifically Smartphones and Smart Ageing) (Figure 2):

Cartoon drawing of two people surrounded by comics and zines
Figure 1: Panel from 'The Legal Deposit and You', by Olivia Hicks (British Library, 2018). Reproduced with permission from the British Library.

 

Cartoon drawing of two women sitting on a sofa looking at and discussing content on a smartphone
Figure 2: Haapio-Kirk, L., Murariu, G., and Hahn, A. (artist) (2022) 'Beyond Anthropomorphism Palestine', Anthropology of Smartphones and Smart Ageing (ASSA) Blog. Based on Maya de Vries and Laila Abed Rabho’s research in Al-Quds (East Jerusalem). Available at: https://wwwdepts-live.ucl.ac.uk/anthropology/assa/discoveries/beyond-anthropomorphism/ . Reproduced with permission.

Moreover, comics in their incarnation as graphic novels have grabbed literary prizes, for example Jimmy Corrigan: the smartest kid on earth (Jonathan Cape, 2001) by Chris Ware won the Guardian First Book Award in 2001, and Sabrina (Granta, 2018) by Nick Drnaso was longlisted for the Man Booker Prize in 2018 (somewhat controversially, see Nally, 2018).

Just Like Reading a Book, But Not…

But by extending the definition of digital comics[3] to include graphic novels mostly produced as ebooks, the ‘same-ness” of reading in print became evident over the course of interviews with publishers and creators. Publishing a comic in pdf format, whether that be on a website, on a publishing platform, or as a book is just the easiest, most cost-effective way to do it:

  We’re print first in our digital workflow—Outside of graphic novels, with other types of books we occasionally have the opportunity to work with the digital version as a consideration at the outset, in which case the tagging/classes are a factored in at the beginning stages (a good example would be a recent straight -to-digital reflowable ebook). This is the exception though, and also does not apply to graphic novels, which are all print-led. (Interview with publisher, December 2020)

Traditional book publishers have not been the only ones taking up comics - gaming and media companies have acquired the rights to comics, comics brands previously published in print. For more and different sectors, comics increasingly have become an attractive option especially for their multimedia appeal. However, what they do with the comics is a mixture of the same, for instance being print-led as described in the above comment, and different, for example through conversion to digital interactive versions as well as providing apps with more functionality than the ebook format.

It's How You Read Them

Comics formatted especially for reading on apps, such as 2000 AD, ComiXology, and Marvel Unlimited, can be variable in the types of reading experiences they offer to readers. While some have retained the ‘multi-panel display’ experience of reading a print comic book, others have gone beyond the ‘reads like a book’ experience. ComiXology, a digital distribution platform for comics owned by Amazon, pioneered the “guided view” technology now used by the likes of Marvel and DC, where readers view one panel at a time. Some of the comics readers I have interviewed refer to this reading experience as ‘the cinematic experience’. Readers page through the comic one panel or scene at a time, yes, as if watching it on film or TV.

These reading technologies do tend to work better on a tablet than on a smartphone. The act of scrolling required to read webcomics on the WEBTOON app (and others, such as Tapas), designed to be read on smartphones, produces that same kind of ‘cinematic’ effect: readers of comics on both the ComiXology and Web Toon apps I have interviewed describe the exact same experience: the build-up of “anticipation”, “tension”,  “on the edge of my seat” as they page or scroll down to the next scene/panel. WEBTOON creators employ certain techniques in order to create that tension in the vertical format, for example the use of white space between panels: the more space, the more scrolling, the more “edge of the seat” experience. Major comics publishers have started creating ‘vertical’ (scrolling on phones) comics: Marvel launched its Infinity Comics to appeal to the smartphone webcomics reader.

So, it would seem that good old-fashioned comics pacing combined with publishing through apps designed for digital devices provide a different, but same reading experience:  a uniquely digital reading experience.

Same But Different: I’m still here

So, here I am, still a PhD student currently conducting research with comics readers, as part of my research and as part of a secondment with the BL supported by AHRC Additional Student Development funding. This additional funding has afforded me the opportunity to employ UX (user behaviour/experience) techniques with readers, primarily through conducting reading observation sessions and activities. I will be following up this blog with an update on this research as well as a call for participation into more reader research.

References 

Berube, L. (2020) ‘Not Just for Kids: UK Digital Comics, from creation to consumption’, British Library Digital Scholarship Blog”, 24 August 2020. Available at: https://blogs.bl.uk/digital-scholarship/2020/08/not-just-for-kids-uk-digital-comics-from-creation-to-consumption.html

Drnaso, N. (2018) Sabrina. London, England: Granta Books.

McCloud, Scott (2000) Reinventing Comics: How Imagination and Technology Are Revolutionizing an Art Form.  New York, N.Y: Paradox Press. 

Nally, C. (2018) ‘Graphic Novels Are Novels: Why the Booker Prize Judges Were Right to Choose One for Its Longlist’, The Conversation, 26 July. Available at: https://theconversation.com/graphic-novels-are-novels-why-the-booker-prize-judges-were-right-to-choose-one-for-its-longlist-100562.

Priego, E. (2011) The Comic Book in the Age of Digital Reproduction. [Thesis] University College London. Available at: https://doi.org/10.6084/m9.figshare.754575.v4, pp278-280.

Ware, C. (2001) Jimmy Corrigan: the smartest kid on earth. London, England: Jonathan Cape.

Notes

[1] “More of the same but different”, a phrase used by a comics creator I interviewed in reference to what comics readers want to read.↩︎

[2] By ‘traditional’, I am referring to publishers who contract with comics creators to undertake the producing, publishing, distribution, selling of a comic, retaining rights for a certain period of time and paying the creator royalties. In my research, publishers who transacted business in this way included multinational and small press publishers. Self-publishing is where the creator owns all the rights and royalties, but also performs the production, publishing, distribution work, or pays for a third-party to do so. ↩︎

[3] For this research, digital comics include a diverse selection of what is produced electronically or online: webcomics, manga, applied comics, experimental comics, as well as graphic novels [ebooks].  I have omitted animation. ↩︎

20 April 2022

Importing images into Zooniverse with a IIIF manifest: introducing an experimental feature

Digital Curator Dr Mia Ridge shares news from a collaboration between the British Library and Zooniverse that means you can more easily create crowdsourcing projects with cultural heritage collections. There's a related blog post on Zooniverse, Fun with IIIF.

IIIF manifests - text files that tell software how to display images, sound or video files alongside metadata and other information about them - might not sound exciting, but by linking to them, you can view and annotate collections from around the world. The IIIF (International Image Interoperability Framework) standard makes images (or audio, video or 3D files) more re-usable - they can be displayed on another site alongside the original metadata and information provided by the source institution. If an institution updates a manifest - perhaps adding information from updated cataloguing or crowdsourcing - any sites that display that image automatically gets the updated metadata.

Playbill showing the title after other large text
Playbill showing the title after other large text

We've posted before about how we used IIIF manifests as the basis for our In the Spotlight crowdsourced tasks on LibCrowds.com. Playbills are great candidates for crowdsourcing because they are hard to transcribe automatically, and the layout and information present varies a lot. Using IIIF meant that we could access images of playbills directly from the British Library servers without needing server space and extra processing to make local copies. You didn't need technical knowledge to copy a manifest address and add a new volume of playbills to In the Spotlight. This worked well for a couple of years, but over time we'd found it difficult to maintain bespoke software for LibCrowds.

When we started looking for alternatives, the Zooniverse platform was an obvious option. Zooniverse hosts dozens of historical or cultural heritage projects, and hundreds of citizen science projects. It has millions of volunteers, and a 'project builder' that means anyone can create a crowdsourcing project - for free! We'd already started using Zooniverse for other Library crowdsourcing projects such as Living with Machines, which showed us how powerful the platform can be for reaching potential volunteers. 

But that experience also showed us how complicated the process of getting images and metadata onto Zooniverse could be. Using Zooniverse for volumes of playbills for In the Spotlight would require some specialist knowledge. We'd need to download images from our servers, resize them, generate a 'manifest' list of images and metadata, then upload it all to Zooniverse; and repeat that for each of the dozens of volumes of digitised playbills.

Fast forward to summer 2021, when we had the opportunity to put a small amount of funding into some development work by Zooniverse. I'd already collaborated with Sam Blickhan at Zooniverse on the Collective Wisdom project, so it was easy to drop her a line and ask if they had any plans or interest in supporting IIIF. It turns out they had, but hadn't had the resources or an interested organisation necessary before.

We came up with a brief outline of what the work needed to do, taking the ability to recreate some of the functionality of In the Spotlight on Zooniverse as a goal. Therefore, 'the ability to add subject sets via IIIF manifest links' was key. ('Subject set' is Zooniverse-speak for 'set of images or other media' that are the basis of crowdsourcing tasks.) And of course we wanted the ability to set up some crowdsourcing tasks with those items… The Zooniverse developer, Jim O'Donnell, shared his work in progress on GitHub, and I was very easily able to set up a test project and ask people to help create sample data for further testing. 

If you have a Zooniverse project and a IIIF address to hand, you can try out the import for yourself: add 'subject-sets/iiif?env=production' to your project builder URL. e.g. if your project is number #xxx then the URL to access the IIIF manifest import would be https://www.zooniverse.org/lab/xxx/subject-sets/iiif?env=production

Paste a manifest URL into the box. The platform parses the file to present a list of metadata fields, which you can flag as hidden or visible in the subject viewer (public task interface). When you're happy, you can click a button to upload the manifest as a new subject set (like a folder of items), and your images are imported. (Don't worry if it says '0 subjects).

 

Screenshot of manifest import screen
Screenshot of manifest import screen

You can try out our live task and help create real data for testing ingest processes at ​​https://frontend.preview.zooniverse.org/projects/bldigital/in-the-spotlight/classify

This is a very brief introduction, with more to come on managing data exports and IIIF annotations once you've set up, tested and launched a crowdsourced workflow (task). We'd love to hear from you - how might this be useful? What issues do you foresee? How might you want to expand or build on this functionality? Email digitalresearch@bl.uk or tweet @mia_out @LibCrowds. You can also comment on GitHub https://github.com/zooniverse/Panoptes-Front-End/pull/6095 or https://github.com/zooniverse/iiif-annotations

Digital work in libraries is always collaborative, so I'd like to thank British Library colleagues in Finance, Procurement, Technology, Collection Metadata Services and various Collections departments; the Zooniverse volunteers who helped test our first task and of course the Zooniverse team, especially Sam, Jim and Chris for their work on this.

 

12 April 2022

Making British Library collections (even) more accessible

Daniel van Strien, Digital Curator, Living with Machines, writes:

The British Library’s digital scholarship department has made many digitised materials available to researchers. This includes a collection of digitised books created by the British Library in partnership with Microsoft. This is a collection of books that have been digitised and processed using Optical Character Recognition (OCR) software to make the text machine-readable. There is also a collection of books digitised in partnership with Google. 

Since being digitised, this collection of digitised books has been used for many different projects. This includes recent work to try and augment this dataset with genre metadata and a project using machine learning to tag images extracted from the books. The books have also served as training data for a historic language model.

This blog post will focus on two challenges of working with this dataset: size and documentation, and discuss how we’ve experimented with one potential approach to addressing these challenges. 

One of the challenges of working with this collection is its size. The OCR output is over 20GB. This poses some challenges for researchers and other interested users wanting to work with these collections. Projects like Living with Machines are one avenue in which the British Library seeks to develop new methods for working at scale. For an individual researcher, one of the possible barriers to working with a collection like this is the computational resources required to process it. 

Recently we have been experimenting with a Python library, datasets, to see if this can help make this collection easier to work with. The datasets library is part of the Hugging Face ecosystem. If you have been following developments in machine learning, you have probably heard of Hugging Face already. If not, Hugging Face is a delightfully named company focusing on developing open-source tools aimed at democratising machine learning. 

The datasets library is a tool aiming to make it easier for researchers to share and process large datasets for machine learning efficiently. Whilst this was the library’s original focus, there may also be other uses cases for which the datasets library may help make datasets held by the British Library more accessible. 

Some features of the datasets library:

  • Tools for efficiently processing large datasets 
  • Support for easily sharing datasets via a ‘dataset hub’ 
  • Support for documenting datasets hosted on the hub (more on this later). 

As a result of these and other features, we have recently worked on adding the British Library books dataset library to the Hugging Face hub. Making the dataset available via the datasets library has now made the dataset more accessible in a few different ways.

Firstly, it is now possible to download the dataset in two lines of Python code: 

Image of a line of code: "from datasets import load_dataset ds = load_dataset('blbooks', '1700_1799')"

We can also use the Hugging Face library to process large datasets. For example, we only want to include data with a high OCR confidence score (this partially helps filter out text with many OCR errors): 

Image of a line of code: "ds.filter(lambda example: example['mean_wc_ocr'] > 0.9)"

One of the particularly nice features here is that the library uses memory mapping to store the dataset under the hood. This means that you can process data that is larger than the RAM you have available on your machine. This can make the process of working with large datasets more accessible. We could also use this as a first step in processing data before getting back to more familiar tools like pandas. 

Image of a line of code: "dogs_data = ds['train'].filter(lamda example: "dog" in example['text'].lower()) df = dogs_data_to_pandas()

In a follow on blog post, we’ll dig into the technical details of datasets in some more detail. Whilst making the technical processing of datasets more accessible is one part of the puzzle, there are also non-technical challenges to making a dataset more usable. 

 

Documenting datasets 

One of the challenges of sharing large datasets is documenting the data effectively. Traditionally libraries have mainly focused on describing material at the ‘item level,’ i.e. documenting one dataset at a time. However, there is a difference between documenting one book and 100,000 books. There are no easy answers to this, but libraries could explore one possible avenue by using Datasheets. Timnit Gebru et al. proposed the idea of Datasheets in ‘Datasheets for Datasets’. A datasheet aims to provide a structured format for describing a dataset. This includes questions like how and why it was constructed, what the data consists of, and how it could potentially be used. Crucially, datasheets also encourage a discussion of the bias and limitations of a dataset. Whilst you can identify some of these limitations by working with the data, there is also a crucial amount of information known by curators of the data that might not be obvious to end-users of the data. Datasheets offer one possible way for libraries to begin more systematically commuting this information. 

The dataset hub adopts the practice of writing datasheets and encourages users of the hub to write a datasheet for their dataset. For the British library books, we have attempted to write one of these datacards. Whilst it is certainly not perfect, it hopefully begins to outline some of the challenges of this dataset and gives end-users a better sense of how they should approach a dataset. 

14 March 2022

The Lotus Sutra Manuscripts Digitisation Project: the collaborative work between the Heritage Made Digital team and the International Dunhuang Project team

Digitisation has become one of the key tasks for the curatorial roles within the British Library. This is supported by two main pillars: the accessibility of the collection items to everybody around the world and the preservation of unique and sometimes, very fragile, items. Digitisation involves many different teams and workflow stages including retrieval, conservation, curatorial management, copyright assessment, imaging, workflow management, quality control, and the final publication to online platforms.

The Heritage Made Digital (HMD) team works across the Library to assist with digitisation projects. An excellent example of the collaborative nature of the relationship between the HMD and International Dunhuang Project (IDP) teams is the quality control (QC) of the Lotus Sutra Project’s digital files. It is crucial that images meet the quality standards of the digital process. As a Digitisation Officer in HMD, I am in charge of QC for the Lotus Sutra Manuscripts Digitisation Project, which is currently conserving and digitising nearly 800 Chinese Lotus Sutra manuscripts to make them freely available on the IDP website. The manuscripts were acquired by Sir Aurel Stein after they were discovered  in a hidden cave in Dunhuang, China in 1900. They are thought to have been sealed there at the beginning of the 11th century. They are now part of the Stein Collection at the British Library and, together with the international partners of the IDP, we are working to make them available digitally.

The majority of the Lotus Sutra manuscripts are scrolls and, after they have been treated by our dedicated Digitisation Conservators, our expert Senior Imaging Technician Isabelle does an outstanding job of imaging the fragile manuscripts. My job is then to prepare the images for publication online. This includes checking that they have the correct technical metadata such as image resolution and colour profile, are an accurate visual representation of the physical object and that the text can be clearly read and interpreted by researchers. After nearly 1000 years in a cave, it would be a shame to make the manuscripts accessible to the public for the first time only to be obscured by a blurry image or a wayward piece of fluff!

With the scrolls measuring up to 13 metres long, most are too long to be imaged in one go. They are instead shot in individual panels, which our Senior Imaging Technicians digitally “stitch” together to form one big image. This gives online viewers a sense of the physical scroll as a whole, in a way that would not be possible in real life for those scrolls that are more than two panels in length unless you have a really big table and a lot of specially trained people to help you roll it out. 

Photo showing the three individual panels of Or.8210S/1530R with breaks in between
Or.8210/S.1530: individual panels
Photo showing the three panels of Or.8210S/1530R as one continuous image
Or.8210/S.1530: stitched image

 

This post-processing can create issues, however. Sometimes an error in the stitching process can cause a scroll to appear warped or wonky. In the stitched image for Or.8210/S.6711, the ruled lines across the top of the scroll appeared wavy and misaligned. But when I compared this with the images of the individual panels, I could see that the lines on the scroll itself were straight and unbroken. It is important that the digital images faithfully represent the physical object as far as possible; we don’t want anyone thinking these flaws are in the physical item and writing a research paper about ‘Wonky lines on Buddhist Lotus Sutra scrolls in the British Library’. Therefore, I asked the Senior Imaging Technician to restitch the images together: no more wonky lines. However, we accept that the stitched images cannot be completely accurate digital surrogates, as they are created by the Imaging Technician to represent the item as it would be seen if it were to be unrolled fully.

 

Or.8210/S.6711: distortion from stitching. The ruled line across the top of the scroll is bowed and misaligned
Or.8210/S.6711: distortion from stitching. The ruled line across the top of the scroll is bowed and misaligned

 

Similarly, our Senior Imaging Technician applies ‘digital black’ to make the image background a uniform colour. This is to hide any dust or uneven background and ensure the object is clear. If this is accidentally overused, it can make it appear that a chunk has been cut out of the scroll. Luckily this is easy to spot and correct, since we retain the unedited TIFFs and RAW files to work from.

 

Or.8210/S.3661, panel 8: overuse of digital black when filling in tear in scroll. It appears to have a large black line down the centre of the image.
Or.8210/S.3661, panel 8: overuse of digital black when filling in tear in scroll

 

Sometimes the scrolls are wonky, or dirty or incomplete. They are hundreds of years old, and this is where it can become tricky to work out whether there is an issue with the images or the scroll itself. The stains, tears and dirt shown in the images below are part of the scrolls and their material history. They give clues to how the manuscripts were made, stored, and used. This is all of interest to researchers and we want to make sure to preserve and display these features in the digital versions. The best part of my job is finding interesting things like this. The fourth image below shows a fossilised insect covering the text of the scroll!

 

Black stains: Or.8210/S.2814, panel 9
Black stains: Or.8210/S.2814, panel 9
Torn and fragmentary panel: Or.8210/S.1669, panel 1
Torn and fragmentary panel: Or.8210/S.1669, panel 1
Insect droppings obscuring the text: Or.8210/S.2043, panel 1
Insect droppings obscuring the text: Or.8210/S.2043, panel 1
Fossilised insect covering text: Or.8210/S.6457, panel 5
Fossilised insect covering text: Or.8210/S.6457, panel 5

 

We want to minimise the handling of the scrolls as much as possible, so we will only reshoot an image if it is absolutely necessary. For example, I would ask a Senior Imaging Technician to reshoot an image if debris is covering the text and makes it unreadable - but only after inspecting the scroll to ensure it can be safely removed and is not stuck to the surface. However, if some debris such as a small piece of fluff, paper or hair, appears on the scroll’s surface but is not obscuring any text, then I would not ask for a reshoot. If it does not affect the readability of the text, or any potential future OCR (Optical Character Recognition) or handwriting analysis, it is not worth the risk of damage that could be caused by extra handling. 

Reshoot: Or.8210/S.6501: debris over text  /  No reshoot: Or.8210/S.4599: debris not covering text.
Reshoot: Or.8210/S.6501: debris over text  /  No reshoot: Or.8210/S.4599: debris not covering text.

 

These are a few examples of the things to which the HMD Digitisation Officers pay close attention during QC. Only through this careful process, can we ensure that the digital images accurately reflect the physicality of the scrolls and represent their original features. By developing a QC process that applies the best techniques and procedures, working to defined standards and guidelines, we succeed in making these incredible items accessible to the world.

Read more about Lotus Sutra Project here: IDP Blog

IDP website: IDP.BL.UK

And IDP twitter: @IDP_UK

Dr Francisco Perez-Garcia

Digitisation Officer, Heritage Made Digital: Asian and African Collections

Follow us @BL_MadeDigital

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs