Digital scholarship blog

Enabling innovative research with British Library digital collections

Introduction

Tracking exciting developments at the intersection of libraries, scholarship and technology. Read more

09 October 2023

Strike a Pose Steampunk style! For our Late event with Clockwork Watch on Friday 13th October

This Friday (13th October) the British Library invites you to join the world of Clockwork Watch by Yomi Ayeni, a participatory storytelling project, set in a fantastical retro-futurist vision of Victorian England, with floating cities and sky pirates, which is one of the showcased narratives in our Digital Storytelling exhibition.

Flyer with text saying Late at the Library, Digital Steampunk at the British Library, London. Friday 13 October, 19:30 – 22:30

We are delighted that Dark Box Images will be bringing their portable darkroom to the Late at the Library: Digital Steampunk event and taking portrait photographs. If this appeals to you, then please arrive early to have your picture taken. Photographer Gregg McNeill is an expert in the wet plate collodion process invented by Frederick Scott Archer in 1851. Gregg’s skill in using an authentic Victorian camera creates genuinely remarkable results that appear right in front of your eyes.

Black and white photograph of a woman wearing an elaborate outfit and a mask with her arms outstretched wide with fabric like wings
Wet plate collodion photograph of Jennifer Garside of Wyte Phantom corsetry, taken by Gregg McNeill of Dark Box Images

If you want to pose for the camera at our steampunk Late, or have a portrait drawn by artist Doctor Geof, please don’t be shy, this is an event where guests are encouraged to dress to impress! The aesthetic of steampunk fashion is inspired by Victoriana and 19th Century literature, including Jules Verne’s novels and the Sherlock Holmes stories by Sir Arthur Conan Doyle. Steampunk looks can include hats and googles, tweed tailoring, waistcoats, corsets, fob watches and fans. Whatever your personal style, we encourage you to unleash your creativity when putting together an outfit for this event.

Furthermore, whether you are seeking a new look or some finishing touches, there will be an opportunity to browse a Night Market at this Late event, where you can purchase and admire a range of exquisite hand crafted items created by:

  • Jema Hewitt, a professional costumer and academic, will be bringing some of her unique, handmade jewellery and accessories to the Library Late event. She was one of the originators of the early artistic steampunk scene in the UK, subsequently exhibiting her costume work internationally, and having three how-to-make books published as her alter ego “Emilly Ladybird”. Jema currently specialises as a pattern cutter for film, theatre and TV, as well as lecturing and teaching workshops.
Photograph of jewellery, hats and clothing
Jewellery, hats and clothing created by Jema Hewitt/Emilly Ladybird
  • Doctor Geof, an artist, scientist, comics creator and maker of whimsical objects. His work is often satirical, usually with an historical twist, and features tea, goblins, krakens, steampunk, smut, nuns, bees, cats and more tea. Since 2004 you may have encountered him selling his comics, prints, cards, mugs, pins, and for some reason a lot of embroidered badges (including an Evil Librarian patch!) at various events. As one of the foremost Steampunk artists in the UK, Doctor Geof has worked with and exhibited at the Cutty Sark, Royal Museums Greenwich, and Discovery Museum Newcastle. He is a talented portrait artist, so please seek him out if you would like him to capture your likeness in ink and watercolour.
A round embroidered patch with a cartoon figure wearing goggles and carrying books. Text says "Evil Librarian"
Evil Librarian embroidered patch by Dr Geof

  • Jennifer Garside, a seamstress specialising in modern corsetry, which takes inspiration from historical styles. Her business, Wyte Phantom, opened in 2010, and she has made costumes for opera singers, performers and artists across the world.

  • Tracy Wells, a couture milliner based in the Lake District. She creates all kinds of hats and headpieces, often collaborating with other artists to explore new styles, concepts and genres.
Photograph of a woman wearing a steampunk hat with feathers
Millinery by Tracy Wells
  • Herr Döktor, a renowned inventor, gadgeteer, and contraptionist, who has been working in his Laboratory in the Surrey Hills for the last two decades, building a better future via the prism of history. He will be bringing a small selection of his inventions and scale models of his larger ideas. (His alter ego, Ian Crichton, is a professional model maker with thirty years experience as a toy prototype maker, museum and exhibition designer, and, most recently, building props and models for the film industry, he also lives in the Surrey Hills). 
Photograph of a man wearing a top hat and carrying a model submarine
Herr Döktor, inventor, gadgeteer, and contraptionist. Photograph by Adam Stait
  • Linette Withers established Anachronalia in 2012 to be a full-time bookbinder, producing historically-inspired books, miniature books, and quirky stationery. Her work has been shortlisted for display at the Bodleian Library at the University of Oxford as part of their ‘Redesigning the Medieval Book’ competition and exhibition in 2018 and one of her books is held in the permanent collection of The Lit & Phil in Newcastle after being part of an exhibition of bookbinding in 2021. She also teaches bookbinding in her studio in Leeds.

  • Heather Hayden of Diamante Queen Designs creates handmade vintage inspired, kitsch, macabre, noir accessories for everybody to wear and enjoy. Heather studied fashion and surface pattern design in the 80's near Leeds during the emergence of Gothic culture and has remained interested in the darker side of life ever since. She became fascinated with Steampunk after seeing Datamancer's Steampunk computer, loving the juxtaposition of new and old technology. This inspired her to make steampunk clothing and accessories using old and found items and upcycling as much as possible.
Photograph of a mannequin head wearing a headpiece with tassels, feathers, flowers and beads
Headpiece by Diamante Queen Designs
  • Matthew Chapman of Raphael's Workshop specialises in creating strange and sublime chainmail items, bringing ideas to life in metal that few would ever consider. From collars to corsets, serpents to squids, arms to armour and medals to masterpieces, you should visit his stall and see what creations spark the imagination.
Photograph of a table displaying a range of wearable items of chainmail jewellery and accessories
Chainmail jewellery and accessories created by Raphael's Workshop

We hope that this post has whetted your appetite for the delights available at the Late at the Library: Digital Steampunk event on Friday 13th October at the British Library. Tickets can be booked here.

02 October 2023

Last chance to see the Digital Storytelling exhibition

All good things must come to an end, no I’m not talking about the collapse of a favourite high street chain store beginning with W, but the final few weeks of our Digital Storytelling exhibition, which closes on the 15th October 2023. If you haven’t seen it yet, then this is your last chance to book!

Digital Storytelling showcases eleven different born digital works, including interactive narratives that respond to user input, reading experiences personalised by data feeds, and immersive multimedia story worlds developed through audience participation. From thought provoking autobiographical hypertexts to data journalism, uncanny ghost stories to weather poetry, steampunk literary adaptation to quirky Elizabethan medical comedy. 

Digital Storytelling exhibition image with art from Astrologaster, Seed, 80 Days, and Zombies, Run!

If you want to hear more about this exhibition, Digital Curator Stella Wisdom will be giving two talks later this week. The first of these will be in-person on Thursday evening, 5th October, in Richmond Lending Library for the Richmond Reads season of events, celebrating the joys and benefits of reading. The second will be held online on Friday morning, 6th October, for the DARIAH-EU autumn 2023 Friday Frontiers series.

We are also delighted to share that there is a chapter about interactive digital books written by Giulia Carla Rossi, Curator for Digital Publications, in The Book by Design, which was recently launched by our colleagues in British Library Publishing. Giulia’s chapter discusses innovative Editions at Play publications, including Seed by Joanna Walsh and Breathe by Kate Pullinger, which are both currently displayed in Digital Storytelling.

Before the Digital Storytelling exhibition closes, we'd love you to join us for a party on the evening of Friday 13th October. For one night only, transmedia storyteller Yomi Ayeni will transform the British Library into the Clockwork Watch story world for an immersive steampunk late event.

Genre-bending DJ Sacha Dieu will be spinning the best in Balkan Gypsy, Electro Swing, and Global Beats. Professor Elemental will perform live for us, and we really hope he’ll sing I Love Libraries! You'll also be able to view the Digital Storytelling exhibition, and there will be quieter areas to explore 19th Century London in Minecraft, play board games including Great Scott! The Game of Mad Invention with games librarian Marion Tessier, and to discover poetry with the Itinerant Poetry Librarian.

If you plan to party with us, book your ticket here.

27 September 2023

Late at the Library: Digital Steampunk

Summer may be over, but there is much to look forward to this autumn, including our Late at the Library: Digital Steampunk event on Friday 13th October 2023, where we invite you to immerse yourself in the Clockwork Watch story world, party with chap hop maestro Professor Elemental and explore 19th-century London in Minecraft. If these kind of shenanigans sound right up your street, then book tickets here and join us!

Clockwork Watch by Yomi Ayeni is currently showcased in the British Library’s Digital Storytelling exhibition, which is open until 15 October 2023. Set in a retro-futurist steampunk Victorian England, Clockwork Watch is a participatory story that includes multiple voices and perspectives on themes relating to empire, colonialism, exploitation and resistance, which is told across a range of formats, including a series of graphic novels (there is an overview of these titles here), immersive theatre, role play, and an online newspaper the London Gazette.

Drawing of a a range of people in steampunk clothing,in front of a London skyline
Steampunk Illustration by Brett Walsh

For the evening of Friday 13th October, the British Library will transform into the story world of the next part of the Clockwork Watch narrative. Featuring an auction of the last few remaining properties on Peak B, and the opening of bids for Peak C, new housing developments situated on floating islands hovering over the British Channel. Leggett and Scarper, the estate agents managing these properties, will also be inviting inventors or anyone with a solution to problems plaguing these floating islands, to submit their plans for a chance to win a Golden Ticket to one of the new homes on Peak C.

Illustration of Peak B property development on a floating island
© Clockwork Watch / Graham Leggett 2023

Attendees will be able to explore the streets of Sherlock Holmes’ London in Minecraft created by Blockworks and Lancaster University, visit the Night Market, have a photograph taken with authentic Victorian Dark Box photography, or a portrait drawn by artist Dr Geof, and that’s before the auction begins. But be warned, buying your way into this real estate dreamworld is not straightforward – this night is a golden opportunity for the Clockwork Watch underbelly of pickpockets, rogues and vagabonds.

Dressing up and joining in is heartily encouraged. To prepare for this event, we suggest reading the Clockwork Watch graphic novels, you can order these online, or purchase the first two ominbus editions from the British Library’s onsite shop. Also check out the London Gazette website and this special British Library edition of the newspaper. We hope to see you there!

Cover page of the London Gazette British Library edition
© Clockwork Watch

26 September 2023

Let’s learn together - Join us in the Cultural Heritage Open Scholarship Network

Are you working in Galleries-Libraries-Archives-Museums (GLAM) and cultural heritage organisations as research support and research-active staff? Are you interested in developing knowledge and skills in open scholarship? Would you like to establish good practices, share your experience with others and collaborate? If your answer is yes to one or more of these questions, we invite you to join the Cultural Heritage Open Scholarship Network (CHOSN).

Initiated by the British Library’s Research Infrastructure Services built on the experience of and positive responses received from the open scholarship training programme, which was run earlier this year. CHOSN is a community of practice for research support and research-active staff who work in GLAMs, organisations interested in developing and sharing open scholarship knowledge and skills, organising events, and supporting each other in this area. 

GLAMs demonstrate a significant amount of research showcases, but we may find ourselves with inadequate resources to make that research openly available, gain relevant open scholarship skills to make it happen, or even identify what forms research in these environments. CHOSN aims to provide a platform to create synergy for those aiming for good practice in open scholarship.

CHOSN flyer image, text says: Cultural Heritage Open Scholarship Network (CHOSN). Are you working in Galleries-Libraries-Archives-Museums (GLAMs)? Join Us! To develop knowledge and skills in open scholarship, organise activities to learn and grow, and create a community of practise to collaborate and support each other.

This network can be of interest to anyone who is facilitating, enabling, supporting research activities in GLAM organisations. They include but are not limited to research support staff, research-active staff, librarians, curatorial teams, IT specialists, copyright officers and so on. Anyone interested in the areas of open scholarship and works in cultural heritage organisations are welcome.

Join us in the Cultural Heritage Open Scholarship Network (CHOSN) to;

  • explore research activities, roles in GLAMs and make them visible,
  • develop knowledge and skills in open scholarship,
  • carry out capacity development activities to learn and grow, and
  • create a community of practice to collaborate and support each other.

We have set up a JISC mailing list to start communication with the network, you can join by signing up here. We will shortly organise an online meeting to kick off the network plans, explore how to move forward and to collectively discuss what we would like to do next. This will all be communicated via the CHOSN mailing list.

If you have any questions about CHOSN, we are happy to hear from you at [email protected].

21 September 2023

Convert-a-Card: Helping Cataloguers Derive Records with OCLC APIs and Python

This blog post is by Harry Lloyd, Research Software Engineer in the Digital Research team, British Library. You can sometimes find him at the Rose and Crown in Kentish Town.

Last week Dr Adi Keinan-Schoonbaert delved into the invaluable work that she and others have done on the Convert-a-Card project since 2015. In this post, I’m going to pick up where she left off, and describe how we’ve been automating parts of the workflow. When I joined the British Library in February, Victoria Morris and former colleague Giorgia Tolfo had prototyped programmatically extracting entities from transcribed catalogue cards and searching by title and author in the OCLC WorldCat database for any close matches. I have been building on this work, and addressing the last yellow rectangle below: “Curator disambiguation and resolution”. Namely how curators choose between OCLC results and develop a MARC record fit for ingest into British Library systems.

A flow chart of the Convert-a-card workflow. Digital catalogue cards to Transkribus to bespoke language model to OCR output (shelfmark, title, author, other text) to OCLC search and retrieval and shelfmark correction to spreadsheet with results to curator disambiguation and resolution to collection metadata ingest
The Convert-a-Card workflow at the start of 2023

 

Entity Extraction

We’re currently working with the digitised images from two drawers of cards, one Urdu and one Chinese. Adi and Giorgia used a layout model on Transkribus to successfully tag different entities on the Urdu cards. The transcribed XML output then had ‘title’, ‘shelfmark’ and ‘author’ tags for the relevant text, making them easy to extract.

On the left an image of an Urdu catalogue card, on the right XML describing the transcribed text, including a "title" tag for the title line
Card with layout model and resulting XML for an Urdu card, showing the `structure {type:title;}` parameter on line one

The same method didn’t work for the Chinese cards, possibly because the cards are less consistently structured. There is, however, consistency in the vertical order of entities on the card: shelfmark comes above title comes above author. This meant I could reuse some code we developed for Rossitza Atanassova’s Incunabula project, which reliably retrieved title and author (and occasionally an ISBN).

Two Chinese cards side-by-side, with different layouts.
Chinese cards. Although the layouts are variable, shelfmark is reliably the first line, with title and author following.

 

Querying OCLC WorldCat

With the title and author for each card, we were set-up to query WorldCat, but how to do this when there are over two thousand cards in these two drawers alone? Victoria and Giorgia made impressive progress combining Python wrappers for the Z39.50 protocol (PyZ3950) and MARC format (Pymarc). With their prototype, a lot of googling of ASN.1, BER and Z39.50, and a couple of quiet weeks drifting through the web of references between the two packages, I built something that could turn a table of titles and authors for the Chinese cards into a list of MARC records. I had also brushed up on enough UTF-8 to fix why none of the Chinese characters were encoded correctly.

For all that I enjoyed trawling through it, Z39.50 is, in the words of a 1999 tutorial, “rather hard to penetrate” and nearly 35 years old. PyZ39.50, the Python wrapper, hasn’t been maintained for two years, and making any changes to the code is a painstaking process. While Z39.50 remains widely used for transferring information between libraries, that doesn’t mean there aren’t better ways of doing things, and in the name of modernity OCLC offer a suite of APIs for their services. Crucially there are endpoints on their Metadata API that allow search and retrieval of records in MARCXML format. As the British Library maintains a cataloguing subscription to OCLC, we have access to the APIs, so all that’s needed is a call to the OCLC OAuth Server, a search on the Metadata API using title and author, then retrieval of the MARCXML for any results. This is very straightforward in Python, and with the Requests package and about ten lines of code we can have our MARCXML matches.

Selecting Matches

At all stages of the project we’ve needed someone to select the best match for a card from WorldCat search results. This responsibility currently lies with curators and cataloguers from the relevant collection area. With that audience in mind, I needed a way to present MARC data from WorldCat so curators could compare the MARC fields for different matches. The solution needed to let a cataloguer choose a card, show the card and a table with the MARC fields for each WorldCat result, and ideally provide filters so curators could use domain knowledge to filter out bad results. I put out a call on the cross-government data science network, and a colleague in the 10DS data science team suggested Streamlit.

Streamlit is a Python package that allows fast development of web apps without needing to be a web app developer (which is handy as I’m not one). Adding Streamlit commands to the script that processes WorldCat MARC records into a dataframe quickly turned it into a functioning web app. The app reads in a dataframe of the cards in one drawer and their potential worldcat matches, and presents it as a table of cards to choose from. You then see the image of the card you’re working on and a MARC field table for the relevant WorldCat matches. This side-by-side view makes it easy to scan across a particular MARC field, and exclude matches that have, for example, the wrong physical dimensions. There’s a filter for cataloguing language, sort options for things like number of subject access fields and total number of fields, and the ability to remove bad matches from view. Once the cataloguer has chosen a match they can save a match to the original dataframe, or note that there were no good matches, or only a partial match.

Screenshot from the Streamlit web app, with an image of a Chinese catalogue card above a table containing MARC data for different WorldCat matches relating to the card.
Screenshot from the Streamlit Convert-a-Card web app, showing the card and the MARC table curators use to choose between matches. As the cataloguers are familiar with MARC, providing the raw fields is the easiest way to choose between matches.

After some very positive initial feedback, we sat down with the Chinese curators and had them test the app out. That led to a fun, interactive, user experience focussed feedback session, and a whole host of GitHub issues on the repository for bugs and design suggestions. Behind the scenes discussion on where to host the app and data are ongoing and not straightforward, but this has been a deeply easy product to prototype, and I’m optimistic it will provide a light weight, gentle learning curve complement to full deriving software like Aleph (the Library’s main cataloguing system).

Next Steps

The project currently uses a range of technologies in  Transkribus, the OCLC APIs, and Streamlit, and tying these together has in itself been a success. Going forwards, we have the possibility of extracting non-English text from the cards to look forward to, and the richer list of entities this would make available. Working with the OCLC APIs has been a learning curve, and they’re not working perfectly yet, but they represent a relatively accessible option compared to Z39.50. And my hope for the Streamlit app is that it will be a useful tool beyond the project for wherever someone wants to use Worldcat to help derive records from minimal information. We still have challenges in terms of design, data storage, and hosting to overcome, but these discussions should have their own benefits in making future development easier. The goal for automation part of the project is a smooth flow of data from Transkribus, through OCLC and on to the curators, and while it’s not perfect, we’re definitely getting there.

15 September 2023

London Fashion Week SS24: British Library x Ahluwalia

This year we will be continuing our collaboration with the British Fashion Council running our annual student research competition, which encourages fashion students to use the British Library collections in creating their fashion designs. Once again, we will start the collaboration with a fashion show produced by a leading designer. This year we are delighted to be working with Priya Ahluwalia. Earlier this year Priya worked with the Business and IP centre, contributing to the Inspiring Entrepreneurs’ International Women’s Day event, which discussed how we can best embrace and encourage diversity and inclusion in business.

On 15 September during London Fashion Week Priya will showcase her SS24 collection at the British Library. Following the show, Priya will lead this year’s student competition, focusing on the importance of research in design process. As a part of this competition students across the UK will create fashion portfolios inspired by the Library’s unique collections.

The previous collaborations with the British Fashion Council involved a range of exciting designers such as Nabil El Nayal, Phoebe English, Supriya Lele and Charles Jeffrey.

Photo of fashion event with Pheobe English (2021)
Phoebe English’s fashion installation at the British Library in 2021

 

The previous student work utilised the riches of the Library’s digital and physical collections, with the Flickr collection being especially popular with students. However, the inspiration came from many different directions - from art books, photographs and maps to the reading room bags.

This year’s student competition will be launched in October 2023.

Collage of different images of types of coats at the British Library including a man wearing a traditional Romanian winter coat, and a technical image detailing elements of a winter coat
From the winning portfolio of Mihai Popesku, Middlesex University student, who used the Library collections to research traditional Romanian dress

Update: there's been some great coverage in the fashion press (and social media), including this Vogue article that begins 'Priya Ahluwalia’s show purposefully took place at the British Library. More than just a venue, it tied into the theme of her work: bringing forgotten or untold stories about talented people to attention'.

14 September 2023

What's the future of crowdsourcing in cultural heritage?

The short version: crowdsourcing in cultural heritage is an exciting field, rich in opportunities for collaborative, interdisciplinary research and practice. It includes online volunteering, citizen science, citizen history, digital public participation, community co-production, and, increasingly, human computation and other systems that will change how participants relate to digital cultural heritage. New technologies like image labelling, text transcription and natural language processing, plus trends in organisations and societies at large mean constantly changing challenges (and potential). Our white paper is an attempt to make recommendations for funders, organisations and practitioners in the near and distant future. You can let us know what we got right, and what we could improve by commenting on Recommendations, Challenges and Opportunities for the Future of Crowdsourcing in Cultural Heritage: a White Paper.

The longer version: The Collective Wisdom project was funded by an AHRC networking grant to bring experts from the UK and the US together to document the state of the art in designing, managing and integrating crowdsourcing activities, and to look ahead to future challenges and unresolved issues that could be addressed by larger, longer-term collaboration on methods for digitally-enabled participation.

Our open access Collective Wisdom Handbook: perspectives on crowdsourcing in cultural heritage is the first outcome of the project, our expert workshops were a second.

Mia (me) and Sam Blickhan launched our White Paper for comment on pubpub at the Digital Humanities 2023 conference in Graz, Austria, in July this year, with Meghan Ferriter attending remotely. Our short paper abstract and DH2023 slides are online at Zenodo

So - what's the future of crowdsourcing in cultural heritage? Head on over to Recommendations, Challenges and Opportunities for the Future of Crowdsourcing in Cultural Heritage: a White Paper and let us know what you think! You've got until the end of September…

You can also read our earlier post on 'community review' for a sense of the feedback we're after - in short, what resonates, what needs tweaking, what examples could we include?

To whet your appetite, here's a preview of our five recommendations. (To find out why we make those recommendations, you'll have to read the White Paper):

  • Infrastructure: Platforms need sustainability. Funding should not always be tied to novelty, but should also support the maintenance, uptake and reuse of well-used tools.
  • Evidencing and Evaluation: Help create an evaluation toolkit for cultural heritage crowdsourcing projects; provide ‘recipes’ for measuring different kinds of success. Shift thinking about value from output/scale/product to include impact on participants' and community well-being.
  • Skills and Competencies: Help create a self-guided skills inventory assessment resource, tool, or worksheet to support skills assessment, and develop workshops to support their integrity and adoption.
  • Communities of Practice: Fund informal meetups, low-cost conferences, peer review panels, and other opportunities for creating and extending community. They should have an international reach, e.g. beyond the UK-US limitations of the initial Collective Wisdom project funding.
  • Incorporating Emergent Technologies and Methods: Fund educational resources and workshops to help the field understand opportunities, and anticipate the consequences of proposed technologies.

What have we missed? Which points do you want to boost? (For example, we discovered how many of our points apply to digital scholarship projects in general). You can '+1' on points that resonate with you, suggest changes to wording, ask questions, provide examples and references, or (constructively, please) challenge our arguments. Our funding only supported participants from the UK and US, so we're very keen to hear from folk from the rest of the world.

12 September 2023

Convert-a-Card: Past, Present and Future of Catalogue Cards Retroconversion

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected].

 

It’s been more than eight years, in June 2015, since the British Library launched its crowdsourcing platform, LibCrowds, with the aim of enhancing access to our collections. The first project series on LibCrowds was called Convert-a-Card, followed by the ever-so-popular In the Spotlight project. The aim of Convert-a-Card was to convert print card catalogues from the Library’s Asian and African Collections into electronic records, for inclusion in our online catalogue Explore.

A significant portion of the Library's extensive historical collections was acquired well before the advent of standard computer-based cataloguing. Consequently, even though the Library's online catalogue offers public access to tens of millions of records, numerous crucial research materials remain discoverable solely through searching the traditional physical card catalogues. The physical cards provide essential information for each book, such as title, author, physical description (dimensions, number of pages, images, etc.), subject and a “shelfmark” – a reference to the item’s location. This information still constitutes the basic set of data to produce e-records in libraries and archives.

Card Catalogue Cabinets in the British Library’s Asian & African Studies Reading Room © Jon Ellis
Card Catalogue Cabinets in the British Library’s Asian & African Studies Reading Room © Jon Ellis

 

The initial focus of Convert-a-Card was the Library’s card catalogues for Chinese, Indonesian and Urdu books – you can read more about this here and here. Scanned catalogue cards were uploaded to Flickr (and later to our Research Repository), grouped by the physical drawer in which they were originally located. Several of these digitised drawers became projects on LibCrowds.

 

Crowdsourcing Retroconversion

Convert-a-Card on LibCrowds included two tasks:

  1. Task 1 – Search for a WorldCat record match: contributors were asked to look at a digitised card and search the OCLC WorldCat database based on some of the metadata elements printed on it (e.g. title, author, publication date), to see if a record for the book already exists in some form online. If found, they select the matching record.
  2. Task 2 – Transcribe the shelfmark: if a match was found, contributors then transcribed the Library's unique shelfmark as printed on the card.

Online volunteers worked on Pinyin (Chinese), Indonesian and Urdu records, mainly between 2015 and 2019. Their valuable contributions resulted in lists of new records which were then ingested into the Library's Explore catalogue – making these items so much more discoverable to our users. For cards only partially matched with online records, curators and cataloguers had a special area on the LibCrowds platform through which they could address some of the discrepancies in partial matches and resolve them.

An example of an Urdu catalogue card
An example of an Urdu catalogue card

 

After much consideration, we have decided to sunset LibCrowds. However, you can see a good snapshot of it thanks to the UK Web Archive (with thanks to Mia Ridge and Filipe Bento for archiving it), or access its GitHub pages – originally set up and maintained by LibCrowds creator Alex Mendes. We have been using mainly Zooniverse for crowdsourcing projects (see for example Living with Machines projects), and you can see here some references to these and other crowdsourcing initiatives. Sunsetting LibCrowds provided us with the opportunity to rethink Convert-a-Card and consider alternative, innovative ways to automate or semi-automate the retroconversion of these valuable catalogue cards.

 

Text Recognition

As a first step, we were looking to automate the retrieval of text from the digitised cards using OCR/Machine Learning. As mentioned, this text includes shelfmark, title, author, place and date of publication, and other information. If extracted accurately enough, this text could be used for WorldCat lookup, as well as for enhancement of existing records. In most cases, the text was typewritten in English, often with additional information, or translation, handwritten in other languages. To start with, we’ve decided to focus only on the typewritten English – with the aspiration to address other scripts and languages in the future.

Last year, we ran some comparative testing with ABBYY FineReader Server (the software generally used for in-house OCR) and Transkribus, to see how accurately they perform this task. We trialled a set of cards with two different versions of ABBYY, and three different models for typewritten Latin scripts in Transkribus (Model IDs 29418, 36202, and 25849). Assessment was done by visually comparing the original text with the OCRed text, examining mainly the key areas of text which are important for this initiative, i.e. the shelfmark, author’s name and book title. For the purpose of automatically recognising the typewritten English on the catalogue cards, Transkribus Model 29418 performed better than the others – and more accurately than ABBYY’s recognition.

An example of a Pinyin card in Transkribus, showing segmentation and transcription
An example of a Pinyin card in Transkribus, showing segmentation and transcription

 

Using that as a base model, we incrementally trained a bespoke model to recognise the text on our Pinyin cards. We’ve also normalised the resulting text, for example removing spaces in the shelfmark, or excluding unnecessary bits of data. This model currently extracts the English text only, with a Character Error Rate (CER) of 1.8%. With more training data, we plan on extending this model to other types of catalogue cards – but for now we are testing this workflow with our Chinese cards.

 

Entities Extraction

Extracting meaningful entities from the OCRed text is our next step, and there are different ways to do that. One such method – if already using Transkribus for text extraction – is training and applying a bespoke P2PaLA layout analysis model. Such model could identify text regions, improve automated segmentation of the cards, and help retrieve specific regions for further tasks. Former colleague Giorgia Tolfo tested this with our Urdu cards, with good results. Trying to replicate this for our Chinese cards was not as successful – perhaps due to the fact that they are less consistent in structure.

Another possible method is by using regular expressions in a programming language. Research Software Engineer (RSE) Harry Lloyd created a Jupyter notebook with Python code to do just that: take the PAGE XML files produced by Transkribus, parse the XML, and extract the title, author and shelfmark from the text. This works exceptionally well, and in the future we’ll expand entity recognition and extraction to other types of data appearing on the cards. But for now, this information suffices to query OCLC WorldCat and see if a matching record exists.

One of the 26 drawers of Chinese (Pinyin) card catalogues © Jon Ellis
One of the 26 drawers of Chinese (Pinyin) card catalogues © Jon Ellis

 

Matching Cards to WorldCat Records

Entities extracted from the catalogue cards can now be used to search and retrieve potentially matching records from the OCLC WorldCat database. Pulling out WorldCat records matched with our card records would help us create new records to go into our cataloguing system Aleph, as well as enrich existing Aleph records with additional information. Previously done by volunteers, we aim to automate this process as much as possible.

Querying WorldCat was initially done using the z39.50 protocol – the same one originally used in LibCrowds. This is a client-server communications protocol designed to support the search and retrieval of information in a distributed network environment. With an excellent start by Victoria Morris and Giorgia Tolfo, who developed a prototype that uses PyZ3950 and PyMARC to query WorldCat, Harry built upon this, refined the code, and tested it successfully for data search and retrieval. Moving forward, we are likely to use the OCLC API for this – which should be a lot more straightforward!

 

Curator/Cataloguer Disambiguation

Getting potential matches from WorldCat is brilliant, but we would like to have an easy way for curators and cataloguers to make the final decision on the ideal match – which WorldCat record would be the best one as a basis to create a new catalogue record on our system. For this purpose, Harry is currently working on a web application based on Streamlit – an open source Python library that enables the building and sharing of web apps. Staff members will be able to use this app by viewing suggested matches, and selecting the most suitable ones.

I’ll leave it up to Harry to tell you about this work – so stay tuned for a follow-up blog post very soon!