Digital scholarship blog

5 posts from June 2021

24 June 2021

My placement: Using Transkribus to OCR Two Centuries of Indian Print

I began a work placement with the Two Centuries of Indian Print project from the British Library working with my supervisor (Digital Curator) Tom Derrick, to automatically transcribe the Library’s Bengali books digitised and catalogued as part of the project. The OCR application we use for transcription is Transkribus, a leading text recognition application for historical documents. We also use a Google Sheet to instantly update each book’s basic information and job status.

In the first two days, I accepted training in how to use the Transkribus application by a face-to-face (virtual) demonstration from my supervisor since I didn't know how to use OCR. He also provided a manual for me to refer to in my practice. There are three main steps to complete a book transcription: uploading books, running layout analysis, and running text detection. We upload books from the British Library’s IIIF image viewer to Transkribus. I needed to first confirm the name and digital system number of a book from our team’s shared Google Sheet so that I could find the digital content of this book within the BL online catalogue. I would record the number of pages the book has into the Google Sheet at the same time. Then I copied the URL of the IIIF manifest and import this book into the collection of our project in Transkribus. After that, I would run layout analysis in Transkribus. It usually takes several minutes to run, and the more pages there are the more time it will take. Perfect layout analysis is where there is one baseline for each line of text on a page.

Although Transkribus is trained on 100+ pages, it still makes mistakes due to multiple causes. Title or chapter headers whose font size differs significantly from other text sometimes would be missed; patterned dividers and borders in the title page will easily been incorrectly identified as text; sometimes the color of paper is too dark, making it difficult to recognize the text. In these cases, the user needs to manually revise the recognition result. After checking the quality of the text analysis, I could then run text recognition. The final step is to check the results of the text recognition and update the Google Sheet.

TranskribusAppplication

Above: A view of a book in the Transkribus application, showing the page images and transcription underneath

During the three weeks of the placement, I handled a total of twelve books. In addition to the regular progression patterns described earlier, I was fortunate to come across several books that required special handling and used them to learn how to handle various situations. For example, the image above shows the result of text recognition for a page of the first book I dealt with in Transkribus, Dhārāpāta: prathama bhāg. Pāṭhaśālastha śiśu digera śikshārtha/ Cintāmani Pāl. Every word in this book is very short and widely spaced, making it very difficult for Transkribus to identify the layout. Because the book is only 28 pages long, I manually labeled all the layouts.

In addition to my work, I have had the pleasure of interacting with many British Library curators and investigators who are engaged in digitization. I attended a regular meeting of our project and learnt the division of labor of the digital project members. Besides, my supervisor Tom contacted some colleagues who work related to the digitization of Chinese collections and provided me with the opportunity to meet them, which has benefited me a lot.

The Principal Investigator for our 2CIP project, Adi, who also has been involved with research and development of Chinese OCR/HTR at the British Library, shared with me the challenges of Chinese OCR/HTR and the progress of current research at the British Library.

Curator for the International Dunhuang Project, Melodie, and a project manager, Tan, presented the research content and outcomes of the project. This project has many partner institutions in different countries that have collections related to the Silk Road. It is a very meaningful digitization project and I admire the development of this project.

The lead Curator for the British Library’s Chinese collections, Sara, introduced different types of Chinese collections and some representative collections in the British Library to me. She also shared with me the objective problems they would encounter when digitizing collections.

Three weeks passed quickly and I gained a lot from my experience at the British Library. In addition to the specifics of how to use Transkribus for text recognition, I have learned about the achievements and problems faced in digitizing Chinese collections from a variety of perspectives.

This is a guest post by UCL Digital Humanities MSc student Xinran Gu.

18 June 2021

The VHS Tapes: Preserving Emerging Formats at the British Library

Researching how to collect, curate and preserve emerging formats is important work for us in the Library. Fortunately we aren't alone in our quest to understand how to manage born digital collections, we are active members of organisations such as the Digital Preservation Coalition and the Videogame Heritage Society, which are excellent networks and forums for us to share and learn from fellow GLAM professionals working in this area.

The Videogame Heritage Society (VHS) is a subject specialist network for digital game preservation, led by the National Videogame Museum (NVM), based in Sheffield. They provide advocacy, support and expertise on the preservation of digital games and digital game culture through a network of museums, heritage institutions, developers, publishers, private collectors and anyone with an interest in videogame history.

The VHS launch event on 21 February 2020 was one of the last physical events I attended before the first Covid-19 lockdown started. Due to the global pandemic, the NVM had to completely re-think how to deliver their programme of planned VHS events, and this has produced a new series of online events called VHS Tapes, which started in February 2021.

At these events, VHS lead Mikey, has been in conversation with members of the VHS community regarding the many issues surrounding digital game preservation, exhibition, and collection. Recordings of these can be found on the NVM's YouTube channel, in this playlist. They include conversations with the NVM's Conor ClarkeFoteini Aravani from the Museum of London and The Retro Hour Podcast. Not wanting to miss out on the fun! The British Library are invited speakers at an upcoming online VHS Tapes event on Tuesday 29 June 2021, 14:00-15:00, places are free, but please book here.

Lynda Clark, Giulia Carla Rossi and I will talk about the British Library’s research in collecting, curating and preserving emerging formats. Including eBook mobile apps, and web-based interactive works, such as those made with tools like Twine, which form the Interactive Narratives and New Media Writing Prize special collections in the UK Web Archive. We’ll discuss digital tools used to build these web archive collections, some of the content and themes of the interactive works collected, and the Library’s plans for the future. We hope to see you there!

A laptop screen showing the interface of the interactive writing tool Twine
An attendee working with the digital interactive writing tool Twine at a 2018 British Library Interactive Fiction Summer School course

This post is by Digital Curator Stella Wisdom (@miss_wisdom

14 June 2021

Adding Data to Wikidata is Efficient with QuickStatements

Once I was set up on Wikipedia (see Triangulating Bermuda, Detroit and William Wallace), I got started with Wikidata. Wikidata is the part of the Wikimedia universe which deals with structured data, like dates of birth, shelf marks and more.

Adding data to Wikidata is really simple: it just requires logging into Wikidata (or creating an account if you don’t already have one) and then pressing edit on any page. you want to edit.

Image of a Wikidata entry about Earth
Editing Wikidata

If the page doesn’t already exist, then creating it is also very simple: just select ‘create a new item’ from the menu on the left-hand side of the page.

When using Wikidata, there are some powerful tools which make adding data quicker and easier. One of these is Quick Statements. Unfortunately, using QuickStatements requires that you have made 50 edits on Wikidata before you make your first batch. Fortunately, it is rather quicker than Citation Hunt (for which, see Triangulating Bermuda, Detroit and William Wallace).

Image of Wikidata menu with 'Create a new item' highlighted
Creating a new item in Wikidata

I made those 50 edits very quickly, by setting up Wikidata item pages for each of the sample items from the India Office Records that we are working with (at the moment we are prioritising adding information about the records; further work will take place before any digitised items are uploaded to Wikimedia platforms). Basic information was added to each of the item pages.

Q107074264 (India Office List January 1885)

Q107074434 (India Office List July 1885)

Q107074463 (India Office List January 1886)

Q107074676 (India Office List July 1886)

Q107074754 (India Office List 1886 Supplement)

Q107074810 (1888-9 Report on the Administration of Bengal)

Q107074801 (1889-90 Report on the Administration of Bengal)

Once I had done this, it became clear that I needed to create more general pages, which could contain the DOIs that link back to the digitised records which are currently only accessible via batch download through the British Library research repository.

Q107134086 Page for administrative reports (V/10/60-1) in general.

Q107136752 Page for India lists (v/13/173-6) in general.

Image of the WikiProject page for the India Office Records
The WikiProject page for the India Office Records

The final preparatory step was to create a WikiProject page, which will facilitate collaboration on the project. This page contains links to all the pages involved in the project and will soon also contain useful resources such as templates for creating new pages as part of the project and queries for using the data.

After this, I began to experiment with Quick Statements, making heavy use of the useful guide to it available on Wikidata.

I decided to upload information on members of a particular regiment in Bengal, since this was information I could easily copy into a spreadsheet because the versions of the reports in the British Library research repository support Optical Character Recognition (OCR).

Image of the original India Office List containing information on members of the 14th Infantry Regiment
Section of the original India Office List containing information on members of the 14th Infantry Regiment (IOR/V/6/175, page 258)

Finally, once I had done all of this, I met with the curators of the India Office Records for feedback and suggestions. It became clear from this that there was in fact some confusion about the exact identification of the regiment they were involved in. Fortunately, it turned out we had identified the correct regiment, but had we made a mistake, it would have just required a simple batch of the Quick Statement edits to quickly put right.

Image of a section of a spreadsheet of members of the 14th Infantry Regiment
Section of my spreadsheet of members of the 14th Infantry Regiment

All in all, I can recommend using Wikidata and I hope I have shown that I can be a useful tool, but also that it is easy to use. The next step for our Wikidata project will be to upload templates and case studies to help and support future volunteer editors to develop it further. We will also add resources to support research on the uploaded data.

Image of Quick Statements for adding gender to each of the pages for the officers
Screenshot of Quick Statements for adding gender to each of the pages for the officers

This is a guest post by UCL Digital Humanities MA student Dominic Kane.

11 June 2021

Libraries & Museums & Archives (Oh My!)

Folks interested in creative reuse of digitised sound recordings, may want to come along to the "Libraries & Museums & Archives (Oh My!)" online conference this Saturday (12th June 2021), organised by The Folklore Library & Archive. Where Cheryl Tipp, the British Library's Curator of Wildlife and Environmental Sounds and I will give a talk on how the Library’s sound archive has been innovatively used to create atmospheric soundscapes both in the virtual landscapes of videogames, and for physical sites of archaeological interest, such as Creswell Crags Museum and Visitor Centre. Plus how the Wildlife and Environmental Sounds collection has been interpreted in other artistic projects, including visualisations by Andy Thomas, and some delightful needlework by textile artist Cat Frampton.

Folklore Library and Archive logo with an open book and a wax seal
The Folklore Library & Archive artwork by Rhi Wynter

In our presentation we'll mention entries in the Off the Map competition, such as Midsummer by Tom Battey. Submissions to the recent Games in the Woods game jam for the Urban Tree Festival, such as Noisy Wood by Ash Green.

Also the fantastic Faint Signals by Invisible Flock, an interactive virtual woodland sound experience, which has been featured recently by Europeana Pro in an article on Seven tips for digital storytelling with cultural heritage, and in this BBC Culture story about The sounds that make us calmer.

Screen image of Faint Signals abstract virtual woodland
Faint Signals by Invisible Flock

The conference will be held online via Zoom, with ticket money going towards the Folklore Library and Archive’s appeal to save the archive of the late folklorist Venetia Newall. Furthermore, all ticket holders will be able to access video replays of the talks after the event, go here for booking your place. There is a stellar line-up of speakers from other organisations, including:

  • Jim Peters, Collections manager (Dept of Prehistory and Europe) from the British Museum talking about his favourite objects from the collections.
  • Alexandra Stockdale-Haley from the National Science and Media Museum, talking about The Cottingley Fairy artefacts and their role in the modern day.
  • Librarians from Senate House Library giving a presentation on The Harry Price Library of Magical Literature.
  • Geraldine Beskin, owner of the Atlantis Bookshop talking about Ghosts of the Theatre.
  • Rachel Morris, co-founder of Metaphor Museum Designers, speaking about the role of the archive in Museums and how to interpret it.
  • Peter Hewitt, founder of the Folklore Museums Network talking about their work bringing museums together.
  • Clare Smith, Historical Collections Curator from the Metropolitan Police Museum giving a talk on The Crime Museum fact vs fiction, and other police artifacts.
  • Lucy Gibbon, Acting Senior Archivist from Orkney Library & Archive will round off the event with stories from the Orkney Archives.

 We hope to see you there! Do follow #folklorelibrary for twitter chat during the conference.

This post is by Digital Curator Stella Wisdom (@miss_wisdom

02 June 2021

Triangulating Bermuda, Detroit and William Wallace

Last Monday I began a work placement with the British Library working with its Wikimedian-in-Residence, Dr Lucy Hinnie, to add information and text from the India Office Records to Wikisource and Wikidata.

My first day mainly consisted of a several different meetings. I was introduced to the team dealing with the India Office Records, which really helped me to get a better sense of the importance of the project and its key objectives. I then attended a metadata workshop (metadata is, generally speaking, data about data, e.g. the author of a book, the time a photo was taken etc). This introduced me to the British Library’s current metadata practices and will be very useful when I begin to upload data to Wikidata in ensuring it is as useful as possible. Finally, I attended a meeting with the curators of the Contemporary British collections, which gave me an overview of the range of the Library’s activities online, its current and future exhibitions and its holdings.

On my second day, I finished my basic Wikipedia training and moved on to getting fully registered, which is needed if you want to add new pages to Wikipedia. This requires 10 edits to existing Wikipedia pages. The fastest way to do this was by completing Citation Hunt, according to Dr Hinnie. What she did not mention was Citation Hunt is roughly what would happen if the British Library catalogue and the Easter Bunny came together to plan an Easter egg hunt in St Pancras.

Screen grab showing the interface for Citation Hunt
Screenshot of Citation Hunt

Citation Hunt gives a random passage of Wikipedia in need of citation and you can either add a citation or skip to another. As you might imagine, these pages are completely unrelated to one another. As such, Citation Hunt had me trawling the internet for such delights as:

• Proof that William Wallace appeared in Age of Empires II. Unfortunately, ‘I remember that bit from when I played’ does not meet Wikipedia’s reliable source guidelines. (William Wallace - Wikipedia)

• A discussion of the OECD ‘Acquis Communautaire.’ (Acquis communautaire - Wikipedia)

• The amount of RAM of in an Atari 1040ST, even though that computer is well and truly before my time. (Atari ST - Wikipedia)

• Evidence that Bill Gates invested in a particular company. (Bill Gates - Wikipedia)

I also found myself lost in the Bermudan Economy (Economy of Bermuda - Wikipedia) and growing into researching commercial agriculture (Ethylene - Wikipedia). Most surreal of all was adding directions from Google Maps for the relative locations of two places in Detroit. (Detroit - Wikipedia) I have never been to Detroit…

Ending my first week, I attended a meeting of the British Library’s Digital Scholarship team. It was really interesting to hear about all the different digital initiatives going on, both within the BL and in partnership with other organisations.

This week, I'm having further training on the tools I will need to use for this project and then, for the remaining four weeks of the placement, I will be uploading and enriching data from the India Office Records.

I look forward to updating you soon on the progress I make!

This is a guest post by UCL Digital Humanities MA student Dominic Kane.