Digital scholarship blog

Enabling innovative research with British Library digital collections

Introduction

Tracking exciting developments at the intersection of libraries, scholarship and technology. Read more

04 May 2023

Webinar on Open Scholarship in GLAMs through Research Repositories

If you work in the galleries, libraries, archives, and museums (GLAM) sector and want to learn more about research repositories, then join us on 18th May, Thursday for an online repository training session for cultural heritage professionals.

Image of man looking at a poster that says 'Open Scholarship in GLAMs through Research Repositiories - Webinar on 18 May, Thursday - Register at bit.ly/BLrepowebinar

This event is part of the Library’s Repository Training Programme for Cultural Heritage Professionals. It is designed based on the input received from previous repository training events (this, this and this) to explore some areas of the open scholarship further. They include but are not limited to, research activities in GLAM, benefits of research repositories, scholarly publishing, research data management and digital preservation in scholarly communications.

 

Who is it for?

It is intended for those who are working in cultural heritage or a collection-holding organisation in roles where they are involved in managing digital collections, supporting the research lifecycle from funding to dissemination, providing research infrastructure and developing policies. However, anyone interested in the given topics is welcome to attend!

 

Programme

13.00                  Welcome and introductions

      Susan Miles, Scholarly Communications Specialist, British Library

Session 1          Open scholarship in GLAM research  

13.15                  Repositories to facilitate open scholarship

     Jenny Basford, Repository Services Lead, British Library

13.40                 Scholarly publishing dynamics in the GLAM environment

     Ilkay Holt, Scholarly Communications Lead, British Library

14.05                  Q&A

14.20                 Break time

Session 2          Building openness in GLAM research  

14.40                  Research data management

      Jez Cope, Data Services Lead, British Library

15.05                  Digital preservation and scholarly communications

      Neil Jefferies, Head of Innovation, Bodleian Libraries

15.30                  Q&A

15.45                  Closing

 

Register!

The event will take place from 13.00 to 15.45 on 18 May, Thursday. Please register at this link to receive your access link for the online session.

 

What is next?

The last training event of the Library’s Repository Training Programme will be held on 31 May in Cardiff, hosted by the National Museums Cardiff. It will be an update and re-run of the previous face-to-face events. More information about the programme and registration link can be found in this blog post.

Please contact [email protected] if you have any questions or comments about the events.

 

Previous Events

31 January, in-person, Edinburgh, hosted by the National Museums Scotland

8 March, online, hosted by the British Library

31 March, in-person, York, hosted by Archeology Data Service at the University of York

 

About British Library’s Repository Training Programme

The Library’s Repository Training Programme for cultural heritage professionals is funded as part of AHRC’s iDAH programme to support cultural heritage organisations in establishing or expanding open scholarship activities and sharing their outputs through research repositories. You can read more about the scoping report and the development of this training programme in this blog post.

02 May 2023

Detecting Catalogue Entries in Printed Catalogue Data

This is a guest blog post by Isaac Dunford, MEng Computer Science student at the University of Southampton. Isaac reports on his Digital Humanities internship project supervised by Dr James Baker.

Introduction

The purpose of this project has been to investigate and implement different methods for detecting catalogue entries within printed catalogues. For whilst printed catalogues are easy enough to digitise and convert into machine readable data, dividing that data by catalogue entry requires visual signifiers of divisions between entries - gaps in the printed page, large or upper-case headers, catalogue references - into machine-readable information. The first part of this project involved experimenting with XML-formatted data derived from the 13-volume Catalogue of books printed in the 15th century now at the British Museum (described by Rossitza Atanassova in a post announcing her AHRC-RLUK Professional Practice Fellowship project) and trying to find the best ways to detect individual entries and reassemble them as data (given that the text for a single catalogue entry may be spread across multiple pages of a printed catalogue). Then the next part of this project involved building a complete system based on this approach to take the large volume of XML files for a volume and output all of the catalogue entries in a series of desired formats. This post describes our initial experiments with that data, the approach we settled on, and key features of our approach that you should be able to reapply to your catalogue data. All data and code can be found on the project GitHub repo.

Experimentation

The catalogue data was exported from Transkribus in two different formats: an ALTO XML schema and a PAGE XML schema. The ALTO layout encodes positional information about each element of the text (that is, where each word occurs relative to the top left corner of the page) that makes spatial analysis - such as looking for gaps between lines - helpful. However, it also creates data files that are heavily encoded, meaning that it can be difficult to extract the text elements from the data files. Whereas the PAGE schema makes it easier to access the text element from the files.

 

An image of a digitised page from volume 8 of the Incunabula Catalogue and the corresponding Optical Character Recognition file encoded in the PAGE XML Schema
Raw PAGE XML for a page from volume 8 of the Incunabula Catalogue

 

An image of a digitised page from volume 8 of the Incunabula Catalogue and the corresponding Optical Character Recognition file encoded in the ALTO XML Schema
Raw ALTO XML for a page from volume 8 of the Incunabula Catalogue

 

Spacing and positioning

One of the first approaches tried in this project was to use size and spacing to find entries. The intuition behind this is that there is generally a larger amount of white space around the headings in the text than there is between regular lines. And in the ALTO schema, there is information about the size of the text within each line as well as about the coordinates of the line within the page.

However, we found that using the size of the text line and/or the positioning of the lines was not effective for three reasons. First, blank space between catalogue entries inconsistently contributed to the size of some lines. Second, whenever there were tables within the text, there would be large gaps in spacing compared to the normal text, that in turn caused those tables to be read as divisions between catalogue entries. And third, even though entry headings were visually further to the left on the page than regular text, and therefore should have had the smallest x coordinates, the materiality of the printed page was inconsistently represented as digital data, and so presented regular lines with small x coordinates that could be read - using this approach - as headings.

Final Approach

Entry Detection

Our chosen approach uses the data in the page XML schema, and is bespoke to the data for the Catalogue of books printed in the 15th century now at the British Museum as produced by Transkribus (and indeed, the version of Transkribus: having built our code around some initial exports, running it over  the later volumes - which had been digitised last -  threw an error due to some slight changes to the exported XML schema).

The code takes the XML input and finds entry using a content-based approach that looks for features at the start and end of each catalogue entry. Indeed after experimenting with different approaches, the most consistent way to detect the catalogue entries was to:

  1. Find the “reference number” (e.g. IB. 39624) which is always present at the end of an entry.
  2. Find a date that is always present after an entry heading.

This gave us an ability to contextually infer the presence of a split between two catalogue entries, the main limitation of which is quality of the Optical Character Recognition (OCR) at the point at which the references and dates occur in the printed volumes.

 

An image of a digitised page with a catalogue entry and the corresponding text output in XML format
XML of a detected entry

 

Language Detection

The reason for dividing catalogue entries in this way was to facilitate analysis of the catalogue data, specifically analysis that sought to define the linguistic character of descriptions in the Catalogue of books printed in the 15th century now at the British Museum and how those descriptions changed and evolved across the thirteen volumes. As segments of each catalogue entry contains text transcribed from the incunabula that were not written by a cataloguer (and therefore not part of their cataloguing ‘voice’), and as those transcribed sections are in French, Dutch, Old English, and other languages that a machine could detect as not being modern English, to further facilitate research use of the final data, one of the extensions we implemented was to label sections of each catalogue entry by the language. This was achieved using a python library for language detection and then - for a particular output type - replacing non-English language sections of text with a placeholder (e.g. NON-ENGLISH SECTION). And whilst the language detection model does not detect the Old-English, and varies between assigning those sections labels for different languages as a result, the language detection was still able to break blocks of text in each catalogue entry into the English and non-English sections.

 

Text files for catalogue entry number IB39624 showing the full text and the detected English-only sections.
Text outputs of the full and English-only sections of the catalogue entry

 

Poorly Scanned Pages

Another extension for this system was to use the input data to try and determine whether a page had been poorly scanned: for example, that the lines in the XML input read from one column straight into another as a single line (rather than the XML reading order following the visual signifiers of column breaks). This system detects poorly scanned pages by looking at the lengths of all lines in the page XML schema, establishing which lines deviate substantially from the mean line length, and if sufficient outliers are found then marking the page as poorly scanned.

Key Features

The key parts of this system which can be taken and applied to a different problem is the method for detecting entries. We expect that the fundamental method of looking for marks in the page content to identify the start and end of catalogue entries in the XML files would be applicable to other data derived from printed catalogues. The only parts of the algorithm which would need changing for a new system would be the regular expressions used to find the start and end of the catalogue entry headings. And as long as the XML input comes in the same schema, the code should be able to consistently divide up the volumes into the individual catalogue entries.

19 April 2023

Repository Training Day in Cardiff: Research in GLAM and research repositories to facilitate open scholarship activities for cultural heritage organisations

If you work in the galleries, libraries, archives, and museums (GLAM) sector and want to learn more about research repositories, then register for a hybrid repository training day for cultural heritage professionals hosted by the National Museum Cardiff in Wales on 31 May 2023.  

The British Library’s Repository Training Programme for cultural heritage professionals is funded as part of AHRC’s iDAH programme to support GLAM organisations in establishing or expanding open scholarship activities and sharing their outputs through research repositories.  

Manuscript illustration of Cardiff from the 17th Century showing a river, fields, a church and other small buildings
Insert from John Speeds County maps of Wales first published in The Theatre of the Empire of Great Britain by George Humble (1610) made available by the National Library of Wales via Flickr Commons

Background 

The very first in-person event was in Edinburgh in January, with a follow-up online session in March and a second in-person event in York, hosted by the Archaeology Data Service (ADS) at the University of York on 23 March.  

We had attendees from the British Museum, National Museums Scotland, National Portrait Gallery, Towards a National Collection (AHRC) and the ADS in various roles including scholarly communications librarian, digital archivist, project manager and senior researchers in their organisations.  

The full programme for this event is available in a previous blog post. During the event, conversations took place on a range of topics from policy development, embedding research culture in organisations to encouraging staff to be involved in research cycles, different types of workflows in different institutions. In the feedback we received from the audience, there is a need to explore more about research data management, scholarly publishing, challenges in smaller organisations, working with emerging formats and building communities of practice.  

Now looking forward, the last hybrid repository training event will be hosted by the National Museum Cardiff in Wales on Wednesday 31 May. You can see the details below and register here. We are looking forward to meeting everyone who is interested in learning more about research repositories from cultural heritage organisations.  

 

Who is this training for? 

We invite everyone who is working in a cultural heritage or a collection-holding organisation in roles where they are involved in managing digital collections, supporting research lifecycle from funding to dissemination, providing research infrastructure and developing policies. However, anyone interested in the given topics is welcome to attend. 

 

What will you learn? 

This one-day training session is designed as a starting point to a broader set of knowledge that will help you to: 

 

  • Understand the research landscape in cultural heritage organisations, benefits of openness for heritage research, basic concepts of open principles and influencing decision makers 
  • Lay foundations for repository services including stakeholder engagement, policy development, technical overview and project planning 
  • Adopt common principles and frameworks, technical standards and requirements in establishing repository services in a cultural heritage organisation 
  • Explore basics of the scholarly communications ecosystem in the context of cultural heritage practices. 

 

Prerequisites 

No previous knowledge of topics is required. However, an understanding of open access will maximise the benefit of the taught content for attendees.  

 

Programme  

10:30 - Welcome and introductions

10:50 - iDAH Programme 

    Joanna Dunster, Head of (Research) Infrastructure, AHRC

11:05 - Session 1 Opening up heritage research 

This session covers the topics of understanding the research landscape in GLAM organisations, benefits of openness for heritage research, basic concepts of open principles and frameworks. 

    Ilkay Holt, Scholarly Communications Lead, BL

    Susan Miles, Scholarly Communications Speacialist, BL

11:45 - Q&A / Discussion

12:00 - Break  

12:15 - Session 2 Getting started with heritage GLAM repositories  

This session covers topics on the role of repository infrastructure in open access to heritage research and positioning research repositories in an organisation including policy and development. 

    Ilkay Holt, Scholarly Communications Lead, BL

    Susan Miles, Scholarly Communications Speacialist, BL

12:45 - Lunch

13:30 - Session continues 

13:55 - Q&A / Discussion

14:10 - Session 3: Realising and expanding the benefits 

This module covers technical overview and requirements for running a cultural heritage repository including an overview of BL’s Shared Research Repository, platforms and software, content administration, technical features.   

    Graham Jevon, Digital Services Specialist, BL

    Nora Ramsey, Assistant to Digital Services Specialist, BL

14:30 - Break

14:40 - Session continues

15:00 - Q&A / Discussion 

15:15-15:30 - Closing Remarks

 

Book your place 

In-person sessions are planned for a maximum of 35 people per event and registrants from cultural heritage institutions will be prioritised. Registration for the event is free. Please fill in this form to book your place.

Please note that registrations for in-person attendance will close at 4pm Friday 26th May and confirmation for in-person attendance will be sent to the registered email address.

Registrations for online attendance will close at 6pm on Tuesday 30th May. Zoom access link will be sent to the registered email address day prior to the event. 

Members of the Research Infrastructure Services Team at the British Library will be delivering the training programme. The team has over 25 years of broad experience and extensive knowledge in supporting open scholarship across the sector and with international partners. They also provide a Shared Research Repository Service for the cultural heritage organisations.  

Please contact [email protected] if you have any questions or comments about this training programme.