Digital scholarship blog

Enabling innovative research with British Library digital collections

2 posts from May 2023

04 May 2023

Webinar on Open Scholarship in GLAMs through Research Repositories

If you work in the galleries, libraries, archives, and museums (GLAM) sector and want to learn more about research repositories, then join us on 18th May, Thursday for an online repository training session for cultural heritage professionals.

Image of man looking at a poster that says 'Open Scholarship in GLAMs through Research Repositiories - Webinar on 18 May, Thursday - Register at bit.ly/BLrepowebinar

This event is part of the Library’s Repository Training Programme for Cultural Heritage Professionals. It is designed based on the input received from previous repository training events (this, this and this) to explore some areas of the open scholarship further. They include but are not limited to, research activities in GLAM, benefits of research repositories, scholarly publishing, research data management and digital preservation in scholarly communications.

 

Who is it for?

It is intended for those who are working in cultural heritage or a collection-holding organisation in roles where they are involved in managing digital collections, supporting the research lifecycle from funding to dissemination, providing research infrastructure and developing policies. However, anyone interested in the given topics is welcome to attend!

 

Programme

13.00                  Welcome and introductions

      Susan Miles, Scholarly Communications Specialist, British Library

Session 1          Open scholarship in GLAM research  

13.15                  Repositories to facilitate open scholarship

     Jenny Basford, Repository Services Lead, British Library

13.40                 Scholarly publishing dynamics in the GLAM environment

     Ilkay Holt, Scholarly Communications Lead, British Library

14.05                  Q&A

14.20                 Break time

Session 2          Building openness in GLAM research  

14.40                  Research data management

      Jez Cope, Data Services Lead, British Library

15.05                  Digital preservation and scholarly communications

      Neil Jefferies, Head of Innovation, Bodleian Libraries

15.30                  Q&A

15.45                  Closing

 

Register!

The event will take place from 13.00 to 15.45 on 18 May, Thursday. Please register at this link to receive your access link for the online session.

 

What is next?

The last training event of the Library’s Repository Training Programme will be held on 31 May in Cardiff, hosted by the National Museums Cardiff. It will be an update and re-run of the previous face-to-face events. More information about the programme and registration link can be found in this blog post.

Please contact [email protected] if you have any questions or comments about the events.

 

Previous Events

31 January, in-person, Edinburgh, hosted by the National Museums Scotland

8 March, online, hosted by the British Library

31 March, in-person, York, hosted by Archeology Data Service at the University of York

 

About British Library’s Repository Training Programme

The Library’s Repository Training Programme for cultural heritage professionals is funded as part of AHRC’s iDAH programme to support cultural heritage organisations in establishing or expanding open scholarship activities and sharing their outputs through research repositories. You can read more about the scoping report and the development of this training programme in this blog post.

02 May 2023

Detecting Catalogue Entries in Printed Catalogue Data

This is a guest blog post by Isaac Dunford, MEng Computer Science student at the University of Southampton. Isaac reports on his Digital Humanities internship project supervised by Dr James Baker.

Introduction

The purpose of this project has been to investigate and implement different methods for detecting catalogue entries within printed catalogues. For whilst printed catalogues are easy enough to digitise and convert into machine readable data, dividing that data by catalogue entry requires visual signifiers of divisions between entries - gaps in the printed page, large or upper-case headers, catalogue references - into machine-readable information. The first part of this project involved experimenting with XML-formatted data derived from the 13-volume Catalogue of books printed in the 15th century now at the British Museum (described by Rossitza Atanassova in a post announcing her AHRC-RLUK Professional Practice Fellowship project) and trying to find the best ways to detect individual entries and reassemble them as data (given that the text for a single catalogue entry may be spread across multiple pages of a printed catalogue). Then the next part of this project involved building a complete system based on this approach to take the large volume of XML files for a volume and output all of the catalogue entries in a series of desired formats. This post describes our initial experiments with that data, the approach we settled on, and key features of our approach that you should be able to reapply to your catalogue data. All data and code can be found on the project GitHub repo.

Experimentation

The catalogue data was exported from Transkribus in two different formats: an ALTO XML schema and a PAGE XML schema. The ALTO layout encodes positional information about each element of the text (that is, where each word occurs relative to the top left corner of the page) that makes spatial analysis - such as looking for gaps between lines - helpful. However, it also creates data files that are heavily encoded, meaning that it can be difficult to extract the text elements from the data files. Whereas the PAGE schema makes it easier to access the text element from the files.

 

An image of a digitised page from volume 8 of the Incunabula Catalogue and the corresponding Optical Character Recognition file encoded in the PAGE XML Schema
Raw PAGE XML for a page from volume 8 of the Incunabula Catalogue

 

An image of a digitised page from volume 8 of the Incunabula Catalogue and the corresponding Optical Character Recognition file encoded in the ALTO XML Schema
Raw ALTO XML for a page from volume 8 of the Incunabula Catalogue

 

Spacing and positioning

One of the first approaches tried in this project was to use size and spacing to find entries. The intuition behind this is that there is generally a larger amount of white space around the headings in the text than there is between regular lines. And in the ALTO schema, there is information about the size of the text within each line as well as about the coordinates of the line within the page.

However, we found that using the size of the text line and/or the positioning of the lines was not effective for three reasons. First, blank space between catalogue entries inconsistently contributed to the size of some lines. Second, whenever there were tables within the text, there would be large gaps in spacing compared to the normal text, that in turn caused those tables to be read as divisions between catalogue entries. And third, even though entry headings were visually further to the left on the page than regular text, and therefore should have had the smallest x coordinates, the materiality of the printed page was inconsistently represented as digital data, and so presented regular lines with small x coordinates that could be read - using this approach - as headings.

Final Approach

Entry Detection

Our chosen approach uses the data in the page XML schema, and is bespoke to the data for the Catalogue of books printed in the 15th century now at the British Museum as produced by Transkribus (and indeed, the version of Transkribus: having built our code around some initial exports, running it over  the later volumes - which had been digitised last -  threw an error due to some slight changes to the exported XML schema).

The code takes the XML input and finds entry using a content-based approach that looks for features at the start and end of each catalogue entry. Indeed after experimenting with different approaches, the most consistent way to detect the catalogue entries was to:

  1. Find the “reference number” (e.g. IB. 39624) which is always present at the end of an entry.
  2. Find a date that is always present after an entry heading.

This gave us an ability to contextually infer the presence of a split between two catalogue entries, the main limitation of which is quality of the Optical Character Recognition (OCR) at the point at which the references and dates occur in the printed volumes.

 

An image of a digitised page with a catalogue entry and the corresponding text output in XML format
XML of a detected entry

 

Language Detection

The reason for dividing catalogue entries in this way was to facilitate analysis of the catalogue data, specifically analysis that sought to define the linguistic character of descriptions in the Catalogue of books printed in the 15th century now at the British Museum and how those descriptions changed and evolved across the thirteen volumes. As segments of each catalogue entry contains text transcribed from the incunabula that were not written by a cataloguer (and therefore not part of their cataloguing ‘voice’), and as those transcribed sections are in French, Dutch, Old English, and other languages that a machine could detect as not being modern English, to further facilitate research use of the final data, one of the extensions we implemented was to label sections of each catalogue entry by the language. This was achieved using a python library for language detection and then - for a particular output type - replacing non-English language sections of text with a placeholder (e.g. NON-ENGLISH SECTION). And whilst the language detection model does not detect the Old-English, and varies between assigning those sections labels for different languages as a result, the language detection was still able to break blocks of text in each catalogue entry into the English and non-English sections.

 

Text files for catalogue entry number IB39624 showing the full text and the detected English-only sections.
Text outputs of the full and English-only sections of the catalogue entry

 

Poorly Scanned Pages

Another extension for this system was to use the input data to try and determine whether a page had been poorly scanned: for example, that the lines in the XML input read from one column straight into another as a single line (rather than the XML reading order following the visual signifiers of column breaks). This system detects poorly scanned pages by looking at the lengths of all lines in the page XML schema, establishing which lines deviate substantially from the mean line length, and if sufficient outliers are found then marking the page as poorly scanned.

Key Features

The key parts of this system which can be taken and applied to a different problem is the method for detecting entries. We expect that the fundamental method of looking for marks in the page content to identify the start and end of catalogue entries in the XML files would be applicable to other data derived from printed catalogues. The only parts of the algorithm which would need changing for a new system would be the regular expressions used to find the start and end of the catalogue entry headings. And as long as the XML input comes in the same schema, the code should be able to consistently divide up the volumes into the individual catalogue entries.