Digital scholarship blog

139 posts categorized "Research collaboration"

26 September 2023

Let’s learn together - Join us in the Cultural Heritage Open Scholarship Network

Are you working in Galleries-Libraries-Archives-Museums (GLAM) and cultural heritage organisations as research support and research-active staff? Are you interested in developing knowledge and skills in open scholarship? Would you like to establish good practices, share your experience with others and collaborate? If your answer is yes to one or more of these questions, we invite you to join the Cultural Heritage Open Scholarship Network (CHOSN).

Initiated by the British Library’s Research Infrastructure Services built on the experience of and positive responses received from the open scholarship training programme, which was run earlier this year. CHOSN is a community of practice for research support and research-active staff who work in GLAMs, organisations interested in developing and sharing open scholarship knowledge and skills, organising events, and supporting each other in this area. 

GLAMs demonstrate a significant amount of research showcases, but we may find ourselves with inadequate resources to make that research openly available, gain relevant open scholarship skills to make it happen, or even identify what forms research in these environments. CHOSN aims to provide a platform to create synergy for those aiming for good practice in open scholarship.

CHOSN flyer image, text says: Cultural Heritage Open Scholarship Network (CHOSN). Are you working in Galleries-Libraries-Archives-Museums (GLAMs)? Join Us! To develop knowledge and skills in open scholarship, organise activities to learn and grow, and create a community of practise to collaborate and support each other.

This network can be of interest to anyone who is facilitating, enabling, supporting research activities in GLAM organisations. They include but are not limited to research support staff, research-active staff, librarians, curatorial teams, IT specialists, copyright officers and so on. Anyone interested in the areas of open scholarship and works in cultural heritage organisations are welcome.

Join us in the Cultural Heritage Open Scholarship Network (CHOSN) to;

  • explore research activities, roles in GLAMs and make them visible,
  • develop knowledge and skills in open scholarship,
  • carry out capacity development activities to learn and grow, and
  • create a community of practice to collaborate and support each other.

We have set up a JISC mailing list to start communication with the network, you can join by signing up here. We will shortly organise an online meeting to kick off the network plans, explore how to move forward and to collectively discuss what we would like to do next. This will all be communicated via the CHOSN mailing list.

If you have any questions about CHOSN, we are happy to hear from you at [email protected].

14 September 2023

What's the future of crowdsourcing in cultural heritage?

The short version: crowdsourcing in cultural heritage is an exciting field, rich in opportunities for collaborative, interdisciplinary research and practice. It includes online volunteering, citizen science, citizen history, digital public participation, community co-production, and, increasingly, human computation and other systems that will change how participants relate to digital cultural heritage. New technologies like image labelling, text transcription and natural language processing, plus trends in organisations and societies at large mean constantly changing challenges (and potential). Our white paper is an attempt to make recommendations for funders, organisations and practitioners in the near and distant future. You can let us know what we got right, and what we could improve by commenting on Recommendations, Challenges and Opportunities for the Future of Crowdsourcing in Cultural Heritage: a White Paper.

The longer version: The Collective Wisdom project was funded by an AHRC networking grant to bring experts from the UK and the US together to document the state of the art in designing, managing and integrating crowdsourcing activities, and to look ahead to future challenges and unresolved issues that could be addressed by larger, longer-term collaboration on methods for digitally-enabled participation.

Our open access Collective Wisdom Handbook: perspectives on crowdsourcing in cultural heritage is the first outcome of the project, our expert workshops were a second.

Mia (me) and Sam Blickhan launched our White Paper for comment on pubpub at the Digital Humanities 2023 conference in Graz, Austria, in July this year, with Meghan Ferriter attending remotely. Our short paper abstract and DH2023 slides are online at Zenodo

So - what's the future of crowdsourcing in cultural heritage? Head on over to Recommendations, Challenges and Opportunities for the Future of Crowdsourcing in Cultural Heritage: a White Paper and let us know what you think! You've got until the end of September…

You can also read our earlier post on 'community review' for a sense of the feedback we're after - in short, what resonates, what needs tweaking, what examples could we include?

To whet your appetite, here's a preview of our five recommendations. (To find out why we make those recommendations, you'll have to read the White Paper):

  • Infrastructure: Platforms need sustainability. Funding should not always be tied to novelty, but should also support the maintenance, uptake and reuse of well-used tools.
  • Evidencing and Evaluation: Help create an evaluation toolkit for cultural heritage crowdsourcing projects; provide ‘recipes’ for measuring different kinds of success. Shift thinking about value from output/scale/product to include impact on participants' and community well-being.
  • Skills and Competencies: Help create a self-guided skills inventory assessment resource, tool, or worksheet to support skills assessment, and develop workshops to support their integrity and adoption.
  • Communities of Practice: Fund informal meetups, low-cost conferences, peer review panels, and other opportunities for creating and extending community. They should have an international reach, e.g. beyond the UK-US limitations of the initial Collective Wisdom project funding.
  • Incorporating Emergent Technologies and Methods: Fund educational resources and workshops to help the field understand opportunities, and anticipate the consequences of proposed technologies.

What have we missed? Which points do you want to boost? (For example, we discovered how many of our points apply to digital scholarship projects in general). You can '+1' on points that resonate with you, suggest changes to wording, ask questions, provide examples and references, or (constructively, please) challenge our arguments. Our funding only supported participants from the UK and US, so we're very keen to hear from folk from the rest of the world.

06 September 2023

Open and Engaged 2023: Community over Commercialisation

The British Library is delighted to host its annual Open and Engaged Conference on Monday 30 October, in-person and online, as part of International Open Access Week.

Open and Engaged 2023: Community over Commercialisation, includes headshots of speakers and lists location as The British Library, London and contact as openaccess@bl.uk

In line with this year’s #OAWeek theme: Open and Engaged 2023: Community over Commercialisation will address approaches and practices to open scholarship that prioritise the best interests of the public and the research community. The programme will focus on community-governance, public-private collaborations, and community building aspects of the topic by keeping the public good in the heart of the talks. It will underline different priorities and approaches for Galleries-Libraries-Archives-Museums (GLAMs) and the cultural sector in the context of open access.

We invite everyone interested in the topic to join us on Monday, 30 October!

This will be a hybrid event taking place at the British Library’s Knowledge Centre in St. Pancras, London, and streamed online for those unable to attend in-person.

You can register for Open and Engaged 2023 by filling this form by Thursday, 26 October 18:00 BST. Please note that the places for in-person attendance are now full and the form is available only for online booking.

Registrants will be contacted with details for either in-person attendance or a link to access the online stream closer to the event.

Programme

Note that clocks change back to GMT in UK on Sunday, 29 October.

9:30     Registration opens for in-person attendees. Entrance Hall at the Knowledge Centre.

10:00   Welcome

10:10   Keynote from Monica Westin, Senior Product Manager at the Internet Archive

Commercial Break: Imagining new ownership models for cultural heritage institutions.

10:40   Session on public-private collaborations for public good chaired by Liz White, Director of Library Partnerships at the British Library.

  • Balancing public-private partnerships with responsibilities to our communities. Mia Ridge, Digital Curator, Western Heritage Collections, The British Library
  • Where do I stand? Deconstructing Digital Collections [Research] Infrastructures: A perspective from Towards a National Collection. Javier Pereda, Senior Researcher of the Towards a National Collection (TaNC)
  • "This is not IP I'm familiar with." The strange afterlife and untapped potential of public domain content in GLAM institutions. Douglas McCarthy, Head of Library Learning Centre, Delft University of Technology.

11:40   Break

12:10   Lightning talks on community projects chaired by Graham Jevon, Digital Service Specialist at the British Library.

  • The Turing Way: Community-led Resources for Open Research and Data Science. Emma Karoune, Senior Research Community Manager, The Alan Turing Institute.
  • Open Online Tools for Creating Interactive Narratives. Giulia Carla Rossi, Curator for Digital Publications and Stella Wisdom, Digital Curator for Contemporary British Collections, The British Library

12:45   Lunch

13:30   Session on the community-centred infrastructure in practice chaired by Jenny Basford, Repository Services Lead at the British Library.

  • AHRC, Digital Research Infrastructure and where we want to go with it. Tao Chang, Associate Director, Infrastructure & Major Programmes, Arts and Humanities Research Council (AHRC)
  • The critical role of repositories in advancing open scholarship. Kathleen Shearer, Executive Director, Confederation of Open Access Repositories (COAR). (Remote talk)
  • Investing in the Future of Open Infrastructure. Kaitlin Thaney, Executive Director, Invest in Open Infrastructure (IOI). (Remote talk)

14:30   Break

15:00   Session on the role of research libraries in prioritizing the community chaired by Ian Cooke, Head of Contemporary British Publications at the British Library.

  • Networks of libraries supporting open access book publishing. Rupert Gatti, Co-founder and the Director of Open Book Publishers, Director of Studies in Economics at the Trinity College Cambridge
  • Collective action for driving open science agenda in Africa and Europe. Iryna Kuchma, Open Access Programme Manager at EIFL. (Remote talk)
  • The Not So Quiet Rights Retention Revolution: Research Libraries, Rights and Supporting our Communities. William Nixon, Deputy Executive Director at RLUK-Research Libraries UK

16:00   Closing remarks

Social media hashtag for the event is #OpenEngaged. If you have any questions, please contact us at [email protected].

03 August 2023

My AHRC-RLUK Professional Practice Fellowship: A year on

A year ago I started work on my RLUK Professional Practice Fellowship project to analyse computationally the descriptions in the Library’s incunabula printed catalogue. As the project comes to a close this week, I would like to update on the work from the last few months leading to the publication of the incunabula printed catalogue data, a featured collection on the British Library’s Research Repository. In a separate blogpost I will discuss the findings from the text analysis and next steps, as well as share my reflections on the fellowship experience.

Since Isaac’s blogpost about the automated detection of the catalogue entries in the OCR files, a lot of effort has gone into improving the code and outputting the descriptions in the format required for the text analysis and as open datasets. With the invaluable help of Harry Lloyd who had joined the Library’s Digital Research team as Research Software Engineer, we verified the results and identified new rules for detecting sub-entries signaled by Another Copy rather than a main entry heading. We also reassembled and parsed the XML files, originally split in two sets per volume for the purpose of generating the OCR, so that the entries are listed in the order in which they appear in the printed volume. We prepared new text files containing all the entries from each volume with each entry represented as a single line of text, that I could use for the corpus linguistics analysis with AntConc. In consultation with the Curator, Karen Limper-Herz, and colleagues in Collection Metadata we agreed how best to store the data for evaluation and in preparation to update the Library’s online catalogue.

Two women looking at the poster illustrating the text analysis with the incunabula catalogue data
Poster session at Digital Humanities Conference 2023

Whilst all this work was taking place, I started the computational analysis of the English text from the descriptions. The reason for using these partial descriptions was to separate what was merely transcribed from the incunabula from the more language used by the cataloguer in their own ‘voice’. I have recorded my initial observations in the poster I presented at the Digital Humanities Conference 2023. Discussing my fellowship project with the conference attendees was extremely rewarding; there was much interest in the way I had used Transkribus to derive the OCR data, some questions about how the project methodology applies to other data and an agreement on the need to contextualise collections descriptions and reflect on any bias in the transmission of knowledge. In the poster I also highlight the importance of the cross-disciplinary collaboration required for this type of work, which resonated well with the conference theme of Collaboration as Opportunity.

I have started disseminating the knowledge gained from the project with members of the GLAM community. At the British Library Harry, Karen and I ran an informal ‘Hack & Yack’ training session showcasing the project aims and methodology through the use of Jupyter notebooks. I also enjoyed the opportunity to discuss my research at a recent Research Libraries UK Digital Scholarship Network workshop and look forward to further conversations on this topic with colleagues in the wider GLAM community. 

We intend to continue to enrich the datasets to enable better access to the collection, the development of new resources for incunabula research and digital scholarship projects. I would like to end by adding my thanks to Graham Jevon, for assisting with the timely publication of the project datasets, and above all to James, Karen and Harry for supporting me throughout this project.

This blogpost is by Dr Rossitza Atanassova, Digital Curator, British Library. She is on Twitter @RossiAtanassova  and Mastodon @[email protected]

 

14 July 2023

Share Family: British National Bibliography (Beta) service is live

Contents

Introduction

Share Family and National Bibliographies

       What is a National bibliography?

       BNB in the Share Family

Benefits

Future developments

Beta service

Further information

 

Introduction

The British National Bibliography (BNB), first published in January 1950, is a weekly listing of new books and journals published or distributed in the United Kingdom and the Republic of Ireland.  Over the last seventy-three years, the BNB has adapted to changing customer needs by embracing new technologies, from cards in the 1950s to mark-up languages for data exchange in the 1970s and CD-ROM in the 1980s. The BNB now provides online access to details of over 5 million publications and forthcoming titles, ranging in scope from computer science to history, from novels to textbooks.

 

Two examples of bibliographies including information like title, author, place of publication, year, description, prices etc.
1. Examples of British National Bibliography records, April 19th 2023. Please click the image to see it in full size & detail.

In 2011, the Library launched the Linked Open Data BNB.  At that time, linked data was an emerging technology using Web protocols to link data sets, as envisaged in Sir Tim Berners-Lee’s concept of a Semantic Web[1].  Our initial foray into linked data was successful from a technical perspective. We were able to convert BNB data held in Machine Readable Cataloging (MARC) format into linked data structures and make it available in a variety of schemas under an open licence.  Nevertheless, we lacked the capacity to re-model our data in order to realise the potential of linked data.  As the technology matured, we began to look around for partners with whom we could collaborate to take BNB forward.

As described in my September 2020 blogpost, British Library Joins Share-VDE Linked Data Community, the British Library joined the Share Community (now the Share Family) to develop our linked data service. The Share Linked Data Environment is “a global family built on collaboration that brings libraries, archives and museums together with a common goal and joins their knowledge in an ever-widening network of inter-connected bibliographic data.” (Share Family, 2022).

 

Share Family and National Bibliographies

“The Share Family is a suite of innovative tools and services, developed and driven by libraries, for libraries, in an international collaborative, consortial effort. Share-VDE enables the discovery of knowledge to increase user engagement with library and cultural heritage collections.”[2]

Screenshot: Share family components showing layers like Advanced API, Advanced Entity Model, Authority Service, Deliverables etc.
2. Share family components[3]. Please click the image to see it in full size & detail.

The Share Family has supported us through the transition from our traditional MARC data to linked open data.  We provided a full copy of the British National Bibliography to the Share team for identification and clustering of entities, e.g. works, publications, persons. Working with colleagues from other institutions on Share-VDE working groups we contribute to the development of the underlying data structures and the presentation of data.  This collaborative approach has enabled delivery of the British National Bibliography as the first institutional tenant of the Share Family National Bibliographies Portal

What is a National bibliography?

“National bibliographies are a permanent record of the cultural and intellectual output of a nation or country, which is witnessed by its publishing output. They gather the bibliographic information of current publications to preserve and provide ongoing access to this record.”

IFLA Bibliography Section

The IFLA (International Federation of Library Associations and Institutions) Register of national bibliographies contains 52 entries, ranging from Andorra to Vietnam.  National bibliographies vary in scope, but each provides insights into the intellectual and cultural history of society, literature and publishing.  The Share Family National Bibliographies Portal offers the potential for clustering and searching multiple national bibliographies on a single platform.

BNB in the Share Family

Screenshot of the BNB home screen stating 'Search for people, original works and publications
3. Screenshot BNB home screen. Please click the image to see it in full size & detail.

The British Library is proud that the British National Bibliography is the first tenant selected for the Share Family National Bibliographies Portal.

BNB is now available to explore in Beta: https://bl.natbib-lod.org. You can search for publications, original works and people, as illustrated by these examples:

You can use the national bibliography to search for a specific publication, such as a large print edition of the novel Small island by Andrea Levy.

Screenshot: Bibliographic description of large print edition of Small Island by Andrea Levy.
4. Screenshot: Bibliographic description of large print edition of Small Island by Andrea Levy. Please click the image to see it in full size & detail.

 

You can also find original works inspired by earlier works:

Screenshot: Results set for publication of the work, Small island by Helen Edmundson
5. Screenshot: Results set for publication of the work, Small island by Helen Edmundso. Please click the image to see it in full size & detail.

 

Alternatively, you can search for works by a specific author… 

Screenshot showing original works by Douglas Adams
6. Screenshot: Original works by Douglas Adams. Please click the image to see it in full size & detail.

 

…or about a specific person

Screenshot showing original works about Douglas Adams
7. Screenshot: Original works about Douglas Adams. Please click the image to see it in full size & detail.

 

…or by organization

Screenshot showing results set for BBC
8. Screenshot: Results set for BBC. Please click the image to see it in full size & detail.

 

Benefits

What benefit do we expect to gain from this collaboration?

  • We profit from practical experience our collaborators have gained through other linked data initiatives
  • We gain access to a state of the art, extensible infrastructure designed for library data
  • We gain a new channel for dissemination of the BNB, in aggregation with other national bibliographies

We are able to re-tool our metadata for the 21st Century:

  • Our data will be remodelled and clustered making it more compatible with current data models, including the IFLA Library Reference Model, RDA: Resource Description and Access, and Bibframe
  • Our data will be enriched with URIs that will make it more effective in linked data environments
  • The entity-centred view of the British National Bibliography offers new perspectives for researchers

 

Future developments

Conversion of the BNB and publication in the National Bibliographies Portal is only the beginning. 

  • The BNB data from the Cluster Knowledge base will also be published in the triple store
  • Original records will be available to the British Library as Bibframe 2.0, for dissemination or reuse as linked data
  • Users will be provided with access to the data via data dumps and a SPARQL endpoint
  • Our MARC records will be enriched with original Share URIs and URIs from external sources
  • Other national bibliographies will join BNB in the national bibliographies portal

The British National Bibliography represents only a fraction of the Library’s data.   You can explore the British Library’s collection through our catalogue, which we plan to contribute to Share-VDE in future.

 

Beta service

The British National Bibliography in the Share Family is being made available in Beta. The service is still being tested. The interface and the functionality are subject to change and may not work for everyone.  You can tell us what you think about the service or report problems by contacting [email protected].

 

Further information:

British National Bibliography https://bnb.bl.uk  

Share VDE http://www.share-family.org/

Share Family wiki https://wiki.share-vde.org/wiki/Main_Page

Share VDE Virtual Discovery Environment in linked open data https://svde.org/

National Bibliographies in Linked Open Data https://natbib-lod.org

British National Bibliography Linked Open Data Portal https://bl.natbib-lod.org

 

Footnotes

[1]  Berners-Lee, Tim; James Hendler; Ora Lassila (May 17, 2001). "The Semantic Web". Appeared in: Scientific American. (284(5):34-43 (May 2001). 

[2] Share-VDE: supporting the creation, management and discovery of linked open data for libraries: executive summary. Share-VDE Executive Committee. December 7th, 2022. Share-VDE Website (viewed 19th June 2023)

[3] Share Family – Linked data ecosystem. How does it work?  http://www.share-family.org/  (viewed on 23rd June 2023)

04 May 2023

Webinar on Open Scholarship in GLAMs through Research Repositories

If you work in the galleries, libraries, archives, and museums (GLAM) sector and want to learn more about research repositories, then join us on 18th May, Thursday for an online repository training session for cultural heritage professionals.

Image of man looking at a poster that says 'Open Scholarship in GLAMs through Research Repositiories - Webinar on 18 May, Thursday - Register at bit.ly/BLrepowebinar

This event is part of the Library’s Repository Training Programme for Cultural Heritage Professionals. It is designed based on the input received from previous repository training events (this, this and this) to explore some areas of the open scholarship further. They include but are not limited to, research activities in GLAM, benefits of research repositories, scholarly publishing, research data management and digital preservation in scholarly communications.

 

Who is it for?

It is intended for those who are working in cultural heritage or a collection-holding organisation in roles where they are involved in managing digital collections, supporting the research lifecycle from funding to dissemination, providing research infrastructure and developing policies. However, anyone interested in the given topics is welcome to attend!

 

Programme

13.00                  Welcome and introductions

      Susan Miles, Scholarly Communications Specialist, British Library

Session 1          Open scholarship in GLAM research  

13.15                  Repositories to facilitate open scholarship

     Jenny Basford, Repository Services Lead, British Library

13.40                 Scholarly publishing dynamics in the GLAM environment

     Ilkay Holt, Scholarly Communications Lead, British Library

14.05                  Q&A

14.20                 Break time

Session 2          Building openness in GLAM research  

14.40                  Research data management

      Jez Cope, Data Services Lead, British Library

15.05                  Digital preservation and scholarly communications

      Neil Jefferies, Head of Innovation, Bodleian Libraries

15.30                  Q&A

15.45                  Closing

 

Register!

The event will take place from 13.00 to 15.45 on 18 May, Thursday. Please register at this link to receive your access link for the online session.

 

What is next?

The last training event of the Library’s Repository Training Programme will be held on 31 May in Cardiff, hosted by the National Museums Cardiff. It will be an update and re-run of the previous face-to-face events. More information about the programme and registration link can be found in this blog post.

Please contact [email protected] if you have any questions or comments about the events.

 

Previous Events

31 January, in-person, Edinburgh, hosted by the National Museums Scotland

8 March, online, hosted by the British Library

31 March, in-person, York, hosted by Archeology Data Service at the University of York

 

About British Library’s Repository Training Programme

The Library’s Repository Training Programme for cultural heritage professionals is funded as part of AHRC’s iDAH programme to support cultural heritage organisations in establishing or expanding open scholarship activities and sharing their outputs through research repositories. You can read more about the scoping report and the development of this training programme in this blog post.

02 May 2023

Detecting Catalogue Entries in Printed Catalogue Data

This is a guest blog post by Isaac Dunford, MEng Computer Science student at the University of Southampton. Isaac reports on his Digital Humanities internship project supervised by Dr James Baker.

Introduction

The purpose of this project has been to investigate and implement different methods for detecting catalogue entries within printed catalogues. For whilst printed catalogues are easy enough to digitise and convert into machine readable data, dividing that data by catalogue entry requires visual signifiers of divisions between entries - gaps in the printed page, large or upper-case headers, catalogue references - into machine-readable information. The first part of this project involved experimenting with XML-formatted data derived from the 13-volume Catalogue of books printed in the 15th century now at the British Museum (described by Rossitza Atanassova in a post announcing her AHRC-RLUK Professional Practice Fellowship project) and trying to find the best ways to detect individual entries and reassemble them as data (given that the text for a single catalogue entry may be spread across multiple pages of a printed catalogue). Then the next part of this project involved building a complete system based on this approach to take the large volume of XML files for a volume and output all of the catalogue entries in a series of desired formats. This post describes our initial experiments with that data, the approach we settled on, and key features of our approach that you should be able to reapply to your catalogue data. All data and code can be found on the project GitHub repo.

Experimentation

The catalogue data was exported from Transkribus in two different formats: an ALTO XML schema and a PAGE XML schema. The ALTO layout encodes positional information about each element of the text (that is, where each word occurs relative to the top left corner of the page) that makes spatial analysis - such as looking for gaps between lines - helpful. However, it also creates data files that are heavily encoded, meaning that it can be difficult to extract the text elements from the data files. Whereas the PAGE schema makes it easier to access the text element from the files.

 

An image of a digitised page from volume 8 of the Incunabula Catalogue and the corresponding Optical Character Recognition file encoded in the PAGE XML Schema
Raw PAGE XML for a page from volume 8 of the Incunabula Catalogue

 

An image of a digitised page from volume 8 of the Incunabula Catalogue and the corresponding Optical Character Recognition file encoded in the ALTO XML Schema
Raw ALTO XML for a page from volume 8 of the Incunabula Catalogue

 

Spacing and positioning

One of the first approaches tried in this project was to use size and spacing to find entries. The intuition behind this is that there is generally a larger amount of white space around the headings in the text than there is between regular lines. And in the ALTO schema, there is information about the size of the text within each line as well as about the coordinates of the line within the page.

However, we found that using the size of the text line and/or the positioning of the lines was not effective for three reasons. First, blank space between catalogue entries inconsistently contributed to the size of some lines. Second, whenever there were tables within the text, there would be large gaps in spacing compared to the normal text, that in turn caused those tables to be read as divisions between catalogue entries. And third, even though entry headings were visually further to the left on the page than regular text, and therefore should have had the smallest x coordinates, the materiality of the printed page was inconsistently represented as digital data, and so presented regular lines with small x coordinates that could be read - using this approach - as headings.

Final Approach

Entry Detection

Our chosen approach uses the data in the page XML schema, and is bespoke to the data for the Catalogue of books printed in the 15th century now at the British Museum as produced by Transkribus (and indeed, the version of Transkribus: having built our code around some initial exports, running it over  the later volumes - which had been digitised last -  threw an error due to some slight changes to the exported XML schema).

The code takes the XML input and finds entry using a content-based approach that looks for features at the start and end of each catalogue entry. Indeed after experimenting with different approaches, the most consistent way to detect the catalogue entries was to:

  1. Find the “reference number” (e.g. IB. 39624) which is always present at the end of an entry.
  2. Find a date that is always present after an entry heading.

This gave us an ability to contextually infer the presence of a split between two catalogue entries, the main limitation of which is quality of the Optical Character Recognition (OCR) at the point at which the references and dates occur in the printed volumes.

 

An image of a digitised page with a catalogue entry and the corresponding text output in XML format
XML of a detected entry

 

Language Detection

The reason for dividing catalogue entries in this way was to facilitate analysis of the catalogue data, specifically analysis that sought to define the linguistic character of descriptions in the Catalogue of books printed in the 15th century now at the British Museum and how those descriptions changed and evolved across the thirteen volumes. As segments of each catalogue entry contains text transcribed from the incunabula that were not written by a cataloguer (and therefore not part of their cataloguing ‘voice’), and as those transcribed sections are in French, Dutch, Old English, and other languages that a machine could detect as not being modern English, to further facilitate research use of the final data, one of the extensions we implemented was to label sections of each catalogue entry by the language. This was achieved using a python library for language detection and then - for a particular output type - replacing non-English language sections of text with a placeholder (e.g. NON-ENGLISH SECTION). And whilst the language detection model does not detect the Old-English, and varies between assigning those sections labels for different languages as a result, the language detection was still able to break blocks of text in each catalogue entry into the English and non-English sections.

 

Text files for catalogue entry number IB39624 showing the full text and the detected English-only sections.
Text outputs of the full and English-only sections of the catalogue entry

 

Poorly Scanned Pages

Another extension for this system was to use the input data to try and determine whether a page had been poorly scanned: for example, that the lines in the XML input read from one column straight into another as a single line (rather than the XML reading order following the visual signifiers of column breaks). This system detects poorly scanned pages by looking at the lengths of all lines in the page XML schema, establishing which lines deviate substantially from the mean line length, and if sufficient outliers are found then marking the page as poorly scanned.

Key Features

The key parts of this system which can be taken and applied to a different problem is the method for detecting entries. We expect that the fundamental method of looking for marks in the page content to identify the start and end of catalogue entries in the XML files would be applicable to other data derived from printed catalogues. The only parts of the algorithm which would need changing for a new system would be the regular expressions used to find the start and end of the catalogue entry headings. And as long as the XML input comes in the same schema, the code should be able to consistently divide up the volumes into the individual catalogue entries.

03 April 2023

Topics in contemporary Digital Scholarship via five years of our Reading Group

Since March 2016, the Digital Scholarship Reading Group at the British Library has discussed articles, videos, podcasts, blog posts and chapters that touch on digital scholarship in libraries. I've shared our readings up to May 2018 and taken a thematic look at our readings at the intersection of digital scholarship and anti-racism in July 2020.

As the Living with Machines project draws to an end this (northern) summer, I thought I'd give an updated list of our readings since June 2018. I started including more pieces on deep learning, machine learning, AI ('artificial intelligence'), big data, data science, digital history, digitised newspapers, and user experience design for digital collections when we began discussing what became Living with Machines in early 2017. This was partly a way for me to catch up with relevant topics, and partly to lay the groundwork for LwM across the organisation. You can see that reflected in our topics up to May 2018 and onward.

Of course, the group continued to cover other topics, and sessions were suggested and/or led by colleagues including Adi Keinan-Schoonbaert, Annabel Gallop, Graham Jevon, Jez Cope, Lucy Hinnie, Mary Stewart, Nora McGregor, Sarah Miles, Sarah Stewart and Stella Wisdom. Especial thanks to Rossitza Atanassova and Deirdre Sullivan who’ve been helping me run the group in recent years. In 2021 we started using the January session to invite colleagues across the Library to look around and pick topics for discussion in the year ahead.

So what did we discuss from June 2018 to the end of 2022?

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs