Digital scholarship blog

236 posts categorized "Data"

30 November 2022

Skills and Training Needs to Open Heritage Research Through Repositories: Scoping report and Repository Training Programme for cultural heritage professionals

Do you think the repository landscape is mature enough in the heritage sector? Are the policies, infrastructure and skills in place to open up heritage research through digital repositories? Our brief analysis shows that research activity in GLAMs needs better acknowledgement, established digital repositories for dissemination of outputs and empowered staff to make use of repository services. At the British Library, we published a report called Scoping Skills and Developing Training Programme for Managing Repository Services in Cultural Heritage Organisations. We looked at the roles and people involved in the research workflow in GLAMs, and their skills needs to share heritage research openly through digital repositories in order to develop a training program for cultural heritage professionals.

 

Making heritage research openly available

Making research openly available to everyone increases the reach and impact of the work, driving increased value for money in research investment, and helps to make research reusable for everyone. ‘Open’ in this context is not only about making research freely accessible but also about ensuring the research is shared with rich metadata, licensed for reuse, including persistent identifiers, and is discoverable. Communicating research in GLAM contexts goes beyond journal articles. Digital scholarship, practice-based and computational research approaches generate a wide range of complex objects that need to be shared, reused to inform practice, policy and future research, and cannot necessarily be assessed with common metrics and rankings of academia.

The array of research activity in GLAMs needs to be addressed in the context of research repositories. If you look at OpenDOAR and Re3data, the global directories of open repositories, the number of repositories in the cultural heritage sector is still small compared to academic institutions. There is an increasing need to establish repositories for heritage research and to empower cultural heritage professionals to make use of repository services. Staff who are involved in supporting research activities, managing digital collections, and providing research infrastructure in GLAM organisations must be supported with capacity development programmes to establish open scholarship activities and share their research outputs through research repositories.

 

Who is involved in the research activities and repository services?

This question is important considering that staff may not be explicitly research-active, yet research is regularly conducted in addition to day-to-day jobs in GLAMs. In addition, organisations are not primarily driven by a research agenda in the heritage sector. The study we undertook as part of an AHRC funded repository infrastructure project showed us that cultural heritage professionals are challenged by the invisibility of forms of research conducted in their day-to-day jobs as well as lack of dedicated time and staff to work around open scholarship.

In order to bring clarity to the personas involved in research activities and link them to competencies and training needs later on for the purpose of this work, we defined five profiles that carry out and contribute to research in cultural heritage organisations. These five profiles illustrate the researcher as a core player, alongside four other profiles involved in making research happen, and ensuring it can be published, shared, communicated and preserved.

 

A 5 column chart showing 'researchers', 'curators and content creators', 'infomediaries', 'infrastructure architects', and 'policy makers' as the key personas identified.
Figure 1. Profiles identified in the cultural heritage institutions to conduct, facilitate, and support research workflow.

 

 

Consultation on training needs for repository services

We explored the skill gaps and training needs of GLAM professionals from curation to rights management, and open scholarship to management of repository services. In addition to scanning the training landscape for competency frameworks, existing programmes and resources, we conducted interviews to explore training requirements relevant to repository services. Finally, we validated initial findings in a consultative workshop with cultural heritage professionals, to hear their experience and get input to a competency framework and training curriculum.

Interviews highlighted that there is a lack of knowledge and support in cultural heritage organisations, where institutional support and training is not guaranteed for research communication or open scholarship. In terms of types of research activities, the workshop brought interesting discussions about what constitutes ‘research’ in the cultural heritage context and what makes it different to research in a university context. The event underlined the fact that cultural heritage staff profiles for producing, supporting, and communicating the research are different to the higher education landscape at many levels.

 

Discussion board showing virtual post its stuck to a canvas with a river in the background, identifying three key areas: 'What skills and knowledge do we already have?', 'What training elements are required?', and 'What skills and knowledge do we need?' (with the second question acting as a metaphorical bridge over the river).
Figure 2: Discussion board from the Skills and Training Breakout Session in virtual Consultative Workshop held on 28/04/2022.

 

The interviews and the consultative workshop highlighted that the ways of research conducted and communicated in the cultural heritage sector (as opposed to academia) should be taken into account in identifying skills needed and developing training programmes in the areas of open scholarship.

 

Competency framework and curriculum for repository training programme

There is a wealth of information, valuable project outputs, and a number of good analytical works available to identify gaps and gain new skills, particularly in the areas of open science, scholarly communications and research data management. However, adjusting and adopting these works to the context of cultural heritage organisations and relevant professionals will increase their relevance and uptake. Derived from our desk research and workshop analysis, we developed a competency framework that sets out the knowledge and skills required to support open scholarship for the personas present in GLAM organisations. Topic clusters used in the framework are as follows:

  1. Repository Service management
  2. Curation & data stewardship
  3. Metadata management
  4. Preservation
  5. Scholarly publishing
  6. Assessment and impact
  7. Advocacy and communication
  8. Capacity development

The proposed curriculum was designed by considering the pathways to develop, accelerate and manage a repository service. It contains only the areas that we identify as a priority to deliver the most value to cultural heritage organisations. Five teaching modules are considered in this preliminary work: 

  1. Opening up heritage research
  2. Getting started with GLAM repositories
  3. Realising and expanding the benefits
  4. Exploring the scholarly communications ecosystem
  5. Topics for future development

A complete version of the competency framework and the curriculum can be found in the report and is also available as a Google spreadsheet. They will drive increased uptake and use of repositories across AHRC’s investments, increasing value for money from both research funding and infrastructure funding.

 

What is next?

From January to July2023, we, at the British Library, will prepare a core set of materials based on this curriculum and deliver training events in a combination of online and in-person workshops. Training events are being planned to take place in Scotland, North England, Wales in person in addition to several online sessions. Both the framework and the training curriculum will be refined as we receive feedback and input from the participants of these events throughout next year. Event details will be announced in collaboration with host institutions in this blog as well as on our social media channels. Watch this space for more information.

If you have any feedback or questions, please contact us at openaccess@bl.uk.

29 November 2022

My AHRC-RLUK Professional Practice Fellowship: Four months on

In August 2022 I started work on a project to investigate the legacies of curatorial voice in the descriptions of incunabula collections at the British Library and their future reuse. My research is funded by the collaborative AHRC-RLUK Professional Practice Fellowship Scheme for academic and research libraries which launched in 2021. As part of the first cohort of ten Fellows I embraced this opportunity to engage in practitioner research that benefits my institution and the wider sector, and to promote the role of library professionals as important research partners.

The overall aim of my Fellowship is to demonstrate new ways of working with digitised catalogues that would also improve the discoverability and usability of the collections they describe. The focus of my research is the Catalogue of books printed in the 15th century now at the British Museum (or BMC) published between 1908 and 2007 which describes over 12,700 volumes from the British Library incunabula collection. By using computational approaches and tools with the data derived from the catalogue I will gain new insights into and interpretations of this valuable resource and enable its reuse in contemporary online resources. 

Titlepage to volume 2 of the Catalogue of books printed in the fifteenth century now in the British Museum, part 2, Germany, Eltvil-Trier
BMC volume 2 titlepage


This research idea was inspired by a recent collaboration with Dr James Baker, who is also my mentor for this Fellowship, and was further developed in conversations with Dr Karen Limper-Herz, Lead Curator for Incunabula, Adrian Edwards, Head of Printed Heritage Collections, and Alan Danskin, Collections Metadata Standards Manager, who support my research at the Library.

My Fellowship runs until July 2023 with Fridays being my main research days. I began by studying the history of the catalogue, its arrangement and the structure of the item descriptions and their relationship with different online resources. Overall, the main focus of this first phase has been on generating the text data required for the computational analysis and investigations into curatorial and cataloguing practice. This work involved new digitisation of the catalogue and a lot of experimentation using the Transkribus AI-empowered platform that proved best-suited for improving the layout and text recognition for the digitised images. During the last two months I have hugely benefited from the expertise of my colleague Tom Derrick, as we worked together on creating the training data and building structure models for the incunabula catalogue images.

An image from Transkribus Lite showing a page from the catalogue with separate regions drawn around columns 1 and 2, and the text baselines highlighted in purple
Layout recognition output for pages with only two columns, including text baselines, viewed on Transkribus Lite

 

An image from Transkribus Lite showing a page from the catalogue alongside the text lines
Text recognition output after applying the model trained with annotations for 2 columns on the page, viewed on Transkribus Lite

 

An image from Transkribus Lite showing a page from the catalogue with separate regions drawn around 4 columns of text separated by a single text block
Layout recognition output for pages with mixed layout of single text block and text in columns, viewed on Transkribus Lite

Whilst the data preparation phase has taken longer than I had planned due to the varied layout of the catalogue, this has been an important part of the process as the project outcomes are dependent on using the best quality text data for the incunabula descriptions. The next phase of the research will involve the segmentation of the records and extraction of relevant information to use with a range of computational tools. I will report on the progress with this work and the next steps early next year. Watch this space and do get in touch if you would like to learn more about my research.

This blogpost is by Dr Rossitza Atanassova, Digital Curator for Digitisation, British Library. She is on Twitter @RossiAtanassova  and Mastodon @ratanass@glammr.us

28 October 2022

Learn more about Living with Machines at events this winter

Digital Curator, and Living with Machines Co-Investigator Dr Mia Ridge writes…

The Living with Machines research project is a collaboration between the British Library, The Alan Turing Institute and various partner universities. Our free exhibition at Leeds City Museum, Living with Machines: Human stories from the industrial age, opened at the end of July. Read on for information about adult events around the exhibition…

Museum Late: Living with Machines, Thursday 24 November, 2022

6 - 10pm Leeds City Museum • £5, booking essential https://my.leedstickethub.co.uk/19101

The first ever Museum Late at Leeds City Museum! Come along to experience the museum after hours with music, pub quiz, weaving, informal workshops, chats with curators, and a quiz. Local food and drinks in the main hall.

Full programme: https://museumsandgalleries.leeds.gov.uk/events/leeds-city-museum/museum-late-living-with-machines/

Tickets: https://my.leedstickethub.co.uk/19101

Study Day: Living with Machines, Friday December 2, 2022

10:00 am - 4:00 pm Online • Free but booking essential: https://my.leedstickethub.co.uk/18775

A unique opportunity to hear experts in the field illuminate key themes from the exhibition and learn how exhibition co-curators found stories and objects to represent research work in AI and digital history. This study day is online via Zoom so that you can attend from anywhere.

Full programme: https://museumsandgalleries.leeds.gov.uk/events/leeds-city-museum/living-with-machines-study-day/

Tickets: https://my.leedstickethub.co.uk/18775

Living with Machines Wikithon, Saturday January 7, 2023

1 – 4:30pm Leeds City Museum • Free but booking essential: https://my.leedstickethub.co.uk/19104

Ever wanted to try editing Wikipedia, but haven't known where to start? Join us for a session with our brilliant Wikipedian-in-residence to help improve Wikipedia’s coverage of local lives and topics at an editathon themed around our exhibition. 

Everyone is welcome. You won’t require any previous Wiki experience but please bring your own laptop for this event. Find out more, including how you can prepare, in my blog post on the Living with Machines site, Help fill gaps in Wikipedia: our Leeds editathon.

The exhibition closes the next day, so it really is your last chance to see it!

Full programme: https://museumsandgalleries.leeds.gov.uk/events/leeds-city-museum/living-with-machines-wikithon-exploring-the-margins/

Tickets: https://my.leedstickethub.co.uk/19104

If you just want to try out something more hands on with textiles inspired by the exhibition, there's also a Peg Loom Weaving Workshop, and not one but two Christmas Wreath Workshops.

You can find out more about our exhibition on the Living with Machines website.

Lwm800x400

20 September 2022

Learn more about what AI means for us at Living with Machines events this autumn

Digital Curator, and Living with Machines Co-Investigator Dr Mia Ridge writes…

The Living with Machines research project is a collaboration between the British Library, The Alan Turing Institute and various partner universities. Our free exhibition at Leeds City Museum, Living with Machines: Human stories from the industrial age, opened at the end of July. Read on for information about adult events around the exhibition…

AI evening panels and workshop, September 2022

We’ve put together some great panels with expert speakers guaranteed to get you thinking about the impact of AI with their thought-provoking examples and questions. You'll have a chance to ask your own questions in the Q&A, and to mingle with other attendees over drinks.

We’ve also collaborated with AI Tech North to offer an exclusive workshop looking at the practical aspects of ethics in AI. If you’re using or considering AI-based services or tools, this might be for you. Our events are also part of the jam-packed programme of the Leeds Digital Festival #LeedsDigi22, where we’re in great company.

The role of AI in Creative and Cultural Industries

Thu, Sep 22, 17:30 – 19:45 BST

Leeds City Museum • Free but booking required

https://www.eventbrite.com/e/the-role-of-ai-in-creative-and-cultural-industries-tickets-395003043737

How will AI change what we wear, the TV and films we watch, what we read? 

Join our fabulous Chair Zillah Watson (independent consultant, ex-BBC) and panellists Rebecca O’Higgins (Founder KI-AH-NA), Laura Ellis (Head of Technology Forecasting, BBC) and Maja Maricevic, (Head of Higher Education and Science, British Library) for an evening that'll help you understand the future of these industries for audiences and professionals alike. 

Maja's written a blog post on The role of AI in creative and cultural industries with more background on this event.

 

Workshop: Developing ethical and fair AI for society and business

Thu, Sep 29, 13:30 - 17:00 BST

Leeds City Museum • Free but booking required

https://www.eventbrite.com/e/workshop-developing-ethical-and-fair-ai-for-society-and-business-tickets-400345623537

 

Panel: Developing ethical and fair AI for society and business

Thu, Sep 29, 17:30 – 19:45 BST

Leeds City Museum • Free but booking required

https://www.eventbrite.com/e/panel-developing-ethical-and-fair-ai-for-society-and-business-tickets-395020706567

AI is coming, so how do we live and work with it? What can we all do to develop ethical approaches to AI to help ensure a more equal and just society? 

Our expert Chair, Timandra Harkness, and panellists Sherin Mathew (Founder & CEO of AI Tech UK), Robbie Stamp (author and CEO at Bioss International), Keely Crockett (Professor in Computational Intelligence, Manchester Metropolitan University) and Andrew Dyson (Global Co-Chair of DLA Piper’s Data Protection, Privacy and Security Group) will present a range of perspectives on this important topic.

If you missed our autumn events, we also have a study day and Wikipedia editathon this winter. You can find out more about our exhibition on the Living with Machines website.

Lwm800x400

27 June 2022

IIIF-yeah! Annual Conference 2022

At the beginning of June Neil Fitzgerald, Head of Digital Research, and myself attended the annual International Image Interoperability Framework (IIIF) Showcase and Conference in Cambridge MA. The showcase was held in Massachusetts’s Institute of Technology’s iconic lecture theatre 10-250 and the conference was held in the Fong Auditorium of Boylston Hall on Harvard’s campus. There was a stillness on the MIT campus, in contrast Harvard Yard was busy with sightseeing members of the public and the dismantling of marquees from the end of year commencements in the previous weeks. 

View of the Massachusetts Institute of Technology Dome IIIF Consortium sticker reading IIIF-yeah! Conference participants outside Boylston Hall, Harvard Yard


The conference atmosphere was energising, with participants excited to be back at an in-person event, the last one being held in 2019 in Göttingen, with virtual meetings held in the meantime. During the last decade IIIF has been growing as reflected by the fast expanding community  and IIIF Consortium, which now comprises 63 organisations from across the GLAM and commercial sectors. 

The Showcase on June 6th was an opportunity to welcome those new to IIIF and highlight recent community developments. I had the pleasure of presenting the work of British Library and Zooninverse to enable new IIIF functionality on Zooniverse to support our In the Spotlight project which crowdsources information about the Library’s historical playbills collection. Other presentations covered the use of IIIF with audio, maps, and in teaching, learning and museum contexts, and the exciting plans to extend IIIF standards for 3D data. Harvard University updated on their efforts to adopt IIIF across the organisation and their IIIF resources webpage is a useful resource. I was particularly impressed by the Leventhal Map and Education Center’s digital maps initiatives, including their collaboration on Allmaps, a set of open source tools for curating, georeferencing and exploring IIIF maps (learn more).

 The following two days were packed with brilliant presentations on IIIF infrastructure, collections enrichment, IIIF resources discovery, IIIF-enabled digital humanities teaching and research, improving user experience and more. Digirati presented a new IIIF manifest editor which is being further developed to support various use cases. Ed Silverton reported on the newest features for the Exhibit tool which we at the British Library have started using to share engaging stories about our IIIF collections.

 Ed Silverton presenting a slide about the Exhibit tool Conference presenters talking about the Audiovisual Metadata Platform Conference reception under a marquee in Harvard Yard

I was interested to hear about Getty’s vision of IIIF as enabling technology, how it fits within their shared data infrastructure and their multiple use cases, including to drive image backgrounds based on colour palette annotations and the Quire publication process. It was great to hear how IIIF has been used in digital humanities research, as in the Mapping Colour in History project at Harvard which enables historical analysis of artworks though pigment data annotations, or how IIIF helps to solve some of the challenges of remote resources aggregation for the Paul Laurence Dunbar initiative.

There was also much excitement about the Detekiiif browser extension for Chrome and Firefox that detects IIIF resources in websites and helps collect and export IIIF manifests. Zentralbibliothek Zürich’s customised version ZB-detektIIIF allows scholars to create IIIF collections in JSON-LD and link to the Mirador Viewer. There were several great presentations about IIIF players and tools for audio-visual content, such as Avalon, Aviary, Clover, Audiovisual Metadata Platform and Mirador video extension. And no IIIF Conference is ever complete without a #FunWithIIIF presentation by Cogapp’s Tristan Roddis this one capturing 30 cool projects using IIIF content and technology! 

We all enjoyed lots of good conversations during the breaks and social events, and some great tours were on offer. Personally I chose to visit the Boston Public Library’s Leventhal Map and Education Centre and exhibition about environment and social justice, and BPL Digitisation studio, the latter equipped with the Internet Archive scanning stations and an impressive maps photography room.

Boston Public Library book trolleys Boston Public Library Maps Digitisation Studio Rossitza Atanassova outside Boston Pubic Library


I was also delighted to pay a visit to the Harvard Libraries digitisation team who generously showed me their imaging stations and range of digitised collections, followed by a private guided tour of the Houghton Library’s special collections and beautiful spaces. Huge thanks to all the conference organisers, the local committee, and the hosts for my visits, Christine Jacobson, Bill Comstock and David Remington. I learned a lot and had an amazing time. 

Finally, all presentations from the three days have been shared and some highlights captured on Twitter #iiif. In addition this week the Consortium is offering four free online workshops to share IIIF best practices and tools with the wider community. Don’t miss your chance to attend. 

This post is by Digital Curator Rossitza Atanassova (@RossiAtanassova)

16 June 2022

Working With Wikidata and Wikimedia Commons: Poetry Pamphlets and Lotus Sutra Manuscripts

Greetings! I’m Xiaoyan Yang, from Beijing, China, an MSc student at University College London. It was a great pleasure to have the opportunity to do a four-week placement at the British Library and Wikimedia UK under the supervision of Lucy Hinnie, Wikimedian in Residence, and Stella Wisdom, Digital Curator, Contemporary British Collections. I mainly focused on the Michael Marks Awards for Poetry Pamphlets Project and Lotus Sutra Project, and the collaboration between the Library and Wikimedia.

What interested you in applying for a placement at the Library?

This kind of placement, in world-famous cultural institutions such as the Library and Wikimedia is  a brand-new experience for me. Because my undergraduate major is economic statistics, most of my internships in the past were in commercial and Internet technology companies. The driving force of my interest in digital humanities research, especially related data, knowledge graph, and visualization, is to better combine information technologies with cultural resources, in order to reach a wider audience, and promote the transmission of cultural and historical memory in a more accessible way.

Libraries are institutions for the preservation and dissemination of knowledge for the public, and the British Library is one of the largest and best libraries in the world without doubt. It has long been a leader and innovator in resource protection and digitization. The International Dunhuang Project (IDP) initiated by the British Library is now one of the most representative transnational collaborative projects of digital humanistic resources in the field. I applied for a placement opportunity hoping to learn more about the usage of digital resources in real projects and the process of collaboration from the initial design to the following arrangement. I also wanted  to have the chance to get involved in the practice of linked data, to accumulate experience, and find the direction of future improvements.

I would like to thank Dr Adi Keinan-Schoonbaert for her kind introduction to the British Library's Asian and African Digitization projects, especially the IDP, which has enabled me to learn more about the librarian-led practices in this area. At the same time, I was very happy to sit in on the weekly meetings of the Digital Scholarship Team during this placement, which allowed me to observe how collaboration between different departments are carried out and managed in a large cultural resource organization like the British Library.

Excerpt from Lotus Sutra Or.8210 S.155. An old scroll of parchment showing vertical lines of older Chinese script.
Excerpt from Lotus Sutra Or.8210 S.155. Kumārajīva, CC BY 4.0, via Wikimedia Commons

What is the most surprising thing you have learned?

In short, it is so easy to contribute knowledge at Wikimedia. In this placement, one of my very first tasks was to upload information about winning and shortlisted poems of the Michael Marks Awards for Poetry Pamphlets for each year from 2009 to the latest, 2021, to Wikidata. The first step was to check whether this poem and its author and publisher already existed in Wikidata. If not, I created an item page for it. Before I started, I thought the process would be very complicated, but after I started following the manual, I found it was actually really easy. I just need to click "Create a new Item". 

I always remember that the first item of people that I created was Sarah Jackson, one of the shortlist winners of this award in 2009. The unique QID was automatically generated as Q111940266. With such a simple operation, anyone can contribute to the vast knowledge world of Wiki. Many people who I have never met may read this item page  in the future, a page created and perfected by me at this moment. This feeling is magical and full of achievement for me. Also, there are many useful guides, examples and batch loading tools such as Quickstatements that help the users to start editing with joy. Useful guides include the Wikidata help pages for Quickstatements and material from the University of Edinburgh.

Image of a Wikimedia SPARQL query to determine a list of information about the Michael Marks Poetry Pamphlet uploads.
An example of one of Xiaoyan’s queries - you can try it here!

How do you hope to use your skills going forward?

My current dissertation research focuses on the regional classic Chinese poetry in the Hexi Corridor. This particular geographical area is deeply bound up with the Silk Road in history and has inspired and attracted many poets to visit and write. My project aims to build a proper ontology and knowledge map, then combining with GIS visualization display and text analysis, to explore the historical, geographic, political and cultural changes in this area, from the perspective of time and space. Wikidata provides a standard way to undertake this work. 

Thanks to Dr Martin Poulter’s wonderful training and Stuart Prior’s kind instructions, I quickly picked up some practical skills on Wiki queries construction. The layout design of the timeline and geographical visualization tools offered by Wiki query inspired me to improve my skills in this field more in the future. What’s more, although I haven’t had a chance to experience Wikibase yet, I am very interested in it now, thanks to Dr Lucy Hinnie and Dr Graham Jevon’s introduction, I will definitely try it in future.

Would you like to share some Wiki advice with us?

Wiki is very self-learning friendly: on the Help page various manuals and examples are presented, all of which are very good learning resources. I will keep learning and exploring in the future.

I do want to share my feelings and a little experience with Wikidata. In the Michael Marks Awards for Poetry Pamphlets Project, all the properties used to describe poets, poems and publishers can be easily found in the existing Wikidata property list. However, in the second Lotus Sutra Project, I encountered more difficulties. For example, it is difficult to find suitable items and properties to represent paragraphs of scrolls’ text content and binding design on Wikidata, and this information is more suitable to be represented on WikiCommons at present.

However, as I learn more and more other Wikidata examples, I understand more and more about Wikidata and the purpose of these restrictions. Maintaining concise structured data and accurate correlation is one of the main purposes of Wikidata. It encourages reuse of existing properties as well as imposing more qualifications on long text descriptions. Therefore, this feature of Wikidata needs to be taken into account from the outset when designing metadata frameworks for data uploading.

In the end, I would like to sincerely thank my direct supervisor Lucy for her kind guidance, help, encouragement and affirmation, as well as the British Library and Wikimedia platform. I have received so much warm help and gained so much valuable practical experience, and I am also very happy and honored that by using my knowledge and technology I can make a small contribution to linked data. I will always cherish the wonderful memories here and continue to explore the potential of digital humanities in the future.

This post is by Xiaoyan Yang, an MSc student at University College London, and was edited by Wikimedian in Residence Dr Lucy Hinnie (@BL_Wikimedian) and Digital Curator Stella Wisdom (@miss_wisdom).

12 April 2022

Making British Library collections (even) more accessible

Daniel van Strien, Digital Curator, Living with Machines, writes:

The British Library’s digital scholarship department has made many digitised materials available to researchers. This includes a collection of digitised books created by the British Library in partnership with Microsoft. This is a collection of books that have been digitised and processed using Optical Character Recognition (OCR) software to make the text machine-readable. There is also a collection of books digitised in partnership with Google. 

Since being digitised, this collection of digitised books has been used for many different projects. This includes recent work to try and augment this dataset with genre metadata and a project using machine learning to tag images extracted from the books. The books have also served as training data for a historic language model.

This blog post will focus on two challenges of working with this dataset: size and documentation, and discuss how we’ve experimented with one potential approach to addressing these challenges. 

One of the challenges of working with this collection is its size. The OCR output is over 20GB. This poses some challenges for researchers and other interested users wanting to work with these collections. Projects like Living with Machines are one avenue in which the British Library seeks to develop new methods for working at scale. For an individual researcher, one of the possible barriers to working with a collection like this is the computational resources required to process it. 

Recently we have been experimenting with a Python library, datasets, to see if this can help make this collection easier to work with. The datasets library is part of the Hugging Face ecosystem. If you have been following developments in machine learning, you have probably heard of Hugging Face already. If not, Hugging Face is a delightfully named company focusing on developing open-source tools aimed at democratising machine learning. 

The datasets library is a tool aiming to make it easier for researchers to share and process large datasets for machine learning efficiently. Whilst this was the library’s original focus, there may also be other uses cases for which the datasets library may help make datasets held by the British Library more accessible. 

Some features of the datasets library:

  • Tools for efficiently processing large datasets 
  • Support for easily sharing datasets via a ‘dataset hub’ 
  • Support for documenting datasets hosted on the hub (more on this later). 

As a result of these and other features, we have recently worked on adding the British Library books dataset library to the Hugging Face hub. Making the dataset available via the datasets library has now made the dataset more accessible in a few different ways.

Firstly, it is now possible to download the dataset in two lines of Python code: 

Image of a line of code: "from datasets import load_dataset ds = load_dataset('blbooks', '1700_1799')"

We can also use the Hugging Face library to process large datasets. For example, we only want to include data with a high OCR confidence score (this partially helps filter out text with many OCR errors): 

Image of a line of code: "ds.filter(lambda example: example['mean_wc_ocr'] > 0.9)"

One of the particularly nice features here is that the library uses memory mapping to store the dataset under the hood. This means that you can process data that is larger than the RAM you have available on your machine. This can make the process of working with large datasets more accessible. We could also use this as a first step in processing data before getting back to more familiar tools like pandas. 

Image of a line of code: "dogs_data = ds['train'].filter(lamda example: "dog" in example['text'].lower()) df = dogs_data_to_pandas()

In a follow on blog post, we’ll dig into the technical details of datasets in some more detail. Whilst making the technical processing of datasets more accessible is one part of the puzzle, there are also non-technical challenges to making a dataset more usable. 

 

Documenting datasets 

One of the challenges of sharing large datasets is documenting the data effectively. Traditionally libraries have mainly focused on describing material at the ‘item level,’ i.e. documenting one dataset at a time. However, there is a difference between documenting one book and 100,000 books. There are no easy answers to this, but libraries could explore one possible avenue by using Datasheets. Timnit Gebru et al. proposed the idea of Datasheets in ‘Datasheets for Datasets’. A datasheet aims to provide a structured format for describing a dataset. This includes questions like how and why it was constructed, what the data consists of, and how it could potentially be used. Crucially, datasheets also encourage a discussion of the bias and limitations of a dataset. Whilst you can identify some of these limitations by working with the data, there is also a crucial amount of information known by curators of the data that might not be obvious to end-users of the data. Datasheets offer one possible way for libraries to begin more systematically commuting this information. 

The dataset hub adopts the practice of writing datasheets and encourages users of the hub to write a datasheet for their dataset. For the British library books, we have attempted to write one of these datacards. Whilst it is certainly not perfect, it hopefully begins to outline some of the challenges of this dataset and gives end-users a better sense of how they should approach a dataset. 

18 March 2022

Looking back at LibCrowds: surveying our participants

'In the Spotlight' is a crowdsourcing project from the British Library that aims to make digitised historical playbills more discoverable, while also encouraging people to closely engage with this otherwise less accessible collection. Digital Curator Dr Mia Ridge writes...

If you follow our @LibCrowds account on twitter, you might have noticed that we've been working on refreshed versions of our In the Spotlight tasks on Zooniverse. That's part of a small project to enable the use of IIIF manifests on Zooniverse - in everyday language, it means that many, many more digitised items can form the basis of crowdsourcing tasks in the Zooniverse Project Builder, and In the Spotlight is the first project to use this new feature. Along with colleagues in Printed Heritage and BL Labs, I've been looking at our original Pybossa-based LibCrowds site to plan a 'graceful ending' for first phase of the project on LibCrowds.com.

As part of our work documenting and archiving the original LibCrowds site, I'm delighted to share summary results from a 2018 survey of In the Spotlight participants, now published on the British library's Research Repository: https://doi.org/10.23636/w4ee-yc34. Our thanks go to Susan Knight, Customer Insight Coordinator, for her help with the survey.

The survey was designed to help us understand who In the Spotlight participants were, and to help us prioritise work on the project. The 22 question survey was based on earlier surveys run by the Galaxy Zoo and Art UK Tagger projects, to allow comparison with other crowdsourcing projects, and to contribute to our understanding of crowdsourcing in cultural heritage more broadly. It was open to anyone who had contributed to the British Library's In the Spotlight project for historical playbills. The survey was distributed to LibCrowds newsletter subscribers, on the LibCrowds community forum and on social media.

Some headline findings from our survey include:

  • Respondents were most likely to be a woman with a Masters degree, in full-time employment, in London or Southeast UK, who contributes in a break between other tasks or 'whenever they have spare time'.
  • 76% of respondents were motivated by contributing to historical or performance research

Responses to the question 'What was it about this project which caused you to spend more time than intended on it?':

  • Easy to do
  • It's so entertaining
  • Every time an entry is completed you are presented with another item which is interesting and
  • illuminating which provides a continuous temptation regarding what you might discover next
  • simplicity
  • A bit of competitiveness about the top ten contributors but also about contributing something useful
  • I just got carried away with the fun
  • It's so easy to complete
  • Easy to want to do just a few more
  • Addiction
  • Felt I could get through more tasks
  • Just getting engrossed
  • It can be a bit addictive!
  • It's so easy to do that it's very easy to get carried away.
  • interested in the [material]

The summary report contains more rich detail, so go check it out!

 

Crowdsourcing projects from the British Library. 2,969 Volunteers. 265,648 Contributions. 175 Projects
Detail of the front page of libcrowds.com; Crowdsourcing projects from the British Library. 2,969 Volunteers. 265,648 Contributions. 175 Projects

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs