Digital scholarship blog

232 posts categorized "Data"

27 June 2022

IIIF-yeah! Annual Conference 2022

At the beginning of June Neil Fitzgerald, Head of Digital Research, and myself attended the annual International Image Interoperability Framework (IIIF) Showcase and Conference in Cambridge MA. The showcase was held in Massachusetts’s Institute of Technology’s iconic lecture theatre 10-250 and the conference was held in the Fong Auditorium of Boylston Hall on Harvard’s campus. There was a stillness on the MIT campus, in contrast Harvard Yard was busy with sightseeing members of the public and the dismantling of marquees from the end of year commencements in the previous weeks. 

View of the Massachusetts Institute of Technology Dome IIIF Consortium sticker reading IIIF-yeah! Conference participants outside Boylston Hall, Harvard Yard


The conference atmosphere was energising, with participants excited to be back at an in-person event, the last one being held in 2019 in Göttingen, with virtual meetings held in the meantime. During the last decade IIIF has been growing as reflected by the fast expanding community  and IIIF Consortium, which now comprises 63 organisations from across the GLAM and commercial sectors. 

The Showcase on June 6th was an opportunity to welcome those new to IIIF and highlight recent community developments. I had the pleasure of presenting the work of British Library and Zooninverse to enable new IIIF functionality on Zooniverse to support our In the Spotlight project which crowdsources information about the Library’s historical playbills collection. Other presentations covered the use of IIIF with audio, maps, and in teaching, learning and museum contexts, and the exciting plans to extend IIIF standards for 3D data. Harvard University updated on their efforts to adopt IIIF across the organisation and their IIIF resources webpage is a useful resource. I was particularly impressed by the Leventhal Map and Education Center’s digital maps initiatives, including their collaboration on Allmaps, a set of open source tools for curating, georeferencing and exploring IIIF maps (learn more).

 The following two days were packed with brilliant presentations on IIIF infrastructure, collections enrichment, IIIF resources discovery, IIIF-enabled digital humanities teaching and research, improving user experience and more. Digirati presented a new IIIF manifest editor which is being further developed to support various use cases. Ed Silverton reported on the newest features for the Exhibit tool which we at the British Library have started using to share engaging stories about our IIIF collections.

 Ed Silverton presenting a slide about the Exhibit tool Conference presenters talking about the Audiovisual Metadata Platform Conference reception under a marquee in Harvard Yard

I was interested to hear about Getty’s vision of IIIF as enabling technology, how it fits within their shared data infrastructure and their multiple use cases, including to drive image backgrounds based on colour palette annotations and the Quire publication process. It was great to hear how IIIF has been used in digital humanities research, as in the Mapping Colour in History project at Harvard which enables historical analysis of artworks though pigment data annotations, or how IIIF helps to solve some of the challenges of remote resources aggregation for the Paul Laurence Dunbar initiative.

There was also much excitement about the Detekiiif browser extension for Chrome and Firefox that detects IIIF resources in websites and helps collect and export IIIF manifests. Zentralbibliothek Zürich’s customised version ZB-detektIIIF allows scholars to create IIIF collections in JSON-LD and link to the Mirador Viewer. There were several great presentations about IIIF players and tools for audio-visual content, such as Avalon, Aviary, Clover, Audiovisual Metadata Platform and Mirador video extension. And no IIIF Conference is ever complete without a #FunWithIIIF presentation by Cogapp’s Tristan Roddis this one capturing 30 cool projects using IIIF content and technology! 

We all enjoyed lots of good conversations during the breaks and social events, and some great tours were on offer. Personally I chose to visit the Boston Public Library’s Leventhal Map and Education Centre and exhibition about environment and social justice, and BPL Digitisation studio, the latter equipped with the Internet Archive scanning stations and an impressive maps photography room.

Boston Public Library book trolleys Boston Public Library Maps Digitisation Studio Rossitza Atanassova outside Boston Pubic Library


I was also delighted to pay a visit to the Harvard Libraries digitisation team who generously showed me their imaging stations and range of digitised collections, followed by a private guided tour of the Houghton Library’s special collections and beautiful spaces. Huge thanks to all the conference organisers, the local committee, and the hosts for my visits, Christine Jacobson, Bill Comstock and David Remington. I learned a lot and had an amazing time. 

Finally, all presentations from the three days have been shared and some highlights captured on Twitter #iiif. In addition this week the Consortium is offering four free online workshops to share IIIF best practices and tools with the wider community. Don’t miss your chance to attend. 

This post is by Digital Curator Rossitza Atanassova (@RossiAtanassova)

16 June 2022

Working With Wikidata and Wikimedia Commons: Poetry Pamphlets and Lotus Sutra Manuscripts

Greetings! I’m Xiaoyan Yang, from Beijing, China, an MSc student at University College London. It was a great pleasure to have the opportunity to do a four-week placement at the British Library and Wikimedia UK under the supervision of Lucy Hinnie, Wikimedian in Residence, and Stella Wisdom, Digital Curator, Contemporary British Collections. I mainly focused on the Michael Marks Awards for Poetry Pamphlets Project and Lotus Sutra Project, and the collaboration between the Library and Wikimedia.

What interested you in applying for a placement at the Library?

This kind of placement, in world-famous cultural institutions such as the Library and Wikimedia is  a brand-new experience for me. Because my undergraduate major is economic statistics, most of my internships in the past were in commercial and Internet technology companies. The driving force of my interest in digital humanities research, especially related data, knowledge graph, and visualization, is to better combine information technologies with cultural resources, in order to reach a wider audience, and promote the transmission of cultural and historical memory in a more accessible way.

Libraries are institutions for the preservation and dissemination of knowledge for the public, and the British Library is one of the largest and best libraries in the world without doubt. It has long been a leader and innovator in resource protection and digitization. The International Dunhuang Project (IDP) initiated by the British Library is now one of the most representative transnational collaborative projects of digital humanistic resources in the field. I applied for a placement opportunity hoping to learn more about the usage of digital resources in real projects and the process of collaboration from the initial design to the following arrangement. I also wanted  to have the chance to get involved in the practice of linked data, to accumulate experience, and find the direction of future improvements.

I would like to thank Dr Adi Keinan-Schoonbaert for her kind introduction to the British Library's Asian and African Digitization projects, especially the IDP, which has enabled me to learn more about the librarian-led practices in this area. At the same time, I was very happy to sit in on the weekly meetings of the Digital Scholarship Team during this placement, which allowed me to observe how collaboration between different departments are carried out and managed in a large cultural resource organization like the British Library.

Excerpt from Lotus Sutra Or.8210 S.155. An old scroll of parchment showing vertical lines of older Chinese script.
Excerpt from Lotus Sutra Or.8210 S.155. Kumārajīva, CC BY 4.0, via Wikimedia Commons

What is the most surprising thing you have learned?

In short, it is so easy to contribute knowledge at Wikimedia. In this placement, one of my very first tasks was to upload information about winning and shortlisted poems of the Michael Marks Awards for Poetry Pamphlets for each year from 2009 to the latest, 2021, to Wikidata. The first step was to check whether this poem and its author and publisher already existed in Wikidata. If not, I created an item page for it. Before I started, I thought the process would be very complicated, but after I started following the manual, I found it was actually really easy. I just need to click "Create a new Item". 

I always remember that the first item of people that I created was Sarah Jackson, one of the shortlist winners of this award in 2009. The unique QID was automatically generated as Q111940266. With such a simple operation, anyone can contribute to the vast knowledge world of Wiki. Many people who I have never met may read this item page  in the future, a page created and perfected by me at this moment. This feeling is magical and full of achievement for me. Also, there are many useful guides, examples and batch loading tools such as Quickstatements that help the users to start editing with joy. Useful guides include the Wikidata help pages for Quickstatements and material from the University of Edinburgh.

Image of a Wikimedia SPARQL query to determine a list of information about the Michael Marks Poetry Pamphlet uploads.
An example of one of Xiaoyan’s queries - you can try it here!

How do you hope to use your skills going forward?

My current dissertation research focuses on the regional classic Chinese poetry in the Hexi Corridor. This particular geographical area is deeply bound up with the Silk Road in history and has inspired and attracted many poets to visit and write. My project aims to build a proper ontology and knowledge map, then combining with GIS visualization display and text analysis, to explore the historical, geographic, political and cultural changes in this area, from the perspective of time and space. Wikidata provides a standard way to undertake this work. 

Thanks to Dr Martin Poulter’s wonderful training and Stuart Prior’s kind instructions, I quickly picked up some practical skills on Wiki queries construction. The layout design of the timeline and geographical visualization tools offered by Wiki query inspired me to improve my skills in this field more in the future. What’s more, although I haven’t had a chance to experience Wikibase yet, I am very interested in it now, thanks to Dr Lucy Hinnie and Dr Graham Jevon’s introduction, I will definitely try it in future.

Would you like to share some Wiki advice with us?

Wiki is very self-learning friendly: on the Help page various manuals and examples are presented, all of which are very good learning resources. I will keep learning and exploring in the future.

I do want to share my feelings and a little experience with Wikidata. In the Michael Marks Awards for Poetry Pamphlets Project, all the properties used to describe poets, poems and publishers can be easily found in the existing Wikidata property list. However, in the second Lotus Sutra Project, I encountered more difficulties. For example, it is difficult to find suitable items and properties to represent paragraphs of scrolls’ text content and binding design on Wikidata, and this information is more suitable to be represented on WikiCommons at present.

However, as I learn more and more other Wikidata examples, I understand more and more about Wikidata and the purpose of these restrictions. Maintaining concise structured data and accurate correlation is one of the main purposes of Wikidata. It encourages reuse of existing properties as well as imposing more qualifications on long text descriptions. Therefore, this feature of Wikidata needs to be taken into account from the outset when designing metadata frameworks for data uploading.

In the end, I would like to sincerely thank my direct supervisor Lucy for her kind guidance, help, encouragement and affirmation, as well as the British Library and Wikimedia platform. I have received so much warm help and gained so much valuable practical experience, and I am also very happy and honored that by using my knowledge and technology I can make a small contribution to linked data. I will always cherish the wonderful memories here and continue to explore the potential of digital humanities in the future.

This post is by Xiaoyan Yang, an MSc student at University College London, and was edited by Wikimedian in Residence Dr Lucy Hinnie (@BL_Wikimedian) and Digital Curator Stella Wisdom (@miss_wisdom).

12 April 2022

Making British Library collections (even) more accessible

Daniel van Strien, Digital Curator, Living with Machines, writes:

The British Library’s digital scholarship department has made many digitised materials available to researchers. This includes a collection of digitised books created by the British Library in partnership with Microsoft. This is a collection of books that have been digitised and processed using Optical Character Recognition (OCR) software to make the text machine-readable. There is also a collection of books digitised in partnership with Google. 

Since being digitised, this collection of digitised books has been used for many different projects. This includes recent work to try and augment this dataset with genre metadata and a project using machine learning to tag images extracted from the books. The books have also served as training data for a historic language model.

This blog post will focus on two challenges of working with this dataset: size and documentation, and discuss how we’ve experimented with one potential approach to addressing these challenges. 

One of the challenges of working with this collection is its size. The OCR output is over 20GB. This poses some challenges for researchers and other interested users wanting to work with these collections. Projects like Living with Machines are one avenue in which the British Library seeks to develop new methods for working at scale. For an individual researcher, one of the possible barriers to working with a collection like this is the computational resources required to process it. 

Recently we have been experimenting with a Python library, datasets, to see if this can help make this collection easier to work with. The datasets library is part of the Hugging Face ecosystem. If you have been following developments in machine learning, you have probably heard of Hugging Face already. If not, Hugging Face is a delightfully named company focusing on developing open-source tools aimed at democratising machine learning. 

The datasets library is a tool aiming to make it easier for researchers to share and process large datasets for machine learning efficiently. Whilst this was the library’s original focus, there may also be other uses cases for which the datasets library may help make datasets held by the British Library more accessible. 

Some features of the datasets library:

  • Tools for efficiently processing large datasets 
  • Support for easily sharing datasets via a ‘dataset hub’ 
  • Support for documenting datasets hosted on the hub (more on this later). 

As a result of these and other features, we have recently worked on adding the British Library books dataset library to the Hugging Face hub. Making the dataset available via the datasets library has now made the dataset more accessible in a few different ways.

Firstly, it is now possible to download the dataset in two lines of Python code: 

Image of a line of code: "from datasets import load_dataset ds = load_dataset('blbooks', '1700_1799')"

We can also use the Hugging Face library to process large datasets. For example, we only want to include data with a high OCR confidence score (this partially helps filter out text with many OCR errors): 

Image of a line of code: "ds.filter(lambda example: example['mean_wc_ocr'] > 0.9)"

One of the particularly nice features here is that the library uses memory mapping to store the dataset under the hood. This means that you can process data that is larger than the RAM you have available on your machine. This can make the process of working with large datasets more accessible. We could also use this as a first step in processing data before getting back to more familiar tools like pandas. 

Image of a line of code: "dogs_data = ds['train'].filter(lamda example: "dog" in example['text'].lower()) df = dogs_data_to_pandas()

In a follow on blog post, we’ll dig into the technical details of datasets in some more detail. Whilst making the technical processing of datasets more accessible is one part of the puzzle, there are also non-technical challenges to making a dataset more usable. 

 

Documenting datasets 

One of the challenges of sharing large datasets is documenting the data effectively. Traditionally libraries have mainly focused on describing material at the ‘item level,’ i.e. documenting one dataset at a time. However, there is a difference between documenting one book and 100,000 books. There are no easy answers to this, but libraries could explore one possible avenue by using Datasheets. Timnit Gebru et al. proposed the idea of Datasheets in ‘Datasheets for Datasets’. A datasheet aims to provide a structured format for describing a dataset. This includes questions like how and why it was constructed, what the data consists of, and how it could potentially be used. Crucially, datasheets also encourage a discussion of the bias and limitations of a dataset. Whilst you can identify some of these limitations by working with the data, there is also a crucial amount of information known by curators of the data that might not be obvious to end-users of the data. Datasheets offer one possible way for libraries to begin more systematically commuting this information. 

The dataset hub adopts the practice of writing datasheets and encourages users of the hub to write a datasheet for their dataset. For the British library books, we have attempted to write one of these datacards. Whilst it is certainly not perfect, it hopefully begins to outline some of the challenges of this dataset and gives end-users a better sense of how they should approach a dataset. 

18 March 2022

Looking back at LibCrowds: surveying our participants

'In the Spotlight' is a crowdsourcing project from the British Library that aims to make digitised historical playbills more discoverable, while also encouraging people to closely engage with this otherwise less accessible collection. Digital Curator Dr Mia Ridge writes...

If you follow our @LibCrowds account on twitter, you might have noticed that we've been working on refreshed versions of our In the Spotlight tasks on Zooniverse. That's part of a small project to enable the use of IIIF manifests on Zooniverse - in everyday language, it means that many, many more digitised items can form the basis of crowdsourcing tasks in the Zooniverse Project Builder, and In the Spotlight is the first project to use this new feature. Along with colleagues in Printed Heritage and BL Labs, I've been looking at our original Pybossa-based LibCrowds site to plan a 'graceful ending' for first phase of the project on LibCrowds.com.

As part of our work documenting and archiving the original LibCrowds site, I'm delighted to share summary results from a 2018 survey of In the Spotlight participants, now published on the British library's Research Repository: https://doi.org/10.23636/w4ee-yc34. Our thanks go to Susan Knight, Customer Insight Coordinator, for her help with the survey.

The survey was designed to help us understand who In the Spotlight participants were, and to help us prioritise work on the project. The 22 question survey was based on earlier surveys run by the Galaxy Zoo and Art UK Tagger projects, to allow comparison with other crowdsourcing projects, and to contribute to our understanding of crowdsourcing in cultural heritage more broadly. It was open to anyone who had contributed to the British Library's In the Spotlight project for historical playbills. The survey was distributed to LibCrowds newsletter subscribers, on the LibCrowds community forum and on social media.

Some headline findings from our survey include:

  • Respondents were most likely to be a woman with a Masters degree, in full-time employment, in London or Southeast UK, who contributes in a break between other tasks or 'whenever they have spare time'.
  • 76% of respondents were motivated by contributing to historical or performance research

Responses to the question 'What was it about this project which caused you to spend more time than intended on it?':

  • Easy to do
  • It's so entertaining
  • Every time an entry is completed you are presented with another item which is interesting and
  • illuminating which provides a continuous temptation regarding what you might discover next
  • simplicity
  • A bit of competitiveness about the top ten contributors but also about contributing something useful
  • I just got carried away with the fun
  • It's so easy to complete
  • Easy to want to do just a few more
  • Addiction
  • Felt I could get through more tasks
  • Just getting engrossed
  • It can be a bit addictive!
  • It's so easy to do that it's very easy to get carried away.
  • interested in the [material]

The summary report contains more rich detail, so go check it out!

 

Crowdsourcing projects from the British Library. 2,969 Volunteers. 265,648 Contributions. 175 Projects
Detail of the front page of libcrowds.com; Crowdsourcing projects from the British Library. 2,969 Volunteers. 265,648 Contributions. 175 Projects

10 March 2022

Scoping the connections between trusted arts and humanities data repositories

CONNECTED: Connecting trusted Arts and Humanities data repositories is a newly funded activity, supported by AHRC. It is led by the British Library, with the Archaeology Data Service and the Oxford Text Archive as co-investigators, and is supported by consultants from MoreBrains Cooperative.The CONNECTED team believes that improving discovery and curation of heritage and emergent content types in the arts and humanities will increase the impact of cultural resources, and enhance equity. Great work is already being done on discovery services for the sector, so we decided to look upstream, and focus on facilitating repository and archive deposit.

The UK boasts a dynamic institutional repository environment in the HE sector, as well as a range of subject- or field-specific repositories. With a distributed repository landscape now firmly established, challenges and inefficiencies still remain that reduce its impact. These include issues around discovery and access, but also questions around interoperability, the relationship of specialised vs general infrastructures, and potential duplication of effort from an author/depositor perspective. Greater coherence and interoperability will effectively unite different trusted repository services to form a resilient distributed data service, which can grow over time as new individual services are required and developed. Alongside the other projects funded as part of ‘Scoping future data services for the arts and humanities’, CONNECTED will help to deliver this unified network.

As practice in the creative arts becomes more digital and the digital humanities continue to thrive, the diversity of ways in which this research is expressed continues to grow. Researchers are increasingly able to combine artefacts, documents, and materials in new and innovative ways; practice-based research in the arts is creating a diverse range of (often complex) outputs, creating new curation and discovery needs; and heritage collections often contain artefacts with large amounts of annotation and commentary amassed over years or centuries, across multiple formats, and with rich contextual information. This expansion is already exposing the limitations of our current information systems, with the potential for vital context and provenance to become invisible. Without additional, careful, future-proofing, the risks of information loss and limits on access will only expand. In addition, metadata creation, deposit, preservation, and discovery strategies should, therefore, be tailored to meet the very different needs of the arts and humanities.

A number of initiatives are aimed at improving interoperability between metadata sources in ways that are more oriented towards the needs of the arts and humanities. Drawing these together with the insights to be gained from the abilities (and limitations) of bibliographic and data-centric metadata and discovery systems, will help to generate robust services in the complex, evolving landscape of arts and humanities research and creation. 

The CONNECTED project will assemble experts, practitioners, and researchers to map current gaps in the content curation and discovery ecosystem and weave together the strengths and potentials of a range of platforms, standards, and technologies in the service of the arts and humanities community. Our activities will run until the end of May, and will comprise three phases:

Phase 1 - Discovery

We will focus on repository or archive deposit as a foundation for the discovery and preservation of diverse outputs, and also as a way to help capture the connections between those objects and the commentary, annotation, and other associated artefacts. 

A data service for the arts and humanities must be developed with researcher needs as a priority, so the project team will engage in a series of semi-structured interviews with a variety of stakeholders including researchers, librarians, curators, and information technologists. The interviews will explore the following ideas:

  • What do researchers need when engaging in discovery of both heritage materials and new outputs?
  • Are there specific needs that relate to different types of content or use-cases? For example, research involving multimedia or structured information processing at scale?
  • What can the current infrastructure support, and where are the gaps between what we have and what we need?
  • What are the feasible technical approaches to transform information discovery?

Phase 2 - Data service programme scoping and planning

The findings from phase 1 will be synthesised using a commercial product strategy approach known as a canvas analysis. Based on the initial impressions from the semi-structured interviews, it is likely that an agile, product, or value proposition canvas will be used to synthesise the findings and structure thinking so that a coherent and robust strategy can be developed. Outputs from the strategy canvas exercise will then be applied to a fully costed and scoped product roadmap and budget for a national data deposit service for the arts and humanities.

Phase 3 - Scoping a unified archiving solution

Building on the partnerships and conversations from the previous phases, the feasibility of a unified ‘deposit switchboard’ will be explored. The purpose of such a switchboard is to enable researchers, curators, and creators to easily deposit items in the most appropriate repository or archive in their field for the object type they are uploading. Using insights gained from the landscaping interviews in phase 1, the team will identify potential pathways to developing a routing service for channelling content to the most appropriate home.

We will conclude with a virtual community workshop to explore the challenges and desirability of the switchboard approach, with a special focus on the benefits this could bring to the uploader of new content and resources.

This is an ambitious project, through which we hope to deliver:

  • A fully costed and scoped technical and organisational roadmap to build the required components and framework for the National Collection
  • Improved usage of resources in the wider GLAM and institutional network, including of course the Archaeology Data Service, The British Library's Shared Research Repository, and the Oxford Text Archive
  • Steps towards a truly community-governed data infrastructure for the arts and humanities as part of the National Collection

As a result of this work, access to UK cultural heritage and outputs will be accelerated and simplified, the impact of the arts and humanities will be enhanced, and we will help the community to consolidate the UK's position as a global leader in digital humanities and infrastructure.

This post is from Rachael Kotarski (@RachPK), Principal Investigator for CONNECTED, and Josh Brown from MoreBrains.

14 February 2022

PhD Placement on Mapping Caribbean Diasporic Networks through Correspondence

Every year the British Library host a range of PhD placement scheme projects. If you are interested in applying for one of these, the 2022 opportunities are advertised here. There are currently 15 projects available across Library departments, all starting from June 2022 onwards and ending before March 2023. If you would like to work with born digital collections, you may want to read last week’s Digital Scholarship blog post about two projects on enhanced curation, hybrid archives and emerging formats. However, if you are interested in Caribbean diasporic networks and want to experiment creating network analysis visualisations, then read on to find out more about the “Mapping Caribbean Diasporic Networks through correspondence (2022-ACQ-CDN)” project.

This is an exciting opportunity to be involved with the preliminary stages of a project to map the Caribbean Diasporic Network evident in the ‘Special Correspondence’ files of the Andrew Salkey Archive. This placement will be based in the Contemporary Literary and Creative Archives team at the British Library with support from Digital Scholarship colleagues. The successful candidate will be given access to a selection of correspondence files to create an item level dataset and explore the content of letters from the likes of Edward Kamau Brathwaite, C.L.R. James, and Samuel Selvon.

Photograph of Andrew Salkey
Photograph of Andrew Salkey, from the Andrew Salkey Archive, Deposit 10310. With kind permission of Jason Salkey.

The main outcome envisaged for this placement is to develop a dataset, using a sample of ten files, linking the data and mapping the correspondent’s names, location they were writing from, and dates of the correspondence in a spreadsheet. The placement student will also learn how to use the Gephi Open Graph Visualisation Platform to create a visual representation of this network, associating individuals with each other and mapping their movement across the world between the 1950s and 1990s.

Gephi is open-source software  for visualising and analysing networks, they provide a step-by-step guide to getting started, with the first step to upload a spreadsheet detailing your ‘nodes’ and ‘edges’. To show an example of how Gephi can be used, We've included an example below, which was created by previous British Library research placement student Sarah FitzGerald from the University of Sussex, using data from the Endangered Archives Programme (EAP) to create a Gephi visualisation of all EAP applications received between 2004 and 2017.

Gephi network visualisation diagram
Network visualisation of EAP Applications created by Sarah FitzGerald

In this visualisation the size of each country relates to the number of applications it features in, as country of archive, country of applicant, or both.  The colours show related groups. Each line shows the direction and frequency of application. The line always travels in a clockwise direction from country of applicant to country of archive, the thicker the line the more applications. Where the country of applicant and country of archive are the same the line becomes a loop. If you want to read more about the other visualisations that Sarah created during her project, please check out these two blog posts:

We hope this new PhD placement will offer the successful candidate the opportunity to develop their specialist knowledge through access to the extensive correspondence series in the Andrew Salkey archive, and to undertake practical research in a curatorial context by improving the accessibility of linked metadata for this collection material. This project is a vital building block in improving the Library’s engagement with this material and exploring the ways it can be accessed by a wider audience.

If you want to apply, details are available on the British Library website at https://www.bl.uk/research-collaboration/doctoral-research/british-library-phd-placement-scheme. Applications for all 2022/23 PhD Placements close on Friday 25 February 2022, 5pm GMT. The application form and guidelines are available online here. Please address any queries to research.development@bl.uk

This post is by Digital Curator Stella Wisdom (@miss_wisdom) and Eleanor Casson (@EleCasson), Curator in Contemporary Archives and Manuscripts.

26 January 2022

Which Came First: The Author or the Text? Wikidata and the New Media Writing Prize

Congratulations to the 2021 New Media Writing Prize (NMWP) winners, which were announced at a Bournemouth University online event recently: Joannes Truyens and collaborators (Main Prize), Melody MOU (Student Award) and Daria Donina (FIPP Journalism Award 2021). The main prize winner ‘Neurocracy’ is an experimental dystopian narrative that takes place over 10 episodes, through Omnipedia, an imagined future version of Wikipedia in 2049. So this seemed like a very apt jumping off point for today’s blog post, which discusses a recent project where we added NMWP data to Wikidata.

Screen image of Omnipediaan imagined futuristic version of Wikipedia from Neurocracy by Joannes Truyens
Omnipedia, an imagined futuristic version of Wikipedia from Neurocracy by Joannes Truyens

Note: If you wish to read ‘Neurocracy’ and are prompted for a username and password, use NewMediaWritingPrize1 password N3wMediaWritingPrize!. You can learn more about the work in this article and listen to an interview with the author in this podcast episode.

Working With Wikidata

Dr Martin Poulter describes learning how to work with Wikidata as being like learning a language. When I first heard this description, I didn’t understand: how could something so reliant on raw data be anything like the intricacies of language learning?

It turns out, Martin was completely correct.

Imagine a stack of data as slips of paper. Each slip has an individual piece of data on it: an author’s name, a publication date, a format, a title. How do you start to string this data together so that it makes sense?

One of the beautiful things about Wikidata is that it is both machine and human readable. In order for it to work this way, and for us to upload it effectively, thinking about the relationships between these slips of paper is essential.

In 2021, I had an opportunity to see what Martin was talking about when he spoke about language, as I was asked to work with a set of data about NMWP shortlisted and winning works, which the British Library has collected in the UK Web Archive. You can read more about this special collection here and here

Image of blank post-it notes and a hand with a marker pen preparing to write on one.

About the New Media Writing Prize

The New Media Writing Prize was founded in 2010 to showcase exciting and inventive stories and poetry that integrate a variety of digital formats, platforms, and media. One of the driving forces in setting up and establishing the prize was Chris Meade, director of if:book uk, a ‘think and do tank’ for exploring digital and collaborative possibilities for writers and readers. He was the lead sponsor of the if:book UK New Media Writing Prize, and the Dot Award, which he created in honour of his mother, Dorothy, and he chaired every NMWP awards evening since 2010. Very sadly Chris passed away on 13th January 2022 and the recent 2021 awards event was dedicated to Chris and his family.

Recognising the significance of the NMWP, in recent years the British Library created the New Media Writing Prize Special Collection as part of its emerging formats work. With 11 years of metadata about a born digital collection, this was an ideal data set for me to work with in order to establish a methodology for working with Wikidata uploads in the Library.

Last year I was fortunate to collaborate with Tegan Pyke, a PhD placement student in the Contemporary British Publications Collections team, supervised by Guilia Carla Rossi, Curator for Digital Publications. Tegan's project examined the digital preservation challenges of complex digital objects, developing and testing a quality assurance process for examining works in the NMWP collection. If you want to read more about this project, a report is available here.  For the Wikidata work Tegan and Giulia provided two spreadsheets of data (or slips of paper!), and my aim was to upload linked data that covered the authors, their works, and the award itself - who had been shortlisted, who had won, and when.

Simple, right?

Getting Started

I thought so - until I began to structure my uploads. There were some key questions that needed to be answered about how these relationships would be built, and I needed to start somewhere. Should I upload the authors or the texts first? Should I go through the prize year by year, or be led by other information? And what about texts with multiple authors?

Suddenly it all felt a bit more intimidating!

I was fortunate to attend some Wikidata training run by Wikimedia UK late last year. Martin was our trainer, and one piece of advice he gave us was indispensable: if you’re not sure where to start, literally write it out with pencil and paper. What is the relationship you’re trying to show, in its simplest form? This is where language framing comes in especially useful: thinking about the basic sentence structures I’d learned in high school German became vital.

Image shows four simple sentences: Christine Wilks won NMWP in 2010. Christine Wilks wrote Underbelly. Underbelly won NMWP in 2010. NMWP was won by Christine Wilks in 2010. Christine is circled in green, NMWP in people, and Underbelly in yellow.  QIDs are listed: Q108810306, highlighted in green Q108459688, highlighted in purple Q109237591, highlighted in yellow  Properties are listed: P166, highlighted in blue P800, highlighted in turquoise P585, highlighted in orange
Image by the author, notes own.

The Numbers Bit

You can see from this image how the framework develops: specific items, like nouns, are given identification numbers when they become a Wikidata item. This is their QID. The relationships between QIDs, sort of like the adjectives and verbs, are defined as properties and have P numbers. So Christine Wilks is now Q108810306, and her relationship to her work, Underbelly, or Q109237591, is defined with P800 which means ‘notable work’.

Q108810306 - P800 - Q109237591

You can upload this relationship using the visual editor on Wikidata, by clicking fields and entering data. If you have a large amount of information (remember those slips of paper!) tools like QuickStatements become very useful. Dominic Kane blogged about his experience of this system during his British Library student placement project in 2021.

The intricacies of language are also very important on Wikidata. The nuance and inference we can draw from specific terms is important. The concept of ‘winning’ an award became a subject of semantic debate: the taxonomy of Wikidata advises that we use ‘award received’ in the case of a literary prize, as it’s less of an active sporting competition than something like a marathon or an athletic event.

Asking Questions of the Data

Ultimately we upload information to Wikidata so that it can be queried. Querying uses SPAQRL, a language which allows users to draw information and patterns from vast swathes of data. Querying can be complex: to go back to the language analogy, you have to phrase the query in precisely the right way to get the information you want.

One of the lessons I learned during the NMWP uploads was the importance of a unifying property. Users will likely query this data with a view to surveying results and finding patterns. Each author and work, therefore, needed to be linked to the prize and the collection itself (pictured above). By adding this QID to the property P6379 (‘has works in the collection’), we create a web of data that links every shortlisted author over the 11 year time period.

Getting Started

To have a look at some of the NMWP data, here are some queries I prepared earlier. Please note that data from the 2021 competition has not yet been uploaded!

Authors who won NMWP

Works that won NMWP

Authors nominated for NMWP

Works nominated for NMWP

If you fancy trying some queries but don’t know where to start, I recommend these tutorials:

Tutorials

Resources About SPARQL

This post is by Wikimedian in Residence Dr Lucy Hinnie (@BL_Wikimedian

30 November 2021

BL Labs Online Symposium 2021, Special Climate Change Edition: Speakers Announced!

BL Labs 9th Symposium – Special Climate Change Edition is taking place on Tuesday 7 December 2021. This special event is devoted to looking at computational research and climate change.

A polar bear jumping off an iceberg with the rear of a ship showing. Image captioned: 'A Bear Plunging Into The Sea'
British Library digitised image from page 303 of "A Voyage of Discovery, made under the orders of the Admiralty, in his Majesty's ships Isabella and Alexander for the purpose of exploring Baffin's Bay, and enquiring into the possibility of a North-West Passage".

To help us explore a range of complex issues at the intersection of computational research and climate change we are delighted to announce our expert panel:

  • Schuyler Esprit – Founding Director of Create Caribbean Research Institute & Research Officer at the School of Graduate Studies and Research at the University of West Indies
  • Helen Hardy – Science Digital Programme Manager at the Natural History Museum, London, responsible for mass digitisation of the Museum’s collections of 80 million items
  • Joycelyn Longdon – Founder of ClimateInColour, a platform at the intersection of climate science and social justice, and PhD Student on the Artificial Intelligence for Environmental Risk programme at University of Cambridge
  • Gavin Shaddick – Chair of Data Science and Statistics, University of Exeter, Director of the UKRI funded Centre for Doctoral Training in Environmental Intelligence: Data Science and AI for Sustainable Futures, co-Director of the University of Exeter-Met Office Joint Centre for Excellence in Environmental Intelligence and an Alan Turing Fellow
  • Richard Sandford – Professor of Heritage Evidence, Foresight and Policy at the Institute of Sustainable Heritage at University College London
  • Joseph Walton – Research Fellow in Digital Humanities and Critical and Cultural Theory at the University of Sussex

Join us for this exciting discussion addressing issues such as how digitisation can improve research efficiency, discussing pros and cons of AI and machine learning in relation to climate change, and the links between new technologies, climate and social justice.

You can see more details about our panel and book your place here.

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs