Digital scholarship blog

115 posts categorized "Tools"

11 November 2020

BL Labs Online Symposium 2020 : Book your place for Tuesday 15-Dec-2020

Posted by Mahendra Mahey, Manager of BL Labs

The BL Labs team are pleased to announce that the eighth annual British Library Labs Symposium 2020 will be held on Tuesday 15 December 2020, from 13:45 - 16:55* (see note below) online. The event is FREE, but you must book a ticket in advance to reserve your place. Last year's event was the largest we have ever held, so please don't miss out and book early, see more information here!

*Please note, that directly after the Symposium, we are organising an experimental online mingling networking session between 16:55 and 17:30!

The British Library Labs (BL Labs) Symposium is an annual event and awards ceremony showcasing innovative projects that use the British Library's digital collections and data. It provides a platform for highlighting and discussing the use of the Library’s digital collections for research, inspiration and enjoyment. The awards this year will recognise outstanding use of British Library's digital content in the categories of Research, Artistic, Educational, Community and British Library staff contributions.

This is our eighth annual symposium and you can see previous Symposia videos from 201920182017201620152014 and our launch event in 2013.

Dr Ruth Anhert, Professor of Literary History and Digital Humanities at Queen Mary University of London Principal Investigator on 'Living With Machines' at The Alan Turing Institute
Ruth Ahnert will be giving the BL Labs Symposium 2020 keynote this year.

We are very proud to announce that this year's keynote will be delivered by Ruth Ahnert, Professor of Literary History and Digital Humanities at Queen Mary University of London, and Principal Investigator on 'Living With Machines' at The Alan Turing Institute.

Her work focuses on Tudor culture, book history, and digital humanities. She is author of The Rise of Prison Literature in the Sixteenth Century (Cambridge University Press, 2013), editor of Re-forming the Psalms in Tudor England, as a special issue of Renaissance Studies (2015), and co-author of two further books: The Network Turn: Changing Perspectives in the Humanities (Cambridge University Press, 2020) and Tudor Networks of Power (forthcoming with Oxford University Press). Recent collaborative work has taken place through AHRC-funded projects ‘Living with Machines’ and 'Networking the Archives: Assembling and analysing a meta-archive of correspondence, 1509-1714’. With Elaine Treharne she is series editor of the Stanford University Press’s Text Technologies series.

Ruth's keynote is entitled: Humanists Living with Machines: reflections on collaboration and computational history during a global pandemic

You can follow Ruth on Twitter.

There will be Awards announcements throughout the event for Research, Artistic, Community, Teaching & Learning and Staff Categories and this year we are going to get the audience to vote for their favourite project in those that were shortlisted, a people's BL Labs Award!

There will be a final talk near the end of the conference and we will announce the speaker for that session very soon.

So don't forget to book your place for the Symposium today as we predict it will be another full house again, the first one online and we don't want you to miss out, see more detailed information here

We look forward to seeing new faces and meeting old friends again!

For any further information, please contact labs@bl.uk

04 November 2020

Transforming Legacy Indexes into Catalogue Entries

This guest post is by Alex Hailey, Curator of Modern Archives and Manuscripts. He's on Twitter as @ajrhailey.

In late 2019 I was lucky enough to join BL and National Archives staff to trial a PG Certificate in Computing for Cultural Heritage at Birkbeck. The course provided an introduction to programming with Python, the basics of SQL, and using the two to work with data. Fellow attendees Graham, Nick, Chris and Giulia have written about their work previously, and I am going to briefly introduce one of my project tasks addressing issues with legacy metadata within the India Office Records.

 

The original data

The IOR/E/4 Correspondence with India series consists of 1,112 volumes dating from 1703-1858: four series of letters received by the East India Company (EIC) Court of Directors from the administration in India, and four series of dispatches sent to India. Catalogue entries for these volumes contain only basic information – title, dates, language, reference and former references – and subject, name and place access to the dispatches is provided through 72 index volumes (reference IOR/Z/E/4), which contain around 430,000 entries.

Sample catalogue record titled Pensions, Carnatic, Proceedings respecting from Reference IOR/Z/E/4/42/P133
Sample catalogue record of an index entry, IOR/Z/E/4/42/P133

The original indexes were produced from 1901-1929 by staff of the Secretarial Bureau, led by indexing pioneer Mary Petherbridge; my colleague Antonia Moon has written about Petherbridge’s work in a previous post. When these indexes were converted to the catalogue in the early 2010s, entries within the index volumes were entered as child or sub-items of the index volumes themselves, with information on the related correspondence volumes entered into the free-text Related material field, as shown in the image above.

 

Problem and solution

This approach has caused some issues. Firstly, users attempting to order the related correspondence regularly end up trying to place an order for an index volume instead, which is frustrating. Secondly, it makes it practically impossible to determine the whole contents of a particular volume in a quick and easy manner, which frustrates access and use.

Manually working through 430,000 entries to group the entries by volume would be an impossible task, but I was able to use Python and a library called Pandas, which has a number of useful features for examining and manipulating catalogue data: methods for reading and writing data from multiple sources, flexible reshaping of datasets, and methods for aggregation, indexing, splitting and replacing strings, including regular expressions.

Using Pandas I was able to separate information in the Related material field, restructure the data so that each instance of an index entry formed an individual record, and then group these by volume and further arrange them alphabetically or by page order.

 

Index entries for reference IOR/Z/E/4/42/P133 split into separate records
Index entries for reference IOR/Z/E/4/42/P133 split into separate records

 

 

 

Outputs and analysis

Examining these outputs gave us new insights into the data. We now know that the indexes cover 230 volumes of the dispatches only. We were also able to identify incomplete references originally recorded in the Related material field, as well as what appear to be keying errors (references which fall outside of the range of the dispatches series). We can now follow these up and correct errors in the catalogue which were previously unknown.

Comparing the data at volume level arranged alphabetically and by page order, we could appreciate just how much depth there was to the index. Traditional indexes are written with a lot of information redundancy, which isn’t immediately apparent until you group the entries according to their location within a particular volume:

Example of index entries arranged by page order, for example, 'Chart, Maps & Surveys, Harbours, Dalrymples' plans of, sent to India, pp87, 377' followed by 'East Indian Ports, Plans of Dalrymple publishing, pp87, 377' etc.
Example of index entries arranged by page order

After discussion with the IOR team we have decided to take the alphabetically arranged data and import it to the archives catalogue, so that users selecting a dispatches volume are presented with the relevant index entries immediately.

The original dataset and derived datasets have been uploaded to the Library’s research repository where they are available for download and reuse under a CC0 licence.

To enable further analysis of the index data I have also tried my hand at creating a Jupyter Notebook to use with the derived data. This is intended to introduce colleagues to using Notebooks, Python and the Pandas library to examine catalogue metadata, conducting basic queries, producing a visualisation and exporting subsets for further investigation.

Wordcloud based on terms contained in the IOR/Z/E/4 data, generated within the Jupyter Notebook. Some of the larger, highlighted words are 'respecting', 'Army', 'India', 'Administration', 'Department', 'Madras', etc. Some small words include 'late', 'allowances', 'paid', 'appointment', 'repair', etc.
Wordcloud based on terms contained in the IOR/Z/E/4 data, generated within the Jupyter Notebook.

My Birkbeck project also included work to create place and institution authority files for the Proceedings of the Governments of India series using keyword extraction with existing catalogue metadata, and this will be discussed in a future post.

Huge thanks must go to Nora McGregor, Jo Pugh and the folks at Birkbeck Department of Computer Science for developing the course and providing us with this opportunity; Antonia Moon and the IOR team for helpful discussions about the IOR data; and the rest of the cohort for moral support when the computer just wouldn’t behave.

Alex Hailey

Curator of Modern Archives and Manuscripts

23 October 2020

BL Labs Public Award Runner Up (Research) 2019 - Automated Labelling of People in Video Archives

Example people identified in TV news related programme clips
People 'automatically' identified in digital TV news related programme clips.

Guest blog post by Andrew Brown (PhD researcher),  Ernesto Coto (Research Software Engineer) and Andrew Zisserman (Professor) of the Visual Geometry Group, Department of Engineering Science, University of Oxford, and BL Labs Public Award Runner-up for Research, 2019. Posted on their behalf by Mahendra Mahey, Manager of BL Labs.

In this work, we automatically identify and label (tag) people in large video archives without the need for any manual annotation or supervision. The project was carried out with the British Library on a sample of 106 videos from their “Television and radio news” archive; a large collection of news programs from the last 10 years. This archive serves as an important and fascinating resource for researchers and the general public alike. However, the sheer scale of the data, coupled with a lack of relevant metadata, makes indexing, analysing and navigating this content an increasingly difficult task. Relying on human annotation is no longer feasible, and without an effective way to navigate these videos, this bank of knowledge is largely inaccessible.

As users, we are typically interested in human-centric queries such as:

  • “When did Jeremy Corbyn first appear in a Newsnight episode?” or
  • “Show me all of the times when Hugh Grant and Shirley Williams appeared together.

Currently this is nigh on impossible without trawling through hundreds of hours of content. 

We posed the following research question:

Is it possible to enable automatic person-search capabilities such as this in the archive, without the need for any manual supervision or labelling?

The answer is “yes”, and the method is described next.

Video Pre-Processing

The basic unit which enables person labelling in videos is the face-track; a group of consecutive face detections within a shot that correspond to the same identity. Face-tracks are extracted from all of the videos in the archive. The task of labelling the people in the videos is then to assign a label to each one of these extracted face-tracks. The video below gives an example of two face-tracks found in a scene.


Two face-tracks found in British Library digital news footage by Visual Geometry Group - University of Oxford.

Techniques at Our Disposal

The base technology used for this work is a state-of-the-art convolutional neural network (CNN), trained for facial recognition [1]. The CNN extracts feature-vectors (a list of numbers) from face images, which indicate the identity of the depicted person. To label a face-track, the distance between the feature-vector for the face-track, and the feature-vector for a face-image with known identity is computed. The face-track is labelled as depicting that identity if the distance is smaller than a certain threshold (i.e. they match). We also use a speaker recognition CNN [2] that works in the same way, except it labels speech segments from unknown identities using speech segments from known identities within the video.

Labelling the Face-Tracks

Our method for automatically labelling the people in the video archive is divided into three main stages:

(1) Our first labelling method uses what we term a “celebrity feature-vector bank”, which consists of names of people that are likely to appear in the videos, and their corresponding feature-vectors. The names are automatically sourced from IMDB cast lists for the programmes (the titles of the programmes are freely available in the meta-data). Face-images for each of the names are automatically downloaded from image-search engines. Incorrect face-images and people with no images of themselves on search engines are automatically removed at this stage. We compute the feature-vectors for each identity and add them to the bank alongside the names. The face-tracks from the video archives are then simply labelled by finding matches in the feature-vector bank.

Face-tracks from the video archives are labelled by finding matches in the feature-vector bank.
Face-tracks from the video archives are labelled by finding matches in the feature-vector bank. 

(2) Our second labelling method uses the idea that if a name is spoken, or found displayed in a scene, then that person is likely to be found within that scene. The task is then to automatically determine whether there is a correspondence or not. Text is automatically read from the news videos using Optical Character Recognition (OCR), and speech is automatically transcribed using Automatic Speech Recognition (ASR). Names are identified and they are searched for on image search engines. The top ranked images are downloaded and the feature-vectors are computed from the faces. If any are close enough to the feature-vectors from the face-tracks present in the scene, then that face-track is labelled with that name. The video below details this process for a written name.


Using text or spoken word and face recognition to identify a person in a news clip.

(3) For our third labelling method, we use speaker recognition to identify any non-labelled speaking people. We use the labels from the previous two stages to automatically acquire labelled speech segments from the corresponding labelled face-tracks. For each remaining non-labelled speaking person, we extract the speech feature-vector and compute the distance of it to the feature-vectors of the labelled speech segments. If one is close enough, then the non-labelled speech segment and corresponding face-track is assigned that name. This process manages to label speaking face-tracks with visually challenging faces, e.g. deep in shadow or at an extremely non-frontal pose.

Indexing and Searching Identities

The results of our work can be browsed via a web search engine of our own design. A search bar allows for users to specify the person or group of people that they would like to search for. People’s names are efficiently indexed so that the complete list of names can be filtered as the user types in the search bar. The search results are returned instantly with their associated metadata (programme name, data and time) and can be displayed in multiple ways. The video associated with each search result can be played, visualising the location and the name of all identified people in the video. See the video below for more details. This allows for the archive videos to be easily navigated using person-search, thus opening them up for use by the general public.


Archive videos easily navigated using person-search.

For examples of more of our Computer Vision research and open-source software, visit the Visual Geometry Group website.

This work was supported by the EPSRC Programme Grant Seebibyte EP/M013774/1

[1] Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, and Andrew Zisserman. VGGFace2: A dataset for recognising faces across pose and age. In Proc. International Conference on Automatic Face & Gesture Recognition, 2018.

[2] Joon Son Chung, Arsha Nagrani and Andrew Zisserman. VoxCeleb2: Deep Speaker Recognition. INTERSPEECH, 2018

BL Labs Public Awards 2020

Inspired by this work that uses the British Library's digital archived news footage? Have you done something innovative using the British Library's digital collections and data? Why not consider entering your work for a BL Labs Public Award 2020 and win fame, glory and even a bit of money?

This year's public and staff awards 2020 are open for submission, the deadline for entry for both is Monday 30 November 2020.

Whilst we welcome projects on any use of our digital collections and data (especially in research, artistic, educational and community categories), we are particularly interested in entries in our public awards that have focused on anti-racist work, about the pandemic or that are using computational methods such as the use of Jupyter Notebooks.

25 September 2020

Making Data Into Sound

This is a guest post by Anne Courtney, Gulf History Cataloguer with the Qatar Digital Library, https://www.qdl.qa/en 

Sonification

Over the summer, I’ve been investigating the sonification of data. On the Qatar Project (QDL), we generate a large amount of data, and I wanted to experiment with different methods of representing it. Sonification was a new technique for me, which I learnt about through this article: https://programminghistorian.org/en/lessons/sonification.

 

What is sonification?

Sonification is the method of representing data in an aural format, rather than visual format, such as a graph. It is particularly useful for showing changes in data over time. Different trends are highlighted depending on the choices made during the process, in the same way as they would be when drawing a graph.

 

How does it work?

First, all the data must be put in the right format:

An example of data in Excel showing listed longitude points of
Figure 1: Excel data of longitude points where the Palsgrave anchored

Then, the data is used to generate a midi file. The Programming Historian provides an example python script for this, and by changing parts of it, it is possible to change the tempo, note length, scale, and other features.

Python script ready to output a midi file of occurrences of Anjouan over time
Figure 2: Python script ready to output a midi file of occurrences of Anjouan over time

Finally, to overlay the different midi files, edit them, and change the instruments, I used MuseScore, freely-downloadable music notation software. Other alternatives include LMMS and Garageband:

A music score with name labels of where the Discovery, Palsgrave, and Mary anchored on their journeys, showing different pitches and musical notations.
Figure 3: The score of the voyages of the Discovery, Palsgrave, and Mary, labelled to show the different places where they anchored.

 

The sound of authorities

Each item which the Qatar project catalogues has authority terms linked to it, which list the main subjects and places connected to the item. As each item is dated, it is possible to trace trends in subjects and places over time by assigning the dates of the items to the authority terms. Each authority term ends up with a list of dates when it was mentioned. By assigning different instruments to the different authorities, it is possible to hear how they are connected to each other.

This sound file contains the sounds of places connected with the trade in enslaved people, and how they intersect with the authority term ‘slave trade’. The file begins in 1700 and finishes in 1900. One of the advantages of sonification is that the silence is as eloquent as the data. The authority terms are mentioned more at the end of the time period than the start, and so the piece becomes noisier as the British increasingly concern themselves with these areas. The pitch of the instruments is determined, in this instance, by the months of the records in which they are mentioned.

Authorities

The authority terms are represented by these instruments:

Anjouan: piccolo

Madagascar: cello

Zanzibar: horn

Mauritius: piano

Slave Trade: tubular bell

 

Listening for ships

Ships

This piece follows the journeys of three ships from March 1633 to January 1637. In this example, the pitch is important because it represents longitude; the further east the ships travel, the higher the pitch. The Discovery and the Palsgrave mostly travelled together from Gravesend to India, and they both made frequent trips between the Gulf and India. The Mary set out from England in April 1636 to begin her own journey to India. The notes represent the time the ships spent in harbour, and the silence is the time spent at sea. The Discovery is represented by the flute, the Palsgrave by the violin, and the Mary by the horn.

14 September 2020

Digital geographical narratives with Knight Lab’s StoryMap

Visualising the journey of a manuscript’s creation

Working for the Qatar Digital Library (QDL), I recently catalogued British Library oriental manuscript 2361, a musical compendium copied in Mughal India during the reign of Aurangzeb (1618-1707; ruled from 1658). The QDL is a British Library-Qatar Foundation collaborative project to digitise and share Gulf-related archival records, maps and audio recordings as well as Arabic scientific manuscripts.

Portrait of Aurangzeb on a horse
Figure 1: Equestrian portrait of Aurangzeb. Mughal, c. 1660-70. British Library, Johnson Album, 3.4. Public domain.

The colophons to Or. 2361 fourteen texts contain an unusually large – but jumbled-up – quantity of information about the places and dates it was copied and checked, revealing that it was largely created during a journey taken by the imperial court in 1663.

Example of handwritten bibliographic information: Colophon to the copy of Kitāb al-madkhal fī al-mūsīqī by al-Fārābī
Figure 2: Colophon to the copy of Kitāb al-madkhal fī al-mūsīqī by al-Fārābī, transcribed in Delhi, 3 Jumādá I, 1073 hijrī/14 December 1662 CE, and checked in Lahore, 22 Rajab 1073/2 March 1663. Or. 2361, f. 240r.

Seeking to make sense of the mass of bibliographic information and unpick the narrative of the manuscript’s creation, I recorded all this data in a spreadsheet. This helped to clarify some patterns- but wasn’t fun to look at! To accompany an Asian and African Studies blog post, I wanted to find an interactive digital tool to develop the visual and spatial aspects of the story and convey the landscapes and distances experienced by the manuscript’s scribes and patron during its mobile production.

Screen shot of a spreadsheet of copy data for Or. 2361 showing information such as dates, locations, scribes etc.
Figure 3: Dull but useful spreadsheet of copy data for Or. 2361.

Many fascinating digital tools can present large datasets, including map co-ordinates. However, I needed to retell a linear, progressive narrative with fewer data points. Inspired by a QNF-BL colleague’s work on Geoffrey Prior’s trip to Muscat, I settled on StoryMap, one of an expanding suite of open-source reporting, data management, research, and storytelling tools developed by Knight Lab at Northwestern University, USA.

 

StoryMap: Easy but fiddly

Requiring no coding ability, the back-end of this free, easy-to-use tool resembles PowerPoint. The user creates a series of slides to which text, images, captions and copyright information can be added. Links to further online media, such as the millions of images published on the QDL, can easily be added.

Screen shot of someone editing in StoryMap
Figure 4: Back-end view of StoryMap's authoring tool.

The basic incarnation of StoryMap is accessed via an author interface which is intuitive and clear, but has its quirks. Slide layouts can’t be varied, and image manipulation must be completed pre-upload, which can get fiddly. Text was faint unless entirely in bold, especially against a backdrop image. A bug randomly rendered bits of uploaded text as hyperlinks, whereas intentional hyperlinks are not obvious.

 

The mapping function

StoryMap’s most interesting feature is an interactive map that uses OpenStreetMap data. Locations are inputted as co-ordinates, or manually by searching for a place-name or dropping a pin. This geographical data links together to produce an overview map summarised on the opening slide, with subsequent views zooming to successive locations in the journey.

Screen shot showing a preview of StoryMap with location points dropped on a world map
Figure 5: StoryMap summary preview showing all location points plotted.

I had to add location data manually as the co-ordinates input function didn’t work. Only one of the various map styles suited the historical subject-matter; however its modern street layout felt contradictory. The ‘ideal’ map – structured with global co-ordinates but correct for a specific historical moment – probably doesn’t exist (one for the next project?).

Screen shot of a point dropped on a local map, showing modern street layout
Figure 6: StoryMap's modern street layout implies New Delhi existed in 1663...

With clearly signposted advanced guidance, support forum, and a link to a GitHub repository, more technically-minded users could take StoryMap to the next level, not least in importing custom maps via Mapbox. Alternative platforms such as Esri’s Classic Story Maps can of course also be explored.

However, for many users, Knight Lab StoryMap’s appeal will lie in its ease of usage and accessibility; it produces polished, engaging outputs quickly with a bare minimum of technical input and is easy to embed in web-text or social media. Thanks to Knight Lab for producing this free tool!

See the finished StoryMap, A Mughal musical miscellany: The journey of Or. 2361.

 

This is a guest post by Jenny Norton-Wright, Arabic Scientific Manuscripts Curator from the British Library Qatar Foundation Partnership. You can follow the British Library Qatar Foundation Partnership on Twitter at @BLQatar.

11 September 2020

BL Labs Public Awards 2020: enter before NOON GMT Monday 30 November 2020! REMINDER

The sixth BL Labs Public Awards 2020 formally recognises outstanding and innovative work that has been carried out using the British Library’s data and / or digital collections by researchers, artists, entrepreneurs, educators, students and the general public.

The closing date for entering the Public Awards is NOON GMT on Monday 30 November 2020 and you can submit your entry any time up to then.

Please help us spread the word! We want to encourage any one interested to submit over the next few months, who knows, you could even win fame and glory, priceless! We really hope to have another year of fantastic projects to showcase at our annual online awards symposium on the 15 December 2020 (which is open for registration too), inspired by our digital collections and data!

This year, BL Labs is commending work in four key areas that have used or been inspired by our digital collections and data:

  • Research - A project or activity that shows the development of new knowledge, research methods, or tools.
  • Artistic - An artistic or creative endeavour that inspires, stimulates, amazes and provokes.
  • Educational - Quality learning experiences created for learners of any age and ability that use the Library's digital content.
  • Community - Work that has been created by an individual or group in a community.

What kind of projects are we looking for this year?

Whilst we are really happy for you to submit your work on any subject that uses our digital collections, in this significant year, we are particularly interested in entries that may have a focus on anti-racist work or projects about lock down / global pandemic. We are also curious and keen to have submissions that have used Jupyter Notebooks to carry out computational work on our digital collections and data.

After the submission deadline has passed, entries will be shortlisted and selected entrants will be notified via email by midnight on Friday 4th December 2020. 

A prize of £150 in British Library online vouchers will be awarded to the winner and £50 in the same format to the runner up in each Awards category at the Symposium. Of course if you enter, it will be at least a chance to showcase your work to a wide audience and in the past this has often resulted in major collaborations.

The talent of the BL Labs Awards winners and runners up over the last five years has led to the production of remarkable and varied collection of innovative projects described in our 'Digital Projects Archive'. In 2019, the Awards commended work in four main categories – Research, Artistic, Community and Educational:

BL_Labs_Winners_2019-smallBL  Labs Award Winners for 2019
(Top-Left) Full-Text search of Early Music Prints Online (F-TEMPO) - Research, (Top-Right) Emerging Formats: Discovering and Collecting Contemporary British Interactive Fiction - Artistic
(Bottom-Left) John Faucit Saville and the theatres of the East Midlands Circuit - Community commendation
(Bottom-Right) The Other Voice (Learning and Teaching)

For further detailed information, please visit BL Labs Public Awards 2020, or contact us at labs@bl.uk if you have a specific query.

Posted by Mahendra Mahey, Manager of British Library Labs.

07 September 2020

When is a persistent identifier not persistent? Or an identifier?

This guest post is by Jez Cope, Data Services Lead with Research Services at the British Library. He is on Twitter @jezcope.

Ever wondered what that bar code on the back of every book is? It’s an ISBN: an International Standard Book Number. Every modern book published has an ISBN, which uniquely identifies that book, and anyone publishing a book can get an ISBN for it whether an individual or a huge publishing house. It’s a little more complex than that in practice but generally speaking it’s 1 book, 1 ISBN. Right? Right.

Except…

If you search an online catalogue, such as WorldCat or The British Library for the ISBN 9780393073775 (or the 10-digit equivalent, 0393073777) you’ll find results appear for two completely different books:

  1. Waal FD. The Bonobo and the Atheist: In Search of Humanism Among the Primates. New York: W. W. Norton & Co.; 2013. 304 p. http://www.worldcat.org/oclc/1167414372
  2. Lodge HC. The Storm Has Many Eyes; a Personal Narrative. 1st edition. New York: New York Norton; 1973. http://www.worldcat.org/oclc/989188234

A screen grab of the main catalogue showing a search for ISBN 0393073777 with the above two results

In fact, things are so confused that the cover of one book gets pulled in for the other as well. Investigate further and you’ll see that it’s not a glitch: both books have been assigned the same ISBN. Others have found the same:

“However, if the books do not match, it’s usually one of two issues. First, if it is the same book but with a different cover, then it is likely the ISBN was reused for a later/earlier reprinting. … In the other case of duplicate ISBNs, it may be that an ISBN was reused on a completely different book. This shouldn’t happen because ISBNs are supposed to be unique, but exceptions have been found.” — GoodReads Librarian Manual: ISBN-10, ISBN-13 and ASINS

While most publishers stick to the rules about never reusing an ISBN, it’s apparently common knowledge in the book trade that ISBNs from old books get reused for newer books, sometimes accidentally (due to a typo), sometimes intentionally (to save money), and that has some tricky consequences.

I recently attended a webinar entitled “Identifiers in Heritage Collections - how embedded are they?” from the Persistent Identifiers as IRO Infrastructure (“HeritagePIDs”) project, part of AHRC’s Towards a National Collection programme. As quite often happens, the question was raised: what Persistent Identifier (PID) should we use for books and why can’t we just use ISBNs? Rod Page, who gave the demo that prompted this discussion, also wrote a short follow-up blog post about what makes PIDs work (or not) which is worth a look before you read the rest of this.

These are really valid questions and worth considering in more detail, and to do that we need to understand what makes a PID special. We call them persistent, and indeed we expect some sort of guarantee that a PID remains valid for the long term, so that we can use it as a link or placeholder for the referent without worrying that the link will get broken. But we also expect PIDs to be actionable: it can be made into a valid URL by following some rules: so that we can directly obtain the object referenced or at least some information about it.

Actionability implies two further properties: an actionable identifier must be

  1. Unique: guaranteed to have only one identifier for a given object (of a given type); and
  2. Unambiguous: guaranteed that a single identifier refers to only one object

Where does this leave us with ISBNs?

Well first up they’re not actionable to start with: given an ISBN, there’s no canonical way to obtain information about the book referenced, although in practice there are a number of databases that can help. There is, in fact, an actionable ISBN standard: ISBN-A permits converting an ISBN into a DOI with all the benefits of the underlying DOI and Handle infrastructure. Sadly, creation of an ISBN-A isn’t automatic and publishers have to explicitly create the ISBN-A DOI in addition to the already-create ISBN; most don’t.

More than that though, it’s hard to make them actionable since ISBNs fail on both uniqueness and unambiguity. Firstly, as seen in the example I gave above, ISBNs do get recycled, They’re not supposed to be:

“Once assigned to a monographic publication, an ISBN can never be reused to identify another monographic publication, even if the original ISBN is found to have been assigned in error.” — International ISBN Agency. ISBN Users’ Manual [Internet]. Seventh Edition. London, UK: International ISBN Agency; 2017 [cited 2020 Jul 23]. Available from: https://www.isbn-international.org/content/isbn-users-manual

Yet they are, so we can’t rely on their precision.[1]

Secondly, and perhaps more problematic in day-to-day use, a given book may have multiple ISBNs. To an extent this is reasonable: different editions of the same book may have different content, or at the very least different page numbering, so a PID should be able to distinguish these for accurate citation. Unfortunately the same edition of the same book will frequently have multiple ISBNs; in particular each different format (hardback, paperback, large print, ePub, MOBI, PDF, …) is expected to have a distinct ISBN. Even if all that changes is the publisher, a new ISBN is still created:

“We recently encountered a case where a publisher had licensed a book to another publisher for a different geographical market. Both books used the same ISBN. If the publisher of the book changes (even if nothing else about the book has changed), the ISBN must also change.” — Everything you wanted to know about the ISBN but were too afraid to ask

Again, this is reasonable since the ISBN is primarily intended for stockkeeping by book sellers[2], and for them the difference between a hardback and paperback is important because they differ in price if nothing else. This has bitten more than one librarian when trying to merge data from two different sources (such as usage and pricing) using the ISBN as the “obvious” merge key. It makes bibliometrics harder too, since you can’t easily pull out a list of all citations of a given edition in the literature, just from a single ISBN.

So where does this leave us?

I’m not really sure yet. ISBNs as they are currently specified and used by the book industry aren’t really fit for purpose as a PID. But they’re there and they sort-of work and establishing a more robust PID for books would need commitment and co-operation from authors, publishers and libraries. That’s not impossible: a lot of work has been done recently to make the ISSN (International Standard Serial Number, for journals) more actionable.

But perhaps there are other options. Where publishers, booksellers and libraries are primarily interested in IDs for stock management, authors, researchers and scholarly communications librarians are more interested in the scholarly record as a whole and tracking the flow of ideas (and credit for those) which is where PIDs come into their own. Is there an argument for a coalition of these groups to establish a parallel identifier system for citation & credit that’s truly persistent? It wouldn’t be the first time: ISNIs (International Standard Name Identifiers) and ORCIDs (Open Researcher and Contributor IDs) both identify people, but for different purposes in different roles and with robust metadata linking the two where possible.

I’m not sure where I’m going with this train of thought so I’ll leave it there for now, but I’m sure I’ll be back. The more I dig into this the more there is to find, including the mysterious, long-forgotten and no-longer accessible Book Item & Component Identifier proposal. In the meantime, if you want a persistent identifier and aren’t sure which one you need these Guides to Choosing a Persistent Identifier from Project FREYA should get you started.


  1. Actually, as my colleague pointed out, even DOIs potentially have this problem, although I feel they can mitigate it better with metadata that allows rich expression of relationships between DOIs.  ↩︎

  2. In fact, the newer ISBN-13 standard is simply an ISBN-10 encoded as an “International Article Number”, the standard barcode format for almost all retail products, by sticking the “Bookland” country code of 978 on the front and recalculating the check digit. ↩︎

04 September 2020

British Library Joins Share-VDE Linked Data Community

This blog post is by Alan Danskin, Collection Metadata Standards Manager, British Library. metadata@bl.uk

What is Share-VDE and why has the British Library joined the Share-VDE Community?

Share-VDE is a library-driven initiative bringing library catalogues together in a shared Virtual Discovery Environment.  It uses linked data technology to create connections between bibliographic information contributed by different institutions

Example SVDE page showing Tim Berners-Lee linked info to publications, wikipedia, and other external sites
Figure 1: SVDE page for Sir Tim Berners-Lee

For example, searching for Sir Tim Berners-Lee retrieves metadata contributed by different members, including links to his publications. The search also returns links to external sources of information, including Wikipedia.

The British Library will be the first institution to contribute its national bibliography to Share-VDE and we also plan to contribute our catalogue data. By collaborating with the Share-VDE community we will extend access to information about our collections and services and enable information to be reused.

The Library also contributes to Share-VDE by participating on community groups working to develop the metadata model and Share-VDE functionality. This provides us with a practical approach for bridging differences between the IFLA Library Reference Model (LRM) and the Bibframe initiative, led by Library of Congress.

Share VDE is promoted by the international bibliographic agency Casalini Libri and @Cult, a solutions developer working in the cultural heritage sector.

Andrew MacEwan, Head of Metadata at the British Library, explained that, “Membership of the Share-VDE community is an exciting opportunity to enrich the Library’s metadata and open it up for re-use by other institutions in a linked data environment.”

Tiziana Possemato, Chief Information Officer at Casalini Libri and Director of @Cult, said "We are delighted to collaborate with the British Library and extremely excited about unlocking the wealth of data in its collections, both to further enrich the Virtual Discovery Environment and to make the Library's resources even more accessible to users."

For further information about:

SHARE-VDE  

Linked Data

Linked Open Data

The British Library is the national library of the United Kingdom and one of the world's greatest research libraries. It provides world class information services to the academic, business, research and scientific communities and offers unparalleled access to the world's largest and most comprehensive research collection. The Library's collection has developed over 250 years and exceeds 150 million separate items representing every age of written civilisation and includes books, journals, manuscripts, maps, stamps, music, patents, photographs, newspapers and sound recordings in all written and spoken languages. Up to 10 million people visit the British Library website - www.bl.uk - every year where they can view up to 4 million digitised collection items and over 40 million pages.

Casalini Libri is a bibliographic agency producing authority and bibliographic data; a library vendor, supplying books and journals, and offering a variety of collection development and technical services; and an e-content provider, working both for publishers and libraries.

@Cult is a software development company, specializing in data conversion for LD; and provider of Integrated Library System and Discovery tools, delivering effective and innovative technological solutions to improve information management and knowledge sharing.

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs