THE BRITISH LIBRARY

Digital scholarship blog

105 posts categorized "Tools"

23 October 2020

BL Labs Public Award Runner Up (Research) 2019 - Automated Labelling of People in Video Archives

Add comment

Example people identified in TV news related programme clips
People 'automatically' identified in digital TV news related programme clips.

Guest blog post by Andrew Brown (PhD researcher),  Ernesto Coto (Research Software Engineer) and Andrew Zisserman (Professor) of the Visual Geometry Group, Department of Engineering Science, University of Oxford, and BL Labs Public Award Runner-up for Research, 2019. Posted on their behalf by Mahendra Mahey, Manager of BL Labs.

In this work, we automatically identify and label (tag) people in large video archives without the need for any manual annotation or supervision. The project was carried out with the British Library on a sample of 106 videos from their “Television and radio news” archive; a large collection of news programs from the last 10 years. This archive serves as an important and fascinating resource for researchers and the general public alike. However, the sheer scale of the data, coupled with a lack of relevant metadata, makes indexing, analysing and navigating this content an increasingly difficult task. Relying on human annotation is no longer feasible, and without an effective way to navigate these videos, this bank of knowledge is largely inaccessible.

As users, we are typically interested in human-centric queries such as:

  • “When did Jeremy Corbyn first appear in a Newsnight episode?” or
  • “Show me all of the times when Hugh Grant and Shirley Williams appeared together.

Currently this is nigh on impossible without trawling through hundreds of hours of content. 

We posed the following research question:

Is it possible to enable automatic person-search capabilities such as this in the archive, without the need for any manual supervision or labelling?

The answer is “yes”, and the method is described next.

Video Pre-Processing

The basic unit which enables person labelling in videos is the face-track; a group of consecutive face detections within a shot that correspond to the same identity. Face-tracks are extracted from all of the videos in the archive. The task of labelling the people in the videos is then to assign a label to each one of these extracted face-tracks. The video below gives an example of two face-tracks found in a scene.


Two face-tracks found in British Library digital news footage by Visual Geometry Group - University of Oxford.

Techniques at Our Disposal

The base technology used for this work is a state-of-the-art convolutional neural network (CNN), trained for facial recognition [1]. The CNN extracts feature-vectors (a list of numbers) from face images, which indicate the identity of the depicted person. To label a face-track, the distance between the feature-vector for the face-track, and the feature-vector for a face-image with known identity is computed. The face-track is labelled as depicting that identity if the distance is smaller than a certain threshold (i.e. they match). We also use a speaker recognition CNN [2] that works in the same way, except it labels speech segments from unknown identities using speech segments from known identities within the video.

Labelling the Face-Tracks

Our method for automatically labelling the people in the video archive is divided into three main stages:

(1) Our first labelling method uses what we term a “celebrity feature-vector bank”, which consists of names of people that are likely to appear in the videos, and their corresponding feature-vectors. The names are automatically sourced from IMDB cast lists for the programmes (the titles of the programmes are freely available in the meta-data). Face-images for each of the names are automatically downloaded from image-search engines. Incorrect face-images and people with no images of themselves on search engines are automatically removed at this stage. We compute the feature-vectors for each identity and add them to the bank alongside the names. The face-tracks from the video archives are then simply labelled by finding matches in the feature-vector bank.

Face-tracks from the video archives are labelled by finding matches in the feature-vector bank.
Face-tracks from the video archives are labelled by finding matches in the feature-vector bank. 

(2) Our second labelling method uses the idea that if a name is spoken, or found displayed in a scene, then that person is likely to be found within that scene. The task is then to automatically determine whether there is a correspondence or not. Text is automatically read from the news videos using Optical Character Recognition (OCR), and speech is automatically transcribed using Automatic Speech Recognition (ASR). Names are identified and they are searched for on image search engines. The top ranked images are downloaded and the feature-vectors are computed from the faces. If any are close enough to the feature-vectors from the face-tracks present in the scene, then that face-track is labelled with that name. The video below details this process for a written name.


Using text or spoken word and face recognition to identify a person in a news clip.

(3) For our third labelling method, we use speaker recognition to identify any non-labelled speaking people. We use the labels from the previous two stages to automatically acquire labelled speech segments from the corresponding labelled face-tracks. For each remaining non-labelled speaking person, we extract the speech feature-vector and compute the distance of it to the feature-vectors of the labelled speech segments. If one is close enough, then the non-labelled speech segment and corresponding face-track is assigned that name. This process manages to label speaking face-tracks with visually challenging faces, e.g. deep in shadow or at an extremely non-frontal pose.

Indexing and Searching Identities

The results of our work can be browsed via a web search engine of our own design. A search bar allows for users to specify the person or group of people that they would like to search for. People’s names are efficiently indexed so that the complete list of names can be filtered as the user types in the search bar. The search results are returned instantly with their associated metadata (programme name, data and time) and can be displayed in multiple ways. The video associated with each search result can be played, visualising the location and the name of all identified people in the video. See the video below for more details. This allows for the archive videos to be easily navigated using person-search, thus opening them up for use by the general public.


Archive videos easily navigated using person-search.

For examples of more of our Computer Vision research and open-source software, visit the Visual Geometry Group website.

This work was supported by the EPSRC Programme Grant Seebibyte EP/M013774/1

[1] Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, and Andrew Zisserman. VGGFace2: A dataset for recognising faces across pose and age. In Proc. International Conference on Automatic Face & Gesture Recognition, 2018.

[2] Joon Son Chung, Arsha Nagrani and Andrew Zisserman. VoxCeleb2: Deep Speaker Recognition. INTERSPEECH, 2018

BL Labs Public Awards 2020

Inspired by this work that uses the British Library's digital archived news footage? Have you done something innovative using the British Library's digital collections and data? Why not consider entering your work for a BL Labs Public Award 2020 and win fame, glory and even a bit of money?

This year's public and staff awards 2020 are open for submission, the deadline for entry for both is Monday 30 November 2020.

Whilst we welcome projects on any use of our digital collections and data (especially in research, artistic, educational and community categories), we are particularly interested in entries in our public awards that have focused on anti-racist work, about the pandemic or that are using computational methods such as the use of Jupyter Notebooks.

25 September 2020

Making Data Into Sound

Add comment

This is a guest post by Anne Courtney, Gulf History Cataloguer with the Qatar Digital Library, https://www.qdl.qa/en 

Sonification

Over the summer, I’ve been investigating the sonification of data. On the Qatar Project (QDL), we generate a large amount of data, and I wanted to experiment with different methods of representing it. Sonification was a new technique for me, which I learnt about through this article: https://programminghistorian.org/en/lessons/sonification.

 

What is sonification?

Sonification is the method of representing data in an aural format, rather than visual format, such as a graph. It is particularly useful for showing changes in data over time. Different trends are highlighted depending on the choices made during the process, in the same way as they would be when drawing a graph.

 

How does it work?

First, all the data must be put in the right format:

An example of data in Excel showing listed longitude points of
Figure 1: Excel data of longitude points where the Palsgrave anchored

Then, the data is used to generate a midi file. The Programming Historian provides an example python script for this, and by changing parts of it, it is possible to change the tempo, note length, scale, and other features.

Python script ready to output a midi file of occurrences of Anjouan over time
Figure 2: Python script ready to output a midi file of occurrences of Anjouan over time

Finally, to overlay the different midi files, edit them, and change the instruments, I used MuseScore, freely-downloadable music notation software. Other alternatives include LMMS and Garageband:

A music score with name labels of where the Discovery, Palsgrave, and Mary anchored on their journeys, showing different pitches and musical notations.
Figure 3: The score of the voyages of the Discovery, Palsgrave, and Mary, labelled to show the different places where they anchored.

 

The sound of authorities

Each item which the Qatar project catalogues has authority terms linked to it, which list the main subjects and places connected to the item. As each item is dated, it is possible to trace trends in subjects and places over time by assigning the dates of the items to the authority terms. Each authority term ends up with a list of dates when it was mentioned. By assigning different instruments to the different authorities, it is possible to hear how they are connected to each other.

This sound file contains the sounds of places connected with the trade in enslaved people, and how they intersect with the authority term ‘slave trade’. The file begins in 1700 and finishes in 1900. One of the advantages of sonification is that the silence is as eloquent as the data. The authority terms are mentioned more at the end of the time period than the start, and so the piece becomes noisier as the British increasingly concern themselves with these areas. The pitch of the instruments is determined, in this instance, by the months of the records in which they are mentioned.

Authorities

The authority terms are represented by these instruments:

Anjouan: piccolo

Madagascar: cello

Zanzibar: horn

Mauritius: piano

Slave Trade: tubular bell

 

Listening for ships

Ships

This piece follows the journeys of three ships from March 1633 to January 1637. In this example, the pitch is important because it represents longitude; the further east the ships travel, the higher the pitch. The Discovery and the Palsgrave mostly travelled together from Gravesend to India, and they both made frequent trips between the Gulf and India. The Mary set out from England in April 1636 to begin her own journey to India. The notes represent the time the ships spent in harbour, and the silence is the time spent at sea. The Discovery is represented by the flute, the Palsgrave by the violin, and the Mary by the horn.

14 September 2020

Digital geographical narratives with Knight Lab’s StoryMap

Add comment

Visualising the journey of a manuscript’s creation

Working for the Qatar Digital Library (QDL), I recently catalogued British Library oriental manuscript 2361, a musical compendium copied in Mughal India during the reign of Aurangzeb (1618-1707; ruled from 1658). The QDL is a British Library-Qatar Foundation collaborative project to digitise and share Gulf-related archival records, maps and audio recordings as well as Arabic scientific manuscripts.

Portrait of Aurangzeb on a horse
Figure 1: Equestrian portrait of Aurangzeb. Mughal, c. 1660-70. British Library, Johnson Album, 3.4. Public domain.

The colophons to Or. 2361 fourteen texts contain an unusually large – but jumbled-up – quantity of information about the places and dates it was copied and checked, revealing that it was largely created during a journey taken by the imperial court in 1663.

Example of handwritten bibliographic information: Colophon to the copy of Kitāb al-madkhal fī al-mūsīqī by al-Fārābī
Figure 2: Colophon to the copy of Kitāb al-madkhal fī al-mūsīqī by al-Fārābī, transcribed in Delhi, 3 Jumādá I, 1073 hijrī/14 December 1662 CE, and checked in Lahore, 22 Rajab 1073/2 March 1663. Or. 2361, f. 240r.

Seeking to make sense of the mass of bibliographic information and unpick the narrative of the manuscript’s creation, I recorded all this data in a spreadsheet. This helped to clarify some patterns- but wasn’t fun to look at! To accompany an Asian and African Studies blog post, I wanted to find an interactive digital tool to develop the visual and spatial aspects of the story and convey the landscapes and distances experienced by the manuscript’s scribes and patron during its mobile production.

Screen shot of a spreadsheet of copy data for Or. 2361 showing information such as dates, locations, scribes etc.
Figure 3: Dull but useful spreadsheet of copy data for Or. 2361.

Many fascinating digital tools can present large datasets, including map co-ordinates. However, I needed to retell a linear, progressive narrative with fewer data points. Inspired by a QNF-BL colleague’s work on Geoffrey Prior’s trip to Muscat, I settled on StoryMap, one of an expanding suite of open-source reporting, data management, research, and storytelling tools developed by Knight Lab at Northwestern University, USA.

 

StoryMap: Easy but fiddly

Requiring no coding ability, the back-end of this free, easy-to-use tool resembles PowerPoint. The user creates a series of slides to which text, images, captions and copyright information can be added. Links to further online media, such as the millions of images published on the QDL, can easily be added.

Screen shot of someone editing in StoryMap
Figure 4: Back-end view of StoryMap's authoring tool.

The basic incarnation of StoryMap is accessed via an author interface which is intuitive and clear, but has its quirks. Slide layouts can’t be varied, and image manipulation must be completed pre-upload, which can get fiddly. Text was faint unless entirely in bold, especially against a backdrop image. A bug randomly rendered bits of uploaded text as hyperlinks, whereas intentional hyperlinks are not obvious.

 

The mapping function

StoryMap’s most interesting feature is an interactive map that uses OpenStreetMap data. Locations are inputted as co-ordinates, or manually by searching for a place-name or dropping a pin. This geographical data links together to produce an overview map summarised on the opening slide, with subsequent views zooming to successive locations in the journey.

Screen shot showing a preview of StoryMap with location points dropped on a world map
Figure 5: StoryMap summary preview showing all location points plotted.

I had to add location data manually as the co-ordinates input function didn’t work. Only one of the various map styles suited the historical subject-matter; however its modern street layout felt contradictory. The ‘ideal’ map – structured with global co-ordinates but correct for a specific historical moment – probably doesn’t exist (one for the next project?).

Screen shot of a point dropped on a local map, showing modern street layout
Figure 6: StoryMap's modern street layout implies New Delhi existed in 1663...

With clearly signposted advanced guidance, support forum, and a link to a GitHub repository, more technically-minded users could take StoryMap to the next level, not least in importing custom maps via Mapbox. Alternative platforms such as Esri’s Classic Story Maps can of course also be explored.

However, for many users, Knight Lab StoryMap’s appeal will lie in its ease of usage and accessibility; it produces polished, engaging outputs quickly with a bare minimum of technical input and is easy to embed in web-text or social media. Thanks to Knight Lab for producing this free tool!

See the finished StoryMap, A Mughal musical miscellany: The journey of Or. 2361.

 

This is a guest post by Jenny Norton-Wright, Arabic Scientific Manuscripts Curator from the British Library Qatar Foundation Partnership. You can follow the British Library Qatar Foundation Partnership on Twitter at @BLQatar.

11 September 2020

BL Labs Public Awards 2020: enter before 0700 GMT Monday 30 November 2020!

Add comment

The sixth BL Labs Public Awards 2020 formally recognises outstanding and innovative work that has been carried out using the British Library’s data and / or digital collections by researchers, artists, entrepreneurs, educators, students and the general public.

The closing date for entering the Public Awards is 0700 GMT on Monday 30 November 2020 and you can submit your entry any time up to then.

Please help us spread the word! We want to encourage any one interested to submit over the next few months, who knows, you could even win fame and glory, priceless! We really hope to have another year of fantastic projects to showcase at our annual online awards symposium on the 15 December 2020 (which is open for registration too), inspired by our digital collections and data!

This year, BL Labs is commending work in four key areas that have used or been inspired by our digital collections and data:

  • Research - A project or activity that shows the development of new knowledge, research methods, or tools.
  • Artistic - An artistic or creative endeavour that inspires, stimulates, amazes and provokes.
  • Educational - Quality learning experiences created for learners of any age and ability that use the Library's digital content.
  • Community - Work that has been created by an individual or group in a community.

What kind of projects are we looking for this year?

Whilst we are really happy for you to submit your work on any subject that uses our digital collections, in this significant year, we are particularly interested in entries that may have a focus on anti-racist work or projects about lock down / global pandemic. We are also curious and keen to have submissions that have used Jupyter Notebooks to carry out computational work on our digital collections and data.

After the submission deadline has passed, entries will be shortlisted and selected entrants will be notified via email by midnight on Friday 4th December 2020. 

A prize of £150 in British Library online vouchers will be awarded to the winner and £50 in the same format to the runner up in each Awards category at the Symposium. Of course if you enter, it will be at least a chance to showcase your work to a wide audience and in the past this has often resulted in major collaborations.

The talent of the BL Labs Awards winners and runners up over the last five years has led to the production of remarkable and varied collection of innovative projects described in our 'Digital Projects Archive'. In 2019, the Awards commended work in four main categories – Research, Artistic, Community and Educational:

BL_Labs_Winners_2019-smallBL  Labs Award Winners for 2019
(Top-Left) Full-Text search of Early Music Prints Online (F-TEMPO) - Research, (Top-Right) Emerging Formats: Discovering and Collecting Contemporary British Interactive Fiction - Artistic
(Bottom-Left) John Faucit Saville and the theatres of the East Midlands Circuit - Community commendation
(Bottom-Right) The Other Voice (Learning and Teaching)

For further detailed information, please visit BL Labs Public Awards 2020, or contact us at labs@bl.uk if you have a specific query.

Posted by Mahendra Mahey, Manager of British Library Labs.

07 September 2020

When is a persistent identifier not persistent? Or an identifier?

Add comment

Ever wondered what that bar code on the back of every book is? It’s an ISBN: an International Standard Book Number. Every modern book published has an ISBN, which uniquely identifies that book, and anyone publishing a book can get an ISBN for it whether an individual or a huge publishing house. It’s a little more complex than that in practice but generally speaking it’s 1 book, 1 ISBN. Right? Right.

Except…

If you search an online catalogue, such as WorldCat or The British Library for the ISBN 9780393073775 (or the 10-digit equivalent, 0393073777) you’ll find results appear for two completely different books:

  1. Waal FD. The Bonobo and the Atheist: In Search of Humanism Among the Primates. New York: W. W. Norton & Co.; 2013. 304 p. http://www.worldcat.org/oclc/1167414372
  2. Lodge HC. The Storm Has Many Eyes; a Personal Narrative. 1st edition. New York: New York Norton; 1973. http://www.worldcat.org/oclc/989188234

A screen grab of the main catalogue showing a search for ISBN 0393073777 with the above two results

In fact, things are so confused that the cover of one book gets pulled in for the other as well. Investigate further and you’ll see that it’s not a glitch: both books have been assigned the same ISBN. Others have found the same:

“However, if the books do not match, it’s usually one of two issues. First, if it is the same book but with a different cover, then it is likely the ISBN was reused for a later/earlier reprinting. … In the other case of duplicate ISBNs, it may be that an ISBN was reused on a completely different book. This shouldn’t happen because ISBNs are supposed to be unique, but exceptions have been found.” — GoodReads Librarian Manual: ISBN-10, ISBN-13 and ASINS

While most publishers stick to the rules about never reusing an ISBN, it’s apparently common knowledge in the book trade that ISBNs from old books get reused for newer books, sometimes accidentally (due to a typo), sometimes intentionally (to save money), and that has some tricky consequences.

I recently attended a webinar entitled “Identifiers in Heritage Collections - how embedded are they?” from the Persistent Identifiers as IRO Infrastructure (“HeritagePIDs”) project, part of AHRC’s Towards a National Collection programme. As quite often happens, the question was raised: what Persistent Identifier (PID) should we use for books and why can’t we just use ISBNs? Rod Page, who gave the demo that prompted this discussion, also wrote a short follow-up blog post about what makes PIDs work (or not) which is worth a look before you read the rest of this.

These are really valid questions and worth considering in more detail, and to do that we need to understand what makes a PID special. We call them persistent, and indeed we expect some sort of guarantee that a PID remains valid for the long term, so that we can use it as a link or placeholder for the referent without worrying that the link will get broken. But we also expect PIDs to be actionable: it can be made into a valid URL by following some rules: so that we can directly obtain the object referenced or at least some information about it.

Actionability implies two further properties: an actionable identifier must be

  1. Unique: guaranteed to have only one identifier for a given object (of a given type); and
  2. Unambiguous: guaranteed that a single identifier refers to only one object

Where does this leave us with ISBNs?

Well first up they’re not actionable to start with: given an ISBN, there’s no canonical way to obtain information about the book referenced, although in practice there are a number of databases that can help. There is, in fact, an actionable ISBN standard: ISBN-A permits converting an ISBN into a DOI with all the benefits of the underlying DOI and Handle infrastructure. Sadly, creation of an ISBN-A isn’t automatic and publishers have to explicitly create the ISBN-A DOI in addition to the already-create ISBN; most don’t.

More than that though, it’s hard to make them actionable since ISBNs fail on both uniqueness and unambiguity. Firstly, as seen in the example I gave above, ISBNs do get recycled, They’re not supposed to be:

“Once assigned to a monographic publication, an ISBN can never be reused to identify another monographic publication, even if the original ISBN is found to have been assigned in error.” — International ISBN Agency. ISBN Users’ Manual [Internet]. Seventh Edition. London, UK: International ISBN Agency; 2017 [cited 2020 Jul 23]. Available from: https://www.isbn-international.org/content/isbn-users-manual

Yet they are, so we can’t rely on their precision.[1]

Secondly, and perhaps more problematic in day-to-day use, a given book may have multiple ISBNs. To an extent this is reasonable: different editions of the same book may have different content, or at the very least different page numbering, so a PID should be able to distinguish these for accurate citation. Unfortunately the same edition of the same book will frequently have multiple ISBNs; in particular each different format (hardback, paperback, large print, ePub, MOBI, PDF, …) is expected to have a distinct ISBN. Even if all that changes is the publisher, a new ISBN is still created:

“We recently encountered a case where a publisher had licensed a book to another publisher for a different geographical market. Both books used the same ISBN. If the publisher of the book changes (even if nothing else about the book has changed), the ISBN must also change.” — Everything you wanted to know about the ISBN but were too afraid to ask

Again, this is reasonable since the ISBN is primarily intended for stockkeeping by book sellers[2], and for them the difference between a hardback and paperback is important because they differ in price if nothing else. This has bitten more than one librarian when trying to merge data from two different sources (such as usage and pricing) using the ISBN as the “obvious” merge key. It makes bibliometrics harder too, since you can’t easily pull out a list of all citations of a given edition in the literature, just from a single ISBN.

So where does this leave us?

I’m not really sure yet. ISBNs as they are currently specified and used by the book industry aren’t really fit for purpose as a PID. But they’re there and they sort-of work and establishing a more robust PID for books would need commitment and co-operation from authors, publishers and libraries. That’s not impossible: a lot of work has been done recently to make the ISSN (International Standard Serial Number, for journals) more actionable.

But perhaps there are other options. Where publishers, booksellers and libraries are primarily interested in IDs for stock management, authors, researchers and scholarly communications librarians are more interested in the scholarly record as a whole and tracking the flow of ideas (and credit for those) which is where PIDs come into their own. Is there an argument for a coalition of these groups to establish a parallel identifier system for citation & credit that’s truly persistent? It wouldn’t be the first time: ISNIs (International Standard Name Identifiers) and ORCIDs (Open Researcher and Contributor IDs) both identify people, but for different purposes in different roles and with robust metadata linking the two where possible.

I’m not sure where I’m going with this train of thought so I’ll leave it there for now, but I’m sure I’ll be back. The more I dig into this the more there is to find, including the mysterious, long-forgotten and no-longer accessible Book Item & Component Identifier proposal. In the meantime, if you want a persistent identifier and aren’t sure which one you need these Guides to Choosing a Persistent Identifier from Project FREYA should get you started.


  1. Actually, as my colleague pointed out, even DOIs potentially have this problem, although I feel they can mitigate it better with metadata that allows rich expression of relationships between DOIs.  ↩︎

  2. In fact, the newer ISBN-13 standard is simply an ISBN-10 encoded as an “International Article Number”, the standard barcode format for almost all retail products, by sticking the “Bookland” country code of 978 on the front and recalculating the check digit. ↩︎

04 September 2020

British Library Joins Share-VDE Linked Data Community

Add comment

This blog post is by Alan Danskin, Collection Metadata Standards Manager, British Library. metadata@bl.uk

What is Share-VDE and why has the British Library joined the Share-VDE Community?

Share-VDE is a library-driven initiative bringing library catalogues together in a shared Virtual Discovery Environment.  It uses linked data technology to create connections between bibliographic information contributed by different institutions

Example SVDE page showing Tim Berners-Lee linked info to publications, wikipedia, and other external sites
Figure 1: SVDE page for Sir Tim Berners-Lee

For example, searching for Sir Tim Berners-Lee retrieves metadata contributed by different members, including links to his publications. The search also returns links to external sources of information, including Wikipedia.

The British Library will be the first institution to contribute its national bibliography to Share-VDE and we also plan to contribute our catalogue data. By collaborating with the Share-VDE community we will extend access to information about our collections and services and enable information to be reused.

The Library also contributes to Share-VDE by participating on community groups working to develop the metadata model and Share-VDE functionality. This provides us with a practical approach for bridging differences between the IFLA Library Reference Model (LRM) and the Bibframe initiative, led by Library of Congress.

Share VDE is promoted by the international bibliographic agency Casalini Libri and @Cult, a solutions developer working in the cultural heritage sector.

Andrew MacEwan, Head of Metadata at the British Library, explained that, “Membership of the Share-VDE community is an exciting opportunity to enrich the Library’s metadata and open it up for re-use by other institutions in a linked data environment.”

Tiziana Possemato, Chief Information Officer at Casalini Libri and Director of @Cult, said "We are delighted to collaborate with the British Library and extremely excited about unlocking the wealth of data in its collections, both to further enrich the Virtual Discovery Environment and to make the Library's resources even more accessible to users."

For further information about:

SHARE-VDE  

Linked Data

Linked Open Data

The British Library is the national library of the United Kingdom and one of the world's greatest research libraries. It provides world class information services to the academic, business, research and scientific communities and offers unparalleled access to the world's largest and most comprehensive research collection. The Library's collection has developed over 250 years and exceeds 150 million separate items representing every age of written civilisation and includes books, journals, manuscripts, maps, stamps, music, patents, photographs, newspapers and sound recordings in all written and spoken languages. Up to 10 million people visit the British Library website - www.bl.uk - every year where they can view up to 4 million digitised collection items and over 40 million pages.

Casalini Libri is a bibliographic agency producing authority and bibliographic data; a library vendor, supplying books and journals, and offering a variety of collection development and technical services; and an e-content provider, working both for publishers and libraries.

@Cult is a software development company, specializing in data conversion for LD; and provider of Integrated Library System and Discovery tools, delivering effective and innovative technological solutions to improve information management and knowledge sharing.

14 July 2020

Legacies of Catalogue Descriptions and Curatorial Voice: Training Sessions

Add comment

This guest post is by James Baker, Senior Lecturer in Digital History and Archives at the University of Sussex.

This month the team behind "Legacies of Catalogue Descriptions and Curatorial Voice: Opportunities for Digital Scholarship" ran two training sessions as part of our Arts and Humanities Research Council funded project. Each standalone session provided instruction in using the software tool AntConc and approaches from computational linguistics for the purposes of examining catalogue data. The objectives of the sessions were twofold: to test our in-development training materials, and to seek feedback from the community in order to better understand their needs and to develop our training offer.

Rather than host open public training, we decided to foster existing partnerships by inviting a small number of individuals drawn from attendees at events hosted as part of our previous Curatorial Voice project (funded by the British Academy). In total thirteen individuals from the UK and US took part across the two sessions, with representatives from libraries, archives, museums, and galleries.

Screenshot of the website for the lesson entitled Computational Analysis of Catalogue Data

Screenshot of the content page and timetable for the lesson
Carpentries-style lesson about analysing catalogue data in Antconc


The training was delivered in the style of a Software Carpentry workshop, drawing on their wonderful lesson templatepedagogical principles, and rapid response to moving coding and data science instruction online in light of the Covid-19 crisis (see ‘Recommendations for Teaching Carpentries Workshops Online’ and ‘Tips for Teaching Online from The Carpentries Community’). In terms of content, we started with the basics: how to get data into AntConc, the layout of AntConc, and settings in AntConc. After that we worked through two substantial modules. The first focused on how to generate, interact with, and interpret a word list, and this was followed by a module on searching, adapting, and reading concordances. The tasks and content of both modules avoided generic software instruction and instead focused on the analysis of free text catalogue fields, with attendees asked to consider what they might infer about a catalogue from its use of tense, what a high volume of capitalised words might tell us about cataloguing style, and how adverb use might be a useful proxy for the presence of controlled vocabulary.

Screenshot of three tasks and solutions in the Searching Concordances section
Tasks in the Searching Concordances section

Running Carpentries-style training over Zoom was new to me, and was - frankly - very odd. During live coding I missed hearing the clack of keyboards as people followed along in response. I missed seeing the sticky notes go up as people completed the task at hand. During exercises I missed hearing the hubbub that accompanies pair programming. And more generally, without seeing the micro-gestures of concentration, relief, frustration, and joy on the faces of learners, I felt somehow isolated as an instructor from the process of learning.

But from the feedback we received the attendees appear to have been happy. It seems we got the pace right (we assumed teaching online would be slower than face-to-face, and it was). The attendees enjoyed using AntConc and were surprised, to quote one attendees, "to see just how quickly you could draw some conclusions". The breakout rooms we used for exercises were a hit. And importantly we have a clear steer on next steps: that we should pivot to a dataset that better reflects the diversity of catalogue data (for this exercise we used a catalogue of printed images that I know very well), that learners would benefit having a list of suggested readings and resources on corpus linguistics, and that we might - to quote one attendee - provide "more examples up front of the kinds of finished research that has leveraged this style of analysis".

These comments and more will feed into the development of our training materials, which we hope to complete by the end of 2020 and - in line with the open values of the project - is happening in public. In the meantime, the materials are there for the community to use, adapt and build on (more or less) as they wish. Should you take a look and have any thoughts on what we might change or include for the final version, we always appreciate an email or a note on our issue tracker.

"Legacies of Catalogue Descriptions and Curatorial Voice: Opportunities for Digital Scholarship" is a collaboration between the Sussex Humanities Lab, the British Library, and Yale University Library that is funded under the Arts and Humanities Research Council (UK) “UK-US Collaboration for Digital Scholarship in Cultural Institutions: Partnership Development Grants” scheme. Project Reference AH/T013036/1.

21 April 2020

Clean. Migrate. Validate. Enhance. Processing Archival Metadata with Open Refine

Add comment

This blogpost is by Graham Jevon, Cataloguer, Endangered Archives Programme 

Creating detailed and consistent metadata is a challenge common to most archives. Many rely on an army of volunteers with varying degrees of cataloguing experience. And no matter how diligent any team of cataloguers are, human error and individual idiosyncrasies are inevitable.

This challenge is particularly pertinent to the Endangered Archives Programme (EAP), which has hitherto funded in excess of 400 projects in more than 90 countries. Each project is unique and employs its own team of one or more cataloguers based in the particular country where the archival content is digitised. But all this disparately created metadata must be uniform when ingested into the British Library’s cataloguing system and uploaded to eap.bl.uk.

Finding an efficient, low-cost method to process large volumes of metadata generated by hundreds of unique teams is a challenge; one that in 2019, EAP sought to alleviate using freely available open source software Open Refine – a power tool for processing data.

This blog highlights some of the ways that we are using Open Refine. It is not an instructional how-to guide (though we are happy to follow-up with more detailed blogs if there is interest), but an introductory overview of some of the Open Refine methods we use to process large volumes of metadata.

Initial metadata capture

Our metadata is initially created by project teams using an Excel spreadsheet template provided by EAP. In the past year we have completely redesigned this template in order to make it as user friendly and controlled as possible.

Screenshot of spreadsheet

But while Excel is perfect for metadata creation, it is not best suited for checking and editing large volumes of data. This is where Open Refine excels (pardon the pun!), so when the final completed spreadsheet is delivered to EAP, we use Open Refine to clean, validate, migrate, and enhance this data.

WorkflowDiagram

Replicating repetitive tasks

Open Refine came to the forefront of our attention after a one-day introductory training session led by Owen Stephens where the key takeaway for EAP was that a sequence of functions performed in Open Refine can be copied and re-used on subsequent datasets.

ScreenshotofOpenRefineSoftware1

This encouraged us to design and create a sequence of processes that can be re-applied every time we receive a new batch of metadata, thus automating large parts of our workflow.

No computer programming skills required

Building this sequence required no computer programming experience (though this can help); just logical thinking, a generous online community willing to share their knowledge and experience, and a willingness to learn Open Refine’s GREL language and generic regular expressions. Some functions can be performed simply by using Open Refine’s built-in menu options. But the limits of Open Refine’s capabilities are almost infinite; the more you explore and experiment, the further you can push the boundaries.

Initially, it was hoped that our whole Open Refine sequence could be repeated in one single large batch of operations. The complexity of the data and the need for archivist intervention meant that it was more appropriate to divide the process into several steps. Our workflow is divided into 7 stages:

  1. Migration
  2. Dates
  3. Languages and Scripts
  4. Related subjects
  5. Related places and other authorities
  6. Uniform Titles
  7. Digital content validation

Each of these stages performs one or more of four tasks: clean, migrate, validate, and enhance.

Task 1: Clean

The first part of our workflow provides basic data cleaning. Across all columns it trims any white space at the beginning or end of a cell, removes any double spaces, and capitalises the first letter of every cell. In just a few seconds, this tidies the entire dataset.

Task 1 Example: Trimming white space (menu option)

Trimming whitespace on an individual column is an easy function to perform as Open Refine has a built in “Common transform” that performs this function.

ScreenshotofOpenRefineSoftware2

Although this is a simple function to perform, we no longer need to repeatedly select this menu option for each column of each dataset we process because this task is now part of the workflow that we simply copy and paste.

Task 1 Example: Capitalising the first letter (using GREL)

Capitalising the first letter of each cell is less straightforward for a new user as it does not have a built-in function that can be selected from a menu. Instead it requires a custom “Transform” using Open Refine’s own expression language (GREL).

ScreenshotofOpenRefineSoftware3


Having to write an expression like this should not put off any Open Refine novices. This is an example of Open Refine’s flexibility and many expressions can be found and copied from the Open Refine wiki pages or from blogs like this. The more you copy others, the more you learn, and the easier you will find it to adapt expressions to your own unique requirements.

Moreover, we do not have to repeat this expression again. Just like the trim whitespace transformation, this is also now part of our copy and paste workflow. One click performs both these tasks and more.

Task 2: Migrate

As previously mentioned, the listing template used by the project teams is not the same as the spreadsheet template required for ingest into the British Library’s cataloguing system. But Open Refine helps us convert the listing template to the ingest template. In just one click, it renames, reorders, and restructures the data from the human friendly listing template to the computer friendly ingest template.

Task 2 example: Variant Titles

The ingest spreadsheet has a “Title” column and a single “Additional Titles” column where all other title variations are compiled. It is not practical to expect temporary cataloguers to understand how to use the “Title” and “Additional Titles” columns on the ingest spreadsheet. It is much more effective to provide cataloguers with a listing template that has three prescriptive title columns. This helps them clearly understand what type of titles are required and where they should be put.

SpreadsheetSnapshot

The EAP team then uses Open Refine to move these titles into the appropriate columns (illustrated above). It places one in the main “Title” field and concatenates the other two titles (if they exist) into the “Additional Titles” field. It also creates two new title type columns, which the ingest process requires so that it knows which title is which.

This is just one part of the migration stage of the workflow, which performs several renaming, re-ordering, and concatenation tasks like this to prepare the data for ingest into the British Library’s cataloguing system.

Task 3: Validate

While cleaning and preparing the data for migration is important, it also vital that we check that the data is accurate and reliable. But who has the time, inclination, or eye stamina to read thousands of rows of data in an Excel spreadsheet? What we require is a computational method to validate data. Perhaps the best way of doing this is to write a bespoke computer program. This indeed is something that I am now working on while learning to write computer code using the Python language (look out for a further blog on this later).

In the meantime, though, Open Refine has helped us to validate large volumes of metadata with no programming experience required.

Task 3 Example: Validating metadata-content connections

When we receive the final output from a digitisation project, one of our most important tasks is to ensure that all of digital content (images, audio and video recordings) correlate with the metadata on the spreadsheet and vice versa.

We begin by running a command line report on the folders containing the digital content. This provides us with a csv file which we can read in Excel. However, the data is not presented in a neat format for comparison purposes.

SpreadsheetSnapshot2

Restructuring data ready for validation comparisons

For this particular task what we want is a simple list of all the digital folder names (not the full directory) and the number of TIFF images each folder contains. Open Refine enables just that, as the next image illustrates.

ScreenshotofOpenRefineSoftware4

Constructing the sequence that restructures this data required careful planning and good familiarity with Open Refine and the GREL expression language. But after the data had been successfully restructured once, we never have to think about how to do this again. As with other parts of the workflow, we now just have to copy and paste the sequence to repeat this transformation on new datasets in the same format.

Cross referencing data for validation

With the data in this neat format, we can now do a number of simple cross referencing checks. We can check that:

  1. Each digital folder has a corresponding row of metadata – if not, this indicates that the metadata is incomplete
  2. Each row of metadata has a corresponding digital folder – if not, this indicates that some digital folders containing images are missing
  3. The actual number of TIFF images in each folder exactly matches the number of images recorded by the cataloguer – if not this may indicate that some images are missing.

For each of these checks we use Open Refine’s cell.cross expression to cross reference the digital folder report with the metadata listing.

In the screenshot below we can see the results of the first validation check. Each digital folder name should match the reference number of a record in the metadata listing. If we find a match it returns that reference number in the “CrossRef” column. If no match is found, that column is left blank. By filtering that column by blanks, we can very quickly identify all of the digital folders that do not contain a corresponding row of metadata. In this example, before applying the filter, we can already see that at least one digital folder is missing metadata. An archivist can then investigate why that is and fix the problem.

ScreenshotofOpenRefineSoftware5

Task 4: Enhance

We enhance our metadata in a number of ways. For example, we import authority codes for languages and scripts, and we assign subject headings and authority records based on keywords and phrases found in the titles and description columns.

Named Entity Extraction

One of Open Refine’s most dynamic features is its ability to connect to other online databases and thanks to the generous support of Dandelion API we are able to use its service to identify entities such as people, places, organisations, and titles of work.

In just a few simple steps, Dandelion API reads our metadata and returns new linked data, which we can filter by category. For example, we can list all of the entities it has extracted and categorised as a place or all the entities categorised as people.

ScreenshotofOpenRefineSoftware6

Not every named entity it finds will be accurate. In the above example “Baptism” is clearly not a place. But it is much easier for an archivist to manually validate a list of 29 phrases identified as places, than to read 10,000 scope and content descriptions looking for named entities.

Clustering inconsistencies

If there is inconsistency in the metadata, the returned entities might contain multiple variants. This can be overcome using Open Refine’s clustering feature. This identifies and collates similar phrases and offers the opportunity to merge them into one consistent spelling.

ScreenshotofOpenRefineSoftware7

Linked data reconciliation

Having identified and validated a list of entities, we then use other linked data services to help create authority records. For this particular task, we use the Wikidata reconciliation service. Wikidata is a structured data sister project to Wikipedia. And the Open Refine reconciliation service enables us to link an entity in our dataset to its corresponding item in Wikidata, which in turn allows us to pull in additional information from Wikidata relating to that item.

For a South American photograph project we recently catalogued, Dandelion API helped identify 335 people (including actors and performers). By subsequently reconciling these people with their corresponding records in Wikidata, we were able to pull in their job title, date of birth, date of death, unique persistent identifiers, and other details required to create a full authority record for that person.

ScreenshotofOpenRefineSoftware8

Creating individual authority records for 335 people would otherwise take days of work. It is a task that previously we might have deemed infeasible. But Open Refine and Wikidata drastically reduces the human effort required.

Summary

In many ways, that is the key benefit. By placing Open Refine at the heart of our workflow for processing metadata, it now takes us less time to do more. Our workflow is not perfect. We are constantly finding new ways to improve it. But we now have a semi-automated method for processing large volumes of metadata.

This blog puts just some of those methods in the spotlight. In the interest of brevity, we refrained from providing step-by-step detail. But if there is interest, we will be happy to write further blogs to help others use this as a starting point for their own metadata processing workflows.