The BL Labs team are pleased to announce that the eighth annual British Library Labs Symposium 2020 will be held on Tuesday 15 December 2020, from 13:45 - 16:55* (see note below) online. The event is FREE, but you must book a ticket in advance to reserve your place. Last year's event was the largest we have ever held, so please don't miss out and book early, see more information here!
*Please note, that directly after the Symposium, we are organising an experimental online mingling networking session between 16:55 and 17:30!
The British Library Labs (BL Labs) Symposium is an annual event and awards ceremony showcasing innovative projects that use the British Library's digital collections and data. It provides a platform for highlighting and discussing the use of the Library’s digital collections for research, inspiration and enjoyment. The awards this year will recognise outstanding use of British Library's digital content in the categories of Research, Artistic, Educational, Community and British Library staff contributions.
Ruth Ahnert will be giving the BL Labs Symposium 2020 keynote this year.
We are very proud to announce that this year's keynote will be delivered by Ruth Ahnert, Professor of Literary History and Digital Humanities at Queen Mary University of London, and Principal Investigator on 'Living With Machines' at The Alan Turing Institute.
Her work focuses on Tudor culture, book history, and digital humanities. She is author of The Rise of Prison Literature in the Sixteenth Century (Cambridge University Press, 2013), editor of Re-forming the Psalms in Tudor England, as a special issue of Renaissance Studies (2015), and co-author of two further books: The Network Turn: Changing Perspectives in the Humanities (Cambridge University Press, 2020) and Tudor Networks of Power (forthcoming with Oxford University Press). Recent collaborative work has taken place through AHRC-funded projects ‘Living with Machines’ and 'Networking the Archives: Assembling and analysing a meta-archive of correspondence, 1509-1714’. With Elaine Treharne she is series editor of the Stanford University Press’s Text Technologies series.
Ruth's keynote is entitled: Humanists Living with Machines: reflections on collaboration and computational history during a global pandemic
There will be Awards announcements throughout the event for Research, Artistic, Community, Teaching & Learning and Staff Categories and this year we are going to get the audience to vote for their favourite project in those that were shortlisted, a people's BL Labs Award!
There will be a final talk near the end of the conference and we will announce the speaker for that session very soon.
World Digital Preservation Day (WDPD) is held on the first Thursday of every November, providing an opportunity for the international digital preservation community to connect and celebrate the positive impact that digital preservation has. Follow #WDPD2020 for discussion throughout the day. Our colleagues in the UK Web Archive (UKWA) have already blogged earlier for WDPD about their Coronavirus Collection, which includes preservation of the ‘Children of Lockdown’ project website.
Here in Digital Scholarship we enjoy collaborating with the British Library's Digital Preservation and UKWA teams. Last year we hosted a six month post-doctoral placement; ‘Emerging Formats: Discovering and Collecting Contemporary British Interactive Fiction’, where Lynda Clark created an Interactive Narratives UKWA collection and evaluated how crawlers captured web hosted works of interactive fiction.
This research project was part of the Library’s ongoing Emerging Formats work, which acknowledges that without intervention, many culturally valuable digital artefacts are at risk of being lost. Interactive narratives are particularly endangered due to the ‘hobbyist’ nature of many creators, meaning they do not necessarily subscribe to standardised practices. However, this also means that digital interactive fiction is created by and for a wide variety of creators and audiences, including various marginalised groups.
This guest post is by Alex Hailey, Curator of Modern Archives and Manuscripts. He's on Twitter as @ajrhailey.
In late 2019 I was lucky enough to join BL and National Archives staff to trial a PG Certificate in Computing for Cultural Heritage at Birkbeck. The course provided an introduction to programming with Python, the basics of SQL, and using the two to work with data. Fellow attendees Graham, Nick, Chris and Giulia have written about their work previously, and I am going to briefly introduce one of my project tasks addressing issues with legacy metadata within the India Office Records.
The original data
The IOR/E/4 Correspondence with India series consists of 1,112 volumes dating from 1703-1858: four series of letters received by the East India Company (EIC) Court of Directors from the administration in India, and four series of dispatches sent to India. Catalogue entries for these volumes contain only basic information – title, dates, language, reference and former references – and subject, name and place access to the dispatches is provided through 72 index volumes (reference IOR/Z/E/4), which contain around 430,000 entries.
Sample catalogue record of an index entry, IOR/Z/E/4/42/P133
The original indexes were produced from 1901-1929 by staff of the Secretarial Bureau, led by indexing pioneer Mary Petherbridge; my colleague Antonia Moon has written about Petherbridge’s work in a previous post. When these indexes were converted to the catalogue in the early 2010s, entries within the index volumes were entered as child or sub-items of the index volumes themselves, with information on the related correspondence volumes entered into the free-text Related material field, as shown in the image above.
Problem and solution
This approach has caused some issues. Firstly, users attempting to order the related correspondence regularly end up trying to place an order for an index volume instead, which is frustrating. Secondly, it makes it practically impossible to determine the whole contents of a particular volume in a quick and easy manner, which frustrates access and use.
Manually working through 430,000 entries to group the entries by volume would be an impossible task, but I was able to use Python and a library called Pandas, which has a number of useful features for examining and manipulating catalogue data: methods for reading and writing data from multiple sources, flexible reshaping of datasets, and methods for aggregation, indexing, splitting and replacing strings, including regular expressions.
Using Pandas I was able to separate information in the Related material field, restructure the data so that each instance of an index entry formed an individual record, and then group these by volume and further arrange them alphabetically or by page order.
Index entries for reference IOR/Z/E/4/42/P133 split into separate records
Outputs and analysis
Examining these outputs gave us new insights into the data. We now know that the indexes cover 230 volumes of the dispatches only. We were also able to identify incomplete references originally recorded in the Related material field, as well as what appear to be keying errors (references which fall outside of the range of the dispatches series). We can now follow these up and correct errors in the catalogue which were previously unknown.
Comparing the data at volume level arranged alphabetically and by page order, we could appreciate just how much depth there was to the index. Traditional indexes are written with a lot of information redundancy, which isn’t immediately apparent until you group the entries according to their location within a particular volume:
Example of index entries arranged by page order
After discussion with the IOR team we have decided to take the alphabetically arranged data and import it to the archives catalogue, so that users selecting a dispatches volume are presented with the relevant index entries immediately.
The original dataset and derived datasets have been uploaded to the Library’s research repository where they are available for download and reuse under a CC0 licence.
To enable further analysis of the index data I have also tried my hand at creating a Jupyter Notebook to use with the derived data. This is intended to introduce colleagues to using Notebooks, Python and the Pandas library to examine catalogue metadata, conducting basic queries, producing a visualisation and exporting subsets for further investigation.
Wordcloud based on terms contained in the IOR/Z/E/4 data, generated within the Jupyter Notebook.
My Birkbeck project also included work to create place and institution authority files for the Proceedings of the Governments of India series using keyword extraction with existing catalogue metadata, and this will be discussed in a future post.
Huge thanks must go to Nora McGregor, Jo Pugh and the folks at Birkbeck Department of Computer Science for developing the course and providing us with this opportunity; Antonia Moon and the IOR team for helpful discussions about the IOR data; and the rest of the cohort for moral support when the computer just wouldn’t behave.
I’m not a summer creature, autumn is my favourite time of the year and I especially love Halloween. It is a perfect excuse for reading ghost stories, watching folk horror films and playing spooky videogames. If this sounds like fun to you too, then I recommend taking a look at the games created for Gothic Novel Jam.
It is always a pleasure to see how creatives use the Flickr images to make new works, such as animations, like The Phantom Monk shown below, made by my talented colleague Carlos Rarugal from the UK Web Archive. He has animated a few spooky creatures for Halloween, which will shared be shared from the Wildlife, Web Archive and Digital Scholarship Twitter accounts. My colleague Cheryl Tipp has been Going batty for Halloween, making a Flappy Bat online game using Scratch, and the UK Web Archive have been celebrating their crawlers with this blog post.
Video created by Carlos Rarugal, using a British Library digitised image from page 377 of "The Lancashire Witches. A novel". Audio is Thunder, Eric & May Nobles, Wales, 1989 (W Thunder r3 C1) and Grey Wolf, Tom Cosburn, Canada, 1995 (W1CDR0000681 BD9)
If you enjoy making games and works of interactive fiction, then you may want to sign up to participate in AdventureX Game Jam, which is taking place online, during 14-28 November 2020. The jam's theme will be announced when AdvXJam opens on the 14th November. You are invited to interpret the theme in any way you choose, and AdventureX are very open-minded about what constitutes a narrative game. All genres, styles and game engines are welcome, as they are very keen to encourage participants to get involved regardless of background or experience level.
Sadly 2020 is not being a year for in-person parties! However, I hope you'll raise a socially distanced glass safely at home to celebrate the eighth birthday of Wikidata, which first went live on 29th October 2012.
You can follow the festivities on social media with posts tagged #WikidataBirthday and read a message from the development team here. The WikiCite 2020 virtual conference kicked the celebrations off a few days early, with sessions about open citations and linked bibliographic data (videos online here) and depending what time you read this post, you may still be able to join a 24-hours long online meetup, where people can drop in to chat to others about Wikidata.
If you are reading this post and wondering what Wikidata is, then you might want to read this introduction. Essentially it "is a document-oriented database, focused on items, which represent topics, concepts, or objects. Each item is allocated a unique, persistent identifier, a positive integer prefixed with the upper-case letter Q, known as a "QID". This enables the basic information required to identify the topic that the item covers to be translated without favouring any language."
Many libraries around the world have been actively adding data about their collections to Wikidata, and a number of groups to support and encourage this work have been established.
The IFLA Wikidata Working Group was formed in late 2019 to explore and advocate for the use of and contribution to Wikidata by library and information professionals. To support the integration of Wikidata and Wikibase with library systems, and alignment of the Wikidata ontology with library metadata formats such as BIBFRAME, RDA, and MARC.
This group was originally due to host a satellite event for the World Library and Information Congress 2020 in Dublin, which was sadly cancelled due to Covid-19. However this event was quickly converted into the Wikicite + Libraries series of six online discussions; about open citations, language revitalisation, knowledge equity, access to scholarly publications, linking and visualising bibliographic data. The recordings of which have all been made available online, via a Youtube playlist.
They have also set up a mailing list (email@example.com) and held an online launch party on the 8th October (slides). If you would like to attend their next meeting, it will be on the 24th November, the booking form is here.
Another online community for librarians working with Wikidata, is the LD4 Wikidata Affinity Group, which explores how libraries can contribute to and leverage Wikidata as a platform for publishing, linking, and enriching library linked data. They meet biweekly via Zoom. At each meeting, either the co-facilitators or an invited guest will give a presentation, or a demonstration, then there is a wider discussion of any issues, which members have encountered, and an opportunity for sharing helpful resources.
If you work in libraries and are curious about Wikidata, I highly recommend attending these groups. If you are looking for a introductory guide, then Practical Wikidata for Librarians is an excellent starting point. There is also Library Carpentry Wikidata currently in development, which is shaping up to be a very useful resource.
It can't be all work and no play though, so I'm celebrating Wikidata's birthday with a seasonal slice of Frankencolin the Caterpillar cake!
People 'automatically' identified in digital TV news related programme clips.
Guest blog post by Andrew Brown (PhD researcher), Ernesto Coto (Research Software Engineer) and Andrew Zisserman (Professor) of the Visual Geometry Group, Department of Engineering Science, University of Oxford, and BL Labs Public Award Runner-up for Research, 2019. Posted on their behalf by Mahendra Mahey, Manager of BL Labs.
In this work, we automatically identify and label (tag) people in large video archives without the need for any manual annotation or supervision. The project was carried out with the British Library on a sample of 106 videos from their “Television and radio news” archive; a large collection of news programs from the last 10 years. This archive serves as an important and fascinating resource for researchers and the general public alike. However, the sheer scale of the data, coupled with a lack of relevant metadata, makes indexing, analysing and navigating this content an increasingly difficult task. Relying on human annotation is no longer feasible, and without an effective way to navigate these videos, this bank of knowledge is largely inaccessible.
As users, we are typically interested in human-centric queries such as:
“When did Jeremy Corbyn first appear in a Newsnight episode?” or
“Show me all of the times when Hugh Grant and Shirley Williams appeared together.
Currently this is nigh on impossible without trawling through hundreds of hours of content.
We posed the following research question:
Is it possible to enable automatic person-search capabilities such as this in the archive, without the need for any manual supervision or labelling?
The answer is “yes”, and the method is described next.
Video Pre-Processing The basic unit which enables person labelling in videos is the face-track; a group of consecutive face detections within a shot that correspond to the same identity. Face-tracks are extracted from all of the videos in the archive. The task of labelling the people in the videos is then to assign a label to each one of these extracted face-tracks. The video below gives an example of two face-tracks found in a scene.
Two face-tracks found in British Library digital news footage by Visual Geometry Group - University of Oxford.
Techniques at Our Disposal The base technology used for this work is a state-of-the-art convolutional neural network (CNN), trained for facial recognition . The CNN extracts feature-vectors (a list of numbers) from face images, which indicate the identity of the depicted person. To label a face-track, the distance between the feature-vector for the face-track, and the feature-vector for a face-image with known identity is computed. The face-track is labelled as depicting that identity if the distance is smaller than a certain threshold (i.e. they match). We also use a speaker recognition CNN  that works in the same way, except it labels speech segments from unknown identities using speech segments from known identities within the video. Labelling the Face-Tracks
Our method for automatically labelling the people in the video archive is divided into three main stages: (1) Our first labelling method uses what we term a “celebrity feature-vector bank”, which consists of names of people that are likely to appear in the videos, and their corresponding feature-vectors. The names are automatically sourced from IMDB cast lists for the programmes (the titles of the programmes are freely available in the meta-data). Face-images for each of the names are automatically downloaded from image-search engines. Incorrect face-images and people with no images of themselves on search engines are automatically removed at this stage. We compute the feature-vectors for each identity and add them to the bank alongside the names. The face-tracks from the video archives are then simply labelled by finding matches in the feature-vector bank.
Face-tracks from the video archives are labelled by finding matches in the feature-vector bank.
(2) Our second labelling method uses the idea that if a name is spoken, or found displayed in a scene, then that person is likely to be found within that scene. The task is then to automatically determine whether there is a correspondence or not. Text is automatically read from the news videos using Optical Character Recognition (OCR), and speech is automatically transcribed using Automatic Speech Recognition (ASR). Names are identified and they are searched for on image search engines. The top ranked images are downloaded and the feature-vectors are computed from the faces. If any are close enough to the feature-vectors from the face-tracks present in the scene, then that face-track is labelled with that name. The video below details this process for a written name.
Using text or spoken word and face recognition to identify a person in a news clip.
(3) For our third labelling method, we use speaker recognition to identify any non-labelled speaking people. We use the labels from the previous two stages to automatically acquire labelled speech segments from the corresponding labelled face-tracks. For each remaining non-labelled speaking person, we extract the speech feature-vector and compute the distance of it to the feature-vectors of the labelled speech segments. If one is close enough, then the non-labelled speech segment and corresponding face-track is assigned that name. This process manages to label speaking face-tracks with visually challenging faces, e.g. deep in shadow or at an extremely non-frontal pose.
Indexing and Searching Identities
The results of our work can be browsed via a web search engine of our own design. A search bar allows for users to specify the person or group of people that they would like to search for. People’s names are efficiently indexed so that the complete list of names can be filtered as the user types in the search bar. The search results are returned instantly with their associated metadata (programme name, data and time) and can be displayed in multiple ways. The video associated with each search result can be played, visualising the location and the name of all identified people in the video. See the video below for more details. This allows for the archive videos to be easily navigated using person-search, thus opening them up for use by the general public.
Archive videos easily navigated using person-search.
 Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, and Andrew Zisserman. VGGFace2: A dataset for recognising faces across pose and age. In Proc. International Conference on Automatic Face & Gesture Recognition, 2018.
 Joon Son Chung, Arsha Nagrani and Andrew Zisserman. VoxCeleb2: Deep Speaker Recognition. INTERSPEECH, 2018
BL Labs Public Awards 2020
Inspired by this work that uses the British Library's digital archived news footage? Have you done something innovative using the British Library's digital collections and data? Why not consider entering your work for a BL Labs Public Award 2020 and win fame, glory and even a bit of money?
Whilst we welcome projects on any use of our digital collections and data (especially in research, artistic, educational and community categories), we are particularly interested in entries in our public awards that have focused on anti-racist work, about the pandemic or that are using computational methods such as the use of Jupyter Notebooks.
The course is designed for graduates who are new to computer science – which was perfect for me, as I had no previous coding knowledge besides some very basic HTML and CSS. It was a very steep learning curve, starting from scratch and ending with developing my own piece of software, but it was great to see how code could be applied to everyday issues to facilitate and automate parts of our workload. The fact that it was targeted at information professionals and that we could use existing datasets to learn from real life examples made it easier to integrate study with work. After a while, I started to look at the everyday tasks in my to-do list and wonder “Can this be solved with Python?”
After a taught module (Demystifying Computing with Python), students had to work on an individual project module and develop a software based on their work (to solve an issue, facilitate a task, re-use and analyse existing resources). I had an idea of the themes I wanted to explore – as Curator of Digital Publications, I’m interested in new media and platforms used to deliver content, and how text and stories are shaped by these tools. When I read about French company Short Édition and the short story vending machine in Canary Wharf I knew I had found my project.
My project is to build a stand-alone printer that prints random poems from a dataset of out-of-copyright texts. A little portable Bot-ish (sic!) Library to showcase the British Library collections and fill the world with more poetry.
A Short Story Station in Canary Wharf, London and my own sketch of a printing machine. (photo by the author)
For my project, I decided to use the British Library’s “Digitised printed books (18th-19th century)” collection. This comprises over 60,000 volumes of 18th and 19th century texts, digitised in partnership with Microsoft and made available under Public Domain Mark. My work focused on the metadata dataset and the dataset of OCR derived text (shout out to the Digital Research team for kindly providing me with this dataset, as its size far exceeded what my computer is able to download).
The British Library actively encourages researchers to use its “digital collection and data in exciting and innovative ways” and projects with similar goals to mine had been undertaken before. In 2017, Dr Jennifer Batt worked with staff at the British Library on a data mining project: her goal was to identify poetry within a dataset of 18th Century digitised newspapers from the British Library’s Burney Collection. In her research, Batt argued that employing a set of recurring words didn’t help her finding poetry within the dataset, as only very few of the poems included key terms like ‘stanza’ and ‘line’ – and none included the word ‘poem’. In my case, I chose to work with the metadata dataset first, as a way of filtering books based on their title, and while, as Batt proved, it’s unlikely that a poem itself includes a term defining its poetry style I was quite confident that such terms might appear in the title of a poetry collection.
My first step then was to identify books containing poetry, by searching through the metadata dataset using key words associated with poetry. My goal was not to find all the poetry in the dataset, but to identify books containing some form of poetry, that could be reused to create my printer dataset. I used the Poetry Foundation’s online “Glossary of Poetic Terms - Forms & Types of Poems” to identify key terms to use, eliminating the anachronisms (no poetry slam in the 19th century, I'm afraid) and ambiguous terms (“romance” returned too many results that weren’t relevant to my research). The result was 4580 book titles containing one or more poetry-related words.
My list of poetry terms used to search through the dataset
Once I solved the problem of extracting single poems, the issue was ‘reshaping’ the text to match the print edition. Line breaks are essential to the meaning of a poem and the OCR text was just one continuous string of text that completely disregarded the metric and rhythm of the original work. The rationale behind my choice of book was also that sonnets present a fairly regular structure, which I was hoping could be of use when reshaping the text. The idea of using the poem’s metre as a tool to determine line length seemed the most effective choice: by knowing the type of metre used (iambic pentameter, terza rima, etc.) it’s possible to anticipate the number of syllables for each line and where line breaks should occur.
So I created a function to count how many syllables a word has following English grammar rules. As it’s often the case with coding, someone has likely already encountered the same problem as you and, if you’re lucky, they have found a solution: I used a function found online as my base (thank you, StackOverflow), building on it in order to cover as many grammar rules (and exceptions) as I was aware of. I used the same model and adapted it to Italian grammar rules, in order to account for the Italian sonnets in the book as well. I then decided to combine the syllable count with the use of capitalisation at the beginning of a line. This increased the chances of a successful result in case the syllable count would return a wrong result (which might happen whenever typos appear in the OCR text).
The same sonnet restructured so that each line is a new string (above), and matches the line breaks in the print edition (below)
Example of sonnet from Legend of the Death of Antar, an eastern romance. The function that divides the poems into lines could be adapted to accommodate breaks between stanzas as well.
Main challenges and gaps in research
Typos in the OCR text: Errors and typos were introduced when the books in the collection were first digitised, which translated into exceptions to the rules I devised for identifying and restructuring poems. In order to ensure the text of every poem has been correctly captured and that typos have been fixed, some degree of manual intervention might be required.
Scalability: The variety of poetry styles and book structures, paired with the lack of tagging around verse text, make it impossible to find a single formula that can be applied to all cases. What I created is quite dependent on a book having one poem per page, and using capitalisation in a certain way.
Time constraint: the time limit we had to deliver the project - and my very-recently-acquired-and-still-very-much-developing skill set - meant I had to focus on a limited number of books and had to prioritise writing the software over building the printer itself.
One of the outputs of this project is a JSON file containing a dictionary of poetry books. After searching for poetry terms, I paired the poetry titles and relative metadata with their pages from the OCR dataset, so the resulting file combines useful data from the two original datasets (book IDs, titles, authors’ names and the OCR text of each book). It’s also slightly easier to navigate compared to the OCR dataset as books can be retrieved by ID, and each page is an item in a list that can be easily called. One of the next steps will be to upload this onto the British Library data repository, in the hope that people might be encouraged to use it and conduct further research around this data collection.
Component parts of the Adafruit IoT Pi Printer Project Pack. (photo by the author)
My aim when working on this project was for the printer to be used to showcase British Library collections; the idea was for it to be located in a public area in the Library, to reach new audiences that might not necessarily be there for research purposes. The printer could also be reprogrammed to print different genres and be customised for different occasions (e.g. exhibitions, anniversary celebrations, etc.) All of this was planned before Covid-19 happened, so it might be necessary to slightly adapt things now - and any suggestions in merit are very welcome! :)
Perhaps you know of a project that developed new forms of knowledge, or an activity that delivered commercial value to the library. Did the person or team create an artistic work that inspired, stimulated, amazed and provoked? Do you know of a project developed by the Library where quality learning experiences were generated using the Library’s digital content?
You may nominate a current member of British Library staff, a team, or yourself (if you are a member of staff), for the Staff Award using this form.
The deadline for submission is NOON (GMT), Monday 30 November 2020.
Nominees will be highlighted on Tuesday 15 December 2020 at the online British Library Labs Annual Symposium where some (winners and runners-up) will also be asked to talk about their projects (everyone is welcome to attend, you just need to register).
You can see the projects submitted by members of staff and public for the awards in our online archive.