THE BRITISH LIBRARY

Digital scholarship blog

39 posts categorized "LIS research"

21 April 2020

Clean. Migrate. Validate. Enhance. Processing Archival Metadata with Open Refine

Add comment

This blogpost is by Graham Jevon, Cataloguer, Endangered Archives Programme 

Creating detailed and consistent metadata is a challenge common to most archives. Many rely on an army of volunteers with varying degrees of cataloguing experience. And no matter how diligent any team of cataloguers are, human error and individual idiosyncrasies are inevitable.

This challenge is particularly pertinent to the Endangered Archives Programme (EAP), which has hitherto funded in excess of 400 projects in more than 90 countries. Each project is unique and employs its own team of one or more cataloguers based in the particular country where the archival content is digitised. But all this disparately created metadata must be uniform when ingested into the British Library’s cataloguing system and uploaded to eap.bl.uk.

Finding an efficient, low-cost method to process large volumes of metadata generated by hundreds of unique teams is a challenge; one that in 2019, EAP sought to alleviate using freely available open source software Open Refine – a power tool for processing data.

This blog highlights some of the ways that we are using Open Refine. It is not an instructional how-to guide (though we are happy to follow-up with more detailed blogs if there is interest), but an introductory overview of some of the Open Refine methods we use to process large volumes of metadata.

Initial metadata capture

Our metadata is initially created by project teams using an Excel spreadsheet template provided by EAP. In the past year we have completely redesigned this template in order to make it as user friendly and controlled as possible.

Screenshot of spreadsheet

But while Excel is perfect for metadata creation, it is not best suited for checking and editing large volumes of data. This is where Open Refine excels (pardon the pun!), so when the final completed spreadsheet is delivered to EAP, we use Open Refine to clean, validate, migrate, and enhance this data.

WorkflowDiagram

Replicating repetitive tasks

Open Refine came to the forefront of our attention after a one-day introductory training session led by Owen Stephens where the key takeaway for EAP was that a sequence of functions performed in Open Refine can be copied and re-used on subsequent datasets.

ScreenshotofOpenRefineSoftware1

This encouraged us to design and create a sequence of processes that can be re-applied every time we receive a new batch of metadata, thus automating large parts of our workflow.

No computer programming skills required

Building this sequence required no computer programming experience (though this can help); just logical thinking, a generous online community willing to share their knowledge and experience, and a willingness to learn Open Refine’s GREL language and generic regular expressions. Some functions can be performed simply by using Open Refine’s built-in menu options. But the limits of Open Refine’s capabilities are almost infinite; the more you explore and experiment, the further you can push the boundaries.

Initially, it was hoped that our whole Open Refine sequence could be repeated in one single large batch of operations. The complexity of the data and the need for archivist intervention meant that it was more appropriate to divide the process into several steps. Our workflow is divided into 7 stages:

  1. Migration
  2. Dates
  3. Languages and Scripts
  4. Related subjects
  5. Related places and other authorities
  6. Uniform Titles
  7. Digital content validation

Each of these stages performs one or more of four tasks: clean, migrate, validate, and enhance.

Task 1: Clean

The first part of our workflow provides basic data cleaning. Across all columns it trims any white space at the beginning or end of a cell, removes any double spaces, and capitalises the first letter of every cell. In just a few seconds, this tidies the entire dataset.

Task 1 Example: Trimming white space (menu option)

Trimming whitespace on an individual column is an easy function to perform as Open Refine has a built in “Common transform” that performs this function.

ScreenshotofOpenRefineSoftware2

Although this is a simple function to perform, we no longer need to repeatedly select this menu option for each column of each dataset we process because this task is now part of the workflow that we simply copy and paste.

Task 1 Example: Capitalising the first letter (using GREL)

Capitalising the first letter of each cell is less straightforward for a new user as it does not have a built-in function that can be selected from a menu. Instead it requires a custom “Transform” using Open Refine’s own expression language (GREL).

ScreenshotofOpenRefineSoftware3


Having to write an expression like this should not put off any Open Refine novices. This is an example of Open Refine’s flexibility and many expressions can be found and copied from the Open Refine wiki pages or from blogs like this. The more you copy others, the more you learn, and the easier you will find it to adapt expressions to your own unique requirements.

Moreover, we do not have to repeat this expression again. Just like the trim whitespace transformation, this is also now part of our copy and paste workflow. One click performs both these tasks and more.

Task 2: Migrate

As previously mentioned, the listing template used by the project teams is not the same as the spreadsheet template required for ingest into the British Library’s cataloguing system. But Open Refine helps us convert the listing template to the ingest template. In just one click, it renames, reorders, and restructures the data from the human friendly listing template to the computer friendly ingest template.

Task 2 example: Variant Titles

The ingest spreadsheet has a “Title” column and a single “Additional Titles” column where all other title variations are compiled. It is not practical to expect temporary cataloguers to understand how to use the “Title” and “Additional Titles” columns on the ingest spreadsheet. It is much more effective to provide cataloguers with a listing template that has three prescriptive title columns. This helps them clearly understand what type of titles are required and where they should be put.

SpreadsheetSnapshot

The EAP team then uses Open Refine to move these titles into the appropriate columns (illustrated above). It places one in the main “Title” field and concatenates the other two titles (if they exist) into the “Additional Titles” field. It also creates two new title type columns, which the ingest process requires so that it knows which title is which.

This is just one part of the migration stage of the workflow, which performs several renaming, re-ordering, and concatenation tasks like this to prepare the data for ingest into the British Library’s cataloguing system.

Task 3: Validate

While cleaning and preparing the data for migration is important, it also vital that we check that the data is accurate and reliable. But who has the time, inclination, or eye stamina to read thousands of rows of data in an Excel spreadsheet? What we require is a computational method to validate data. Perhaps the best way of doing this is to write a bespoke computer program. This indeed is something that I am now working on while learning to write computer code using the Python language (look out for a further blog on this later).

In the meantime, though, Open Refine has helped us to validate large volumes of metadata with no programming experience required.

Task 3 Example: Validating metadata-content connections

When we receive the final output from a digitisation project, one of our most important tasks is to ensure that all of digital content (images, audio and video recordings) correlate with the metadata on the spreadsheet and vice versa.

We begin by running a command line report on the folders containing the digital content. This provides us with a csv file which we can read in Excel. However, the data is not presented in a neat format for comparison purposes.

SpreadsheetSnapshot2

Restructuring data ready for validation comparisons

For this particular task what we want is a simple list of all the digital folder names (not the full directory) and the number of TIFF images each folder contains. Open Refine enables just that, as the next image illustrates.

ScreenshotofOpenRefineSoftware4

Constructing the sequence that restructures this data required careful planning and good familiarity with Open Refine and the GREL expression language. But after the data had been successfully restructured once, we never have to think about how to do this again. As with other parts of the workflow, we now just have to copy and paste the sequence to repeat this transformation on new datasets in the same format.

Cross referencing data for validation

With the data in this neat format, we can now do a number of simple cross referencing checks. We can check that:

  1. Each digital folder has a corresponding row of metadata – if not, this indicates that the metadata is incomplete
  2. Each row of metadata has a corresponding digital folder – if not, this indicates that some digital folders containing images are missing
  3. The actual number of TIFF images in each folder exactly matches the number of images recorded by the cataloguer – if not this may indicate that some images are missing.

For each of these checks we use Open Refine’s cell.cross expression to cross reference the digital folder report with the metadata listing.

In the screenshot below we can see the results of the first validation check. Each digital folder name should match the reference number of a record in the metadata listing. If we find a match it returns that reference number in the “CrossRef” column. If no match is found, that column is left blank. By filtering that column by blanks, we can very quickly identify all of the digital folders that do not contain a corresponding row of metadata. In this example, before applying the filter, we can already see that at least one digital folder is missing metadata. An archivist can then investigate why that is and fix the problem.

ScreenshotofOpenRefineSoftware5

Task 4: Enhance

We enhance our metadata in a number of ways. For example, we import authority codes for languages and scripts, and we assign subject headings and authority records based on keywords and phrases found in the titles and description columns.

Named Entity Extraction

One of Open Refine’s most dynamic features is its ability to connect to other online databases and thanks to the generous support of Dandelion API we are able to use its service to identify entities such as people, places, organisations, and titles of work.

In just a few simple steps, Dandelion API reads our metadata and returns new linked data, which we can filter by category. For example, we can list all of the entities it has extracted and categorised as a place or all the entities categorised as people.

ScreenshotofOpenRefineSoftware6

Not every named entity it finds will be accurate. In the above example “Baptism” is clearly not a place. But it is much easier for an archivist to manually validate a list of 29 phrases identified as places, than to read 10,000 scope and content descriptions looking for named entities.

Clustering inconsistencies

If there is inconsistency in the metadata, the returned entities might contain multiple variants. This can be overcome using Open Refine’s clustering feature. This identifies and collates similar phrases and offers the opportunity to merge them into one consistent spelling.

ScreenshotofOpenRefineSoftware7

Linked data reconciliation

Having identified and validated a list of entities, we then use other linked data services to help create authority records. For this particular task, we use the Wikidata reconciliation service. Wikidata is a structured data sister project to Wikipedia. And the Open Refine reconciliation service enables us to link an entity in our dataset to its corresponding item in Wikidata, which in turn allows us to pull in additional information from Wikidata relating to that item.

For a South American photograph project we recently catalogued, Dandelion API helped identify 335 people (including actors and performers). By subsequently reconciling these people with their corresponding records in Wikidata, we were able to pull in their job title, date of birth, date of death, unique persistent identifiers, and other details required to create a full authority record for that person.

ScreenshotofOpenRefineSoftware8

Creating individual authority records for 335 people would otherwise take days of work. It is a task that previously we might have deemed infeasible. But Open Refine and Wikidata drastically reduces the human effort required.

Summary

In many ways, that is the key benefit. By placing Open Refine at the heart of our workflow for processing metadata, it now takes us less time to do more. Our workflow is not perfect. We are constantly finding new ways to improve it. But we now have a semi-automated method for processing large volumes of metadata.

This blog puts just some of those methods in the spotlight. In the interest of brevity, we refrained from providing step-by-step detail. But if there is interest, we will be happy to write further blogs to help others use this as a starting point for their own metadata processing workflows.

20 April 2020

BL Labs Research Award Winner 2019 - Tim Crawford - F-Tempo

Add comment

Posted on behalf of Tim Crawford, Professorial Research Fellow in Computational Musicology at Goldsmiths, University of London and BL Labs Research Award winner for 2019 by Mahendra Mahey, Manager of BL Labs.

Introducing F-TEMPO

Early music printing

Music printing, introduced in the later 15th century, enabled the dissemination of the greatest music of the age, which until that time was the exclusive preserve of royal and aristocratic courts or the Church. A vast repertory of all kinds of music is preserved in these prints, and they became the main conduit for the spread of the reputation and influence of the great composers of the Renaissance and early Baroque periods, such as Josquin, Lassus, Palestrina, Marenzio and Monteverdi. As this music became accessible to the increasingly well-heeled merchant classes, entirely new cultural networks of taste and transmission became established and can be traced in the patterns of survival of these printed sources.

Music historians have tended to neglect the analysis of these patterns in favour of a focus on a canon of ‘great works’ by ‘great composers’, with the consequence that there is a large sub-repertory of music that has not been seriously investigated or published in modern editions. By including this ‘hidden’ musical corpus, we could explore for the first time, for example, the networks of influence, distribution and fashion, and the effects on these of political, religious and social change over time.

Online resources of music and how to read them

Vast amounts of music, mostly audio tracks, are now available using services such as Spotify, iTunes or YouTube. Music is also available online in great quantity in the form of PDF files rendering page-images of either original musical documents or modern, computer-generated music notation. These are a surrogate for paper-based books used in traditional musicology, but offer few advantages beyond convenience. What they don’t allow is full-text search, unlike the text-based online materials which are increasingly the subject of ‘distant reading’ in the digital humanities.

With good score images, Optical Music Recognition (OMR) programs can sometimes produce useful scores from printed music of simple texture; however, in general, OMR output contains errors due to misrecognised symbols. The results often amount to musical gibberish, severely limiting the usefulness of OMR for creating large digital score collections. Our OMR program is Aruspix, which is highly reliable on good images, even when they have been digitised from microfilm.

Here is a screen-shot from Aruspix, showing part of the original page-image at the top, and the program’s best effort at recognising the 16th-century music notation below. It is not hard to see that, although the program does a pretty good job on the whole, there are not a few recognition errors. The program includes a graphical interface for correcting these, but we don’t make use of that for F-TEMPO for reasons of time – even a few seconds of correction per image would slow the whole process catastrophically.

The Aruspix user-interface
The Aruspix user-interface

 

 

Finding what we want – error-tolerant encoding

Although OMR is far from perfect, online users are generally happy to use computer methods on large collections containing noise; this is the principle behind the searches in Google Books, which are based on Optical Character Recognition (OCR).

For F-TEMPO, from the output of the Aruspix OMR program, for each page of music, we extract a ‘string’ representing the pitch-name and octave for the sequence of notes. Since certain errors (especially wrong or missing clefs or accidentals) affect all subsequent notes, we encode the intervals between notes rather than the notes themselves, so that we can match transposed versions of the sequences or parts of them. We then use a simple alphabetic code to represent the intervals in the computer.

Here is an example of a few notes from a popular French chanson, showing our encoding method.

A few notes from a Crequillon chanson, and our encoding of the intervals
A few notes from a Crequillon chanson, and our encoding of the intervals

F-TEMPO in action

F-TEMPO uses state-of-the-art, scalable retrieval methods, providing rapid searches of almost 60,000 page-images for those similar to a query-page in less than a second. It successfully recovers matches when the query page is not complete, e.g. when page-breaks are different. Also, close non-identical matches, as between voice-parts of a polyphonic work in imitative style, are highly ranked in results; similarly, different works based on the same musical content are usually well-matched.

Here is a screen-shot from the demo interface to F-TEMPO. The ‘query’ image is on the left, and searches are done by hitting the ‘Enter’ or ‘Return’ key in the normal way. The list of results appears in the middle column, with the best match (usually the query page itself) highlighted and displayed on the right. As other results are selected, their images are displayed on the right. Users can upload their own images of 16th-century music that might be in the collection to serve as queries; we have found that even photos taken with a mobile phone work well. However, don’t expect coherent results if you upload other kinds of image!

F-Tempo-User Interface
F-Tempo-User Interface

The F-TEMPO web-site can be found at: http://f-tempo.org

Click on the ‘Demo’ button to try out the program for yourself.

What more can we do with F-TEMPO?

Using the full-text search methods enabled by F-TEMPO’s API we might begin to ask intriguing questions, such as:

  • ‘How did certain pieces of music spread and become established favourites throughout Europe during the 16th century?’
  • ‘How well is the relative popularity of such early-modern favourites reflected in modern recordings since the 1950s?’
  • ‘How many unrecognised arrangements are there in the 16th-century repertory?’

In early testing we identified an instrumental ricercar as a wordless transcription of a Latin motet, hitherto unknown to musicology. As the collection grows, we are finding more such unexpected concordances, and can sometimes identify the composers of works labelled in some printed sources as by ‘Incertus’ (Uncertain). We have also uncovered some interesting conflicting attributions which could provoke interesting scholarly discussion.

Early Music Online and F-TEMPO

From the outset, this project has been based on the Early Music Online (EMO) collection, the result of a 2011 JISC-funded Rapid Digitisation project between the British Library and Royal Holloway, University of London. This digitised about 300 books of early printed music at the BL from archival microfilms, producing black-and-white images which have served as an excellent proof of concept for the development of F-TEMPO. The c.200 books judged suitable for our early methods in EMO contain about 32,000 pages of music, and form the basis for our resource.

The current version of F-TEMPO includes just under 30,000 more pages of early printed music from the Polish National Library, Warsaw, as well as a few thousand from the Bibliothèque nationale, Paris. We will soon be incorporating no fewer than a further half-a-million pages from the Bavarian State Library collection in Munich, as soon as we have run them through our automatic indexing system.

 (This work was funded for the past year by the JISC / British Academy Digital Humanities Research in the Humanities scheme. Thanks are due to David Lewis, Golnaz Badkobeh and Ryaan Ahmed for technical help and their many suggestions.)

08 April 2020

Legacies of Catalogue Descriptions and Curatorial Voice: a new AHRC project

Add comment

This guest post is by James Baker, Senior Lecturer in Digital History and Archives at the School of History, Art History and Philosophy, University of Sussex. James has a background in the history of the printed image, archival theory, art history, and computational analysis. He is author of The Business of Satirical Prints in Late-Georgian England (2017), the first monograph on the infrastructure of the satirical print trade circa 1770-1830, and a member of the Programming Historian team.

I love a good catalogue. Whether describing historic books, personal papers, scientific objects, or works of art, catalogue entries are the stuff of historical research, brief insights into a many possible avenues of discovery. As a historian, I am trained to think critically about catalogues and the entries they contain, to remember that they are always crafted by people, institutions, and temporally specific ways of working, and to consider what that reality might do to my understanding of the past those catalogues and entries represent. Recently, I've started to make these catalogues my objects of historical study, to research what they contain, the labour that produced them, and the socio-cultural forces that shaped that labour, with a particular focus on the anglophone printed catalogue circa 1930-1990. One motivation for this is purely historical, to elucidate what I see as an important historical phenomenon. But another is about now, about how those catalogues are used and reused in the digital age. Browse the shelves of a university library and you'll quickly see that circumstances of production are encoded into the architecture of the printed catalogue: title pages, prefaces, fonts, spines, and the quality of paper are all signals of their historical nature. But when their entries - as many have been over the last 30 years - are moved into a database and online, these cues become detached, and their replacement – a bibliographic citation – is insufficient to evoke their historical specificity, does little to help alert the user to the myriad of texts they are navigating each time they search an online catalogue.

It is these interests and concerns that underpin "Legacies of Catalogue Descriptions and Curatorial Voice: Opportunities for Digital Scholarship", a collaboration between the Sussex Humanities Lab, the British Library, and Yale University Library. This 12-month project funded by the Arts and Humanities Research Council aims to open up new and important directions for computational, critical, and curatorial analysis of collection catalogues. Our pilot research will investigate the temporal and spatial legacy of a catalogue I know well - the landmark ‘Catalogue of Political and Personal Satires Preserved in the Department of Prints and Drawings in the British Museum’, produced by Mary Dorothy George between 1930 and 1954, 1.1 million words of text to which all scholars of the long-eighteenth century printed image are indebted, and which forms the basis of many catalogue entries at other institutions, not least those of our partners at the Lewis Walpole Library. We are particularly interested in tracing the temporal and spatial legacies of this catalogue, and plan to repurpose corpus linguistic methods developed in our "Curatorial Voice" project (generously funded by the British Academy) to examine the enduring legacies of Dorothy George's "voice" beyond her printed volumes.

Participants at the Curatorial Voices workshop, working in small groups and drawing images on paper.
Some things we got up to at our February 2019 Curatorial Voice workshop. What a difference a year makes!

But we also want to demonstrate the value of these methods to cultural institutions. Alongside their collections, catalogues are central to the identities and legacies of these institutions. And so we posit that being better able to examine their catalogue data can help cultural institutions get on with important catalogue related work: to target precious cataloguing and curatorial labour towards the records that need the most attention, to produce empirically-grounded guides to best practice, and to enable more critical user engagement with 'legacy' catalogue records (for more info, see our paper ‘Investigating Curatorial Voice with Corpus Linguistic Techniques: the case of Dorothy George and applications in museological practice’, Museum & Society, 2020).

A table with boxes of black and red lines which visualise the representation of spacial and non-spacial sentence parts in the descriptions of the satirical prints.
An analysis of our BM Satire Descriptions corpus (see doi.org/10.5281/zenodo.3245037 for how we made it and doi.org/10.5281/zenodo.3245017 for our methods). In this visualization - a snapshot of a bigger interactive - one box represents a single description, red lines are sentence parts marked ‘spatial’, and black lines are sentence parts marked as ‘non-spatial’. This output was based on iterative machine learning analysis with Method52. The data used is published by ResearchSpace under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Over the course of the "Legacies" project, we had hoped to run two capability building workshops aimed at library, archives, and museum professionals. The first of these was due to take place at the British Library this May, and the aim of the workshop was to test our still very much work-in-progress training module on the computational analysis of catalogue data. Then Covid-19 hit and, like most things in life, the plan had to be dropped.

The new plan is still in development, but the project team know that we need input from the community to make the training module of greatest benefit to that community. The current plan is that in late summer we will run some ad hoc virtual training sessions on computational analysis of catalogue data. And so we are looking for library, archives, and museum professionals who produce or work with catalogue data to be our crash test dummies, to run through parts of the module, to tell us what works, what doesn't, and what is missing. If you'd be interested in taking part in one of these training sessions, please email James Baker and tell me why. We look forward to hearing from you.

"Legacies of Catalogue Descriptions and Curatorial Voice: Opportunities for Digital Scholarship" is funded under the Arts and Humanities Research Council (UK) “UK-US Collaboration for Digital Scholarship in Cultural Institutions: Partnership Development Grants” scheme. Project Reference AH/T013036/1.

11 February 2020

Call for participants: April 2020 book sprint on the state of the art in crowdsourcing in cultural heritage

Add comment

[Update, March 2020: like so much else, our plans for the 'Collective Wisdom' project have been thrown out by the COVID-19 pandemic. We have an extension from our funders and will look to confirm dates when the global situation (especially around international flights) becomes clearer. In the meantime, the JISCMail Crowdsourcing list has some discussion on starting and managing projects in the current context.]

One of the key outcomes of our AHRC UK-US Partnership Development Grant, 'From crowdsourcing to digitally-enabled participation: the state of the art in collaboration, access, and inclusion for cultural heritage institutions', is the publication of an open access book written through a collaborative 'book sprint'. We'll work with up to 12 other collaborators to write a high-quality book that provides a comprehensive, practical and authoritative guide to crowdsourcing and digitally-enabled participation projects in the cultural heritage sector. Could you be one of our collaborators? Read on!

The book sprint will be held at the Peale Center for Baltimore History and Architecture from 19 - 24th April 2020. We've added a half-day debriefing session to the usual five day sprint, so that we can capture all the ideas that didn't make it into the book and start to shape the agenda for a follow-up workshop to be held at the British Library in October. Due to the pace of writing and facilitation, participants must be able to commit to five and a half days in order to attend. 

We have some confirmed participants already - including representatives from FromThePage, King’s College London Department of Digital Humanities, the Virginia Tech Department of Computer Science, and the Colored Conventions Project, plus the project investigators Mia Ridge (British Library), Meghan Ferriter (Library of Congress) and Sam Blickhan (Zooniverse) - with additional places to be filled by this open call for participation. 

An open call enables us to include folk from a range of backgrounds and experiences. This matches the ethos of the book sprint model, which states that 'diversity in participants—perspectives, experience, job roles, ethnicity, gender—creates a better work dynamic and a better book'. Participants will have the opportunity to not only create this authoritative text, but to facilitate the formation of an online community of practice which will serve as a resource and support system for those engaging with crowdsourcing and digitally-enabled participation projects.

We're looking for participants who are enthusiastic, experienced and engaged, with expertise at any point in the life cycle of crowdsourcing and digital participation. Your expertise might have been gained through hands-on experience on projects or by conducting research in areas from co-creation with heritage organisations or community archives to HCI, human computation and CSCW. We have a generous definition of 'digitally-enabled participation', including not-entirely-digital volunteering projects around cultural heritage collections, and activities that go beyond typical collection-centric 'crowdsourcing' tasks like transcription, classification and description. Got questions? Please email digitalresearch@bl.uk!

How to apply

  1. Read the Book Sprint FAQs to make sure you're aware of the process and commitment required
  2. Fill in this short Google Form by midnight GMT February 26th

What happens next?

We'll review applications and let people know by the end of February 2020.

We're planning to book travel and accommodation for participants as soon as dates and attendance is confirmed - this helps keeps costs down and also means that individuals aren't out of pocket while waiting for reimbursement. The AHRC fund will pay for travel and accommodation for all book sprint participants. We will also host a follow up workshop at the British Library in October and hope to provide travel and accommodations for book sprint participants. 

We'll be holding a pre-sprint video call (on March 18, 19 or 20) to put faces to names and think about topics that people might want to research in advance and collect as an annotated bibliography for use during the sprint. 

If you can't make the book sprint but would still like to contribute, we've got you covered! We'll publish the first version of the book online for comment and feedback. Book sprints don't allow for remote participation, so this is our best way of including the vast amounts of expertise not in the room.

You can sign up to the British Library's crowdsourcing newsletters for updates, or join our Crowdsourcing group on Humanities Commons set up to share progress and engage in discussion with the wider community. 

New project! 'From crowdsourcing to digitally-enabled participation: the state of the art in collaboration, access, and inclusion for cultural heritage institutions'

Add comment

[Update, March 2020: like so much else, our plans for the 'Collective Wisdom' project have been thrown out by the COVID-19 pandemic. We have an extension from our funders and will look to confirm dates when the global situation (especially around international flights) becomes clearer. In the meantime, the JISCMail Crowdsourcing list has some discussion on starting and managing projects in the current context.]

We - Mia Ridge (British Library), Meghan Ferriter (Library of Congress) and Sam Blickhan (Zooniverse) - are excited to announce that we've been awarded an AHRC UK-US Partnership Development Grant. Our overarching goals are:

  • To foster an international community of practice in crowdsourcing in cultural heritage
  • To capture and disseminate the state of the art and promote knowledge exchange in crowdsourcing and digitally-enabled participation
  • To set a research agenda and generate shared understandings of unsolved or tricky problems that could lead to future funding applications

How will we do that?

We're holding a five day collaborative 'book sprint' (or writing workshop) at the Peale Center for Baltimore History and Architecture in April 2020. Working with up to 12 other collaborators, we'll write a high-quality book that provides a comprehensive, practical and authoritative guide to crowdsourcing and digitally-enabled participation projects in the cultural heritage sector. We want to provide an effective road map for cultural institutions hoping to use crowdsourcing for the first time and a resource for institutions already using crowdsourcing to benchmark their work.

In the spirit of digital participation, we'll publish a commentable version of the book online with an open call for feedback from the extended international community of crowdsourcing practitioners, academics and volunteers. We're excited about including the expertise of those unable to attend the book sprint in our final open access publication.

The book sprint will close with a short debrief session to capture suggestions about gaps in the field and sketch the agenda for the closing workshop. 

In October 2020 we're holding a workshop at the British Library for up to 25 participants to interrogate, refine and advance questions raised during the year and identify high priority gaps and emerging challenges in the field that could be addressed by future research collaborations. We'll work with a community manager to ensure that remote participants are as integrated into the event as much as possible, which will lower our carbon footprint and let people contribute without getting on a plane. 

We'll publish a white paper reporting on this workshop, outlining emerging, intractable and unsolved challenges that could be addressed by further funding for collaborative work. 

Finally, we want this project to help foster the wonderful community of crowdsourcing practitioners, participants and researchers by hosting events and online discussion. 

Why now?

For several years, crowdsourcing has provided a framework for online participation with, and around, cultural heritage collections. This popularity leads to increased participant expectations while also attracting criticism such as accusations of ‘free labour’. Now, the introduction of machine learning and AI methods, and co-creation and new models of ownership and authorship present significant challenges for institutions used to managing interactions with collections on their own terms. 

How can you get involved?

Our call for participants in our April Book Sprint is now open!

Our final workshop will be held in mid- or late-October. The easiest way to get updates such as calls for contributors and links to blog posts is to sign up for the British Library's crowdsourcing newsletters or join the Crowdsourcing group on Humanities Commons

29 November 2019

Introducing Filipe Bento - BL Labs Technical Lead

Add comment

Posted by Filipe Bento, BL Labs Technical Lead

Filipe BentoI am passionate about libraries and digital initiatives within them, and am particularly interested in Open Knowledge, scholarly communication, scientific information dissemination, (Linked) Open Data, and all the innovative services that can be offered to promote their ultimate dissemination and usage, not only within academia, but also within the wider community such as industry and society. I have over twenty years experience in developing and supporting library tools, some of which have facilitated automation over manual methods to make the lives of people who work or use libraries easier.

Before working at the British Library, I was an independent consultant in the areas of digital strategies and initiatives, library technologies, information management, digital policies, Software as a Service (SaaS) and Open Source Software (OSS). Previous to that, I worked at EBSCO Information Services in several roles, firstly as the Discovery Service Engineering Support Team Manager (Europe and Latin America) and for three years as the Software Services, Application Programming Interfaces (API) and Applications (Apps) manager. My last role at EBSCO was implementing and managing the EBSCO App Store which involved working with several departments within the organisation such as marketing and legal.

Filipe Bento giving a talk the BAD conference in the Azores
Giving a talk the National Congress of BAD (Portuguese Librarians, Archivists and Documentalists Association), in the Azores

I helped the University of Aveiro's Library become the first Portuguese adopter of reference Open Source Software (OSS)  - OJS [Open Journal Systems] and implemented the institutional digital repository DSpace for the university (which included a massive data transformation and records deposit, often from citations exported from Scopus). I started my career as a lecturer and then as a computer specialist at the University of Aveiro’s Library, coordinating the development of information systems for its many branches for over fifteen years.

My PhD research in Information and Communication in Digital Platforms gave me the opportunity to connect with my professional interests in libraries, especially in the areas of information discovery. In my PhD, I was able to implement VuFind with innovative community features, as a proposal for the university, which involved engaging actively in its developer community, providing general and technical support in the process. My thesis is available via the link "Search 4.0: Integration and Cooperation Confluence in Scientific Information Discovery".

University of Aveiro (main campus), Portugal
University of Aveiro (main campus), Portugal

I have also been very active in a number of communities;
I was the (former) chairman of the board of USE.pt, the Portuguese Ex Libris Systems’ Users Association, and a previous member of the DigiMedia Research Center - Digital Media and Interaction at the University of Aveiro.

In my personal life I had been a radio and club DJ and worked on a number of personal music projects. I enjoy photography and video and am a keen traveler. I especially like being behind the wheels of cars / motorbikes and the propellers of drones.

I am really excited in joining the BL Labs team as I believe it provides an excellent opportunity to apply my skills, knowledge and expertise in library digital collections development, systems, data and APIs in a digital scholarship and wider context. I am really looking forward in offering practical advice and implementations in providing access to data, data curation, data visualisation, text and data mining and interactive web based computing environments such as Jupyter Notebooks to name a few. BL Labs and the British Library offers a rich, innovative and stimulating environment to explore what its staff and users want to do with its incredible and diverse digital collections.

03 October 2019

BL Labs Symposium (2019): Book your place for Mon 11-Nov-2019

Add comment

Posted by Mahendra Mahey, Manager of BL Labs

The BL Labs team are pleased to announce that the seventh annual British Library Labs Symposium will be held on Monday 11 November 2019, from 9:30 - 17:00* (see note below) in the British Library Knowledge Centre, St Pancras. The event is FREE, and you must book a ticket in advance to reserve your place. Last year's event was the largest we have ever held, so please don't miss out and book early!

*Please note, that directly after the Symposium, we have teamed up with an interactive/immersive theatre company called 'Uninvited Guests' for a specially organised early evening event for Symposium attendees (the full cost is £13 with some concessions available). Read more at the bottom of this posting!

The Symposium showcases innovative and inspiring projects which have used the British Library’s digital content. Last year's Award winner's drew attention to artistic, research, teaching & learning, and commercial activities that used our digital collections.

The annual event provides a platform for the development of ideas and projects, facilitating collaboration, networking and debate in the Digital Scholarship field as well as being a focus on the creative reuse of the British Library's and other organisations' digital collections and data in many other sectors. Read what groups of Master's Library and Information Science students from City University London (#CityLIS) said about the Symposium last year.

We are very proud to announce that this year's keynote will be delivered by scientist Armand Leroi, Professor of Evolutionary Biology at Imperial College, London.

Armand Leroi
Professor Armand Leroi from Imperial College
will be giving the keynote at this year's BL Labs Symposium (2019)

Professor Armand Leroi is an author, broadcaster and evolutionary biologist.

He has written and presented several documentary series on Channel 4 and BBC Four. His latest documentary was The Secret Science of Pop for BBC Four (2017) presenting the results of the analysis of over 17,000 western pop music from 1960 to 2010 from the US Bill Board top 100 charts together with colleagues from Queen Mary University, with further work published by through the Royal Society. Armand has a special interest in how we can apply techniques from evolutionary biology to ask important questions about culture, humanities and what is unique about us as humans.

Previously, Armand presented Human Mutants, a three-part documentary series about human deformity for Channel 4 and as an award winning book, Mutants: On Genetic Variety and Human Body. He also wrote and presented a two part series What Makes Us Human also for Channel 4. On BBC Four Armand presented the documentaries What Darwin Didn't Know and Aristotle's Lagoon also releasing the book, The Lagoon: How Aristotle Invented Science looking at Aristotle's impact on Science as we know it today.

Armands' keynote will reflect on his interest and experience in applying techniques he has used over many years from evolutionary biology such as bioinformatics, data-mining and machine learning to ask meaningful 'big' questions about culture, humanities and what makes us human.

The title of his talk will be 'The New Science of Culture'. Armand will follow in the footsteps of previous prestigious BL Labs keynote speakers: Dan Pett (2018); Josie Fraser (2017); Melissa Terras (2016); David De Roure and George Oates (2015); Tim Hitchcock (2014); Bill Thompson and Andrew Prescott in 2013.

The symposium will be introduced by the British Library's new Chief Librarian Liz Jolly. The day will include an update and exciting news from Mahendra Mahey (BL Labs Manager at the British Library) about the work of BL Labs highlighting innovative collaborations BL Labs has been working on including how it is working with Labs around the world to share experiences and knowledge, lessons learned . There will be news from the Digital Scholarship team about the exciting projects they have been working on such as Living with Machines and other initiatives together with a special insight from the British Library’s Digital Preservation team into how they attempt to preserve our digital collections and data for future generations.

Throughout the day, there will be several announcements and presentations showcasing work from nominated projects for the BL Labs Awards 2019, which were recognised last year for work that used the British Library’s digital content in Artistic, Research, Educational and commercial activities.

There will also be a chance to find out who has been nominated and recognised for the British Library Staff Award 2019 which highlights the work of an outstanding individual (or team) at the British Library who has worked creatively and originally with the British Library's digital collections and data (nominations close midday 5 November 2019).

As is our tradition, the Symposium will have plenty of opportunities for networking throughout the day, culminating in a reception for delegates and British Library staff to mingle and chat over a drink and nibbles.

Finally, we have teamed up with the interactive/immersive theatre company 'Uninvited Guests' who will give a specially organised performance for BL Labs Symposium attendees, directly after the symposium. This participatory performance will take the audience on a journey through a world that is on the cusp of a technological disaster. Our period of history could vanish forever from human memory because digital information will be wiped out for good. How can we leave a trace of our existence to those born later? Don't miss out on a chance to book on this unique event at 5pm specially organised to coincide with the end of the BL Labs Symposium. For more information, and for booking (spaces are limited), please visit here (the full cost is £13 with some concessions available). Please note, if you are unfortunate in not being able to join the 5pm show, there will be another performance at 1945 the same evening (book here for that one).

So don't forget to book your place for the Symposium today as we predict it will be another full house again and we don't want you to miss out.

We look forward to seeing new faces and meeting old friends again!

For any further information, please contact labs@bl.uk

02 October 2019

The 2019 British Library Labs Staff Award - Nominations Open!

Add comment

Looking for entries now!

A set of 4 light bulbs presented next to each other, the third light bulb is switched on. The image is supposed to a metaphor to represent an 'idea'
Nominate a British Library staff member or a team that has done something exciting, innovative and cool with the British Library’s digital collections or data.

The 2019 British Library Labs Staff Award, now in its fourth year, gives recognition to current British Library staff who have created something brilliant using the Library’s digital collections or data.

Perhaps you know of a project that developed new forms of knowledge, or an activity that delivered commercial value to the library. Did the person or team create an artistic work that inspired, stimulated, amazed and provoked? Do you know of a project developed by the Library where quality learning experiences were generated using the Library’s digital content? 

You may nominate a current member of British Library staff, a team, or yourself (if you are a member of staff), for the Staff Award using this form.

The deadline for submission is 12:00 (BST), Tuesday 5 November 2019.

Nominees will be highlighted on Monday 11 November 2019 at the British Library Labs Annual Symposium where some (winners and runners-up) will also be asked to talk about their projects.

You can see the projects submitted by members of staff for the last two years' awards in our online archive, as well as blogs for last year's winners and runners-up.

The Staff Award complements the British Library Labs Awards, introduced in 2015, which recognise outstanding work that has been done in the broader community. Last year's winner focused on the brilliant work of the 'Polonsky Foundation England and France Project: Digitising and Presenting Manuscripts from the British Library and the Bibliothèque nationale de France, 700–1200'.

The runner up for the BL Labs Staff Award last year was the 'Digital Documents Harvesting and Processing Tool (DDHAPT)' which was designed to overcome the problem of finding individual known documents in the United Kingdom's Legal Deposit Web Archive.

In the public competition, last year's winners drew attention to artistic, research, teaching & learning, and commercial activities that used our digital collections.

British Library Labs is a project within the Digital Scholarship department at the British Library that supports and inspires the use of the Library's digital collections and data in exciting and innovative ways. It was previously funded by the Andrew W. Mellon Foundation and is now solely funded by the British Library.

If you have any questions, please contact us at labs@bl.uk.