THE BRITISH LIBRARY

Digital scholarship blog

Enabling innovative research with British Library digital collections

Introduction

Tracking exciting developments at the intersection of libraries, scholarship and technology. Read more

15 February 2018

BL Labs 2017 Symposium: Git Lit, Learning & Teaching Award Runner Up

Applications of Distributed Version Control Technologies Toward the Creation of 50,000 Digital Scholarly Editions

The British Library maintains a collection of roughly 50,000 digital texts, scanned from public-domain books, most of which were originally published in the 19th century. As scanned books, their text format is Analyzed Layout and Text Object (ALTO) Extensible Markup Language (XML), a verbose markup format created by Optical Character Recognition (OCR) software, and one which is only marginally human-readable. Our project, Git-Lit, converts each text to the plain text format Markdown, creates version-controlled repositories for each using the distributed version control system Git, and posts the repositories to the project management platform GitHub, where they can be edited by anyone. Along the way, websites for each text, optimized for readability, are automatically generated via GitHub Pages. These websites integrate with the annotation platform Hypothes.is, enabling them to be annotated. In this way, Git-Lit aims to make this collection of British Library electronic texts discoverable, readable, editable, annotatable, and downloadable.

A Screenshot of the Website Automatically Generated from the British Library Electronic Text
A Screenshot of the Website Automatically Generated from the British Library Electronic Text


The biggest advantage of using a distributed version control system like Git is that it leverages the kinds of decentralized collaboration workflows that have long been in use in software development. Open-source software and web development, for which Git and GitHub were originally designed, is a much-studied methodology, long proven to be more effective than closed-source methods. Rather than maintain a central silo for serving code and electronic texts, the decentralized approach ensures a plurality of textual versions. Since anyone may copy ("fork") a project, modify it, and create their own version, there is no one central, canonical text, but many. Each version may freely borrow ("pull") from others, request that others integrate their changes ("pull request"), and discuss potential changes ("issues") using the project management subsystems of GitHub. This workflow streamlines collaboration, and encourages external contributions. Furthermore, since each change ("commit") requires a description of the commit, and reasons for it, the Git platform enforces the kind of editorial documentation necessary for scholarly editing. We like to think of git-based editing, therefore, as scholarly editing, and GitHub-based collaboration as a democratization of scholarly editing.

Furthermore, since GitHub allows instant editing of texts in the web browser, it is a simple and intuitive method of crowdsourcing the text cleanup process. Since OCRd texts are often full of errors, GitHub allows any reader to correct an obvious OCR error she or he finds. The analogous process of reporting errors to centralized text repositories like Project Gutenberg has been known to take several years. On GitHub, however, it is instantaneous.

Not the least advantage of this setup is the automated creation of websites from the plain text sources. Not only does this transform the markdown to a clean, readable edition of the text, but it provides integration with the annotation platform Hypothes.is. Hypothes.is allows for social annotation of a text, making it ideal for classroom use. Professors may assign a British Library text as a course reading, and may require their students annotate it, an activity which can generate discussions in the limitless virtual margins of this electronic textual space.

The Git-Lit project has so far posted around 50 texts to GitHub, as prototypes, with the full corpus of roughly 50,000 texts soon to come. After the full corpus is processed in this way, we'll begin enhancing some of the metadata. So far, we have developed techniques for probabilistically inferring the language of each text, and using Ben Schmidt's document vectorization method, Stable Random Projection, we have been able to probabilistically infer Library of Congress classifications, as well. This enables the automatic generation of sub-corpora like PR (British Literature), or PZ (American Literature).

In the coming year, we hope to integrate the Git-Lit transformed British Library texts into a structured database, further enhancing the discoverability of its texts. We have just received a micro-grant from NYC-DH to help launch Corpus-DB, a project also aiming to produce textual corpora, and through Corpus-DB, we will soon create a SQL database containing the metadata, our enhanced and inferred metadata, and other aggregated book data gleaned from public APIs. This will soon allow readers and computational text analysts the ability to download groups of British Library electronic texts. Users interested in downloading, say, all novels set in London, will be able to get a complete full-text dump of all public-domain novels in this category by visiting a URL such as api.corpus-db.org/novels/setting/London. We expect that this will greatly streamline the corpus creation process that takes up so much of the time of a computational text analysis.

Both Git-Lit and Corpus-DB are open-source projects, open to contributions from anyone, regardless of skill. If you'd like to contribute to our project in some way, get in contact with us, and we'll tell you how you can help.

Jonathan Reeve
Jonathan Reeve

Jonathan Reeve is a third-year graduate student in the Department of English and Comparative Literature at Columbia University, where he specializes in computational literary analysis. Find his recent experiments at jonreeve.com.

If this blog post has stimulated your interest in working with the British Library's digital collections, start a project and enter it for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library to find out who wins.

Posted by BL Labs on behalf of Jonathan Reeve

14 February 2018

BL Labs 2017 Symposium: Movable Type, Commercial Award Winner

Movable Type is a tabletop word game, and something of a love letter to classic books and authors, made completely of custom playing cards. While the game’s appearance might remind you of Scrabble, it has some tricks up its sleeve to give it a much more modern and dynamic feel.

Movable Type Card Game
Figure 1: Movable Type Card Game


The initial idea for Movable Type was born around two years ago. I had been making games for some years and knew I wanted to do something with a word game. My main objective was to have a game that was very interactive, easy to grasp, and tactical, while also being very quick to play. As much as I love word games, some of them have a tendency to outstay their welcome. 

The central mechanism in Movable Type is called card-drafting. This method allows players to pick their letter cards each round – this does away with the large amounts of luck you find in many classic word games, and instead shifts attention onto the tactical decisions of the players. It also means that rules are kept very simple and that players can take their turns simultaneously, creating a much more dynamic play environment.

Movable Type - The Cards
Movable Type - The Cards


The prototype for Movable Type was only a few weeks old when I settled on the art style I was going to use in the final product. I’ve been a long-time fan of the British Library Flickr account, which lets users browse through images from public domain books. Once I had spotted the large collection of initial capitals, I was sold!

Movable Type - Illustrated Letters
Movable Type - Illustrated Letters


My wife, Tiffany Moon, is a graphic designer by trade. She helped clean up the images and present these beautiful pieces of art in a colourful new fashion, appropriate for a retail product. I also wanted portraits of some of my favourite authors to be in the game, so I commissioned Alisdair Wood to create woodcut-style images of ten classic, influential and diverse literary figures. Without those initial capital images taken from the British Library collection and used to direct the game’s overall style, Movable Type would likely not look half as impressive and definitely wouldn’t resonate with me and many players like the current style does.

Movable Type - Illustrated Famous People
Movable Type - Illustrated Famous People


I launched Movable Type on the crowdfunding platform, Kickstarter, last year. Upon its release, it won the Imirt Irish Game Award for Best Analog Game and second runner-up for Game of the Year. It received good reception at several public events and sold out of its initial print run, so I decided that a second edition of the game was in order. That bigger and better second edition is funding on Kickstarter and should be in some select retail stores by mid to late 2018 (fingers crossed!).

Movable Type receiving the Commercial Category BL Lab Award, was a huge boon for the reputation of the game and myself as a game designer. Furthermore, it was a genuine honour to be at the British Library for this event and able to share my product with the audience there.

If this blog post has stimulated your interest in working with the British Library's digital collections, start a project and enter it for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.

Posted by BL Labs on behalf of Robin David O’Keeffe

13 February 2018

BL Labs 2017 Symposium: Samtla, Research Award Runner Up

Samtla (Search And Mining Tools for Labelling Archives) was developed to address a need in the humanities for research tools that help to search, browse, compare, and annotate documents stored in digital archives. The system was designed in collaboration with researchers at Southampton University, whose research involved locating shared vocabulary and phrases across an archive of Aramaic Magic Texts from Late Antiquity. The archive contained texts written in Aramaic, Mandaic, Syriac, and Hebrew languages. Due to the morphological complexity of these languages, where morphemes are attached to a root morpheme to mark gender and number, standard approaches and off-the-shelf software were not flexible enough for the task, as they tended to be designed to work with a specific archive or user group. 

Figure1
Figure 1: Samtla supports tolerant search allowing queries to be matched exactly and approximately. (Click to enlarge image)

  Samtla is designed to extract the same or similar information that may be expressed by authors in different ways, whether it is in the choice of vocabulary or the grammar. Traditionally search and text mining tools have been based on words, which limits their use to corpora containing languages were 'words' can be easily identified and extracted from text, e.g. languages with a whitespace character like English, French, German, etc. Word models tend to fail when the language is morphologically complex, like Aramaic, and Hebrew. Samtla addresses these issues by adopting a character-level approach stored in a statistical language model. This means that rather than extracting words, we extract character-sequences representing the morphology of the language, which we then use to match the search terms of the query and rank the documents according to the statistics of the language. Character-based models are language independent as there is no need to preprocess the document, and we can locate words and phrases with a lot of flexibility. As a result Samtla compensates for the variability in language use, spelling errors made by users when they search, and errors in the document as a result of the digitisation process (e.g. OCR errors). 

Figure2
Figure 2: Samtla's document comparison tool displaying a semantically similar passage between two Bibles from different periods. (Click to enlarge image)

 The British Library have been very supportive of the work by openly providing access to their digital archives. The archives ranged in domain, topic, language, and scale, which enabled us to test Samtla’s flexibility to its limits. One of the biggest challenges we faced was indexing larger-scale archives of several gigabytes. Some archives also contained a scan of the original document together with metadata about the structure of the text. This provided a basis for developing new tools that brought researchers closer to the original object, which included highlighting the named entities over both the raw text, and the scanned image.

Currently we are focusing on developing approaches for leveraging the semantics underlying text data in order to help researchers find semantically related information. Semantic annotation is also useful for labelling text data with named entities, and sentiments. Our current aim is to develop approaches for annotating text data in any language or domain, which is challenging due to the fact that languages encode the semantics of a text in different ways.

As a first step we are offering labelled data to researchers, as part of a trial service, in order to help speed up the research process, or provide tagged data for machine learning approaches. If you are interested in participating in this trial, then more information can be found at www.samtla.com.

Figure3
Figure 3: Samtla's annotation tools label the texts with named entities to provide faceted browsing and data layers over the original image. (Click to enlarge image)

 If this blog post has stimulated your interest in working with the British Library's digital collections, start a project and enter it for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.


Posted by BL Labs on behalf of Dr Martyn Harris, Prof Dan Levene, Prof Mark Levene and Dr Dell Zhang.

05 February 2018

8th Century Arabic science meets today's computer science

Or, Announcing a Competition for the Automatic Transcription of Historical Arabic Scientific Manuscripts 

“An impartial view of Digital Humanities (DH) scholarship in the present day reveals a stark divide between ‘the West and the rest’…Far fewer large-scale DH initiatives have focused on Asia and the non-Western world than on Western Europe and the Americas…Digital databases and text corpora – the ‘raw material’ of text mining and computational text analysis – are far more abundant for English and other Latin alphabetic scripts than they are for Chinese, Japanese, Korean, Sanskrit, Hindi, Arabic and other non-Latin orthographies…Troves of unread primary sources lie dormant because no text mining technology exists to parse them.”

-Dr. Thomas Mullaney, Associate Professor of Chinese History at Stanford University

Supporting the use of Asian & African Collections in digital scholarship means shining a light on this stark divide and seeking ways to close the gap. In this spirit, we are excited to announce the ICFHR2018 Competition on Recognition of Historical Arabic Scientific Manuscripts.

Add MS 7474_0043.script

The Competition

Drawing together experts from British Library, The Alan Turing Institute, Qatar Digital Library and PRImA Research Lab, our aim in launching this competition is to play an active roll in advancing the state-of-the-art in handwritten text recognition technologies for Arabic. For our first challenge we are focussing on finding an optimal solution for accurately and automatically transcribing historical Arabic scientific handwritten manuscripts.

Though such technologies are still in their infancy, unlocking historical handwritten Arabic manuscripts for large-scale text analysis has the potential to truly transform research. In conjunction with the competition we hope to build and make freely open and available a substantial image and ground truth dataset to support continued efforts in this area. 

Enter the Competition

Organisers

Apostolos Antonacopoulos Professor of Pattern Recognition, University of Salford and Head of (PRImA) research lab 
Christian Clausner Research Fellow at the Pattern Recognition and Image Analysis (PRImA) research lab  
Nora McGregor Digital Curator at British Library, Asian & African Collections
Daniel Lowe Curator at British Library, Arabic Collections
Daniel Wilson-Nunn, PhD student at University of Warwick & Turing PhD Student based at Alan Turing Institute 
• Bink Hallum, Arabic Scientific Manuscripts Curator at British Library/Qatar Foundation Partnership 

Further reading

For more on recent Digital Research Team text recognition and transcription projects see:

 

This post is by Nora McGregor, Digital Curator, British Library. She is on twitter as @ndalyrose

Building a Handwritten Arabic Manuscript Ground Truth Dataset يد واحدة لا تصفـّق

Are you able to read handwritten Arabic from historical manuscripts such as these? Then we could use your help!

In conjunction with our ICFHR2018 Competition on Recognition of Historical Arabic Scientific Manuscripts it is our aim to build a substantial image and ground truth dataset that can be used as the basis for advancing research in historical handwritten Arabic text analysis. This data will be made freely available for anyone wishing to advance the state-of-the-art in optical character recognition technology. 

What is Ground Truth?

The Impact Centre of Competence in Digitisation explains:

In digital imaging and OCR, ground truth is the objective verification of the particular properties of a digital image, used to test the accuracy of automated image analysis processes. The ground truth of an image’s text content, for instance, is the complete and accurate record of every character and word in the image. This can be compared to the output of an OCR engine and used to assess the engine’s accuracy, and how important any deviation from ground truth is in that instance.

The task to create such a dataset is enormous however so we're looking to build a network of folks who might be interested in sparing some time to transcribe a page or two.

If you're interested in learning more, and possibly contributing, we would love to hear from you! Please send us your details and we'll be in touch about upcoming workshops and activities to be held both in London and remotely.

 

03 February 2018

Fashion Design Competition Winner Announced

The British Library has recently run a fashion design competition using our digital collections to inspire a new generation of fashion designers. In partnership with the British Fashion Council and fashion house Teatum Jones, students from universities across the UK were invited to create a fashion portfolio which tells an inspirational story inspired by the British Library’s Flickr Collection.

The Flickr Commons Collection contains over 1 million images ‘snipped’ from around 65,000 digitised books largely from the 19th Century. This collection is ideal for creative researchers and has already inspired artists and designers.

Fashion competition 1

This competition challenged students’ creativity and story-telling through design and research skills. Working to a brief set by the Library and Teatum Jones, students were given free rein to find inspiration from across the Library’s collections, ranging from South American costumes to aviation.

To open the competition and welcome to students to the Library we held a Fashion Research Masterclass as part of our postgraduate open day programme in October 2017. The day featured talks from experts across the Library and an inspirational Show and Tell focusing on variety of collections that could inspire fashion research and design.

Fashion competition 2

The competition judging took place in January 2018, with 8 finalists from across the UK presenting their portfolios to a panel of specialists from the fashion sector including Paul Baptiste (Accessories Buying Manager, Fenwick), Catherine Teatum and Rob Jones (Creative Directors, Teatum Jones), Judith Rosser-Davies (British Fashion Council) and Mahendra Mahey (British Library Labs). Unfortunately we could only have one winner and we are delighted to announce Alanna Hilton from Edinburgh College of Art as our winner.

Alanna’s collection ‘Unlabelled’ has designs “that reject labelling [and] where consumers of all ages, sizes, ethnicities and genders can find beautiful clothes”. For winning the competition, Alanna received a financial prize from the British Fashion Council and membership of the British Library membership scheme. All of the submissions were incredibly strong, and we were very pleased to see the students all taking such different design routes, inspired by our collections.

The Library will continue to explore how we facilitate creative research process, as well as link it to business support provided by our Business and IP Centre. Our work with the fashion students highlighted the real variety in the ways researchers access our collections, find inspiration and use the library.

We are continuing to work with the British Fashion Council, exploring how to better reach this part of the research community and we will have an update from our winning designer later in the year. The project is also grateful to Innovate UK for their support.

This post is by Edmund Connolly from the British Library's Higher Education & Cultural Engagement department, he is on twitter as @Ed_Connolly.

02 February 2018

Converting Privy Council Appeals Metadata to Linked Data

To continue the series of posts on metadata about appeals to the Judicial Committee of the Privy Council, this post describes the process of converting this data to Linked Data. In the previous post, I briefly explained the concept of Linked Data and outlined the potential benefits of applying this approach to the JCPC dataset. An earlier post explained how cleaning the data enabled me to produce some initial visualisations; a post on the Social Science blog provides some historical context about the JCPC itself.

Data Model

In my previous post, I included the following diagram to show how the Linked JCPC Data might be structured.

JCPCDataModelHumanReadable_V1_20180104

To convert the dataset to Linked Data using this model, each entity represented by a blue node, and each class and property represented by the purple and green nodes need a unique identifier known as a Uniform Resource Indicator (URI). For the entities, I generated these URIs myself based on guidelines provided by the British Library, using the following structure:

  • http://data.bl.uk/jcpc/id/appeal/...
  • http://data.bl.uk/jcpc/id/judgment/...
  • http://data.bl.uk/jcpc/id/location/...

In the above URIs, the ‘...’ is replaced by a unique reference to a particular appeal, judgment, or location, e.g. a combination of the judgment number and year.

To ensure that the data can easily be understood by a computer and linked to other datasets, the classes and properties should be represented by existing URIs from established ontologies. An ontology is a controlled vocabulary (like a thesaurus) that not only defines terms relating to a subject area, but also defines the relationships between those terms. Generic properties and classes, such as titles, dates, names and locations, can be represented by established ontologies like Dublin Core, Friend of a Friend (FOAF) and vCard.

After considerable searching I was unable to find any online ontologies that precisely represent the legal concepts in the JCPC dataset. Instead, I decided to use relevant terms from Wikidata, where available, and to create terms in a new JCPC ontology for those entities and concepts not defined elsewhere. Taking this approach allowed me to concentrate my efforts on the process of conversion, but the possibility remains to align these terms with appropriate legal ontologies in future.

An updated version of the data model shows the ontology terms used for classes and properties (purple and green boxes):

JCPCDataModel_V9_20180104

Rather than include the full URI for each property or class, the first part of the URI is represented by a prefix, e.g. ‘foaf’, which is followed by the specific term, e.g. ‘name’, separated by a colon.

More Data Cleaning

The data model diagram also helped identify fields in the spreadsheet that required further cleaning before conversion could take place. This cleaning largely involved editing the Appellant and Respondent fields to separate multiple parties that originally appeared in the same cell and to move descriptive information to the Appellant/Respondent Description column. For those parties whose names were identical, I additionally checked the details of the case to determine whether they were in fact the same person appearing in multiple appeals/judgments.

Reconciliation

Reconciliation is the process of aligning identifiers for entities in one dataset with the identifiers for those entities in another dataset. If these entities are connected using Linked Data, this process implicitly links all the information about the entity in one dataset to the entity in the other dataset. For example, one of the people in the JCPC dataset is H. G. Wells – if we link the JCPC instance of H. G. Wells to his Wikidata identifier, this will then facilitate access to further information about H. G. Wells from Wikidata:

ReconciliationExample_V1_20180115

 Rather than look up each of these entities manually, I used a reconciliation service provided by OpenRefine, a piece of software I used previously for cleaning the JCPC data. The reconciliation service automatically looks up each value in a particular column from an external source (e.g. an authority file) specified by the user. For each value, it either provides a definite match or a selection of possible matches to choose from. Consultant and OpenRefine guru Owen Stephens has put together a couple of really helpful screencasts on reconciliation.

While reconciliation is very clever, it still requires some human intervention to ensure accuracy. The reconciliation service will match entities with similar names, but they might not necessarily refer to exactly the same thing. As we know, many people have the same name, and the same place names appear in multiple locations all over the world. I therefore had to check all matches that OpenRefine said were ‘definite’, and discard those that matched the name but referred to an incorrect entity.

Locations

I initially looked for a suitable gazetteer or authority file to which I could link the various case locations. My first port of call was Geonames, the standard authority file for linking location data. This was encouraging, as it does include alternative and historical place names for modern places. However, it doesn't contain any additional information about the dates for which each name was valid, or the geographical boundaries of the place at different times (the historical/political nature of the geography of this period was highlighted in a previous post). I additionally looked for openly-available digital gazetteers for the relevant historical period (1860-1998), but unfortunately none yet seem to exist. However, I have recently become aware of the University of Pittsburgh’s World Historical Gazetteer project, and will watch its progress with interest. For now, Geonames seems like the best option, while being aware of its limitations.

Courts

Although there have been attempts to create standard URIs for courts, there doesn’t yet seem to be a suitable authority file to which I could reconcile the JCPC data. Instead, I decided to use the Virtual International Authority File (VIAF), which combines authority files from libraries all over the world. Matches were found for most of the courts contained in the dataset.

Parties

For the parties involved in the cases, I initially also used VIAF, which resulted in few definite matches. I therefore additionally decided to reconcile Appellant, Respondent, Intervenant and Third Party data to Wikidata. This was far more successful than VIAF, resulting in a combined total of about 200 matches. As a result, I was able to identify cases involving H. G. Wells, Bob Marley, and Frederick Deeming, one of the prime suspects for the Jack the Ripper murders. Due to time constraints, I was only able to check those matches identified as ‘definite’; more could potentially be found by looking at each party individually and selecting any appropriate matches from the list of possible options.

Conversion

Once the entities were separated from each other and reconciled to external sources (where possible), the data was ready to convert to Linked Data. I did this using LODRefine, a version of OpenRefine packaged with plugins for producing Linked Data. LODRefine converts an OpenRefine project to Linked Data based on an ‘RDF skeleton’ specified by the user. RDF stands for Resource Description Framework, and is the standard by which Linked Data is represented. It describes each relationship in the dataset as a triple, comprising a subject, predicate and object. The subject is the entity you’re describing, the object is either a piece of information about that entity or another entity, and the predicate is the relationship between the two. For example, in the data model diagram we have the following relationship:

  AppealTitleTriple_V1_20180108

This is a triple, where the URI for the Appeal is the subject, the URI dc:title (the property ‘title’ in the Dublin Core terms vocabulary) is the predicate, and the value of the Appeal Title column is the object. I expressed each of the relationships in the data model as a triple like this one in LODRefine’s RDF skeleton. Once this was complete, it was simply a case of clicking LODRefine’s ‘Export’ button and selecting one of the available RDF formats. Having previously spent considerable time writing code to convert data to RDF, I was surprised and delighted by how quick and simple this process was.

Publication

The Linked Data version of the JCPC dataset is not yet available online as we’re currently going through the process of ascertaining the appropriate licence to publish it under. Once this is confirmed, the dataset will be available to download from data.bl.uk in both RDF/XML and Turtle formats.

The next post in this series will look at what can be done with the JCPC data following its conversion to Linked Data.

This post is by Sarah Middle, a PhD placement student at the British Library researching the appeal cases heard by the Judicial Committee of the Privy Council (JCPC).  Sarah is on twitter as @digitalshrew.   

01 February 2018

BL Labs 2017 Symposium: A large-scale comparison of world music corpora with computational tools, Research Award Winner

A large-scale comparison of world music corpora with computational tools.

By Maria Panteli, Emmanouil Benetos, and Simon Dixon from the Centre for Digital Music, Queen Mary University of London

The comparative analysis of world music cultures has been the focus of several ethnomusicological studies in the last century. With the advances of Music Information Retrieval and the increased accessibility of sound archives, large-scale analysis of world music with computational tools is today feasible. We combine music recordings from two archives, the Smithsonian Folkways Recordings and British Library Sound Archive, to create one of the largest world music corpora studied so far (8200 geographically balanced recordings sampled from a total of 70000 recordings). This work was submitted for the 2017 British Library Labs Awards - Research category.

Our aim is to explore relationships of music similarity between different parts of the world. The history of cultural exchange goes back many years and music, an essential cultural identifier, has travelled beyond country borders. But is this true for all countries? What if a country is geographically isolated or its society resisted external musical influence? Can we find such music examples whose characteristics stand out from other musics in the world? By comparing folk and traditional music from 137 countries we aim to identify geographical areas that have developed a unique musical character.

Maria Panteli fig 1

Methodology: Signal processing and machine learning methods are combined to extract meaningful music representations from the sound recordings. Data mining methods are applied to explore music similarity and identify outlier recordings.

We use digital signal processing tools to extract music descriptors from the sound recordings capturing aspects of rhythm, timbre, melody, and harmony. Machine learning methods are applied to learn high-level representations of the music and the outcome is a projection of world music recordings to a space respecting music similarity relations. We use data mining methods to explore this space and identify music recordings that are most distinct compared to the rest of our corpus. We refer to these recordings as ‘outliers’ and study their geographical patterns. More details on the methodology are provided here.

 

  Maria Panteli fig 2

 

Distribution of outliers per country: The colour scale corresponds to the normalised number of outliers per country, where 0% indicates that none of the recordings of the country were identified as outliers and 100% indicates that all of the recordings of the country are outliers.

We observed that out of 137 countries, Botswana had the most outlier recordings compared to the rest of the corpus. Music from China, characterised by bright timbres, was also found to be relatively distinct compared to music from its neighbouring countries. Analysis with respect to different features revealed that African countries such as Benin and Botswana, indicated the largest amount of rhythmic outliers with recordings often featuring the use of polyrhythms. Harmonic outliers originated mostly from Southeast Asian countries such as Pakistan and Indonesia, and African countries such as Benin and Gambia, with recordings often featuring inharmonic instruments such as the gong and bell. You can explore and listen to music outliers in this interactive visualisation. The datasets and code used in this project are included in this link.

Maria Panteli fig 3

Interactive visualisation to explore and listen to music outliers.

This line of research makes a large-scale comparison of recorded music possible, a significant contribution for ethnomusicology, and one we believe will help us understand better the music cultures of the world.

Posted by British Library Labs.