Digital scholarship blog

12 posts from February 2018

05 February 2018

Building a Handwritten Arabic Manuscript Ground Truth Dataset يد واحدة لا تصفـّق

Are you able to read handwritten Arabic from historical manuscripts such as these? Then we could use your help!

In conjunction with our ICFHR2018 Competition on Recognition of Historical Arabic Scientific Manuscripts it is our aim to build a substantial image and ground truth dataset that can be used as the basis for advancing research in historical handwritten Arabic text analysis. This data will be made freely available for anyone wishing to advance the state-of-the-art in optical character recognition technology. 

What is Ground Truth?

The Impact Centre of Competence in Digitisation explains:

In digital imaging and OCR, ground truth is the objective verification of the particular properties of a digital image, used to test the accuracy of automated image analysis processes. The ground truth of an image’s text content, for instance, is the complete and accurate record of every character and word in the image. This can be compared to the output of an OCR engine and used to assess the engine’s accuracy, and how important any deviation from ground truth is in that instance.

The task to create such a dataset is enormous however so we're looking to build a network of folks who might be interested in sparing some time to transcribe a page or two.

If you're interested in learning more, and possibly contributing, we would love to hear from you! Please send us your details and we'll be in touch about upcoming workshops and activities to be held both in London and remotely.

 

03 February 2018

Fashion Design Competition Winner Announced

The British Library has recently run a fashion design competition using our digital collections to inspire a new generation of fashion designers. In partnership with the British Fashion Council and fashion house Teatum Jones, students from universities across the UK were invited to create a fashion portfolio which tells an inspirational story inspired by the British Library’s Flickr Collection.

The Flickr Commons Collection contains over 1 million images ‘snipped’ from around 65,000 digitised books largely from the 19th Century. This collection is ideal for creative researchers and has already inspired artists and designers.

Fashion competition 1

This competition challenged students’ creativity and story-telling through design and research skills. Working to a brief set by the Library and Teatum Jones, students were given free rein to find inspiration from across the Library’s collections, ranging from South American costumes to aviation.

To open the competition and welcome to students to the Library we held a Fashion Research Masterclass as part of our postgraduate open day programme in October 2017. The day featured talks from experts across the Library and an inspirational Show and Tell focusing on variety of collections that could inspire fashion research and design.

Fashion competition 2

The competition judging took place in January 2018, with 8 finalists from across the UK presenting their portfolios to a panel of specialists from the fashion sector including Paul Baptiste (Accessories Buying Manager, Fenwick), Catherine Teatum and Rob Jones (Creative Directors, Teatum Jones), Judith Rosser-Davies (British Fashion Council) and Mahendra Mahey (British Library Labs). Unfortunately we could only have one winner and we are delighted to announce Alanna Hilton from Edinburgh College of Art as our winner.

Alanna’s collection ‘Unlabelled’ has designs “that reject labelling [and] where consumers of all ages, sizes, ethnicities and genders can find beautiful clothes”. For winning the competition, Alanna received a financial prize from the British Fashion Council and membership of the British Library membership scheme. All of the submissions were incredibly strong, and we were very pleased to see the students all taking such different design routes, inspired by our collections.

The Library will continue to explore how we facilitate creative research process, as well as link it to business support provided by our Business and IP Centre. Our work with the fashion students highlighted the real variety in the ways researchers access our collections, find inspiration and use the library.

We are continuing to work with the British Fashion Council, exploring how to better reach this part of the research community and we will have an update from our winning designer later in the year. The project is also grateful to Innovate UK for their support.

This post is by Edmund Connolly from the British Library's Higher Education & Cultural Engagement department, he is on twitter as @Ed_Connolly.

02 February 2018

Converting Privy Council Appeals Metadata to Linked Data

To continue the series of posts on metadata about appeals to the Judicial Committee of the Privy Council, this post describes the process of converting this data to Linked Data. In the previous post, I briefly explained the concept of Linked Data and outlined the potential benefits of applying this approach to the JCPC dataset. An earlier post explained how cleaning the data enabled me to produce some initial visualisations; a post on the Social Science blog provides some historical context about the JCPC itself.

Data Model

In my previous post, I included the following diagram to show how the Linked JCPC Data might be structured.

JCPCDataModelHumanReadable_V1_20180104

To convert the dataset to Linked Data using this model, each entity represented by a blue node, and each class and property represented by the purple and green nodes need a unique identifier known as a Uniform Resource Indicator (URI). For the entities, I generated these URIs myself based on guidelines provided by the British Library, using the following structure:

  • http://data.bl.uk/jcpc/id/appeal/...
  • http://data.bl.uk/jcpc/id/judgment/...
  • http://data.bl.uk/jcpc/id/location/...

In the above URIs, the ‘...’ is replaced by a unique reference to a particular appeal, judgment, or location, e.g. a combination of the judgment number and year.

To ensure that the data can easily be understood by a computer and linked to other datasets, the classes and properties should be represented by existing URIs from established ontologies. An ontology is a controlled vocabulary (like a thesaurus) that not only defines terms relating to a subject area, but also defines the relationships between those terms. Generic properties and classes, such as titles, dates, names and locations, can be represented by established ontologies like Dublin Core, Friend of a Friend (FOAF) and vCard.

After considerable searching I was unable to find any online ontologies that precisely represent the legal concepts in the JCPC dataset. Instead, I decided to use relevant terms from Wikidata, where available, and to create terms in a new JCPC ontology for those entities and concepts not defined elsewhere. Taking this approach allowed me to concentrate my efforts on the process of conversion, but the possibility remains to align these terms with appropriate legal ontologies in future.

An updated version of the data model shows the ontology terms used for classes and properties (purple and green boxes):

JCPCDataModel_V9_20180104

Rather than include the full URI for each property or class, the first part of the URI is represented by a prefix, e.g. ‘foaf’, which is followed by the specific term, e.g. ‘name’, separated by a colon.

More Data Cleaning

The data model diagram also helped identify fields in the spreadsheet that required further cleaning before conversion could take place. This cleaning largely involved editing the Appellant and Respondent fields to separate multiple parties that originally appeared in the same cell and to move descriptive information to the Appellant/Respondent Description column. For those parties whose names were identical, I additionally checked the details of the case to determine whether they were in fact the same person appearing in multiple appeals/judgments.

Reconciliation

Reconciliation is the process of aligning identifiers for entities in one dataset with the identifiers for those entities in another dataset. If these entities are connected using Linked Data, this process implicitly links all the information about the entity in one dataset to the entity in the other dataset. For example, one of the people in the JCPC dataset is H. G. Wells – if we link the JCPC instance of H. G. Wells to his Wikidata identifier, this will then facilitate access to further information about H. G. Wells from Wikidata:

ReconciliationExample_V1_20180115

 Rather than look up each of these entities manually, I used a reconciliation service provided by OpenRefine, a piece of software I used previously for cleaning the JCPC data. The reconciliation service automatically looks up each value in a particular column from an external source (e.g. an authority file) specified by the user. For each value, it either provides a definite match or a selection of possible matches to choose from. Consultant and OpenRefine guru Owen Stephens has put together a couple of really helpful screencasts on reconciliation.

While reconciliation is very clever, it still requires some human intervention to ensure accuracy. The reconciliation service will match entities with similar names, but they might not necessarily refer to exactly the same thing. As we know, many people have the same name, and the same place names appear in multiple locations all over the world. I therefore had to check all matches that OpenRefine said were ‘definite’, and discard those that matched the name but referred to an incorrect entity.

Locations

I initially looked for a suitable gazetteer or authority file to which I could link the various case locations. My first port of call was Geonames, the standard authority file for linking location data. This was encouraging, as it does include alternative and historical place names for modern places. However, it doesn't contain any additional information about the dates for which each name was valid, or the geographical boundaries of the place at different times (the historical/political nature of the geography of this period was highlighted in a previous post). I additionally looked for openly-available digital gazetteers for the relevant historical period (1860-1998), but unfortunately none yet seem to exist. However, I have recently become aware of the University of Pittsburgh’s World Historical Gazetteer project, and will watch its progress with interest. For now, Geonames seems like the best option, while being aware of its limitations.

Courts

Although there have been attempts to create standard URIs for courts, there doesn’t yet seem to be a suitable authority file to which I could reconcile the JCPC data. Instead, I decided to use the Virtual International Authority File (VIAF), which combines authority files from libraries all over the world. Matches were found for most of the courts contained in the dataset.

Parties

For the parties involved in the cases, I initially also used VIAF, which resulted in few definite matches. I therefore additionally decided to reconcile Appellant, Respondent, Intervenant and Third Party data to Wikidata. This was far more successful than VIAF, resulting in a combined total of about 200 matches. As a result, I was able to identify cases involving H. G. Wells, Bob Marley, and Frederick Deeming, one of the prime suspects for the Jack the Ripper murders. Due to time constraints, I was only able to check those matches identified as ‘definite’; more could potentially be found by looking at each party individually and selecting any appropriate matches from the list of possible options.

Conversion

Once the entities were separated from each other and reconciled to external sources (where possible), the data was ready to convert to Linked Data. I did this using LODRefine, a version of OpenRefine packaged with plugins for producing Linked Data. LODRefine converts an OpenRefine project to Linked Data based on an ‘RDF skeleton’ specified by the user. RDF stands for Resource Description Framework, and is the standard by which Linked Data is represented. It describes each relationship in the dataset as a triple, comprising a subject, predicate and object. The subject is the entity you’re describing, the object is either a piece of information about that entity or another entity, and the predicate is the relationship between the two. For example, in the data model diagram we have the following relationship:

  AppealTitleTriple_V1_20180108

This is a triple, where the URI for the Appeal is the subject, the URI dc:title (the property ‘title’ in the Dublin Core terms vocabulary) is the predicate, and the value of the Appeal Title column is the object. I expressed each of the relationships in the data model as a triple like this one in LODRefine’s RDF skeleton. Once this was complete, it was simply a case of clicking LODRefine’s ‘Export’ button and selecting one of the available RDF formats. Having previously spent considerable time writing code to convert data to RDF, I was surprised and delighted by how quick and simple this process was.

Publication

The Linked Data version of the JCPC dataset is not yet available online as we’re currently going through the process of ascertaining the appropriate licence to publish it under. Once this is confirmed, the dataset will be available to download from data.bl.uk in both RDF/XML and Turtle formats.

The next post in this series will look at what can be done with the JCPC data following its conversion to Linked Data.

This post is by Sarah Middle, a PhD placement student at the British Library researching the appeal cases heard by the Judicial Committee of the Privy Council (JCPC).  Sarah is on twitter as @digitalshrew.   

01 February 2018

BL Labs 2017 Symposium: A large-scale comparison of world music corpora with computational tools, Research Award Winner

A large-scale comparison of world music corpora with computational tools.

By Maria Panteli, Emmanouil Benetos, and Simon Dixon from the Centre for Digital Music, Queen Mary University of London

The comparative analysis of world music cultures has been the focus of several ethnomusicological studies in the last century. With the advances of Music Information Retrieval and the increased accessibility of sound archives, large-scale analysis of world music with computational tools is today feasible. We combine music recordings from two archives, the Smithsonian Folkways Recordings and British Library Sound Archive, to create one of the largest world music corpora studied so far (8200 geographically balanced recordings sampled from a total of 70000 recordings). This work was winner for the 2017 British Library Labs Awards - Research category.

Our aim is to explore relationships of music similarity between different parts of the world. The history of cultural exchange goes back many years and music, an essential cultural identifier, has travelled beyond country borders. But is this true for all countries? What if a country is geographically isolated or its society resisted external musical influence? Can we find such music examples whose characteristics stand out from other musics in the world? By comparing folk and traditional music from 137 countries we aim to identify geographical areas that have developed a unique musical character.

Maria Panteli fig 1

Methodology: Signal processing and machine learning methods are combined to extract meaningful music representations from the sound recordings. Data mining methods are applied to explore music similarity and identify outlier recordings.

We use digital signal processing tools to extract music descriptors from the sound recordings capturing aspects of rhythm, timbre, melody, and harmony. Machine learning methods are applied to learn high-level representations of the music and the outcome is a projection of world music recordings to a space respecting music similarity relations. We use data mining methods to explore this space and identify music recordings that are most distinct compared to the rest of our corpus. We refer to these recordings as ‘outliers’ and study their geographical patterns. More details on the methodology are provided here.

 

  Maria Panteli fig 2

 

Distribution of outliers per country: The colour scale corresponds to the normalised number of outliers per country, where 0% indicates that none of the recordings of the country were identified as outliers and 100% indicates that all of the recordings of the country are outliers.

We observed that out of 137 countries, Botswana had the most outlier recordings compared to the rest of the corpus. Music from China, characterised by bright timbres, was also found to be relatively distinct compared to music from its neighbouring countries. Analysis with respect to different features revealed that African countries such as Benin and Botswana, indicated the largest amount of rhythmic outliers with recordings often featuring the use of polyrhythms. Harmonic outliers originated mostly from Southeast Asian countries such as Pakistan and Indonesia, and African countries such as Benin and Gambia, with recordings often featuring inharmonic instruments such as the gong and bell. You can explore and listen to music outliers in this interactive visualisation. The datasets and code used in this project are included in this link.

Maria Panteli fig 3

Interactive visualisation to explore and listen to music outliers.

This line of research makes a large-scale comparison of recorded music possible, a significant contribution for ethnomusicology, and one we believe will help us understand better the music cultures of the world.

Posted by British Library Labs.