Digital scholarship blog

5 posts from March 2015

31 March 2015

Digital Music Events at the British Library

On Friday 13 May, we organised here at the British Library the final workshop for the Digital Music Lab project. Prof. Lorna Hughes, Chair in Digital Humanities at the School of Advanced Studies, opened the event with a stimulating talk entitled Digital Humanities, Big Data and New Research Methods in which she discussed the challenges faced by todays’ digital humanists when working with big data.  The rest of the morning  was dedicated to short presentations by the project members who explained to the audience the aims of DML project  in developing methods and technologies to extract, visualise and analyse thousands of audio recording files from three major repositories, British Library Sounds, CHARM and I Like Music as to compare information on relationships between different musical genres; discover patterns within similar musical styles and visualise changes in tonality, pitch and tempo as applied to a variety of genres as well as within a single piece recorded by different artists in different times and locations.  

Dr. Tillman Weyde, Principal Investigator of Digital Music Labs, describing the project.

In the afternoon attendees had the chance to explore the DML Web interface that is being developed to visualise and compare audio collections from the datasets used by the project. We had useful feedback from participants and are now working on the improvement of the tool in response to the comments received. The day ended with a general debate in which attendees could share their comments not only on the project but also on specific requests and challenges in their own research fields while working with big data.

We will continue the discussions on Digital Music Lab and other similar projects on our next Digital Conversations event that will take place at the BL on Thursday 21 May 2015 from 18.00 to 20.15. So, if you missed the DML workshop you still have a chance to catch up on the latest developments in big data analysis for musicological and performance research and share your ideas, experiences and opinions about the theme.  Free tickets to attend the Digital Music Analysis  evening can be booked at


Aquiles Alencar-Brayner

Digital Curator


25 March 2015

Enabling Complex Analysis of Large Scale Digital Collections

Jisc have announced the projects that have been funded through their Research Data Spring programme. One of those chosen is 'Enabling Complex Analysis of Large Scale Digital Collections', a project led by Melissa Terras (Professor of Digital Humanities, UCL) in collaboration with the British Library Digital Research team.

Research Data Spring aims to find new technical tools, software, and service solutions which will improve researchers’ workflows and the use and management of their data. Following an invitational sandpit event in Birmingham last month aimed to encouraging co-design, 'Enabling Complex Analysis of Large Scale Digital Collections' was chosen from over 40 proposed projects to proceed to a three month development phase.

Our rationale for the project is that lots of money has been spent digitising heritage collections and that - as well as being objects that can be presented online for research and public use and reuse - digitised heritage collections are data. The problem of course is that non-computationally trained scholars often don't know what to ask of large quantities of data, it is common that they do not have access to high performance computing facilities, and the exemplar workflows that they need are hard to find. As a consequence, support from content providers for this category of work is regularly ad hoc and difficult to justify substantial investment in. 'Enabling Complex Analysis of Large Scale Digital Collections' aims to address this fundamental problem by extending research data management processes in order to enable novel research and a deeper understanding of emerging research needs. In the initial three month pilot period we will index a collection of circa 60,000 public domain digitised books (see 'A Million First Steps') at UCL Research IT Services and work with a small number of researchers to turn their research questions in computational analysis. The outputs from each research scenario - including derived data, queries, documentation, and indicative visualisations - will be made available as citeable, CC-BY workflow packages suitable for teaching, self-learning, and reuse. Moreover these workflows will deepen understanding of complex, poorly structured, and heterogeneous humanities data and the questions researchers could ask of that data, highlighting through use cases the potential for process and service development in the cultural sector. Details of the proposed work for after the initial three month phase are on the Figshare document embedded above.

We are also delighted that two other projects with British Library involvement have been funded through the Research Data Spring. 'Unlocking the UK's thesis data through persistent identifiers' will investigate integrating ORCID personal identifiers and DataCite DOIs into our ever growing and unique UK thesis collection. 'Methods for Accessing Sensitive Data', otherwise known as AMASED, will adapt and implement DataSHIELD technology in order to (legally) circumvent key copyright, licensing, and privacy obstacles preventing analysis of digital datasets in the humanities and academic publishing. The British Library will supply the same circa 60,000 public domain digitised books to this project to test the extension of DataSHIELD to textual data.

James Baker

Curator, Digital Research



Creative Commons Licence This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Exceptions: embeds to and from external sources

20 March 2015

Texcavator in Residence

This is a guest post by Melvin Wevers, Utrecht University

As part of a three-week research stay at the British Library, I looked whether and how the British historical newspaper collection could be implemented within my own research project. I am a PhD candidate within the Translantis research program based at Utrecht University in the Netherlands. The program uses digital humanities tools to analyze how the United States has served as a cultural model for the Netherlands in the long twentieth century. A sister-project Asymmetrical Encounters, which is based in Utrecht, London, and Trier looks at similar processes within a European context. Our main sources include Dutch newspapers that are part of the National Library of the Netherlands (KB).

My research at the British Library served two main goals. First, to investigate how the British newspaper data could be incorporated into my project’s research tool Texcavator. Second, to analyze to what extent the newspapers can be used for historical research using computational methods, such as Full-Text searching, Topic Modeling, and Named Entity Recognition.

Texcavator allows researchers to search through newspapers archives using full-text search and more advanced search strategies such as wildcards and fuzzy searching. Secondly, the tool allows researchers to create timelines and word clouds which can also be tagged with Named Entity annotators or sentiment mining modules. Thirdly, the tool has an export function that allows the researcher to create subcorpora that can be used within other tools and analytical software such as Mallet or Text Voyant. Texcavator uses Elastic Search (ES) as a search and analytics engine. ES makes use of JSON files that need to be formatted in a particular schema in order to work within Texcavator. This includes information on newspaper title, date of publication, article type, newspaper type, and its spatial distribution.

The newspaper data on the servers of the British Library makes use of an XML scheme. In order to parse the XML files into an JSON format, I have made use of a PERL parser that modified the scheme to work with Texcavator and converted it into a JSON file. A python script enabled me to batch index the files into an Elastic Search index. Next, I installed Texcavator and configured it to communicate with the Elastic Search index. This shows that it is fairly easy to include the BL newspaper data into an ES index that can be queried.

After this, I set out to determine whether the historical newspapers are suited for historical analysis. One of the challenges of working with this archive is the poor quality of the digitized texts. The articles are not legible as they include a lot of gibberish produced by the optical character recognition (OCR). Using Texcavator these newspaper still prove to be useful for historical research. As Texcavator is able to combine both OCR-ed texts with the images of the article, the research can use key-word searches to find (some of the) articles that contain this word, which can then be read by looking at the images: the OCR facilitates the searching, and the images are used to close-read the articles. Smart queries using wildcards as well as ES optimization can improve the precision and recall of searching.

The texts can also be cleaned up by for instance removing all stop words, or words that have a very low frequency (these often include bad OCR). After cleaning up the texts, techniques such as Topic Modeling and Named Entity recognition can still be applied. The OCR quality of some newspapers in the archive (such as the Pall Mall Gazette) has already been improved. Using the ES search index, I have exported this specific newspaper into a sub corpus that I have tagged with location entities and run through the Topic Modeling engine Mallet. I am using this corpus to analyze the international outlook between 1860 and 1900. For more on this see, my paper proposal "Reporting the Empire" for DH Benelux.

One of the first steps in working with the data at the British Library includes making the data available to researcher in an index which allows for the creation of derived datasets based on either specific queries or meta-data such as newspaper title or spatial distribution. The ability to create derived datasets is a mandatory step within the larger digital humanities workflow. The datasets can then be processed using freely available text mining tools and visualization libraries such as D3.JS or Gephi. I further explained this particular approach to Digital Humanities in a talk I gave at the UCL DH seminar on the 25th of February. The slides of my presentation "Doing Digital History" can be found on Slideshare.

17 March 2015

BL Labs Competition and Awards Roadshow 2015

Mahendra Mahey, Manager of BL Labs
Closing date for Competition: Thursday 30th of April, 2015
Closing date for Award: Monday 14th of September 2015

The 2015 BL Labs Competition has been launched for the third time and again we want researchers to submit their ideas for projects that highlight the Library’s digital collections and winners work in residence with the Labs team to make their ideas real, please help us spread the word!

Previous finalists of the BL Labs Competition have helped us learn more about:

We have seen an amazing range of creative and innovative ideas in entries in 2013 and 2014 and we look forward to seeing even more in 2015! Winners will be chosen by Friday 29th of May 2015.

In addition, we are launching the new 2015 BL Labs Awards for outstanding work that has already been completed using British Library digital content. We are looking for examples in the categories of Research, Creativity, and Entrepreneurship. Shortlisted candidates will be informed by Monday 12th October 2015.

Competition winners will showcase their work and Award winners will be announced at the third Labs Symposium on Monday 2 November, 2015 in the British Library Conference Centre.

We organising a number of roadshows around the country promoting the competition, for more information and to register please see:

Contact us at or visit


09 March 2015

Jisc Digital Festival 2015

Today and tomorrow I'm at this year's Jisc Digital Festival; networking with other digital folks and hearing about new technologies, which are relevant to learning and research.

This morning's opening keynote by Simon Nelson, CEO of FutureLearn, got proceedings off to a great start. Talking about MOOCs; he explained how they widen participation in education and are of increasing importance as a recruitment method for universities. He also mentioned how universities can collaborate with other types of partners, such as cultural heritage institutions, to create richer, contextualised courses; and he gave the upcoming Propaganda and Ideology in Everyday Life MOOC as an example of this, as it has been created by the University of Nottingham working with curators at the British Library.  

The British Library also has a stall in the exhibition, with two of my lovely colleagues from Boston Spa telling people about EThOS, Document Delivery and Imaging Services (BLDSS). They also have postcards, pens, stress toys and even lip balms to give out. Though apparently the most popular item that punters are picking up from the stall is a printed copy of Living Knowledge: The British Library 2015 – 2023, which I found ironic, as we are at a digifest and the document is free online!

 I also picked up some freebies from the other exhibition stalls, including an amazing Sex Pistols inspired t-shirt from figshare, which I can't wait to wear.

Figshare t-shirt

I was also grateful for the #digifest travel mugs, as hotel mugs are never large enough for a proper cup of tea in my opinion, which reminds me of one of the quotes on the walls at work that I walk past everyday "You can never get a cup of tea large enough or a book long enough to suit me" quite right C.S. Lewis, I agree with you completely.

Jisc #digifest travel mug, next to my hotel mug

For those of you not currently here in Birmingham, there is live coverage online of keynotes and selected sessions. There is also an active twitterstream at #digifest15; and if you are here in person, please do say hello, I'll be wearing my new figshare t-shirt :-)


Stella Wisdom

Curator, Digital Research