THE BRITISH LIBRARY

Digital scholarship blog

07 December 2018

Introducing an experimental format for learning about content mining for digital scholarship

This post by the British Library’s Digital Curator for Western Heritage Collections, Dr Mia Ridge, reports on an experimental format designed to provide more flexible and timely training on fast-moving topics like text and data mining.

This post covers two topics – firstly, an update to the established format of sessions on our Digital Scholarship Training Programme (DSTP) to introduce ‘strands’ of related modules that cumulatively make up a ‘course’, and secondly, an overview of subjects we’ve covered related to content mining for digital scholarship with cultural heritage collections.

Introducing ‘strands’

The Digital Research team have been running the DSTP for some years now. It’s been very successful but we know that it's hard for people to get away for a whole day, so we wanted to break courses that might previously have taken 5 or 6 hours of a day into smaller modules. Shorter sessions (talks or hands-on workshops) only an hour or at most two long seemed to fit more flexibly into busy diaries. We can also reach more people with talks than with hands-on workshops, which are limited by the number of training laptops and the need to offer more individual

A 'strand' is a new, flexible format for learning and maintaining skills, with training delivered through shorter modules that combine to build attendees’ knowledge of a particular topic over time. We can repeat individual modules – for example, a shorter ‘Introduction to’ session might run more often, or target people with some existing knowledge for more advanced sessions. I haven’t formally evaluated it but I suspect that the ability to pick and choose sessions means that attendees for each module are more engaged, which makes for a better session for everyone. We've seen a lot of uptake – in some cases the 40 or so places available go almost immediately - so offering shorter sessions seems to be working.

Designing courses as individual modules makes it easier to update individual sections as technologies and platforms change. This format has several other advantages: staff find it easier to attend hour-long modules, and they can try out methods on their own collections between sessions. It takes time for attendees to collect and prepare their own data for processing with digital methods (not to mention preparation time and complexity for the instructor), so we've stayed away from this in traditional workshops.

New topics can be introduced on a 'just in time' basis as new tools and techniques emerge. This seemed to address lots of issues I was having in putting together a new course on content mining. It also makes it easier to tackle a new subject than the established 5-6 hour format, as I can pilot short sessions and use the lessons learnt in planning the next module.

The modular format also means we can invite international experts and collaborators to give talks on their specialisms with relatively low organisational overhead, as we regularly run ‘21st Century Curatorship’ talks for staff. We can link relevant staff talks, or our monthly ‘Hack and Yack’ and Digital Scholarship Reading Groups sessions to specific strands.

We originally planned to start each strand with an introductory module outlining key concepts and terms, but in reality we dived into the first one as we already had talks that'd fit lined up.

Content mining for digital scholarship with cultural heritage collections

Tom and Nora trying out AntConcFrom the course blurb: ‘Content mining (sometimes ‘text and data mining’) is a form of computational processing that uses automated analytical techniques to analyse text, images, audio-visual material, metadata and other forms of data for patterns, trends and other useful information. Content mining methods have been applied to digitised and digital historic, cultural and scientific collections to help scholars answer new research questions at scale, analysing hundreds or hundreds of thousands of items. In addition to supporting new forms of digital scholarship that apply content mining methods, methods like Named Entity Recognition or Topic Modelling can make collection items more discoverable. Content mining in cultural heritage draws on data science, 'distant reading' and other techniques to categorise items; identify concepts and entities such as people, places and events; apply sentiment analysis and analyse items at scale.’

An easily updatable mixture of introductory talks, tutorial sessions, hands-on workshops and case studies from external experts fit perfectly into the modular format, and it's worked out well, with a range of topics and formats offered so far. Sessions have included: an Introduction to Machine Learning; Computational models for detecting semantic change in historical texts (Dr Barbara McGillivray, Alan Turing Institute); Computer vision tools with Dr Giles Bergel, from the University of Oxford's Visual Geometry Group; Jupyter Notebooks/Python for simple processing and visualisations of data from In the Spotlight; Listening to the Crowd: Data Science to Understand the British Museum's Visitors (Taha Yasseri, Turing/OII); Visualising cultural heritage collections (Olivia Fletcher Vane, Royal College of Art); An Introduction to Corpus Linguistics for the Humanities (Ruth Byrne, BL and Lancaster PhD student); Corpus Analysis with AntConc.

What’s next?

My colleagues Nora McGregor, Stella Wisdom and Adi Keinan-Schoonbaert have some great ‘strands’ planned for the future, including Stella’s on ‘Emerging Formats’ and Adi’s on ‘Place’, so watch this space for updates!