Digital scholarship blog

6 posts from February 2013

22 February 2013

Anyone for a spot of TEI?

I met TEI master @jamescummings at the unparalleled Digital Humanities @Oxford Summer School last year where he ran the Introduction to Text Encoding Initiative strand and knew it was imperative we cajole him to come to the Library as part of our Digital Scholarship Training Programme! He thankfully took us up on our challenge to get curators on the path to TEI in the space of just one day and we welcomed him to the Library last week.

The Jane Austen's Fiction Manuscript: a Digital Edition project, of which British Library is a partner, utilises a customized TEI markup scheme to capture features such as the layout of the text on a page, abbreviations and corrections.

The result has been an inspired one, with curators pleading for advanced practical sessions to be added to our curriculum to build our inhouse skills across the collection areas. One group particularly interested in adopting TEI are staff working on the new Qatar Project which will see more than half a million pages from our archives digitised and made freely available online within the next three years, so stay tuned for that!

If you too are looking to get started in the world of TEI have a look at his helpful blogpost Self Study (part 2): Introduction to the Text Encoding Initiative Guidelines.

20 February 2013

Recognising speech


Clockwise from top left: Luis Carrasqueiro, Peter Robinson, Johan Oomen, Roeland Ordelman

We know that once we've have an object in digital form then the opportunities for analysing it, visualising it, sharing it, mashing it, deconstructing it, rebuilding it, and learning from it are great. We know that through its very structure it can be linked up with other kinds of digital objects, and that the way digital objects can be networked together is changing scholarship and research. But some digital objects are easier to work with than others.

Take speech recordings for instance. The extent of record speech archives - either audio or video - is probably of a number beyond calculating, but they play only a reduced part in the online search experience. Type in a search term into whatever, and you will bring up textual records, images, maps, videos and more, but you won't have searched across a whole vast tier of digital content, which is the speech content embedded in audio and video records. You will have searched across associated descriptions, but not the words themselves. You are not getting the whole picture, and this leads to bias in what you find, what you may use, and what conclusions you come to in your research.

Coming to the rescue - just possibly - are speech-to-text technologies. In a nutshell these are technologies which use processes analogous to Optical Character Recognition (OCR) to convert speech audio into word-searchable text. The results aren't perfect transcriptions, with accuracy rates ranging from 30-90% depending on the kind of speech and the familiarity of the software with the speaker. Smartphones now come with a speech-to-text capability, because recognising the commands from a single, familiar voice is now relatively easy for our portable devices. Scaling this up to large archives of speech-based audio and video is rather more of a challenge.

The British Library has been looking at this issue over the past year through its Opening up Speech Archives project, funded by the Arts & Humanities Research Council. The aim of the project is not to identitfy the best systems out there - you quickly learn that it's not a case of what's best but rather what is most suitable for particular uses - and not about implementing any such system here (yet). Instead we've been looking at how speech-to-text will affect the research experience, and trying to learn from researchers how they can work with audiovisual media opened up through such tools - and how their needs can be fed back to developers.

One output of the research project was a conference, entitled Opening up Speech Archives, which was held at the Library on 8 February 2013. This event bought together product developers, service providers, archivists, curators, librarians, technicians and ordinary researchers from a variety of disciplines. We also had demonstrations of various systems and solutions for delegates to try out.


The conference room awaits

For the record, here's who spoke and what they had to say:

Luke McKernan (that's me) spoke on 'The needs of academic researchers using speech-to-text systems'. You can read my blog post on which this was based, or download a copy of my talk, which tried to look at the bigger picture, arguing that speech-to-text technologies will bring about a huge change in how we discover things.

John Coleman (Oxford University)  spoke on 'Using speech recognition, transcription standards and media fragments protocols in large-scale spoken audio collections'. Find out more about his work on the British National Corpus (100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English) and the Mining a Year of Speech project.

Peter Robinson and Sergio Grau Puerto (also Oxford University) discussed 'Podcasting 2.0: the use of automatic speech-to-text technology to increase the discoverability of open lectures'. Find out more about their JISC-funded SPINDLE project, which used open source speech-to-text software to create keywords for searching the university's podcasts collection.

Johan Oomen, Erwin Verbuggen and Roeland Ordelman (Netherlands Institute for Sound & Vision) spoke about the hugely impressive (and handsomely-funded) work being funded in the Netherlands to open up audiovisual archives, with individual talks on 'Time-based media in the context of television broadcast and digital humanities', 'Creating smart links to contextual information' and 'The European dimension: EUscreenXL'. You can find their presentation on SlideShare.

Ian Mottashed (Cambridge Imaging Systems)  talked about the complementary field of subtitle capture and search in 'Unlocking value in footage with time-based metadata'. If you are in a Higher Education institution which subscribes to the off-air TV archive Bob National you can see the results of their work - or else come to the BL reading rooms and try out our Broadcast News service, which uses the same underlying software with subtitle searching.

Theo Jones, Chris Lowis, Pete Warren (BBC R&D) gave us 'The BBC World Service Archive: how audiences and machines can help you to publish a large archive'. This was about the BBC World Service Prototype, which is using a mixture of catalogue, machine indexing of speech files (using the open source CMU Sphinx) and crowdsourcing to categorise the digitised World Service radio archive. If you are keen to test this out, you can sign up for the prototype at It could be how all radio archives will operate one day.

Chris Hartley of information reprocessing company Autonomy talked about 'Practical examples of opening up speech (and video) archives', showing the extraordinary ways in which high-end systems can now analyse text, speech, images, music and video to generate rich data and rich data analyses. The main customers for such systems have been governments and the military - now education is starting to be the beneficiary, with JISC Historic Books using an Autonomy IDOL platform (though with no audio or vidoe content, yet).

Rounding things up was Luís Carrasqueiro  of the British Universities Film & Video Council speaking on 'Widening the use of audiovisual materials in education', including mention of the BUFVC's forthcoming citation guidelines for AV media, which could play a major part in helping making sound and video integral to the research process. He also gave us the key message for the day, which was "Imperfection is OK - live with it". Wise words - too many considering speech-to-text systems dream of something that will create the word-perfect transcription. They can't, and it doesn't matter. It's more than enough to be able to search across the words that they can find, and to extract from these keywords, location terms, names, dates and more, which can then be linked together with other digital objects. Which is where we came in.

We'll be publishing project papers and associated findings on a British Library web page in due course, and hope to make a demonstration service available in our Reading Rooms soon. Meanwhile, if you want to see some speech-to-text services in action, here are a few to try out:

  • ScienceCinema is a collection of word-searchable lecture videos from the U.S. Department of Energy and CERN, indexed using Microsoft's speech-to-text system MAVIS.
  • Voxalead is a multimedia news test site, searching across freely-available web news sites from around the world, bringing together programme descriptions, subtitles and speech-to-text transcripts and presenting the results in a variety of interesting ways. It's an off-shoot of French search engine Exalead.
  • DemocracyLive is a BBC News site that combines Autonomy-developed speech-to-text with Hansard to provide word-searching of video records of the Houses of Parliament, Select Committees, the national assemblies and more.


Audience, agog

18 February 2013

"Off The Map" Competition: Student Visit from De Montfort University Leicester

Some of you may recall that I recently blogged about the "Off The Map" competition launch, held at the British Library last month. As part of the competition, on Tuesday 12 February, Tom Harper, the Library's Curator of Antiquarian Mapping, and myself were very pleased to welcome Game Art Design students from De Montfort University, Leicester, to look at the originals of the digital copies of historical maps, which they will be transforming into virtual gaming environments.

Three of the students looking intently at the maps on display for them in the Library's Board Room

Of the three site choices set by the competition: Stonehenge, the Pyramids at Giza, and the Tower of London, De Montfort have decided to choose the Tower, but broadening their remit significantly to include the area of the City with London Bridge, Pudding lane, and the riverside wharves and markets.

They are not short of source material. Thousands of maps and prints from 1550 to 1850 showing every nook and cranny of the city are available, including digital versions, for example, at The originals, however, provide a level of detail and a sense of closeness to their historical periods that is difficult to replicate. 

Engraved lines emphasise the texture of the stone and wooden buildings, the straw roofs. The scale, of William Morgan’s 1682 map of London (160 x 250 cm) rather too large for a smartphone. The original maps were suitably pored over, as only maps can be.

Tom was interested to learn that not only are they looking to recreate the architecture, the layout, the fabric of the city, but they will be including atmospheric effects, people, sounds. Bustling Billingsgate market? Tower Hill on execution day? These snapshots have been captured by maps and views. The engraved panorama of London by Visccher of 1616 was undoubtably the most pored-over. In it, the bustling southern end of London Bridge is furnished with heads on spikes (their constituent bodies are shown lined up before the Tower) and market traders blocking the streets.

Visscher’s view is a truthful, rounded portrayal of London from 1616, produced using the most sophisticated visual, surveying and reproductive techniques. It is entirely appropriate then that it should inspire the Off the Map competitors; using similarly advanced techniques to create an equally vivid map.

Londinum florentissima Britanniae urbs; toto orbe celeberrimum emporiumque: C.J. Visscher Delineavit, 1616 (detail) [British Library shelfmark: Maps C.5.a.6.

In addition to viewing the maps and asking Tom lots of questions, they also had a guided tour of the building and Steve van Dulken kindly showed the group around the Business Intellectual Property Centre; explaining the resources and services available there.

Steve van Dulken speaking to the group in the Business Intellectual Property Centre

The students got a lot out of their visit and said this about the day:

"We can’t thank Stella, Tom and the staff of the British Library enough for our unforgettable visit. We were given a fascinating tour of the building and some of its history, but best of all we had an opportunity to see some beautiful maps of London from the 1600's, up-close! We spent a great deal of time scanning and surveying the maps, gathering details about the scale, architecture and layout of the area we were interested in. Some of the maps offered us a new perspective (literally, in some cases!) and inspired some great ideas from each of us. We came away bubbling with ideas."

You can read more in their blog "Pudding Lane Productions".

Stella Wisdom, Digital Curator and Tom Harper, Curator of Antiquarian Mapping, The British Library

13 February 2013

Wikipedia and the British Library

I've been working as the Wikipedian in Residence at the British Library for the past nine months. This is a one-year project funded by the AHRC, which aims to study the ways in which academics and specialists can engage with Wikipedia and similar projects.

It builds on the work previously done by a number of other Wikipedians in Residence at institutions around the world (full list); usually, they've worked with galleries or museums to help improve content relating to the collections of those institutions. The benefits for everyone are clear - Wikipedia improves in quality and scope; the institutions engage communities interested in their material, and reach potentially much broader audiences.

We've tried something a bit different this time around. While we've worked on some content projects, we've focused on working with researchers and librarians to help build skills and give people the confidence to engage directly with these communities. Over the past months, I've talked to well over three hundred people, demonstrating tools and encouraging them to think about making a first step. There are three approaches we've been looking at here:

  • Contextualising research. Part of the perennial problem of academic projects is that they are often very specialised; it can be very difficult to explain the details of the work to a layperson. Wikipedia allows researchers to help improve the "background" material needed to put their work in context, indirectly the supporting public impact of their work. Working with the International Dunhuang Project, the BL hosted a series of workshops over a week; here, curators, Wikipedia contributors, and students worked to write articles about Central Asian archaeology and exploration - see our report.
  • Capturing research. Wikipedia - a publicly-visible, constantly shifting draft awaiting further collaboration - is great for absorbing pieces of secondary research work that may never be formally published elsewhere. As a cataloguer, I used to spend time trying to chase down small details - who did this particular bookplate belong to? was this author the same as another under a pseudonym? what was the original title of this book, and was it first written in Russian or French? Many projects, especially those concentrating on historical networks or correspondence, produce many incidental biographies or summaries of events; Wikipedia can be a very efficient way to get this work out to a wider audience, rather than keeping it in a local silo. Next month, I'll be working with the Darwin Correspondence Project in Cambridge to look at using some of their biographical summaries as the nucleus of Wikipedia articles.
  • Digital content. Wikimedia is one of the largest open-content communities around, and is always keen to use new high-quality material. If your project is producing data or images (or anything else) under a free license, there may well be someone wanting to use it in an interesting and transformative way - and to expose it to new audiences. At the Library, we've been working to get high-quality imagery from our Royal Manuscripts collection (recently digitised) to supplement related articles - such as the beautiful image illustrating the history of the fleur-de-lys in seven languages, below:
  • Clovis recevant la fleur de lys - XVe siècle

If you're interested in what else we've done, you can see an outline presentation I gave to AHRC here.

I'm at the Library until the end of April - if you think you or a group you're working with would be interested to hear more, please get in touch!

11 February 2013

Mashups, Linked Data, APIs, Oh My!

Our internal Digital Scholarship Training Programme is running at full speed now and we’re nearly through the first semester having held 9 of the 15 courses we've developed! Over 70 curators have participated in one or more of them and I’m thrilled to have the opportunity to give a full report on this initiative in Nebraska this summer at #dh2013.  

Last week we had Owen Stephens over to teach a full day course to staff on the topic of Information Integration:  Mashups, APIs and Linked Data from the library perspective.  I initially took inspiration for the course from one held at USC’s Information Sciences Institute. That one you’ll see covers these topics over 16 weeks, so Owen did a masterful job covering them in a meaningful way in less than 8 hours!


Emma Goodliffe, International Dunhuang Project and Jennifer Howes, Curator of Visual Arts

The aim of ours was very specific to how Mashups, APIs and Linked Data are playing out in the cultural heritage realm with a view towards providing curators with the opportunity and inspiration for situating their own curatorial interests there. One of the best pieces of advice he gave on the day was to build things which we find useful. If APIs fulfil our own requirements, to enable new more efficient workflows for providing digital content for instance, they’ll find the support and maintenance required to keep them alive and well. To put that into practice, Owen worked up a really nice exercise which allowed curators to access our very own British National Bibliography API. He’s posted it here - do have a go! 

Nora McGregor
Digital Curator

06 February 2013

Digital Conversations series: Debating the Cloud

 Digital Conversations is a series of seminars hosted by the Digital Research and Curator Team aimed to engage staff in a lively discourse about digital innovation. So far the team has organised six Conversations at which inspirational individuals and organisations were invited to give 10-minute long thought-provoking presentations around a topic relating to the digital environment. Annotation and sharing, profiling and privacy, and digital narratives are some of the topics covered in this series that featured speakers from The Guardian DataBlog, The Oxford Internet Institute, Mendeley, BBC, Artfinder, Microsoft Research, TouchPress, to name but a few.

The latest Digital Conversations on 18 January focused on the topic of Cloud Computing and considered the opportunities it presents for the British Library, Higher Education, public sector ICT provision, and audio social networks.


Panellists (from the left): Lance Patterson, Niels van Dijk, Peter Middleton, Simon Waddington and Andy Tattersall

Lewis Crawford from the Architecture Team at the Library spoke about the benefits of the Cloud for the Library which currently uses a private cloud cluster for full text indexing of content in its Digital Library System and for image conversion. However, in the longer term it is hoped that private cloud storage will be used for all Digital Library content, with metadata and access services based around this cloud store.

Dr. Simon Waddington, Centre for e-Research, King’s College London, shared the results from the JISC-funded Kindura project which used DuraCloud to build a prototype hybrid-cloud based repository for research data. Taking into account the diverse requirements of researchers from different disciplines, the system acts as ‘broker’ for the management of research data, offering both cost optimisation and an important institutional rules engine.

 DC6SimonWaddingtonDr. Simon Waddington presenting the Kindura prototype

Audioboo‘s Lance Patterson offered a technical demonstration of this audio-based social network and explained how cloud-distributed transcoding and streaming helps manage high level of demand and supports projects such as the BooKnows educational initiative that uses audio to create and share knowledge among students and exerts worldwide.

The other speakers included Peter Middleton who introduced the UK Government G-Cloud Programme which encourages the adoption of Cloud services across the public sector;  Andy Tattersall, ScHARR, University of Sheffield, who shared his and his University’s experience of implementing the Google Cloud for collaborative working, teaching and learning; and Niels van Dijk from the Dutch National Research and Educational Network (NREN) who presented on its arm SURFconext that offers a collaboration infrastructure connecting systems, services, tools and people.

 DC6SURFnet1Niels van Dijk about to introduce SURFconext

There was a lively panel debate on the value of the cloud for on-demand, easily scalable, “no lock in” services, data portability and interoperability, whilst highlighting the security and legislative risks relevant to storage in the Cloud.


Previous Conversation in the series can be viewed from the Digital Conversations playlist on the British Library YouTube Channel. The series has generated a lot of interest from external audiences and the team is currently exploring the possibitily of opening up the talks to people from outside the Library. Enjoy the videos and watch this space for more news!