Sound and vision blog

05 March 2012

Words into words



This British Library event on 'The Future of Text', held 22 September 2011, includes a talk by me on the opportunities provided by subtitle and speech-to-text searching (at 1:25:10, and you'll need to turn the volume up...)

The key term when considering what we need to do with moving images at the British is 'integration'. It turns up on every strategy document, every PowerPoint presentation, every funding application. We are not interested (primarily) in the medium for its own sake, but as it supports research in other subjects. We want researchers to search for the topic that interests them and to be able to offer them, in the one place, books, journals, newspapers, photographs, maps, websites, sound recordings ... and moving images. There should be no hierachy among the media, and the more varied and integrated an offering we can provide for researchers, the more chances there are of them finding something that surprises them, that takes their research into corners they hadn't considered.

To achieve this noble vision, we need to do two things. The first, of course, is to have the moving images. We have a growing collection of these (around 55,000 at the last count), many of them music-related since they were collected by our Sound Archive, though the collection is starting to increase in breadth. We hope to extend the number of moving image items we offer considerably through partnerships, more of which at another time.

Second is to have the tools to enable researchers to find these different media in the one place. The Library has already made a big step forward here with its new Explore the British Library catalogue, which brings a large part of our collection, including  some of the moving image collection, in the one place. Searches can be filtered by any medium, including moving image, and we'll be adding more films and video records to the catalogue over the next few months.

But having films, books, manuscripts etc. all in the one place doesn't necessarily make for equality of searching. Unless you have equally rich metadata, or catalogue records, for each medium, then - simply put - those media with more words will get more attention. As the Library delves all the more into offering full-text searching, then the moving image has to be there too, or it will get put to the sidelines once again.

We were aware of this need when we started our television and radio news recording programme, which is due to become a reading room service quite soon (more on that innovation in a later post). The service, which we are calling Broadcast News, captures subtitles from television news programmes where these are available, then translates these into word-searchable text (a considerable technical challenge, because the subtitles on your TV programmes are graphics, not text, and need to be read through a process not unlike OCR). So you can search across thousands of television news programmes through the words spoken on the screen.

This is exciting, but not all television channels come with subtitles, particularly satellite channels. Other tools are required, and this is why we are looking at speech-to-text software. Voice recognition and speech-to-text are starting to become familiar. Mobile phone apps now offer voice command features and the ability to translates voice messages into text. Speech-to-text applications are used by medical services, legal services, and the military. The great challenge is to scale such technology up to the demands of large archives. The problems are considerable. Most voice recognition packages rely on recognising one voice - your own. They struggle with alien voices, multiple speakers, unfamiliar accents, and so on. Here at the Library we have television news programmes, radio broadcasts, oral history recordings and other speech-based archives access to which would be revolutionised by an effective, and affordable, speech-to-text capability, enabling these media to be word-searchable in seconds rather than the hours it currently takes to get through some recordings.

The right solution is not going to become available overnight. Last year we successfully trialled Microsoft's MAVIS speech-to-text programme as part of our Growing Knowledge exhibition, indexing 1,000 hours of interview material and 100 hours of video news. We are now going to build on that initial experiment as part of a one-year research project funded by the Arts and Humanities Research Council, as part of its Digital Transformations in Arts and Humanities theme.

The project is not about finding a technical solution per se (they already exist). Although we hope to generate up to 6,000 hours of indexed, word-searchable content (3,000 of video news, 3,000 of radio), the chief aim of the project will be to determine the value to researchers. We will be asking three main questions:

  1. How useful are the results to researchers in the arts and humanities? Speech-to-text systems cannot deliver perfect transcripts, but they are now at a stage of accuracy where they can offer a reliable, indeed liberating word-searching capability. The value of this will need to be explored with researchers in the arts and humanities. We will establish user groups working with postgraduate students in radio studies and journalism studies, testing research scenarios that focus both on the audio-visual media alone and integrated with other, text-based media.
  2. We need to understand the methodological and interpretative issues involved. Imperfect indexing by speech-to-text systems can lead to misleading results (for example, a television news programme with the words ‘new tax breaks for married couples’ was indexed by MAVIS as ‘no tax breaks for married couples’). The project will need to explore such pitfalls, to consider how best to quote and cite such recordings, how to evaluate results from audio-visual media alongside other text-based media (what is the correlation between a speech transcription and the text of a newspaper article?), and other issues.
  3. How can speech-to-text technology be adopted in UK research in a form that is readily accessible and affordable? The project will look at the various systems available and provide guidelines as to usability, affordability and sustainability.

So we are not just interested in our own needs, but in how such technologies can support research in the arts and humanities overall. We will be publishing and promoting the results of our findings at the end of the project. We are keen to hear from anyone with an interest in this area, so if this something that you know about, or have an opinion about, do get in touch. The email address is [email protected].


Hello Luke,

I would be most interested to learn the results of the project when published, so that I can tell my MLitt students who are studying film and sound archiving at the University of Dundee distance learning course.

David Lee

Hello David,

Thanks for your interest. We'll be distributing information on our findings as widely as possible. We're at the early testing stage and the results we're getting are fascinating - and offer such promise for researchers in the future. We have over a million speech-based recording here at the BL that need opening up ...

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.


Post a comment

Comments are moderated, and will not appear until the author has approved them.