The Newsroom blog

News about yesterday's news, and where news may be going

2 posts categorized "Catalogues and databases"

02 April 2014

Taming the news beast

Taming the News Beast was the striking title of a seminar held on April 1st by ISKO UK, the British branch of the International Society for Knowledge Organization. Subtitled "finding context and value is text and data" its aim was to explore the ways in which we can control the explosion of news information data and derive value from it. Much has been written about this explosion from the points of view of its producers and consumers, but less well known is the huge challenges it presents for those whose job it is to manage such data by working effectively with those who generate it. Few environments depend more on effective information management - while creating any number of problems for those trying to apply the rules - than the news industry today. Hence the seminar, which aimed "to share knowledge from the intersections of technology, semantics and product development".


Looking at the large lecture theatre at University College London filled to the brim with an enthusiastic audience of data developers, information scientists, journalism students and archivists, your blogger was moved to think that things were very different to when he spent his time at library college, many years ago now. Library and information studies, as they called it then, excited no one. Now, in the era of big data, it is where the big ideas are happening. Librarians (let's continue to give them their traditional name) are masters of the digital universe, or might aspire to be. Metadata is cool; ontologies are where it's at; semantics really means something.

The epitome of this excitement about information management - particularly news information - is the work coming out of BBC development projects such as BBC News Labs, which was introduced in a presentation by its Innovation Manager, Matt Shearer. News Labs has a a small team of people looking at better ways in which to manage news information, both within and outside the BBC. Its work includes the Juicer API (for semantic prototyping), the #newsHACK days for testing of product development ideas, entity extraction (extracting key terms from a mass of unstructured text), linked data (the important principle of working with data based on terms produced for DBpedia which other institutions can share in to create linked-up knowledge) and the Storyline ontology. There is particular excitement in trying to extract searachable terms for audiovisual media, through such technologies as speech, image and music recognition. If there is a pattern, the machines can be trained to recognise it.

Shearer's enthusiastic and sometimes mind-spinning presentation was matched by his colleague Jeremy Tarling, data architect with News Labs, who introduced Storyline - an open data model for news. Storyline is a way of structuring news stories around themes, based on a linked data model. The linked data bit is the way of ensuring consistency and shareability (they are working with other news organisations on the project). The theme element is about a new way of presenting news online which joins up stories in a less linear, more intuitive fashion. If you type in 'Edward Snowden' into a search engine you will get hundreds of stories - how to sort these out or to tell what the overarching narrative is that connects them all? If you can bundle the Snowden stories that your news organisation has produced around stories that go to make up the Edward Snowden theme - for example, Snowden at Moscow airport, Snowden finds job in Russia - you start to impose more of a pattern, and to draw out more of a story - the storyline, that is.

The nuts and bolts of this are interesting, because it requires journalists to tag their stories correctly, and listening between the lines one could see that some journalists were more willing and able to do so than others. But this sort of data innovation is happening, and it will have a dramatic impact on how news sources such as the BBC News website look in the future.

The energy, resources and ingenuity put into such work by the BBC can leave the rest of us overwhelmed, not to say humbled, but the remaining speakers had equally interesting things to say. Rob Corrao, Chief Operating Officer of LAC Group, gave a dry, droll account of how his consultancy company had been brought in to enable ABC News in New York to get on top of the "endless torrent" of news information coming in every day. This was a different approach to the problem, more of an exercise in logistics than simple data management policies. They managed the people and the work-processes first, then everything else fell into place. A content strategy was essential to understanding how best to manage the news process, including such simple ideas as prioritising the digitisation of footage of people likely to feature before long in obituary pieces. The more you know what the news will be in advance, the easier it is to manage it.

Ian Roberts of the University of Sheffield introduced AnnoMarket, a European-funded project which will process your text documents for you, or conduct analyses of news and social media sources. As automated metadata extraction tools start to make more of an impact (that is, tools which extract useful information from digital sources), so businesses are popping up which will do the hard work for you. Send them a large bunch of documents in digital form, and they will analyse them for you. Essentially it's like handing them a book and they give you back an index.

Finally Pete Sowerbutts of the Press Association talked about how the news agency is applying semantic data management tools to its news archives, so that with a bit of basic information about a subject (e.g. name, age, occupation), place or organisation and some properly applied tagging, a linked-up catalogue starts to emerge. People, places and organisations are the subjects that all of the projects like to tackle, because they are easily defined. Themes - i.e. what news stories are actually about - are harder to pin down, semantically speaking.

Beneath all the jargon, much of this was about tackling age-old problems of how best to catalogue the world around us. Librarians in the room of a particular vintage looked like they had seen all of this before, and indeed they had. Librarians' role in life is to try impose order on an impossibly chaotic world. Previously they came up with classification schemes and controlled vocabularies and tried to make real-life objects match these. Now we have automated systems which try to apply similar rules with reduced human intervention because of the sheer vastness of the data we are trying to manage, and because it is digital and digital lets you do this sort of thing. Yet real life continues to elude all of our attempts to describe it precisely. Sometimes they only way you are going to find out what a news publication is actually about is to pick it up and read it. But you still have to find it in the first place. 

An unanswered question for me was whether what applies to news applies to news archives. News changes once it has been produced. It turns into a body of information about the past, where the stories that mattered when they were news may no longer matter, because researchers will approach the body of information with their own ideas in mind, looking across stories as much as they may look directly for them. Our finding tools for news archives must be practical, but they must not be too prescriptive. ABC News may hope to guess what the news will be in the future, but the news archivist can never be so presumptuous. It is you, the users, who will provide the storylines.



24 March 2014

Checking out the NSB

The NSB, or Newspaper Storage Building, is the British Library's new home for newspapers. Situated at our second site in Boston Spa, Yorkshire, it is not where users will be able to read our print newspapers - that will be in the Newsroom at St Pancras, when they become available once more in Autumn 2014 - but it is where they are starting to be stored, in optimum preservation conditions.


The Newspaper Storage Building (NSB)

The urgent need for moving the newspaper collection from its former North London home in Colindale to the NSB is made clear in a recent blog post by Sandy Ryan of our Collection Care team.

In 2001, as part of a three-year project to survey all of the Library’s collections on all of its sites, we surveyed the newspaper collections at Colindale using the PAS (Preservation Needs Assessment Survey) methodology. The results showed that the newspaper collection is the most vulnerable of all of the Library’s collections and gave us a statistically sound picture of the state of this national collection. Our results showed that 34% of the collection at Colindale was unstable – 19.4% in poor condition, 14.6% unusable.

It simply wasn't possible for us to continue with a third of the collection in an unstable condition and nearly 15% of it actually unusable. So it was that, thanks to £33M of government funding, we embarked on our Newspaper Programme, which has seen the closing down of the Newspaper Library at Colindale, the building on the NSB with the ongoing transfer of the newspaper collection to the new facility, the planned digitisation of 40 million newspaper pages through the British Newspaper Archive, and the opening of the Newsroom at our St Pancras site next month.

I visited the NSB for the first time last week. It is, to be honest, a large black lump - a Vogon spaceship of a building, landed in the middle of the Yorkshire countryside. But what it lacks in beauty on the outside it more than gains in purpose on the inside. It is essential for the long-term preservation of the print newspapers that they be kept optimum temperature and humidity-controlled conditions, and in the dark. Inside the NSB the temperature is being maintained at 14⁰C,  with relative humidity at 55%, and the oxygen level 14-15%, eliminating any risk of fire. So it is great for newspapers, but not so great for humans. Instead the process of ingest, shelving and retrievable is all undertaken by fully-automated machinery - appropriately robotic for a spaceship.


Looking up from the inside of the NSB as one of the cranes descends, from BBC News

Disappointingly it means that visitors can't see inside the main storage area (there is no viewing gallery), but you can see what it looks like from this recent BBC News video, which also shows the conditions at Colindale - the contrast is dramatic. A push of a button from St Pancras will send a message to the Boston Spa robots to select the requested newspaper volume from its rack, carry it to the outside, and have it delivered to St Pancras within 48 hours.

Such a process requires more than robots - it requires minute attention to data. Every volume in the collection has been marked with a barcode, with these records matched to the appropriate catalogue record. This would have been a time-consuming but otherwise straightforward process were there a simple one-to-one relationship between catalogue and object, but sadly that is not the case. Many of the newspaper titles have been bound alongside other titles, and newspapers in case are full of cataloguing complexities because they have a tendency to change their titles or frequency of publication. This has meant a huge job of matching complex records to the complixities of how the newspapers have been bound or boxed, in a form that makes sense to an electronic catalogue and ordering system.


Stacks of newspaper volumes ready for ingest into the NSB

You can see the results of this in operation at the one part of the NSB that it open to visitors, the ingest area i.e. where the newspapers are delivered into the NSB. Each shrink-wrapped newspaper volume has its barcode, which is checked against the NSB's management system. Each pile of newspaper volumes, as shown in the photograph above, is also barcoded, because the stack of volumes then has to find its place within the NSB. Each stack - which is a maximum 400mm high - has a top and bottom board with straps tightly-secured about them (a task performed by actual humans). Only when every volume that should be there has been checked to form a complete stack, and the barcode for stack itself swiped and checked, can the complete set be whisked away on a conveyor belt, through an air lock, and off to its alloted rack within the darkened vastiness of the NSB.


Another delivery arrives from Colindale

All of this is currently going through a testing process, as we try to anticipate all of the different kinds of order enquries that will be made, how they are identified, retrieved and delivered, while making sure that nothing gets lost or damaged along the way. Meanwhile the newspapers are now being delivered from Colindale to the NSB, a few lorries-worth at a time so far, but soon to be three lorries per day, every day, up to Autumn 2014 when every newspaper volume (c.280,000 of them) will be in place and the full print newspaper delivery service can be put into action.

The Newsroom opens on 7 April, when there will be no print newspapers available to begin with. The service we'll be providing between then and Autumn 2014 will be microfilm and digital copies only. As a third of the collection is accessible in microfilm form, we will be able to satisfy a great many research enquiries as things are, and in an ideal operational world we would want to deliver 100% microfilm and digital access, and never have to move the newspapers again. The NSB has been designed for their long-term preservation, so that our news heritage is safe for many generations to come.

The NSB has excited a lot of interest for its automated storage systems, but what intrigues me from the curatorial point of view is that relationship between the object and its description. Our analogue heritage fits clumsily into the digital age. We have bound newspapers into volumes for the convenience of storing them on shelves, but what was convenient for finding newspapers by humans able to walk up and down those shelves is not so convenient when we need to understand newspapers as they were issued, which is by issue.

Our newspaper catalogue is at title-level; that is, we can tell you that we have The Times, and that for any particular date or time period for that newspaper we can direct you to the relevant volume. We can't point you directly to that individual instance of a newspaper, unless you use the British Newspaper Archive, where that 1% of our collection that has been digitised is discoverable by article, page, issue or title. That's what digital should give you - logical disaggregation of the object into an intelligent, reuseable, interoperable artefact. Newspapers can then be linked up with other newspapers, indeed other news forms such as television, radio, web or anything with a date to it.

We have more thinking to do about how best to make our newspapers available, and why.