23 May 2013
Does crowdsourcing capture the crowd at its wisest? An interview with Nick Hiley
How best to make the non-textual “stuff” (to borrow Tim Hitchcock’s nomenclature) we digitise discoverable for users is one of the real challenges facing digitization projects. Whereas making printed text available to mining is relatively straightforward, the same cannot be said for images. Cartoons, a medium which typically contains text alongside implied meaning, foreground these challenges. It seems appropriate then that the British Cartoon Archive was chosen, as a partner in the Going Digital programme (which I blogged about last week), to host a workshop which touched on issues of image creation, management, metadata and discoverability. I was excited to have the opportunity to chat to Dr Nick Hiley, the Head of the British Cartoon Archive (whose website was deemed by curators and and experts from the British Library and participating libraries as one of the top 100 for future researchers), to discuss those challenges, their impact on digital scholarship, and locating the real wisdom of the crowd.
James: Welcome to the British Library and thanks for coming along. First up I wanted to ask was how did you get involved in Going Digital?
Nick: I was asked if I’d like to participate! Very early on in fact, at the sort of time when you think this will never take off the ground and it doesn’t matter if I say yes! Then suddenly it took off, and I thought really early on that if we can’t do something interesting about digitising images then what can we present because that’s what we do and have done for a long time. There was some sudden worry, because I realised I’ve been doing this for so long without actually knowing the difference between gifs and tiffs and jpegs and whatever else, but then I realised again that you just have to know the point at which you leave the technicality to someone else and you continue doing what you are good at, which I think in our case - at the Archive - is content. And that is what I’ve tried to get over in the workshop: that you need to know that all these technicalities exist but you also need to know the point at which you say that is where I stop and I hand over to somebody else to understand the bit depths and colour balance and exactly how to store this material safely, put it on the web, and so on. So, I was invited.
James: I’m glad you were invited - not only because you could then invite me to come along - but because I see the British Cartoon Archive as having a really interesting collection - not only because of my own research interests - but because of the wealth of description that the dataset contains.
Nick: Indeed. The collection is big, and in a sense 5-10 years ago it was bigger, comparatively. When I arrived in 1999 there were very few databases of this sort of size - even though it only contained 30,000 images. The other thing that interests me is that we’ve been around long enough to remake our images. Whereas if somebody is starting a project now you must produce your archival master and generate images from that - you digitise once and use many times - well we’ve digitised once many times because what that image is needed for and what we’ve defined as an archival image has changed so much over 10-15 years. So the tiny little images that we put in our catalogue in the beginning, because we needed the catalogue to tell us what was downstairs in the archive for us to bring up, are quite different from the 100MB tiffs that we might produce now. A 100MB tiff delivered over a dial-up landline is no use to anybody!
James: And a 100 MB tiff made for a very different purpose: that is the archive in a way. Whilst places such as the British Cartoon Archive were previously cataloguing what was downstairs with their digital images, you are now doing something different. In many cases you don’t even have it downstairs. The archive is the server.
Nick: The strange thing is that because we’ve been around so long we still don’t think enough of our digital collection that we have. We have a big digital collection. Probably a million images, if you break it down into the different sizes of images we deliver on the web. For instance most of these still have a number which is the number given to the physical object. So we essentially have two sets of collections which have the same catalogue numbers. We need to do something about that because we need to accept that this is a separate collection and that is has to be separately conserved and looked after.
James: Given that change in the nature of the purpose of the images themselves how has the way they’ve been described changed over the life of the British Cartoon Archive?
Nick: I think that probably the ambitions have developed, initially I don’t think it was an ambition to put quite so much contextual information with the image: in terms of notes about the background, cataloguing not only the people shown but also the people referred to, the implied text - so if it visually refers to Alice in Wonderland but not textually, we put that in. And it is that which interestingly now is changing again, because of the feeling that it is better to get as many images out there as possible than to describe them in great detail with added metadata. It’s an interesting point that we’re at, because you can see the rise in that ambition to describe and to index through the description, and you can perhaps see that beginning to be undermined again by the feeling that this is work that should be done by the users, work that will be done if you throw all this stuff out. So we’re either at a terribly interesting and ambitious stage, or we’re peering into the dark ages. My feeling is that it is the whole idea of 20/80 - which is a completely bogus set of statistics - but the idea that 20% of the material in an archive gets used 80% of the time. And the way to break that down is through metadata, and we can ensure that people don’t go only to what they know and we can make sure that 20/80 is broken down, that people use different images - that’s something I’ve seen over the years, I think largely because of the work that Jane Newton has done with cataloging. You used to get people come along for images, cartoons they’ve seen in books, and then they’d do searches on our catalogue as the catalogue grew and they come with completely new images of Hitler which you’d never seen anybody use before and they’d want those for their books. And I fear what is going to happen now is that we go back to a digital 20/80, and we get people following the routes that other people have followed through the digital collection, and finding 20% of it again and again and again. Architects talk about paths of design, the foot routes that people take through buildings and across patches of ground, if you want to stop those developing and wearing away the grass - the digital grass - then I think you’ve got to do that with metadata, which is what makes me a sceptic when it comes to things like crowdsourcing.
James: Following on from that, from the idea that there is a concern that digital archives could become paths trodden over and over again, and adding on the fact that the fewer abilities to connect within a collection the lesser chance people will go off those trails, if you introduce something like crowdsourcing into the mix will the trails users are already reaching be those that are described? So perhaps the challenge with crowdsourcing is to get information which people wouldn’t expect to add detail on?
Nick: Much of my academic work has been on the history of audiences. I love audiences. They are very challenging to research because they don’t leave any records: they don’t have to. But they leave subtle records in the changes that they make in the media with which they interact. So look at the design of cinemas and you can tell something about what people do in them: it is very crude, like a river running through a landscape. You can see what audiences in the mass have done, so I’m very much on the side of the users and audiences. I just don’t have a great deal of faith in the present definition of crowdsourcing. I think if you look at crowdsourcing projects the results they produce are not specific to the digital media and they are not even characteristic of the digital media. They are very much like any small society that you might ever have been a member of: five people doing all the work, and every time the committee is re-elected the same people are unopposed for treasurer or chairman. However many members of the society you have, the work is done by a small number of people. And that it seems to me is what a lot of these crowdsourcing projects are. It’s a great ambition to capture what users are interested in, but what is happening in most of these projects is remote volunteers. And I think if you look at the characteristics of the web, even those website we think of as been characterised by user creation - Flickr, Youtube - one user in five hundred creates material for those sites. The way that people use the web is they move freely, they don’t leave things behind naturally. And I think that that might be the best way to capture this extra sort of information, but capturing information about searches for instance. We’re in the British Library now. There might be temptation to archive Amazon and show the range of products that are available in 2013. But historically I’d far more like to have records of how people searched Amazon in 2013, what all those people are actually looking for, what they’re trying to find, what they’re wanting, the way that they move around. And I think that is the wisdom of the crowd, that is the extra dimension. It is not that they’ll sign up and do some editing for you in a remote crowdsourcing project, I think we may be looking to capture the wrong sorts of things.
James: Final question then, so in a hypothetical world if you managed to persuade some research funder to give you some money to create a usable interface that would provide rich information on the patterns of use in the your website, how would you envisage deploying that? Would it be through an algorithms which took patterns of use to drive other users toward similar content? Because the problem with the latter, which is common in say Amazon, is that it would do precisely what you don’t want: if people come in on traditional routes finding what they expect, you don’t want to keep narrowing the focus of what people get to. So how might you envisage using that kind of information? Or is that the great unknown in need of an solution?
Nick: If I was given resources I’d hope that we could put money into people cataloguing and creating metadata! Though I understand there are great tensions between building this great fortress of knowledge with careful, organised cataloguing by experts and the fact that this fortress becomes a forbidding thing, and I’d be the first out there trying to knock it down! But somewhere between the two I think there will be natural processes, I don’t think we have to make it happen - if I’m wrong I’m retire and somebody else will take my place. What I’m really worried about is that behind every crowdsourcing initiative I see a group of managers in finger mittens trying to warm themselves over a single candle, because what they are trying to do is cut costs. It is being seized by people who want to dethrone the expert because expertise is expensive. And I do think there is an argument that expertise can be liberating and it can mean that all this material we are throwing out during digitization can be used in new and different ways. Whenever expertise narrows things down and narrows vistas and possibilities, we should get rid of it. But I don’t think that’s happened just yet.
James: Thank you Nick for taking the time to speak to me today.