16 May 2013
On metadata and cartoons
I love cartoons. And few collections of cartoons excite me more than those held by the British Cartoon Archive. Thanks to some meticulous cataloguing its digital archive is a pleasure to explore, so it seemed fitting to me that the BCA was chosen to host a 'Digitising the Image' workshop on 15 May as part of the AHRC-funded Going Digital doctoral training programme. This programme includes events at The Courtauld Institute of Art, Goldsmiths (University of London), the Open University, and the Universities of East Anglia, Essex, Kent, and Sussex, and runs until the end of July this year. I was invited along to this particular event to talk about how archives of digital images can be used in research, and I chose to focus on how metadata can provide novel opportunities for discovering large corpora of digital images - if perhaps through a less appealing door than by going directly to the cartoons themselves (slides here). The rest of the day covered creating images, file types, publishing images, copyright, and metadata, and provided an excllent opportunity to reflect on how these skills - perhaps even more importantly the knowledge of the possibility of aquiring these skills - can be brought to wider audiences. Going Digital is a good start to this process, but only really the first tentative steps into fully integrating 'the digital' into how budding historians, art historians and literary critics are trained in higher education.
So, back to metadata and cartoons. A few weeks before the event I asked the BCA to provide me with a dump of metadata. Quite wisely they came back with some sample .xml which - after some tests - I realised I could do something with at a technical level. I was also advised that the metadata was strongest for the 1960s and 1970s. This then became my focus and having received the full dataset I set about doing some quick and dirty transformations and visualisations for demonstration purposes (warning: quick and dirty are the operative words).
The content includes nearly 400,000 lines of data, with date, title, subject, author and various archival data. After doing a little cleaning of the 'Date' field - and where necesary some judicious removing - in Open Refine, I poked around the data for useful fields (I'll admit that plenty more cleaning could be done). By far the most interesting were the 'Title' field - in which is free text of any inscriptions within the cartoon - and the 'Subject' field - containing text entered by the BCA team in order to categorise the cartoon (so for a single cartoon the list of subjects might include 'backgardens', 'budgerigars', 'pigs', 'ballet', 'typewriters'). It is this latter field which makes the collection such a rich resource for researchers.
In order to force the data into Voyant - perhaps the easiest data discovery tool for newcomers to get to grips with - I had to sort the data by date and then remove the data column to create an artifical chronology: not ideal, but necesary as Voyant can only handle text not text vs. date. A fudged solution also had to be found to get the data into Zotero for use in Paper Machines. I wanted to demonstrate topic modeling given recent discussions on the subject in the Journal of Digital Humanities, yet getting the data into an easy to use tool such as Paper Machines proved troublesome: converting the data to bibtex made Zotero (on top of which Paper Machines sits) fall over, so instead I crudely chopped the textual data into annual text files for the years 1960 to 1979 and uploaded them for comparison. Again not ideal at all, but it got the point across for at the event I was able to demonstrate manipulating the data in these tools live: risky perhaps, but if my object was for the audience to understand the power of the tools (which it was!) then static slides wouldn't do. And what more than justified the risk was the evident enthuasiam in the room for the tools and for the fresh discoveries this type of data driven analysis can enable. More evidence then - if any were needed - that doing trumps reading/hearing/seeing when it comes to encouraging critical tool use.
At this point you might be thinking, what did I actually discover in the data. In a sense I discovered what I expected to discover (and not for the first time). The themes of the cartoons in the corpus track the politics of the day, with for example clusters of words around 'Maggie' and 'Conservative' growing to a crescendo by the end of the 1970s. Equally expected, but nonetheless of interest, is the observation that textual content within cartoons during the same period tended toward natural language, with words such as 'british', 'harold', 'christmas' and 'strike' marginal (see below).
A more naunced discovery, and one which I think suggests the potential both of the data and of the method, is revealed by comparing visualisations of the 'Title' field and of the 'Title' and 'Subject' fields combined. In the latter case, the subjects overwhelm the titles. This is to be expected: as the subjects are chosen by curators of the data at the point of digitisation we might expect these entries to form clusters and to reuse categories. Hence although the addition of the 'Subject' field to the 'Title' only increased the number of unique words from 30,621 to 33,178, it increases the total words from 660,981 words to 1,208,082 and the most frequent word from 2,877 occurances for "it's" to 12,000 for "party" (note: all counts correct after the application of standard stop words - with a few manual additions - to the data).
This additional data also changes the trends within the corpus. So whilst comparing 'police', 'unions', and 'strikes' in the Subject+Title corpus shows 'police' and 'strikes' as occuring with relatively equal frequency over time (or across the length of the text), when we look at only the text within the cartoons 'police' occurs with far greater frequency across the period (see above). What is going on here demonstrates the value of capturing implied meaning in metadata as opposed to merely inscribed text. The word 'police' is simply more likely to appear in cartoons: think of stock phrases such as "Stop! Police!" (and derivations thereof) or the appearance of the words 'Police Station' above or around the door of a building. Words such as 'unions' and 'strikes' are more likely on the other hand to only appear in natural speech: "Who's still out? Any new strikes?", "We're not against pay strikes mate", "I dunno Denis - if these strikes go on". So whereas the word 'police' and the theme of policing might appear together, the theme of striking and unions is more likely to be implied within a cartoon and is then more available for this sort of corpus analysis when that impled meaning has been captured and translated into text.
In the case of the BCA 1960s and 1970s collections this capturing of implied meaning was undertaken by paid experts. Today some of this sort of work can be outsourced to volunteering crowd: our own Picturing Canada project is an excellent example of how this could work for digital images. In a future post I will discuss with Nick Hiley, Head of the British Cartoon Archive, the challenges of creating high-quality descrptive metadata in an era where crowdsourcing is so in vogue.