22 October 2014
Victorian Meme Machine - Extracting and Converting Jokes
Posted on behalf of Bob Nicholson.
The Victorian Meme Machine is a collaboration between the British Library Labs and Dr Bob Nicholson (Edge Hill University). The project will create an extensive database of Victorian jokes and then experiment with ways to recirculate them over social media. For an introduction to the project, take a look at this blog post or this video presentation.
In my previous blog post I wrote about the challenge of finding jokes in nineteenth century books and newspapers. There’s still a lot of work to be done before we have a truly comprehensive strategy for identifying gags in digital archives, but our initial searches scooped up a lot of low-hanging fruit. Using a range of keywords and manual browsing methods we quickly managed to identify the locations of more than 100,000 gags. In truth, this was always going to be the easy bit. The real challenge lies in automatically extracting these jokes from their home-archives, importing them into our own database, and then converting them into a format that we can broadcast over social media.
Extracting joke columns from the 19th Century British Library Newspaper Archive – the primary source of our material – presents a range of technical and legal obstacles. On the plus side, the underlying structure of the archive is well-suited to our purposes. Newspaper pages have already been broken up into individual articles and columns, and the XML for each these articles includes an ‘Article Title’ field. As a result, it should theoretically be possible to isolate every article with the title “Jokes of the Day” and then extract them from the rest of the database. When I pitched this project to the BL Labs, I naïvely thought that we’d be able to perform these extractions in a matter of minutes – unfortunately, it’s not that easy.
The archive’s public-facing platform is owned and operated by the commercial publisher Gale Cengage, who sells subscriptions to universities and libraries around the world (UK universities currently get free access via JISC). Consequently, access to the archive’s underlying content is restricted when using this interface. While it’s easy to identify thousands of joke columns using the archive’s search tools, it isn’t possible to automatically extract all of the results. The interface does not provide access to the underlying XML files, and images can only be downloaded one-by-one using a web browser’s ‘save image as’ button. In other words, we can’t use the commercial interface to instantly grab the XML and TIFF files for every article with the phrase “Jokes of the Week” in its title.
The British Library keeps its own copies these files, but they are currently housed in a form of digital deep-storage that is impossible for researchers to directly access and extremely cumbersome to discover content within it. In order to move forward with the automatic extraction of jokes we will need to secure access to this data, transfer it onto a more accessible internal server, custom build an index that allows us to search the full text of the articles and titles so that we may extract all of the relevant text and image files showing the areas of the newspaper scans from which the text was derived.
All of this is technically possible, and I’m hopeful that we’ll find a way to do it in the next stage of the project. However, given the limited time available to us we decided to press ahead with a small sample of manually extracted columns and focus our attention on the next stages of the project. This manually created sample will be of great use in future, as we and other research groups can use it to train computer models, which should enable us to automatically classify text from other corpora as potentially containing jokes that we would not have been able to find otherwise.
For our sample we manually downloaded all of the ‘Jokes of the Day’ columns published by Lloyd’s Weekly News in 1891. Here’s a typical example:
These columns contain a mixture of joke formats – puns, conversations, comic stories, etc – and are formatted in a way that makes them broadly representative of the material found elsewhere in the database. If we can find a way to process 1,000 jokes from this source, we shouldn’t have too much difficulty scaling things up to deal with 100,000 similar gags from other newspapers.
Our sample of joke columns was downloaded as a set of jpeg images. In order to make them keyword searchable, transform them into ‘memes’, and send them out over social media we first need to convert them into accurate, machine-readable text. We don’t have access to the existing OCR data, but even if this was available it wouldn’t be accurate enough for our purposes. Here’s an example of how one joke has been interpreted by OCR software:
Some gags have been rendered more successfully than this, but many are substantially worse. Joke columns often appeared at the edge of a page, which makes them susceptible to fading and page bending. They also make use of unusual punctuation, which tends to confuse the scanning software. Unlike newspaper archives, which remain functional even with relatively low-quality OCR, our project requires 100% accuracy (or something very close) in order to republish the jokes in new formats.
So, even if we had access to OCR data we’d need to correct and improve it manually. We experimented with this process using OCR data taken from the British Newspaper Archive, but the time it took to identify and correct errors turned out to be longer than transcribing the jokes from scratch. Our volunteers reported that the correction process required them to keep looking back and forth between the image and the OCR in order to correct errors one-by-one, whereas typing up a fresh transcription was apparently quick and straightforward. It seems a shame to abandon the OCR, and I’m hopeful that we’ll eventually find a way to make it usable. The imperfect data might work as a stop-gap to make jokes searchable before they are manually corrected. We may be able to improve it using new OCR software, or speed up the correction process by making use of interface improvements like TILT. However, for now, the most effective way to convert the jokes into an accurate, machine-readable format is simply to transcribe directly from the image.