22 August 2016
SherlockNet: tagging and captioning the British Library’s Flickr images
Finalists of the BL Labs Competition 2016, Karen Wang, Luda Zhao and Brian Do, inform us on the progress of their SherlockNet project:
This is an update on SherlockNet, our project to use machine learning and other computational techniques to dramatically increase the discoverability of the British Library’s Flickr images dataset. Below is some of our progress on tagging, captioning, and the web interface.
When we started this project, our goal was to classify every single image in the British Library's Flickr collection into one of 12 tags -- animals, people, decorations, miniatures, landscapes, nature, architecture, objects, diagrams, text, seals, and maps. Over the course of our work, we realised the following:
- We were achieving incredible accuracy (>80%) in our classification using our computational methods.
- If our algorithm assigned two tags to an image with approximately equal probability, there was a high chance the image had elements associated with both tags.
- However, these tags were in no way enough to expose all the information in the images.
- Luckily, each image is associated with text on the corresponding page.
We thus wondered whether we could use the surrounding text of each image to help expand the “universe” of possible tags. While the text around an image may or may not be directly related to the image, this strategy isn’t without precedent: Google Images uses text as its main method of annotating images! So we decided to dig in and see how this would go.
As a first step, we took all digitised text from the three pages surrounding each image (the page before, the page of, and the page after) and extracted all noun phrases. We figured that although important information may be captured in verbs and adjectives, the main things people will be searching for are nouns. Besides, at this point this is a proof of principle that we can easily extend later to a larger set of words. We then constructed a composite set of all words from all images, and only kept words present in between 5% and 80% of documents. This was to get rid of words that were too rare (often misspellings) or too common (words like ‘the’, ‘a’, ‘me’ -- called “stop words” in the natural language processing field).
With this data we were able to use a tool called Latent Dirichlet Allocation (LDA) to find “clusters” of images in an automatic way. We chose the original 12 tags after manually going through 1,000 images on our own and deciding which categories made the most sense based on what we saw; but what if there are categories we overlooked or were unable to discern by hand? LDA solves this by trying to find a minimal set of tags where each document is represented by a set of tags, and each tag is represented by a set of words. Obviously the algorithm can’t provide meaning to each tag, so we provide meaning to the tag by looking at the words that are present or absent in each tag. We ran LDA on a sample of 10,000 images and found tags clusters for men, women, nature, and animals. Not coincidentally, these are similar to our original tags and represent a majority of our images.
This doesn’t solve our quest for expanding our tag universe though. One strategy we thought about was to just use the set of words from each page as the tags for each image. We quickly found, however, that most of the words around each image are irrelevant to the image, and in fact sometimes there was no relation at all. To solve this problem, we used a voting system . From our computational algorithm, we found the 20 images most similar to the image in question. We then looked for the words that were found most often in the pages around these 20 images. We then use these words to describe the image in question. This actually works quite well in practice! We’re now trying to combine this strategy (finding generalised tags for images) with the simpler strategy (unique words that describe images) to come up with tags that describe images at different “levels”.
We started with a very ambitious goal: given only the image input, can we give a machine -generated, natural-language description of the image with a reasonably high degree of accuracy and usefulness? Given the difficulty of the task and of our timeframe, we didn’t expect to get perfect results, but we’ve hoped to come up with a first prototype to demonstrate some of the recent advances and techniques that we hope will be promising for research and application in the future.
We planned to look at two approaches to this problem:
- Similarity-based captioning. Images that are very similar to each other using a distance metric often share common objects, traits, and attributes that shows up in the distribution of words in their captions. By pooling words together from a bag of captions of similar images, one can come up with a reasonable caption for the target image.
- Learning-based captioning. By utilising a CNN similar to what we used for tagging, we can capture higher-level features in images. We then attempt to learn the mappings between the higher-level features and their representations in words, using either another neural network or other methods.
We have made some promising forays into the second technique. As a first step, we used a pre-trained CNN-RNN architecture called NeuralTalk to caption our images. As the models are trained on the Microsoft COCO dataset, which consists of pictures and photograph that differs significantly from the British Library's Flickr dataset, we expect the transfer of knowledge to be difficult. Indeed, the resulting captions of some ~1000 test images show that weakness, with the black-and-white exclusivity of the British Library illustration and the more abstract nature of some illustrations being major roadblocks in the qualities of the captioning. Many of the caption would comment on the “black and white” quality of the photo or “hallucinate” objects that did not exist in the images. However, there were some promising results that came back from the model. Below are some hand-pick examples. Note that this was generated with no other metadata; only the raw image was given.
From a rough manual pass, we estimate that around 1 in 4 captions are of useable quality: accurate, contains interesting and useful data that would aid in search discovery, catalogisation etc., with occasional gems (like the elephant caption!). More work will be directed to help us increase this metric.
We have been working on building the web interface to expose this novel tag data to users around the world.
One thing that’s awesome about making the British Library dataset available via Flickr, is that Flickr provide an amazing API for developers. The API exposes, among other functions, the image website’s search logic via tags as well as free text search using the image title and description, and the capability to sort by a number of factors including relevance and “interestingness”. We’ve been working on using the Flickr API, along with AngularJS and Node.js to build a wireframe site. You can check it out here.
If you look at the demo or the British Library's Flickr album, you’ll see that each image has a relatively sparse set of tags to query from. Thus, our next steps will be adding our own tags and captions to each image on Flickr. We will pre-pend these with a custom namespace to distinguish them from existing user-contributed and machine tags, and utilise them in queries to find better results.
Finally, we are interested in what users will use the site for. For example, we could track user’s queries and which images they click on or save. These images are presumably more relevant to these queries, and we rank them higher in future queries. We also want to be able to track general analytics like the most popular queries over time. Thus incorporating user analytics will be the final step in building the web interface.
We welcome any feedback and questions you may have! Contact us at [email protected]
 Johnson J, Ballan L, Fei-Fei L. Love Thy Neighbors: Image Annotation by Exploiting Image Metadata. arXiv (2016)