03 November 2016
SherlockNet update - 10s of millions more tags and thousands of captions added to the BL Flickr Images!
SherlockNet are Brian Do, Karen Wang and Luda Zhao, finalists for the Labs Competition 2016.
We have some exciting updates regarding SherlockNet, our ongoing efforts to using machine learning techniques to radically improve the discoverability of the British Library Flickr Commons image dataset.
Over the past two months we’ve been working on expanding and refining the set of tags assigned to each image. Initially, we set out simply to assign the images to one of 11 categories, which worked surprisingly well with less than a 20% error rate. But we realised that people usually search from a much larger set of words, and we spent a lot of time thinking about how we would assign more descriptive tags to each image.
Eventually, we settled on a Google Images style approach, where we parse the text surrounding each image and use it to get a relevant set of tags. Luckily, the British Library digitised the text around all 1 million images back in 2007-8 using Optical Character Recognition (OCR), so we were able to grab this data. We explored computational tools such as Term Frequency – Inverse Document Frequency (Tf-idf) and Latent Dirichlet allocation (LDA), which try to assign the most “informative” words to each image, but found that images aren’t always associated with the words on the page.
To solve this problem, we decided to use a 'voting' system where we find the 20 images most similar to our image of interest, and have all images vote on the nouns that appear most commonly in their surrounding text. The most commonly appearing words will be the tags we assign to the image. Despite some computational hurdles selecting the 20 most similar images from a set of 1 million, we were able to achieve this goal. Along the way, we encountered several interesting problems.
- Spelling was a particularly difficult issue. The OCR algorithms that were state of the art back in 2007-2008 are now obsolete, so a sizable portion of our digitised text was misspelled / transcribed incorrectly. We used a pretty complicated decision tree to fix misspelled words. In a nutshell, it amounted to finding the word that a) is most common across British English literature and b) has the smallest edit distance relative to our misspelled word. Edit distance is the fewest number of edits (additions, deletions, substitutions) needed to transform one word into another.
- Words come in various forms (e.g. ‘interest’, ‘interested’, ‘interestingly’) and these forms have to be resolved into one “stem” (in this case, ‘interest’). Luckily, natural language toolkits have stemmers that do this for us. It doesn’t work all the time (e.g. ‘United States’ becomes ‘United St’ because ‘ates’ is a common suffix) but we can use various modes of spell-check trickery to fix these induced misspellings.
- About 5% of our books are in French, German, or Spanish. In this first iteration of the project we wanted to stick to English tags, so how do we detect if a word is English or not? We found that checking each misspelled (in English) word against all 3 foreign dictionaries would be extremely computationally intensive, so we decided to throw out all misspelled words for which the edit distance to the closest English word was greater than three. In other words, foreign words are very different from real English words, unlike misspelled words which are much closer.
- Several words appear very frequently in all 11 categories of images. These words were ‘great’, ‘time’, ‘large’, ‘part’, ‘good’, ‘small’, ‘long’, and ‘present’. We removed these words as they would be uninformative tags.
In the end, we ended up with between 10 and 20 tags for each image. We estimate that between 30% and 50% of the tags convey some information about the image, and the other ones are circumstantial. Even at this stage, it has been immensely helpful in some of the searches we’ve done already (check out “bird”, “dog”, “mine”, “circle”, and “arch” as examples). We are actively looking for suggestions to improve our tagging accuracy. Nevertheless, we’re extremely excited that images now have useful annotations attached to them!
For the past few weeks we’ve been working on the incorporation of ~20 million tags and related images and uploading them onto our website. Luckily, Amazon Web Services provides comprehensive computing resources to take care of storing and transferring our data into databases to be queried by the front-end.
In order to make searching easier we’ve also added functionality to automatically include synonyms in your search. For example, you can type in “lady”, click on Synonym Search, and it adds “gentlewoman”, “ma'am”, “madam”, “noblewoman”, and “peeress” to your search as well. This is particularly useful in a tag-based indexing approach as we are using.
As our data gets uploaded over the coming days, you should begin to see our generated tags and related images show up on the Flickr website. You can click on each image to view it in more detail, or on each tag to re-query the website for that particular tag. This way users can easily browse relevant images or tags to find what they are interested in.
Each image is currently captioned with a default description containing information on which source the image came from. As Luda finishes up his captioning, we will begin uploading his captions as well.
We will also be working on adding more advanced search capabilities via wrapper calls to the Flickr API. Proposed functionality will include logical AND and NOT operators, as well as better filtering by machine tags.
As mentioned in our previous post, we have been experimenting with techniques to automatically caption images with relevant natural language captions. Since an Artificial Intelligence (AI) is responsible for recognising, understanding, and learning proper language models for captions, we expected the task to be far harder than that of tagging, and although the final results we obtained may not be ready for a production-level archival purposes, we hope our work can help spark further research in this field.
Our last post left off with our usage of a pre-trained Convolutional Neural Networks - Recurrent Neural Networks (CNN-RNN) architecture to caption images. We showed that we were able to produce some interesting captions, albeit at low accuracy. The problem we pinpointed was in the training set of the model, which was derived from the Microsoft COCO dataset, consisting of photographs of modern day scenes, which differs significantly from the BL Flickr dataset.
Through collaboration with BL Labs, we were able to locate a dataset that was potentially better for our purposes: the British Museum prints and drawing online collection, consisting of over 200,000 print drawing, and illustrations, along with handwritten captions describing the image, which the British Museum has generously given us permission to use in this context. However, since the dataset is directly obtained from the public SPARQL endpoints, we needed to run some pre-processing to make it usable. For the images, we cropped them to standard 225 x 225 size and converted them to grayscale. For caption, pre-processing ranged from simple exclusion of dates and author information, to more sophisticated “normalization” procedures, aimed to lessen the size of the total vocabulary of the captions. For words that are exceeding rare (<8 occurrences), we replaced them with <UNK> (unknown) symbols denoting their rarity. We used the same neuraltalk architecture, using the features from a Very Deep Convolutional Networks for Large-Scale Visual Recognition (VGGNet) as intermediate input into the language model. As it turns out, even with aggressive filtering of words, the distribution of vocabulary in this dataset was still too diverse for the model. Despite our best efforts to tune hyperparameters, the model we trained was consistently over-sensitive to key phrases in the dataset, which results in the model converging on local minimums where the captions would stay the same and not show any variation. This seems to be a hard barrier to learning from this dataset. We will be publishing our code in the future, and we welcome anyone with any insight to continue on this research.
The British Museum dataset (Prints and Drawings from the 19th Century) however, does contain valuable contextual data, and due to our difficulty in using it to directly caption the dataset, we decided to use it in other ways. By parsing the caption and performing Part-Of-Speech (POS) tagging, we were able to extract nouns and proper nouns from each caption. We then compiled common nouns from all the images and filtered out the most common(>=500 images) as tags, resulting in over 1100 different tags. This essentially converts the British Museum dataset into a rich dataset of diverse tags, which we would be able to apply to our earlier work with tag classification. We trained a few models with some “fun” tags, such as “Napoleon”, “parrots” and “angels”, and we were able to get decent testing accuracies of over 75% on binary labels. We will be uploading a subset of these tags under the “sherlocknet:tags” prefix to the Flickr image set, as well as the previous COCO captions for a small subset of images(~100K).
You can access our interface here: bit.ly/sherlocknet or look for 'sherlocknet:tag=' and 'sherlocknet:category=' tags on the British Library Flickr Commons site, here is an example, and see the image below:
Please check it out and let us know if you have any feedback!
We are really excited that we will be there in London in a few days time to present our findings, why don't you come and join us at the British Library Labs Symposium, between 0930 - 1730 on Monday 7th of November, 2016?