THE BRITISH LIBRARY

Digital scholarship blog

17 posts categorized "Printed books"

27 July 2017

A workshop on Optical Character Recognition for Bangla

Add comment

I was fortunate enough to travel to Kolkata recently along with other members of the Two Centuries of Indian Print team where we ran a workshop on ‘Developments with Optical Character Recognition for Bangla’. The event, which took place at Jadavpur University, proved an excellent forum to share knowledge in this area of growing interest and was reflected in the range of library professionals, academics and computer scientists who attended from ten institutions across Bengal and from the US.

Applying Optical Character Recognition (OCR) to printed texts is one of the key expectations of 21st century scholars and library users, who want to quickly find information online that accurately meets their research needs. Cultural institutions are gateways to millions of items containing knowledge that can transform modern research. The workshop looked at the developments, challenges and opportunities of OCR in opening up vast quantities of knowledge to digital researchers.

Dr. Naira Khan from the University of Dhaka’s Computational Linguistics department kicked off the workshop by introducing the key process of how OCR works, including ‘pre-processing’ steps such as binarisation which reduces a scanned page of text to its binary form to remove background noise, isolating only the text on the page. Skew detection, another pre-processing technique, corrects scans with angular text that can cause problems for OCR systems that require perfectly horizontal or vertical text. Dr. Khan moved on to explain how OCR systems segment pages into text and non-text regions right down to pixel detection to recognise word boundaries. When it comes down to recognising individual characters, Bangla script presents some unique challenges, containing such a vast range of compound characters, vowel signs and ligatures, not to mention the distinctive top line connecting characters known as the ‘Matra’. Breaking the characters into their geometric features such as lines, arcs and circles enables combinations of features to be formed, classified as characters and expressed in digital form as OCR output.  

Naira_blog_imageadjustment

Dr. Khan introducing the concepts of OCR

After Dr. Khan’s inspiring talk attendees learned of the British Library’s particular challenge searching for an OCR solution for our 19th century Bengali books currently being digitised, and the potential use of an OCR’d dataset for Digital Humanities researchers wanting to perform text and data mining. The books span an enormous range of genres from works by religious missionaries, to those covering food, science and works of fiction. So obtaining OCR would enable automated searching and analysis of the full text across hundreds of thousands of pages that could lead to exciting research discoveries in South Asian studies.   

The event concluded with a practical session during which attendees used different OCR software on a sample of the BL’s digitised Bengali books. They experimented with Tesseract, Google Drive, i2ocr and newOCR. The general consensus was Google Drive proved to be the most accurate! Although, there are other tools we have only just begun to try out such as Transkribus that may be useful.

PracticalExercise_blogWorkshop participants trying out various OCR tools

All-in-all the workshop proved a really worthwhile exercise in widening knowledge among Indian institutions about the challenges and possible uses of OCR for Bangla. The work currently being undertaken by universities and technology centres using state-of-the-art machine learning techniques to perform text recognition will hopefully close the gap between Bangla (as well as other Indic scripts) and Latin scripts when it comes to efficient OCR tools.

 

This is a post by Tom Derrick, Digital Curator for the Two Centuries of Indian Print project.

17 July 2017

A Wonderland of Knowledge - Behind the Scenes of the British Library (Nadya Miryanova work experience)

Add comment

Posted by Nadya Miryanova BL Labs School Work Placement Student, currently studying at Lady Eleanor Holles, working with Mahendra Mahey, Manager of BL Labs.

British Library
Introduction to the British Library

Day 1

It was with a mixture of anticipation, curiosity and excitement that I opened the door to the staff entrance and started my two week work placement in the world’s largest library. I have been placed with BL Labs in the Digital Scholarship department, where I am working with Mahendra Mahey (Project Manager of BL Labs) for the following two weeks. After the inescapable health and safety induction, I am now extremely well acquainted with the BL’s elaborate fire alarm system, and following lunch at the staff restaurant, Mahendra provided me with an introduction to the British Library and explained the work undertaken by the BL Labs.

When most people hear the word ‘library’, conventional ideas typically spring to mind, including a copious number of books, and, of course, a disgruntled librarian ironically rather loudly encouraging silence every five minutes. I must admit that initially, my perspective was the same.

However, my viewpoint was soon to be completely turned around.

BL interior
British Library interior

An extraordinary institution, the British Library is indeed widely known for its remarkable collection of books, it is home to around 14 million. However, contrary to popular belief, these are only a small section of the Library’s vast collections. In fact, the British Library actually has an extremely diverse range of items, ranging from patents to musical scores, and from ancient artifacts dating as far back as 1000 BC to this morning’s newspapers, altogether giving a grand figure of approximately 200 million documented items. I was also delighted to discover that the British Library has the world’s largest collection of stamps! It is estimated that if somebody looked at 5 items each day, it would take an astonishing 80,000 years to see the whole of BL collections. 

I learnt that the objective of the BL Labs is to encourage scholars, innovators, artists, entrepreneurs and educators to work with the Library's digital collections, supporting its mission to try to ensure that the wealth and diversity of the Library’s intellectual digital heritage is available for the research, creativity and fulfillment of everyone. At BL Labs, anyone is invited to address an important research question(s) or ideas which uses the Library’s digital content and data, by entering the annual Awards or becoming involved in a collaborative project or even just using the collections in whatever way they want.

Although initially a little nervous when entering this immense institution, my fears evaporated completely, when on my very first day of working here, I was brought immediately into a friendly, welcoming atmosphere, promoted by the sincere kindness and interest that I was met with from each member of the Library's staff. 

Books Image
The George the IV British Library book collection

Day 2

At precisely 9 o’clock in the morning, I found myself seated at my office desk, looking at the newly filled out Outlook calendar on my computer to see what new and exciting tasks I would be faced with that day and looking out for any upcoming events. My Tuesday consisted mostly of independent work at my desk, and after a quick catch-up with Mahendra at 9.30, where we discussed the working plan for the day and reviewed yesterday’s work, I sat down to start my second full day of work at the British Library.

BL labs symposium
British Library Labs leaflet

Between 2013-2016, the British Library Labs held a competition, which looked for transformative project ideas which used the British Library’s digital collections and data in new and exciting ways. The BL Labs Awards recognises outstanding and innovative work that has been carried out using these collections. Mahendra had previously introduced me to the Labs Competition and Awards pages of the BL Labs website, and my main objective was to update the ideas and project submissions on this page, specifically adding the remaining Competition 2016 Entries, reviewing the 2015 and 2014 entries and checking that they were all complete with no entries missing. The competition entries can be accessed on the website http://labs.bl.uk/Ideas+for+Labs.

This was an excellent opportunity for me to work on a new editing platform and further enhance my editing skills, which will doubtlessly prove very useful in everyday life as well as in the future. As I worked through editing and updating the pages, what struck me most was the incredible diversity and wide variety of ideas within the competition entries. From a project exploring Black Abolitionists and their presence in Britain, to the proposed creation of a Victorian meme machine, and from a planned political meeting’s mapper, to a suggested Alice in Wonderland bow tie design, each idea was entirely unique and original, despite the fact that each entry was adhering to the same brief. I was mesmerised by the amount of thought and careful planning that was evident in every submission, each one was intricately detailed and provided a careful and thorough plan of work. 

Victorian Meme
An example of a Victorian meme

After finishing lunch relatively early, I found myself with half an hour of my allocated break still left, and took the opportunity to explore the library. I walked down to the visitor’s entrance, and took a moment to admire the King’s library, a majestic tower of books standing in the British Library's centre. Stepping closer, I was able to read some of the inscriptions on the spines of the books, and was delighted to see that one of them was a book of Catullus’ poetry, poetry that I previously had studied in Latin GCSE. The scope of knowledge that lies within this library is practically endless, and it led me to reflect on the importance of the work of the BL Labs. I thought back to the competition entries, they prove that the possibilities for projects truly have no limit. The BL Labs are able to give scholars, academics and students the opportunity to access some of these digital collections such as books very easily and in any part of the world. Without this access, many of the wonderful projects that the BL currently works on would not be possible.

With that thought fresh in my mind, I was brought back to reality, and returned to my desk to continue working, this time on my mini-project. My last task for the day involved brainstorming ideas for this project. A direct focus was soon established, and I decided to explore the Russian language titles in the 65,000 digitised 19th Century Microsoft books. Later on, I shall be writing a blog post detailing my experience of working on this project.

Day 3

As the Piccadilly line train arrived at St Pancras, I actually managed to step and head off in the completely right direction for the first time that week (needless to say, my sense of direction is not the best). Feeling rather proud of myself, I walked with a skip in my step, ready to immerse myself in whatever plan of work awaited today.

I looked at the schedule of the day and my heart leapt, I was to be attending my first ever proper staff meeting. It was a very technical meeting, started off by the Head of Digital Scholarship, Adam Faquhar, who talked about current activities taking place in the Digital Scholarship department. Everyone made contributions to the general discussion in the meeting and Mahendra talked about the development of the BL Labs work and the progress made so far. It also provided me with an opportunity to talk about some of the things I was presently doing and I found that everybody was very receptive and supportive. I found it very interesting to be introduced to people who work in the same area on a day-to-day basis with the British Library and enjoyed hearing about all the different projects currently being undertaken.

SherlockNet Web interface
SherlockNet web interface

I then began working on some YouTube transcription work on the winners of the 2016 BL Labs competition, the first one being SherlockNet. The SherlockNet team worked to use convolutional neural networks to automatically tag and caption the British Library Flickr collection of digitised images taken largely from 19th Century books. If that doesn't sound impressive enough, consider the fact that this entry was submitted by three people, who were just 19 years old (undergraduate university students). My work involved listening carefully to each one of the interviews, and typing on a separate word document exactly what Luda Zhao, Karen Wang and Brian Do were talking about. This word document would then be used to make subtitles for the final film and would prove invaluable when creating a storyboard for the final cut down interview. 

BL poster
British Library Alice in Wonderland Poster

Day 4

As I turned the corner of Midland Road and stood to face the traffic lights, my gaze wondered over to the now familiar Alice in Wonderland poster that had the ‘British Library’ printed on it in block capitals. I smiled as I looked up at the Cheshire cat that was perched neatly on top of the first 'I' in the words 'British Library' and the cat smiled back, revealing a wide toothy grin. Alice, likewise, was looking up at the Cheshire cat, and in that moment, her situation was made very credible to me. She was surrounded by this entirely new world of Wonderland, and in a similar way, I find myself in a parallel world of continuous acquisition of knowledge, as each day I am learning something new, with the British Library being the Wonderland. A wonderful and well-known literary extract from Lewis Carol came to mind:

 “`Would you tell me, please, which way I ought to go from here?' (Alice)

That depends a good deal on where you want to get to,' said the Cat.

`I don't much care where--' said Alice.

`Then it doesn't matter which way you go,' said the Cat.

`--so long as I get somewhere,' Alice added as an explanation.

`Oh, you're sure to do that,' said the Cat, `if you only walk long enough.'

With this in mind, I briskly walked over to the doors of the office.

The beginning of my day consisted mostly of working on my own project, further classifiying a sub collection of Russian titles from the digitised collection of 65,000 books mostly from the 19th century. I worked on further enhancing the organisation and categorisation of these books, establishing a clear methodical approach that began with sorting the books into 2 categories-fiction and non-fiction. Curiously, the majority of the titles were actually non-fiction. After an e-mail correspondence with Katya Rogatchevskaia, Lead Curator East European Collections, I discovered that most of the books that were part of the digitisation were acquired at the time when they were published, so they were selected by Katya’s distant predecessors, a fact I found remarkable.

Nicholas II abdication in Russian
The Act of Abdication of Nicholas II and his brother Grand Duke Michael,
published as a placard that would be distributed
by hand or pasted to walls (shelfmark: HS.74/1870),
an example of a Russian language title that is now digitised

For the second-half of the day, I focussed once more on the YouTube transcriptions work and managed to finish transcribing the interviews for SherlockNet. I then discussed with Mahendra how I would storyboard the interviews in preparation for the film editing process. First, I would have to pick out specific sections of the interview that were most suitable to use in the film, marking the exact timings when the person started speaking to when they finished, and I then placed the series of timings in a chronological order. I was also able to choose the music for the end product (possibly my favourite part!), and I based my selection of the music on the mood of the videos and my perception of the characters of the individuals. I concluded my day by finding a no-copyright YouTube music page and discovered an assortment of possible music tracks. I managed to narrow down the selection to four possible soundtracks, which included titles such as ‘Spring in my Step’ and ‘Good Starts’.

Day 5

As I swiped my staff pass across the reader which permits access into the building, I checked my phone to see what the time was. It was 8.30am and concurrently, I caught sight of the date, Friday 14th July. I stopped in my tracks. Today was marking my first full working week at the British Library, I could hardly believe how quickly the time went! It forcibly reminded me of the inscription on my clock at home, ‘tempus fugit’ (time flees) because if there’s one thing that has gone abnormally fast here at my time at the BL, it’s time.

Hebrew manuscript
Digitised Hebrew Manuscript available through the British Library

In the morning, I attended a meeting discussing an event Mahendra is planning around the Digitised Hebrew manuscripts, and I was lucky enough to meet Ilana Tahan, the Lead Curator of Hebrew and Christian Orient Collections. The meeting included a telephone call to Eva Frojmovic, an academic at the Centre for Jewish Studies in the School of Fine Art of the History of Art and Cultural Studies in the University of Leeds. The discussion was centered mostly on an event that would be taking place where the BL would be talking about its collection of digitised Hebrew manuscripts in order to promote their free use to the general public. The very beautiful Hebrew manuscripts could actually have a very wide target audience, perhaps additionally reaching outside the academic learning sphere and having the potential to be used in the creative/artistic space.

Contrary to popular belief, the collection of 1302 digitised manuscripts can be used by anyone and everyone, leading to exciting possibilities and new projects. The amazing thing about the digital collections is that it makes it possible for someone who does not live in London to access them, where ever they may be in the world, and they can be looked at digitally, and can be used to enhance any learning experience, ranging from seminars or lessons to PhD research projects. The actual hard-copy of the manuscripts can also be, of course, accessed in the British Library. The structure and timings of the event were discussed, and a date was set for the next meeting and for the event. To finish the meeting, Mahendra offered an explanation of the handwriting recognition transcription process for the manuscripts. There are 22 letters in the Hebrew alphabet, and each individual handwritten letter is recognised as a shape by the computer, though it's important that the computer has ground truth (i.e. examples of human transcribed manuscripts). Each letter and word is recognised and processed and will very cleverly convert the original Hebrew handwritten-script written into computerised Hebrew script. This means it would then allow someone to search for words in the manuscript, easily and quickly using a computerised search tool. 

Ilana looking at manuscripts
Ilana Tahan, Lead Curator of Hebrew and Christian Orient Collections,
looking through Hebrew manuscripts

For the majority of the afternoon, I was floating between a variety of different projects, doing more work on the YouTube transcriptions and enhancing my mini-project, as well as creating a table of the outstanding blogs that still had to be published on the British Library's Digital Scholarship blog.

At the end of the day, I did a review of my first week, evaluating the progress that I had made with Mahendra. Throughout the week, I feel that I have enhanced and developed a number of invaluable skills, and have gained an incredible insight into the working world.

I will be writing about my second week, as well as my mini-project soon, so please come and visit this blog again if you are interested to find out more about some of the work being done at the British Library.

 

 

16 May 2017

Michael Takeo Magruder @ Gazelli Art House

Add comment

Posted by Mahendra Mahey (Manager of BL Labs) on behalf of Michael Takeo Magruder (BL Labs Artist/Researcher in Residence).

Takeo_BL-Labs-Blog_Gazelli1
Michael Takeo Marguder's Gazell.io works

Earlier this year I was invited by Gazelli Art House to be a digital artist-in-residence on their online platform Gazell.io. After a series of conversations with Gazelli’s director, Mila Askarova, we decided it would be a perfect opportunity to broker a partnership with British Library Labs and use the occasion to publish some of the work-in-progress ideas from my Imaginary Cities project at the British Library.

Given Gazelli’s growing interest in and reputation for exhibiting virtual reality (VR) art, we chose to launch my March showcase with A New Jerusalem since it was in many ways the inspiration for the Imaginary Cities concept.

MTM_NJ-internal
A New Jerusalem by Michael Takeo Magruder

During the second half of my Gazell.io residency I began publishing various aesthetic-code studies that had been created for the Imaginary Cities project. I was also invited by Gazelli to hold a private sharing event at their London gallery in Mayfair to showcase some of the project’s physical experiments and outcomes. The evening was organised by Gazelli’s Artist Liaison, Victoria Al-Din, and brought together colleagues from the British Library, art curators from leading cultural institutions and academics connected to media art practice. It was a wonderful event, and it was incredibly useful to be able to present my ideas and the resulting artistic-technical prototypes to a group with such a deep and broad range of expertise. 


Sharing works in progress for the Imaginary Cities project at Gazelli Art House, London. 30th March 2017

09 March 2017

Archaeologies of reading: guest post from Matthew Symonds, Centre for Editing Lives and Letters

Add comment

Digital Curator Mia Ridge: today we have a guest post by Matthew Symonds from the Centre for Editing Lives and Letters on the Archaeologies of reading project, based on a talk he did for our internal '21st century curatorship' seminar series. Over to Matt...

Some people get really itchy about the idea of making notes in books, and dare not defile the pristine printed page. Others leave their books a riot of exclamation marks, sarcastic incredulity and highlighter pen.

Historians – even historians disciplined by spending years in the BL’s Rare Books and Manuscripts rooms – would much prefer it if people did mark books, preferably in sentences like “I, Famous Historical Personage, have read this book and think the following having read it…”. It makes it that much easier to investigate how people engaged with the ideas and information they read.

Brilliantly for us historians, rare books collections are filled with this sort of material. The problem is it’s also difficult to catalogue and make discoverable (nota bene – it’s hard because no institutions could afford to employ and train sufficient cataloguers, not because librarians don’t realise this is an issue).

The Archaeology of Reading in Early Modern Europe (AOR) takes digital images of books owned and annotated by two renaissance readers, the professional reader Gabriel Harvey and the extraordinary polymath John Dee, transcribes and translates all the comments in the margin, and marks up all traces of a reader’s intervention with the printed book and puts the whole thing on the Internet in a way designed to be useful and accessible to researchers and the general public alike.

image from https://s3.amazonaws.com/feather-client-files-aviary-prod-us-east-1/2017-03-09/76bacc2c-befe-4e7c-b729-c49cf47adf0b.png
Screenshot, The Archaeology of Reading in Early Modern Europe

AOR is a digital humanities collaboration between the Centre for Editing Lives and Letters (CELL) at University College London, Johns Hopkins University and Princeton University, and generously funded by the Andrew W. Mellon Foundation.

More importantly, it’s also a collaboration between academic researchers, librarians and software engineers. An absolutely vital consideration of how we planned AOR, how we work on it, how we’re planning to expand it, was to identify a project that could offer a common ground to be shared between these three interests, where each party would have something to gain from it.

As one of the researchers, it was really important to me to avoid forming some sort of “client-provider” relationship with the librarians who curate and know so much about my sources, and the software engineers who build the digital infrastructure.

But we do use an academic problem as a means of giving our project a focus. In 1990, Antony Grafton and the late Lisa Jardine published their seminal article ‘“Studied for Action: how Gabriel Harvey read his Livy’ in the journal Past & Present.

One major insight of the article is that people read books in conjunction with one another, often for specific, pragmatic purposes. People didn’t pick up a book from their shelves, open at page one and proceed through to the finis, marking up as they went. They put other books next to them, books that explained, clarified, argued with one another.

By studying the marginalia, it’s possible to reconstruct these pathways across a library, recreating the strategies people used to manage the vast quantities of information they had at their disposal.

In order to produce this archaeology of reading, we’ve built a “digital bookwheel”, an attempt to recreate the revolving reading desk of the renaissance period which allowed the lucky owner to manoeuvre back and forth their books. From here, the user can call up the books we’ve digitised, read the transcriptions, and search for particular words and concepts.

image from http://s3.amazonaws.com/feather-files-aviary-prod-us-east-1/98739f1160a9458db215cec49fb033ee/2017-03-09/ac83353a40f24bea921e478b1450993e.png
Screenshot, The Archaeology of Reading in Early Modern Europe


It’s built out of open source materials, leveraging the International Image Interoperability Framework (IIIF) and the IIIF-compliant Mirador 2 Viewer. Interested parties can download the XML files of our transcriptions, as well as the data produced in the process.

The exciting thing for us is that all the work on creating this digital infrastructure – which is very much a work in progress -- has provided us with the raw materials for asking new research questions, questions that can only be asked by getting away from our computer and returning back to the rare books room.

24 January 2017

Publication of Quarterly Lists: Catalogues of Indian Books

Add comment

The Two Centuries of Indian Print project is pleased to announce the online availability of some wonderful catalogues held by the library, generally known as the Quarterly Lists. They record books published quarterly and by province of British India between 1867 and 1947.

Digitised for the first time, the Quarterly Lists can now be accessed as searchable PDFs via the British Library's datasets portal, data.bl.uk. Researchers will be able to examine rich bibliographic data about books published throughout India, including the names and address of printers and publishers, publication price and how many copies were sold.

 

SV_412_8_1875-78_0003

 

Our next steps will be to OCR the Quarterly Lists to create ALTO XML for every page, which is designed to show accurate representations of the content layout. This will allow researchers to apply computational tools and methods to look across all of the lists to answer their questions about book history. So if a researcher is interested in what the history of book publishing reveals about a particular time period and place, we would like to make that possible by giving them full access to this dataset.

To get to this point however, we will have to overcome the layout challenge that the Quarterly Lists present. Across all of the lists we have found a few different layout styles which are rather tricky for OCR solutions to handle meaningfully. Note for instance how the list below compares to the one from the Calcutta Gazette above. Through the Digital Research strand of the project we will be seeking out innovative research groups willing to take a crack at improving the OCR quality and accuracy of tabular text extraction from the Quarterly Lists. 

The Quarterly Lists available on data.bl.uk are out of copyright and openly licensed for reuse. If you or anyone you know are interested in using the Quarterly Lists in your research or simply want to find out more about them, feel free to drop me an email; Tom.Derrick@bl.uk or follow more about the project @BL_IndianPrint

You can read more about the history of the Quarterly Lists, in a previous blog I wrote last year.

10 November 2016

British Library Labs Symposium 2016 - Competition and Award Winners

Add comment

The 4th annual British Library Labs Symposium took place on 7th November 2016 and was a resounding success! 

More than 220 people attended and the event was a fantastic experience, showcasing and celebrating the Digital Scholarship field and highlighting the work of BL Labs and their collaborators. The Symposium included a number of exciting announcements about the winners of the BL Labs Competition and BL Labs Awards, who are presented in this blog post. Separate posts will be published about the runners up of the Competition and Awards and posts written by all of the winners and runners up about their work are also scheduled for the next few weeks - watch this space!

BL Labs Competition winner for 2016

Roly Keating, Chief Executive of the British Library announced that the overall winner of the BL Labs Competition for 2016 was...

SherlockNet: Using Convolutional Neural Networks to automatically tag and caption the British Library Flickr collection
By Karen Wang and Luda Zhao, Masters students at Stanford University, and Brian Do, Harvard Medicine MD student

Machine learning can extract information and insights from data on a massive scale. The project developed and optimised Convolutional Neural Networks (CNN), inspired by biological neural networks in the brain, in order to tag and caption the British Library’s Flickr Commons 1 million collection. In the first step of the project, images were classified with general categorical tags (e.g. “people”, “maps”). This served as the basis for the development of new ways to facilitate rapid online tagging with user-defined sets of tags. In the second stage, automatically generate descriptive natural-language captions were provided for images (e.g. “A man in a meadow on a horse”). This computationally guided approach has produced automatic pattern recognition which provides a more intuitive way for researchers to discover and use images. The tags and captions will be made accessible and searchable by the public through the web-based interface and text annotations will be used to globally analyse trends in the Flickr collection over time.

Bl_labs_symposium_2016_131
SherlockNet team presenting at the Symposium

Karen Wang is currently a senior studying Computer Science at Stanford University, California. She also has an Art Practice minor. Karen is very interested in the intersection of computer science and humanities research, so this project is near and dear to her heart! She will be continuing her studies next year at Stanford in CS, Artificial Intelligence track.

Luda Zhao is currently a Masters student studying Computer Science at Stanford University, living in Palo Alto, California. He is interested in using machine learning and data mining to tackle tough problems in a variety of real-life contexts, and he's excited to work with the British Library to make art more discoverable for people everywhere.

Brian Do grew up in sunny California and is a first-year MD/PhD student at Harvard Medical School. Previously he studied Computer Science and biology at Stanford. Brian loves using data visualisation and cutting edge tools to reveal unexpected things about sports, finance and even his own text message history.

SherlockNet recently posted an update of their work and you can try out their SherlockNet interface and tell us what you think.

BL Labs Awards winners for 2016

Research Award winner

Allan Sudlow, Head of Research Development at the British Library announced that the winner of the Research Award was...

Scissors and Paste

By Melodee Beals, Lecturer in Digital History at Loughborough University and historian of migration and media

Bl_labs_symposium_2016_162
Melodee Beals presenting Scissors & Paste

Scissors and Paste utilises the 1800-1900 digitised British Library Newspapers, collection to explore the possibilities of mining large-scale newspaper databases for reprinted and repurposed news content. The project has involved the development of a suite of tools and methodologies, created using both out-of-the-box and custom-made project-specific software, to efficiently identify reprint families of journalistic texts and then suggest both directionality and branching within these subsets. From these case-studies, detailed analyses of additions, omissions and wholesale changes offer insights into the mechanics of reprinting that left behind few if any other traces in the historical record.

Melodee Beals joined the Department of Politics, History and International Relations at Loughborough University in September 2015. Previously, Melodee has worked as a pedagogical researcher for the History Subject Centre, a teaching fellow for the School of Comparative American Studies at the University of Warwick and a Principal Lecturer for Sheffield Hallam University, where she acted as Subject Group Leader for History. Melodee completed her PhD at the University of Glasgow.

Commercial Award winner

Isabel Oswell, Head of Business Audiences at the British Library announced that the winner of the Commercial Award was...

Curating Digital Collections to Go Mobile

By Mitchel Davis, publishing and media entrepreneur

Bl_labs_symposium_2016_178
Mitchell Davis presenting Curating Digital Collections to Go Mobile

As a direct result of its collaborative work with the British Library, BiblioLabs has developed BiblioBoard, an award-winning e-Content delivery platform, and online curatorial and multimedia publishing tools to support it. These tools make it simple for subject area experts to create visually stunning multi-media exhibits for the web and mobile devices without any technical expertise. The curatorial output is almost instantly available via a fully responsive web site as well as through native apps for mobile devices. This unified digital library interface incorporates viewers for PDF, ePub, images, documents, video and audio files allowing users to immerse themselves in the content without having to link out to other sites to view disparate media formats.

Mitchell Davis founded BookSurge in 2000, the world’s first integrated global print-on-demand and publishing services company (sold to Amazon.com in 2005 and re-branded as CreateSpace). Since 2008, he has been founder and chief business officer of BiblioLabs- the creators of BiblioBoard. Mitchell is also an indie producer and publisher who has created several award winning indie books and documentary films over the past decade through Organic Process Productions, a small philanthropic media company he founded with his wife Farrah Hoffmire in 2005.

Artistic Award winner

Jamie Andrews, Head of Culture and Learning at the British Library announced that the winner of the Artistic Award was... 

Here there, Young Sailor

Written and directed by writer and filmmaker Ling Low and visual art by Lyn Ong

Hey There, Young Sailor combines live action with animation, hand-drawn artwork and found archive images to tell a love story set at sea. Inspired by the works of early cinema pioneer Georges Méliès, the video draws on late 19th century and early 20th century images from the British Library's Flickr collection for its collages and tableaux. The video was commissioned by Malaysian indie folk band The Impatient Sisters and independently produced by a Malaysian and Indonesian team.

Bl_labs_symposium_2016_192
Ling Low receives her Award from Jamie Andrews

Ling Low is based between Malaysia and the UK and she has written and directed various short films and music videos. In her fiction and films, Ling is drawn to the complexities of human relationships and missed connections. By day, she works as a journalist and media consultant. Ling has edited a non-fiction anthology of human interest journalism, entitled Stories From The City: Rediscovering Kuala Lumpur, published in 2016. Her journalism has also been published widely, including in the Guardian, the Telegraph and Esquire Malaysia.

Teaching / Learning Award winner

Ria Bartlett, Lead Producer: Onsite Learning at the British Library announced that the winner of the Teaching / Learning Award was...

Library Carpentry

Founded by James Baker, Lecturer at the Sussex Humanities Lab, who represented the global Library Carpentry Team (see below) at the Symposium

Bl_labs_symposium_2016_212
James Baker presenting Library Carpentry

Library Carpentry is software skills training aimed at the needs and requirements of library professionals. It takes the form of a series of modules that are available online for self-directed study or for adaption and reuse by library professionals in face-to-face workshops. Library Carpentry is in the commons and for the commons: it is not tied to any institution or person. For more information on Library Carpentry see http://librarycarpentry.github.io/

James Baker is a Lecturer in Digital History and Archives at the School of History, Art History and Philosophy and at the Sussex Humanities Lab. He is a historian of the long eighteenth century and contemporary Britain. James is a Software Sustainability Institute Fellow and holds degrees from the University of Southampton and latterly the University of Kent. Prior to joining Sussex, James has held positions of Digital Curator at the British Library and Postdoctoral Fellow with the Paul Mellon Centre for Studies of British Art. James is a convenor of the Institute of Historical Research Digital History seminar and a member of the History Lab Plus Advisory Board.

 The Library Carpentry Team is regularly accepting new members and currently also includes: 

Carpentry
The Library Carpentry Team

British Library Labs Staff Award winner

Phil Spence, Chief Operating Officer at the British Library announced that the winner of the British Library Labs Staff Award was...

Libcrowds

Led by Alex Mendes, Software Developer at the British Library

LibCrowds is a crowdsourcing platform built by Alexander Mendes. It aims to create searchable catalogue records for some of the hundreds of thousands of items that can currently only be found in printed and card catalogues. By participating in the crowdsourcing projects, users will help researchers everywhere to access the British Library’s collections more easily in the future.

Bl_labs_symposium_2016_247
Nora McGregor presenting LibCrowds on behalf of Alex Mendes

The first project series, Convert-a-Card, experimented with a new method for transforming printed card catalogues into electronic records for inclusion in our online catalogue Explore, by asking volunteers to link scanned images of the cards with records retrieved from the WorldCat database. Additional projects have recently been launched that invite volunteers to transcribe cards that may require more specific language skills, such as the South Asian minor languages. Records matched, located, transcribed or translated as part of the crowdsourcing projects were uploaded to the British Library's Explore catalogue for anyone to search online. By participating users can have a direct impact on the availability of research material to anyone interested in the diverse collections available at the British Library.

Alex Mendes has worked at the British Library for several years and recently completed a Bachelor’s degree in Computer Science with the Open University. Alex enjoys the consistent challenges encountered when attempting to find innovative new solutions to unusual problems in software development.

AlexMendes
Alex Mendes

If you would like to find out more about BL Labs, our Competition or Awards please contact us at labs@bl.uk   

03 November 2016

SherlockNet update - 10s of millions more tags and thousands of captions added to the BL Flickr Images!

Add comment

SherlockNet are Brian Do, Karen Wang and Luda Zhao, finalists for the Labs Competition 2016.

We have some exciting updates regarding SherlockNet, our ongoing efforts to using machine learning techniques to radically improve the discoverability of the British Library Flickr Commons image dataset.

Tagging

Over the past two months we’ve been working on expanding and refining the set of tags assigned to each image. Initially, we set out simply to assign the images to one of 11 categories, which worked surprisingly well with less than a 20% error rate. But we realised that people usually search from a much larger set of words, and we spent a lot of time thinking about how we would assign more descriptive tags to each image.

Eventually, we settled on a Google Images style approach, where we parse the text surrounding each image and use it to get a relevant set of tags. Luckily, the British Library digitised the text around all 1 million images back in 2007-8 using Optical Character Recognition (OCR), so we were able to grab this data. We explored computational tools such as Term Frequency – Inverse Document Frequency (Tf-idf) and Latent Dirichlet allocation (LDA), which try to assign the most “informative” words to each image, but found that images aren’t always associated with the words on the page.

To solve this problem, we decided to use a 'voting' system where we find the 20 images most similar to our image of interest, and have all images vote on the nouns that appear most commonly in their surrounding text. The most commonly appearing words will be the tags we assign to the image. Despite some computational hurdles selecting the 20 most similar images from a set of 1 million, we were able to achieve this goal. Along the way, we encountered several interesting problems.

Similar images
For all images, similar images are displayed
  1. Spelling was a particularly difficult issue. The OCR algorithms that were state of the art back in 2007-2008 are now obsolete, so a sizable portion of our digitised text was misspelled / transcribed incorrectly. We used a pretty complicated decision tree to fix misspelled words. In a nutshell, it amounted to finding the word that a) is most common across British English literature and b) has the smallest edit distance relative to our misspelled word. Edit distance is the fewest number of edits (additions, deletions, substitutions) needed to transform one word into another.
  2. Words come in various forms (e.g. ‘interest’, ‘interested’, ‘interestingly’) and these forms have to be resolved into one “stem” (in this case, ‘interest’). Luckily, natural language toolkits have stemmers that do this for us. It doesn’t work all the time (e.g. ‘United States’ becomes ‘United St’ because ‘ates’ is a common suffix) but we can use various modes of spell-check trickery to fix these induced misspellings.
  3. About 5% of our books are in French, German, or Spanish. In this first iteration of the project we wanted to stick to English tags, so how do we detect if a word is English or not? We found that checking each misspelled (in English) word against all 3 foreign dictionaries would be extremely computationally intensive, so we decided to throw out all misspelled words for which the edit distance to the closest English word was greater than three. In other words, foreign words are very different from real English words, unlike misspelled words which are much closer.
  4. Several words appear very frequently in all 11 categories of images. These words were ‘great’, ‘time’, ‘large’, ‘part’, ‘good’, ‘small’, ‘long’, and ‘present’. We removed these words as they would be uninformative tags.

In the end, we ended up with between 10 and 20 tags for each image. We estimate that between 30% and 50% of the tags convey some information about the image, and the other ones are circumstantial. Even at this stage, it has been immensely helpful in some of the searches we’ve done already (check out “bird”, “dog”, “mine”, “circle”, and “arch” as examples). We are actively looking for suggestions to improve our tagging accuracy. Nevertheless, we’re extremely excited that images now have useful annotations attached to them!

SherlockNet Interface

Sherlocknet-interface
SherlockNet Interface

For the past few weeks we’ve been working on the incorporation of ~20 million tags and related images and uploading them onto our website. Luckily, Amazon Web Services provides comprehensive computing resources to take care of storing and transferring our data into databases to be queried by the front-end.

In order to make searching easier we’ve also added functionality to automatically include synonyms in your search. For example, you can type in “lady”, click on Synonym Search, and it adds “gentlewoman”, “ma'am”, “madam”, “noblewoman”, and “peeress” to your search as well. This is particularly useful in a tag-based indexing approach as we are using.

As our data gets uploaded over the coming days, you should begin to see our generated tags and related images show up on the Flickr website. You can click on each image to view it in more detail, or on each tag to re-query the website for that particular tag. This way users can easily browse relevant images or tags to find what they are interested in.

Each image is currently captioned with a default description containing information on which source the image came from. As Luda finishes up his captioning, we will begin uploading his captions as well.

We will also be working on adding more advanced search capabilities via wrapper calls to the Flickr API. Proposed functionality will include logical AND and NOT operators, as well as better filtering by machine tags.

Captioning

As mentioned in our previous post, we have been experimenting with techniques to automatically caption images with relevant natural language captions. Since an Artificial Intelligence (AI) is responsible for recognising, understanding, and learning proper language models for captions, we expected the task to be far harder than that of tagging, and although the final results we obtained may not be ready for a production-level archival purposes, we hope our work can help spark further research in this field.

Our last post left off with our usage of a pre-trained Convolutional Neural Networks - Recurrent Neural Networks (CNN-RNN) architecture to caption images. We showed that we were able to produce some interesting captions, albeit at low accuracy. The problem we pinpointed was in the training set of the model, which was derived from the Microsoft COCO dataset, consisting of photographs of modern day scenes, which differs significantly from the BL Flickr dataset.

Through collaboration with BL Labs, we were able to locate a dataset that was potentially better for our purposes: the British Museum prints and drawing online collection, consisting of over 200,000 print drawing, and illustrations, along with handwritten captions describing the image, which the British Museum has generously given us permission to use in this context. However, since the dataset is directly obtained from the public SPARQL endpoints, we needed to run some pre-processing to make it usable. For the images, we cropped them to standard 225 x 225 size and converted them to grayscale. For caption, pre-processing ranged from simple exclusion of dates and author information, to more sophisticated “normalization” procedures, aimed to lessen the size of the total vocabulary of the captions. For words that are exceeding rare (<8 occurrences), we replaced them with <UNK> (unknown) symbols denoting their rarity. We used the same neuraltalk architecture, using the features from a Very Deep Convolutional Networks for Large-Scale Visual Recognition (VGGNet) as intermediate input into the language model. As it turns out, even with aggressive filtering of words, the distribution of vocabulary in this dataset was still too diverse for the model. Despite our best efforts to tune hyperparameters, the model we trained was consistently over-sensitive to key phrases in the dataset, which results in the model converging on local minimums where the captions would stay the same and not show any variation. This seems to be a hard barrier to learning from this dataset. We will be publishing our code in the future, and we welcome anyone with any insight to continue on this research.

Captions
Although there were occasion images with delightfully detailed captions (left), our models couldn’t quite capture useful information for the vast majority of the images(right). More work is definitely needed in this area!

The British Museum dataset (Prints and Drawings from the 19th Century) however, does contain valuable contextual data, and due to our difficulty in using it to directly caption the dataset, we decided to use it in other ways. By parsing the caption and performing Part-Of-Speech (POS) tagging, we were able to extract nouns and proper nouns from each caption. We then compiled common nouns from all the images and filtered out the most common(>=500 images) as tags, resulting in over 1100 different tags. This essentially converts the British Museum dataset into a rich dataset of diverse tags, which we would be able to apply to our earlier work with tag classification. We trained a few models with some “fun” tags, such as “Napoleon”, “parrots” and “angels”, and we were able to get decent testing accuracies of over 75% on binary labels. We will be uploading a subset of these tags under the “sherlocknet:tags” prefix to the Flickr image set, as well as the previous COCO captions for a small subset of images(~100K).

You can access our interface here: bit.ly/sherlocknet or look for 'sherlocknet:tag=' and 'sherlocknet:category=' tags on the British Library Flickr Commons site, here is an example, and see the image below:

Sherlocknet tags
Example Tags on a Flickr Image generated by SherlockNet

Please check it out and let us know if you have any feedback!

We are really excited that we will be there in London in a few days time to present our findings, why don't you come and join us at the British Library Labs Symposium, between 0930 - 1730 on Monday 7th of November, 2016?

Black Abolitionist Performances and their Presence in Britain - An update!

Add comment

Posted by Hannah-Rose Murray, finalist in the BL Labs Competition 2016.

Reflecting back on an incredible and interesting journey over the last few months, it is remarkable at the speed in which five months has flown by! In May, I was chosen as one of the finalists for the British Library Labs Competition 2016, and my project has focused on black abolitionist performances and their presence in Britain during the nineteenth century. Black men and women had an impact in nearly every part of Great Britain, and it is of no surprise to learn their lectures were held in famous meeting halls, taverns, the houses of wealthy patrons, theatres, and churches across the country: we inevitably and unknowably walk past sites with a rich history of Black Britain every day.

I was inspired to apply for this competition by last year’s winner, Katrina Navickas. Her project focused on the Chartist movement, and in particular using the nineteenth century digitised newspaper database to find locations of Chartist meetings around the country. Katrina and the Labs team wrote code to identify these meetings in the Chartist newspaper, and churned out hundreds of results that would have taken her years to search manually.

I wanted to do the same thing, but with black abolitionist speeches. However, there was an inherent problem: these abolitionists travelled to Britain between 1830-1900 and gave lectures in large cities and small towns: in other words their lectures were covered in numerous city and provincial newspapers. The scale of the project was perhaps one of the most difficult things we have had to deal with.

When searching the newspapers, one of the first things we found was the OCR (Optical Character Recognition) is patchy at best. OCR refers to scanned images that have been turned into machine-readable text, and the quality of the OCR depended on many factors – from the quality of the scan itself, to the quality of the paper the newspaper was printed on, to whether it has been damaged or ‘muddied.’ If the OCR is unintelligible, the data will not be ‘read’ properly – hence there could be hundreds of references to Frederick Douglass that are not accessible or ‘readable’ to us through an electronic search (see the image below).

American-slavery
An excerpt from a newspaper article about a public meeting about slavery, from the Leamington Spa Courier, 20 February 1847

In order to 'clean' and sort through the ‘muddied’ OCR and the ‘clean’ OCR, we need to teach the computer what is ‘positive text’ (i.e., language that uses the word ‘abolitionist’, ‘black’, ‘fugitive’, ‘negro’) and ‘negative text’ (language that does not relate to abolition). For example, the image to the left shows an advert for one of Frederick Douglass’s lectures (Leamington Spa Courier, 20 February 1847). The key words in this particular advert that are likely to appear in other adverts, reports and commentaries are ‘Frederick Douglass’, ‘fugitive’, ‘slave’, ‘American’, and ‘slavery.’ I can search for this advert through the digitised database, but there are perhaps hundreds more waiting to be uncovered.
We found examples where the name ‘Frederick’ had been ‘read’ as F!e83hrick or something similar. The image below shows some OCR from the Aberdeen Journal, 5 February 1851, and an article about “three fugitive slaves.” The term ‘Fugitive Slaves’ as a heading is completely illegible, as is William’s name before ‘Crafts.’ If I used a search engine to search for William Craft, it is unlikely this result would be highlighted because of the poor OCR.

Ocr-text
OCR from the Aberdeen Journal, 5 February 1851, and an article about “three fugitive slaves.”

I have spent several years transcribing black abolitionist speeches and most of this will act as the ‘positive’ text. ‘Negative’ text can refer to other lectures of a similar structure but do not relate to abolition specifically, for example prison reform meetings or meetings about church finances. This will ensure the abolitionist language becomes easily readable. We can then test the performance of this against some of the data we already have, and once the probability ensures we are on the right track, we can apply it to a larger data set.

All of this data is built into what is called a classifier, created by Ben O’Steen, Technical Lead of BL Labs. This classifier will read the OCR and collect newspaper references, but works differently to a search engine because it measures words by weight and frequency. It also relies on probability, so for example, if there is an article that mentions fugitive and slave in the same section, it ranks a higher probability that article will be discussing someone like Frederick Douglass or William Craft. On the other hand, a search engine might read the word ‘fugitive slave’ in different articles on the same page of a newspaper.

We’re currently processing the results of the classifier, and adjusting accordingly to try and reach a higher accuracy. This involves some degree of human effort while I double check the references to see whether the results actually contains an abolitionist speech. So far, we have had a few references to abolitionist speeches, but the classifier’s biggest difficulty is language. For example, there were hundreds of results from the 1830s and the 1860s – I instantly knew that these would be references around the Chartist movement because the language the Chartists used would include words like ‘slavery’ when describing labour conditions, and frequently compared these conditions to ‘negro slavery’ in the US. The large number of references from the 1860s highlight the renewed interest in American slavery because of the American Civil War, and there are thousands of articles discussing the Union, Confederacy, slavery and the position of black people as fugitives or soldiers. Several times, the results focused on fugitive slaves in America and not in Britain.

Another result we had referred to a West Indian lion tamer in London! This is a fascinating story and part of the hidden history we see as a central part of the project, but is obviously not an abolitionist speech. We are currently working on restricting our date parameters from 1845 to 1860 to start with, to avoid numerous mentions of Chartists and the War. This is one way in which we have had to be flexible with the initial proposal of the project.

Aside from the work on the classifier, we have also been working on numerous ways to improve the OCR – is it better to apply OCR correction software or is it more beneficial to completely re-OCR the collection, or perhaps a combination of both? We have sent some small samples to a company based in Canberra, Australia called Overproof, who specialise in OCR correction and have provided promising results. Obviously the results are on a small scale but it’s been really interesting so far to see the improvements in today’s software compared to when some of these newspapers were originally scanned ten years before. We have also sent the same sample to the IMPACT centre for competence of Competence in Digitisation whose mission is to make the digitisation of historical printed text “better, faster, cheaper” and provides tools, services and facilities to further advance the state-of-the-art in the field of document imaging, language technology and the processing of historical text. Preliminary results will be presented at the Labs Symposium.

Updated website

Before I started working with the Library, I had designed a website at http://www.frederickdouglassinbritain.com. The structure was rudimentary and slightly awkward, dwarfed by the numerous pages I kept adding to it. As the project progressed, I wanted to improve the website at the same time, and with the invaluable help of Dr Mike Gardner from the University of Nottingham, I re-launched my website at the end of October. Initially, I had two maps, one showing the speaking locations of Frederick Douglass, and another map showing speaking locations by other black abolitionists such as William and Ellen Craft, William Wells Brown and Moses Roper (shown below).

Website-update-maps
Left map showing the speaking locations of Frederick Douglass. Right map showing speaking locations by other black abolitionists such as William and Ellen Craft, William Wells Brown and Moses Roper.

After working with Mike, we not only improved the aesthetics of the website and the maps (making them more professional) but we also used clustering to highlight the areas where these men and women spoke the most. This avoided the ‘busy’ appearance of the first maps and allowed visitors to explore individual places and lectures more efficiently, as the old maps had one pin per location. Furthermore, on the black abolitionist speaking locations map (below right), a user can choose an individual and see only their lectures, or choose two or three in order to correlate patterns between who gave these lectures and where they travelled. 

Website-update-maps-v2
The new map interface for my website.

Events

I am very passionate about public engagement and regard it as an essential part of being an academic, since it is so important to engage and share with, and learn from, the public. We have created two events: as part of Black History Month on the 6th October, we had a performance here at the Library celebrating the life of two formerly enslaved individuals named William and Ellen Craft. Joe Williams of Heritage Corner in Leeds – an actor and researcher who has performed as numerous people such as Frederick Douglass and the black circus entertainer Pablo Fanque – had been writing a play about the Crafts, and because it fitted so well with the project, we invited Joe and actress Martelle Edinborough, who played Ellen, to London for a performance. Both Joe and Martelle were incredible and it really brought the Craft’s story and the project to life. We had a Q&A afterwards where everyone was very responsive and positive to the performance and the Craft’s story of heroism and bravery.

Hannah-murray-actors
(Left to Right) Martelle Edinborough, Hannah-Rose Murray and Joe Williams

The next event is a walking tour, taking place on Saturday 26 November. I’ve devised this tour around central London, highlighting six sites where black activists made an indelible mark on British society during the nineteenth century. It is a way of showing how we walk past these sites on a daily basis, and how we need to recognise the contributions of these individuals to British history.

Hopefully this project will inspire others to research and use digital scholarship to find more ‘hidden voices’ in the archive. In terms of black history specifically, people of colour were actors, sailors, boxers, students, authors as well as lecturers, and there is so much more to uncover about their contribution to British history. My personal journey with the Library and the Labs team has also been a rewarding experience. It has further convinced me that we need stronger networks of collaboration between scholars and computer scientists, and the value of digital humanities in general. Academics could harness the power of technology to bring their research to life, an important and necessary tool for public engagement. I hope to continue working with the Labs team fine-tuning some of the results, as well as writing some pages about black abolitionists for the new website. I’m very grateful to the Library and the Labs team for their support, patience, and this amazing opportunity as I’ve learned so much about digital humanities, and this project – with its combination of manual and technological methods – as a larger model for how we should move forward in the future. The project will shape my career in new and exciting ways, and the opportunity to work with one of the best libraries in the world is a really gratifying experience.

I am really excited that I will be there in London in a few days time to present my findings, why don't you come and join us at the British Library Labs Symposium, between 0930 - 1730 on Monday 7th of November, 2016?