THE BRITISH LIBRARY

Digital scholarship blog

62 posts categorized "Humanities"

12 April 2018

The 2018 BL Labs Awards: enter before midnight Thursday 11th October!

Add comment

With six months to go before the submission deadline, we would like to announce the 2018 British Library Labs Awards!

The BL Labs Awards are a way of formally recognising outstanding and innovative work that has been created using the British Library’s digital collections and data.

Have you been working on a project that uses digitised material from the British Library's collections? If so, we'd like to encourage you to enter that project for an award in one of our categories.

This year, the BL Labs Awards is commending work in four key areas:

  • Research - A project or activity which shows the development of new knowledge, research methods, or tools.
  • Commercial - An activity that delivers or develops commercial value in the context of new products, tools, or services that build on, incorporate, or enhance the Library's digital content.
  • Artistic - An artistic or creative endeavour which inspires, stimulates, amazes and provokes.
  • Teaching / Learning - Quality learning experiences created for learners of any age and ability that use the Library's digital content.

BLAwards2018
BL Labs Awards 2018 Winners (Top-Left- Research Award Winner – A large-scale comparison of world music corpora with computational tools , Top-Right (Commercial Award Winner – Movable Type: The Card Game), Bottom-Left(Artistic Award Winner – Imaginary Cities) and Bottom-Right (Teaching / Learning Award Winner – Vittoria’s World of Stories)

There is also a Staff award which recognises a project completed by a staff member or team, with the winner and runner up being announced at the Symposium along with the other award winners.

The closing date for entering your work for the 2018 round of BL Labs Awards is midnight BST on Thursday 11th October (2018)Please submit your entry and/or help us spread the word to all interested and relevant parties over the next few months. This will ensure we have another year of fantastic digital-based projects highlighted by the Awards!

The entries will be shortlisted after the submission deadline (11/10/2018) has passed, and selected shortlisted entrants will be notified via email by midnight BST on Friday 26th October 2018. 

A prize of £500 will be awarded to the winner and £100 to the runner up in each of the Awards categories at the BL Labs Symposium on 12th November 2018 at the British Library, St Pancras, London.

The talent of the BL Labs Awards winners and runners up from 2017, 2016 and 2015 has resulted in a remarkable and varied collection of innovative projects. You can read about some of the 2017 Awards winners and runners up in our other blogs, links below:

BLAwards2018-Staff
British Library Labs Staff Award Winner – Two Centuries of Indian Print


Research category Award (2017) winner: 'A large-scale comparison of world music corpora with computational tools', by Maria Panteli, Emmanouil Benetos and Simon Dixon. Centre for Digital Music, Queen Mary University of London

  • Research category Award (2017) runner up: 'Samtla' by Dr Martyn Harris, Prof Dan Levene, Prof Mark Levene and Dr Dell Zhang
  • Commercial Award (2017) winner: 'Movable Type: The Card Game' by Robin O'Keeffe
  • Artistic Award (2017) winner: 'Imaginary Cities' by Michael Takeo Magruder
  • Artistic Award (2017) runner up: 'Face Swap', by Tristan Roddis and Cogapp
  • Teaching and Learning (2017) winner: 'Vittoria's World of Stories' by the pupils and staff of Vittoria Primary School, Islington
  • Teaching and Learning (2017) runner up: 'Git Lit' by Jonathan Reeve
  • Staff Award (2017) winner: 'Two Centuries of Indian Print' by Layli Uddin, Priyanka Basu, Tom Derrick, Megan O’Looney, Alia Carter, Nur Sobers khan, Laurence Roger and Nora McGregor
  • Staff Award (2017) runner up: 'Putting Collection metadata on the map: Picturing Canada', by Philip Hatfield and Joan Francis

For any further information about BL Labs or our Awards, please contact us at labs@bl.uk.

23 March 2018

Shine a light on past entertainments with In the Spotlight

Add comment

In this post, Dr Mia Ridge and Alex Mendes provide an update on the Library's latest crowdsourcing project...

People who've explored In the Spotlight, our project helping make historic playbills more findable, might have noticed a line of text just above the 'Save and Continue' button: 'Seen something interesting? Add a note'.

Insights from your comments

Since the project began, we've received almost 700 comments [update - it's actually over 1900, across all projects]. Some of them simply tell us that an image is blank or upside-down, but many others share interesting findings. We love hearing from you, and we've been highlighting individual comments on Twitter (@LibCrowds) and on our forum.

Comments have pointed out spectacles including 'a Terrific Eruption of Mount Vesuvius, accompanied by TORRENTS OF BURNING LAVA' and a 'Serpent vomiting Fire'. New amenities mentioned include lighting ('600 wax lights and a new set of gold chandeliers' or new gas lighting) and the addition of backs to seats. Famous actors spotted include Sarah Siddons, Jenny Lind and Ira Aldridge, while Mr Kean has caused all kinds of trouble.

Lots of comments are about performances that aren't plays, from hornpipes to tableaux to ballets, songs, speeches, fireworks, scientific demonstrations, performing animals, panoramas, conjuring and juggling tricks, lists of scenery, gun tricks, pantomimes, acrobatics, excerpts from plays, and even the 'reenactment of the Coronation'! We're thinking hard about the best way to deal with them (and with playbills that don't include a year), and will post to the forum and twitter to ask for your ideas soon.

General updates

Since we first shared the link, there have been over 4,700 visitors from 91 countries. About 80% are primarily English-speakers, with Russian, German and French the next most popular languages.

We've had over 42,000 contributions from over 630 participants (with 1499 participants registered on the platform overall). Together, they've helped complete 34 projects by undertaking countless marking and transcription tasks to make genres, dates and play titles searchable.

Each project is based on a specific volume of playbills from a regional theatre or theatres. The fastest projects were 'Theatre Royal, Bristol 1819-1823 (Vol. 2)', completed in 8 days, 31 minutes, with 'Miscellaneous Plymouth theatres 1796-1882 (Vol. 1)' a close second at 8 days, 5 hours, 30 mins. We currently have playbills from theatres in Dublin, Hull, Nottingham - Oswestry or Plymouth - which will be completed first?

Recent blog posts include a wonderful story from PhD student and In the Spotlight participant Edward Mills tracing an ancient custom through the Library's digitised collections in The Flitch of Bacon: An Unexpected Journey Through the Collections of the British Library, and Christian Algar on the 'rich pageant' of historical playbills.

You might have noticed some small changes to the navigation and data pages as we updated the software this week. Most of the changes were behind the scenes, providing additional admin and analysis functions to ensure that data sent off to the catalogue is as accurate as possible.

image from http://s3.amazonaws.com/feather-files-aviary-prod-us-east-1/98739f1160a9458db215cec49fb033ee/2018-03-23/3bfdfe7285d54738a6f225032e20b995.png
Visitors have come from all over the world, but we'd love to reach more

 

Thank you!

We're grateful to everyone who's made a large or small contribution, but particular thanks to Barbara G, David Y, Dina S, Ervins S, Jo B, John L, Katharine S, Kathryn P-S, Lisa G, Maria Antonia V-S, Martin B, mistrec, Olga K, Raphael H, Rosie C, Sharon E, sylvmorris1, Tabitha M, thtrisdead, Tif D, Vijay V and various anonymous posters for your comments. Your comments are also helping us work out how to tweak some of the interfaces so people can let us know about a problem with a task by clicking a button, so expect more improvements in the future!

Step into the Spotlight

It's easy to try out In the Spotlight - you don't need to register, so you can start marking out the titles of plays or transcribing the titles, dates or genres of plays straight away. Give it a go and let us know what you find!

image from http://s3.amazonaws.com/feather-files-aviary-prod-us-east-1/98739f1160a9458db215cec49fb033ee/2018-03-23/63194392defb46a8bae006ea04dc7148.png
There are wonders galore waiting for the spotlight

12 March 2018

The Ground Truth: Transcribing historical Arabic Scientific Manuscripts for OCR research

Add comment

Announcing a collaborative transcription project to support state-of-the-art research in automatic handwritten text recognition for historical Arabic texts

Cultural heritage institutions around the world are digitising hundreds of thousands of pages of historical Arabic manuscript and archive collections. Making these fully text searchable has the potential to truly transform scholarship, opening up this rich content for discovery and enabling large-scale analysis.

Computer scientists and scholars are working on this challenge, building systems which can automatically transcribe images of handwritten text, but for historical Arabic script a solution remains just out of reach.

Our aim is to contribute to continued research in this area by building an open image and ground truth dataset of historical handwritten Arabic texts, ensuring historical Arabic collections benefit from state-of-the-art developments in handwritten text recognition.

What is Ground Truth?

Optical Character Recognition (OCR) systems essentially turn a picture of text into text itself—in other words, producing something like a .TXT or .DOC file from a scanned .JPG of a printed or handwritten page. Most OCR systems require ground truth, a set of files which represent the truthful record of elements of an image, for training and evaluation purposes.

The ground truth of an image’s text content, for instance, is the complete and accurate record of every character and word in the image.

By knowing what the system is supposed to recognise on a page of handwritten text, researchers can both train their system to recognise the characters as well as test how well the system does once trained.

Transcription
 

  
View more transcriptions in progress from this manuscript (Or 3366) on the platform 

A collaborative approach

This project is a proof of concept exploring whether the creation of such a dataset can be done collaboratively at scale, using the collective expertise of volunteers around the world. At the heart of this approach is the Library’s enduring commitment to creating new and interesting ways to connect diverse communities of interest and expertise, be it scholars, the general public, computer scientists, students, and curators, around our collections. For this we are utilising a free and open-source platform, From the Page, which allows anyone with an interest in historical Arabic manuscripts to experience them up close, many for the first time, to discuss, learn and share expertise in their transcription.

Helping transform research

The Digital Scholarship Department was able to fund the development of this open source platform to support Right-to-Left transcription, a feature which will benefit any scholar wishing to use the software for their own transcription needs. Any transcriptions produced in this pilot will be transformed into ground truth resources, hosted by the British Library and made freely available, without rights restriction, for anyone wishing to advance the state-of-the-art in optical character recognition technology. Specifically, resources created will be contributed to ground-breaking projects already underway such as Transkribus, the Open Islamic Texts Initiative, the IMPACT Centre of Competence Image and Ground Truth Resources and more!

Visit the new Arabic Scientific Manuscripts of the British Library transcription platform and download our Getting Started Guide for more detail (an Arabic version will be available shortly). 

  

Posted by Nora McGregor, Digital Curator, British Library

 

28 February 2018

Announcing the BL Labs roadshows locations and dates for 2018!

Add comment

The @BL_Labs Roadshows: dates and locations for 2018

Do you want to learn more about the British Library’s digital collections? Are you interested in discovering how other researchers have used our digitised material in creative and innovative ways? Would you like to give us feedback on the kinds of services we are providing and would like to provide for digital scholars? Come and meet Library staff and gain an insight into some of the opportunities and challenges of working with our digital content. Get advice, pick up tips, and consider entering the digital project you have been working on for one of the BL Labs Awards (deadline Thursday 11th October 2018).

Our @BL_Labs Roadshows will be held at university departments across the UK between March and June 2018. Events will include presentations from the British Library and host institutions, practical hands-on workshops, a chance to explore and discuss what you may do with some of the Library's data and for you to speak to and get feedback from experts. We’re also keen to hear your views on some of the long-term services the British Library is hoping to develop for those who want to work with our digital collections and data.

Register for one of the roadshows! They are FREE to attend and OPEN TO ALL (unless otherwise stated). For further details about locations we are visiting this year, see below: 

Scanned British Isles with places JPEG correctetd
BL Labs Roadshow locations for 2018

March

  • Monday 26 March 2018 (10:00 – 13:00) - BL Labs Roadshow at CityLIS (City University of London Department of Library and Information Science), London (internal event)

April

May

June

  • Tuesday 5 June 2018 (12:00 – 16:00) - BL Labs Roadshow at the University of Leeds, Leeds
  • Wednesday 27 June 2018 (09:00 – 13:00) - BL Labs Roadshow at the University of Birmingham, Birmingham

You will be able to view the full programme details for each of the roadshows, and book your place via Eventbrite. Links will be live shortly or visit our events page.

For any further questions, please contact us at labs@bl.uk.

The British Library Labs project is funded by the Andrew W. Mellon Foundation and the British Library.

Posted by BL Labs

21 February 2018

BL Labs 2017 Symposium: Opening up the British Library’s Early Indian Printed Books Collection (Staff Award Winner)

Add comment

Making the British Library’s valuable collection of early Bengali books more accessible to researchers and the general public around the world rests heavily on the collaborative work undertaken across different teams of the library and partners in the UK and abroad. The commitment and passion of the project team has relied on the contribution and expertise of collaborators, as well as the forward thinking vision of the library, partners and fundraisers.

Receiving the BL Labs Staff Award 2017 is a great opportunity to thank everyone involved. 

Members of the Two Centuries of Indian Print team receiving the British Library Labs award at the Symposium on 30th October.
Members of the Two Centuries of Indian Print team receiving the British Library Labs award at the Symposium on 30th October 2017
 
Tom Derrick (Digital Curator) was in India at the same time the team received their Award.
Tom Derrick (Digital Curator) was in India at the same time the team received their Award

The Two Centuries of Indian Print project is a partnership between the British Library, the School of Cultural Texts and Records (SCTR) at Jadavpur University, Srishti Institute of Art, Design and Technology, and the Library at SOAS University of London, among others. It has also involved collaborations with the National Library of India, and other institutions in India.

The AHRC Newton-Bhabha Fund and the Department for Business, Energy and Industrial Strategy have generously funded the work undertaken so far by the project, focusing on early printed Bengali books. Many are unavailable in other library collections or are extremely difficult to locate and access. The project has undertaken a variety of initiatives from the digitisation of books and enhancement of the catalogue records in English and Bengali, to stimulating the use of digital humanities tools and techniques, running a programme of digital skills sharing and capacity building workshops, and hosting the South Asia Series seminars. All of these initiatives greatly contribute to the discovery and study of the collection. The project is also conducting ground breaking work in finding a solution to Optical Character Recognition (OCR) in Bangla script. OCR is not available for South Asian languages currently and harnessing viable Optical Character Recognition technology would enable full text search of the books, paving the way for researchers to use natural language processing techniques to perform large scale analysis across a large corpus of text covering a diverse range of topics relating to Indian society, religion, and politics to name but a few. Doing so will increase the possibilities for new discoveries in this academic field. 

However, despite its status as one of the most widely spoken languages in the world, Bangla script has been greatly underserved by providers of OCR solutions. This is due in part to the orthographical and typographical variances that have taken place in recent centuries that make building a dictionary and character ‘classifier’ more challenging. Due to the wide date range of the books we are digitising, these issues affect the quality of OCR. The physical condition of our historical books, including faded text, presents additional difficulties for creating machine readable versions of the books. 

To overcome these obstacles, the project team has been advancing the development of OCR for Bangla through the organisation of an international competition which reviewed the state-of-the-art in commercial and open source text recognition tools. The results of the competition will be announced at the ICDAR 2017 conference in Kyoto later this month. Watch this space! The competition dataset has been made openly available for download and reuse for any researchers or institutions who would like to experiment with OCR for Bengali.

A page from the Animal Biographies, VT 1712 showing its transcription produced for the ICDAR 2017 competition
A page from the Animal Biographies, VT 1712 showing its transcription produced for the ICDAR 2017 competition

The project has organised two Skills Exchange Programmes, hosting mid-career Library professionals from the the National Library of India at the British Library for a week, providing a packed programme of tours and talks from all areas of the Library. The project has also conducted digital skills sharing and capacity building workshops for library professionals and archivists from cultural heritage institutions in India. The first workshop took place at Jadavpur University, Kolkata, in December 2016. Library and information professionals from cultural heritage institutions in Bengal took part in a one-day event to learn more about how information technology is transforming humanities research today and in turn Library services, as well as the methods for interrogating humanities-related datasets.

Afterthe success of this first workshop another event was held in July 2017, at which more than 30 library professionals discussed OCR developments for Bangla, trying out different tools and discussing digital scholarship techniques and projects. Most recently, the project’s digital curator facilitated a workshop around Digitisation Standards at the International Conference of Asian Libraries in Delhi. The workshops continue in earnest in the new year with another digital humanities skills workshop planned for January 2018 to be held in partnership with the Srishti Institute of Art, Design, and Technology.

Attendees of the workshop held at Jadavpur University in December 2016 taking part in a group activity to discuss the application of digital humanities methods to library collections
Attendees of the workshop held at Jadavpur University in December 2016 taking part in a group activity to discuss the application of digital humanities methods to library collections

The Project Team also held a two day Academic Symposium on South Asian book history at Jadavpur University in the summer, with 17 speakers from India, wider South Asia, and the UK. Attendance was between 50-70 people a day and feedback was very good.  We plan to have a publication arising from this Symposium, and to upload a video to our project webspace. The project also hosts a popular series of talks based around the Two Centuries of Indian Print project and the British Library’s South Asia collections. The seminars take place fortnightly at the British Library. So far we have hosted a range of academics and researchers, from PhD students to senior academics from the UK and abroad, who share cutting-edge research with discussion chaired by curators and specialists in the field. The seminars have been a great success attracting large attendances and speakers from around the world. We also host a number of show and tells of our material to raise awareness for our collection and to engage in community outreach.

Everyone on the project is thrilled to have won this award and we will be working hard in 2018 to continue bringing the Two Centuries of Indian Print project to the attention and use of researchers and the general public.

Submit a project for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.

Posted by BL Labs on behalf of The Two Centuries of Indian Print team.

15 February 2018

BL Labs 2017 Symposium: Git Lit, Learning & Teaching Award Runner Up

Add comment

Applications of Distributed Version Control Technologies Toward the Creation of 50,000 Digital Scholarly Editions

The British Library maintains a collection of roughly 50,000 digital texts, scanned from public-domain books, most of which were originally published in the 19th century. As scanned books, their text format is Analyzed Layout and Text Object (ALTO) Extensible Markup Language (XML), a verbose markup format created by Optical Character Recognition (OCR) software, and one which is only marginally human-readable. Our project, Git-Lit, converts each text to the plain text format Markdown, creates version-controlled repositories for each using the distributed version control system Git, and posts the repositories to the project management platform GitHub, where they can be edited by anyone. Along the way, websites for each text, optimized for readability, are automatically generated via GitHub Pages. These websites integrate with the annotation platform Hypothes.is, enabling them to be annotated. In this way, Git-Lit aims to make this collection of British Library electronic texts discoverable, readable, editable, annotatable, and downloadable.

A Screenshot of the Website Automatically Generated from the British Library Electronic Text
A Screenshot of the Website Automatically Generated from the British Library Electronic Text


The biggest advantage of using a distributed version control system like Git is that it leverages the kinds of decentralized collaboration workflows that have long been in use in software development. Open-source software and web development, for which Git and GitHub were originally designed, is a much-studied methodology, long proven to be more effective than closed-source methods. Rather than maintain a central silo for serving code and electronic texts, the decentralized approach ensures a plurality of textual versions. Since anyone may copy ("fork") a project, modify it, and create their own version, there is no one central, canonical text, but many. Each version may freely borrow ("pull") from others, request that others integrate their changes ("pull request"), and discuss potential changes ("issues") using the project management subsystems of GitHub. This workflow streamlines collaboration, and encourages external contributions. Furthermore, since each change ("commit") requires a description of the commit, and reasons for it, the Git platform enforces the kind of editorial documentation necessary for scholarly editing. We like to think of git-based editing, therefore, as scholarly editing, and GitHub-based collaboration as a democratization of scholarly editing.

Furthermore, since GitHub allows instant editing of texts in the web browser, it is a simple and intuitive method of crowdsourcing the text cleanup process. Since OCRd texts are often full of errors, GitHub allows any reader to correct an obvious OCR error she or he finds. The analogous process of reporting errors to centralized text repositories like Project Gutenberg has been known to take several years. On GitHub, however, it is instantaneous.

Not the least advantage of this setup is the automated creation of websites from the plain text sources. Not only does this transform the markdown to a clean, readable edition of the text, but it provides integration with the annotation platform Hypothes.is. Hypothes.is allows for social annotation of a text, making it ideal for classroom use. Professors may assign a British Library text as a course reading, and may require their students annotate it, an activity which can generate discussions in the limitless virtual margins of this electronic textual space.

The Git-Lit project has so far posted around 50 texts to GitHub, as prototypes, with the full corpus of roughly 50,000 texts soon to come. After the full corpus is processed in this way, we'll begin enhancing some of the metadata. So far, we have developed techniques for probabilistically inferring the language of each text, and using Ben Schmidt's document vectorization method, Stable Random Projection, we have been able to probabilistically infer Library of Congress classifications, as well. This enables the automatic generation of sub-corpora like PR (British Literature), or PZ (American Literature).

In the coming year, we hope to integrate the Git-Lit transformed British Library texts into a structured database, further enhancing the discoverability of its texts. We have just received a micro-grant from NYC-DH to help launch Corpus-DB, a project also aiming to produce textual corpora, and through Corpus-DB, we will soon create a SQL database containing the metadata, our enhanced and inferred metadata, and other aggregated book data gleaned from public APIs. This will soon allow readers and computational text analysts the ability to download groups of British Library electronic texts. Users interested in downloading, say, all novels set in London, will be able to get a complete full-text dump of all public-domain novels in this category by visiting a URL such as api.corpus-db.org/novels/setting/London. We expect that this will greatly streamline the corpus creation process that takes up so much of the time of a computational text analysis.

Both Git-Lit and Corpus-DB are open-source projects, open to contributions from anyone, regardless of skill. If you'd like to contribute to our project in some way, get in contact with us, and we'll tell you how you can help.

Jonathan Reeve
Jonathan Reeve

Jonathan Reeve is a third-year graduate student in the Department of English and Comparative Literature at Columbia University, where he specializes in computational literary analysis. Find his recent experiments at jonreeve.com.

If this blog post has stimulated your interest in working with the British Library's digital collections, start a project and enter it for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library to find out who wins.

Posted by BL Labs on behalf of Jonathan Reeve

14 February 2018

BL Labs 2017 Symposium: Movable Type, Commercial Award Winner

Add comment

Movable Type is a tabletop word game, and something of a love letter to classic books and authors, made completely of custom playing cards. While the game’s appearance might remind you of Scrabble, it has some tricks up its sleeve to give it a much more modern and dynamic feel.

Movable Type Card Game
Figure 1: Movable Type Card Game


The initial idea for Movable Type was born around two years ago. I had been making games for some years and knew I wanted to do something with a word game. My main objective was to have a game that was very interactive, easy to grasp, and tactical, while also being very quick to play. As much as I love word games, some of them have a tendency to outstay their welcome. 

The central mechanism in Movable Type is called card-drafting. This method allows players to pick their letter cards each round – this does away with the large amounts of luck you find in many classic word games, and instead shifts attention onto the tactical decisions of the players. It also means that rules are kept very simple and that players can take their turns simultaneously, creating a much more dynamic play environment.

Movable Type - The Cards
Movable Type - The Cards


The prototype for Movable Type was only a few weeks old when I settled on the art style I was going to use in the final product. I’ve been a long-time fan of the British Library Flickr account, which lets users browse through images from public domain books. Once I had spotted the large collection of initial capitals, I was sold!

Movable Type - Illustrated Letters
Movable Type - Illustrated Letters


My wife, Tiffany Moon, is a graphic designer by trade. She helped clean up the images and present these beautiful pieces of art in a colourful new fashion, appropriate for a retail product. I also wanted portraits of some of my favourite authors to be in the game, so I commissioned Alisdair Wood to create woodcut-style images of ten classic, influential and diverse literary figures. Without those initial capital images taken from the British Library collection and used to direct the game’s overall style, Movable Type would likely not look half as impressive and definitely wouldn’t resonate with me and many players like the current style does.

Movable Type - Illustrated Famous People
Movable Type - Illustrated Famous People


I launched Movable Type on the crowdfunding platform, Kickstarter, last year. Upon its release, it won the Imirt Irish Game Award for Best Analog Game and second runner-up for Game of the Year. It received good reception at several public events and sold out of its initial print run, so I decided that a second edition of the game was in order. That bigger and better second edition is funding on Kickstarter and should be in some select retail stores by mid to late 2018 (fingers crossed!).

Movable Type receiving the Commercial Category BL Lab Award, was a huge boon for the reputation of the game and myself as a game designer. Furthermore, it was a genuine honour to be at the British Library for this event and able to share my product with the audience there.

If this blog post has stimulated your interest in working with the British Library's digital collections, start a project and enter it for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.

Posted by BL Labs on behalf of Robin David O’Keeffe

13 February 2018

BL Labs 2017 Symposium: Samtla, Research Award Runner Up

Add comment

Samtla (Search And Mining Tools for Labelling Archives) was developed to address a need in the humanities for research tools that help to search, browse, compare, and annotate documents stored in digital archives. The system was designed in collaboration with researchers at Southampton University, whose research involved locating shared vocabulary and phrases across an archive of Aramaic Magic Texts from Late Antiquity. The archive contained texts written in Aramaic, Mandaic, Syriac, and Hebrew languages. Due to the morphological complexity of these languages, where morphemes are attached to a root morpheme to mark gender and number, standard approaches and off-the-shelf software were not flexible enough for the task, as they tended to be designed to work with a specific archive or user group. 

Figure1
Figure 1: Samtla supports tolerant search allowing queries to be matched exactly and approximately. (Click to enlarge image)

  Samtla is designed to extract the same or similar information that may be expressed by authors in different ways, whether it is in the choice of vocabulary or the grammar. Traditionally search and text mining tools have been based on words, which limits their use to corpora containing languages were 'words' can be easily identified and extracted from text, e.g. languages with a whitespace character like English, French, German, etc. Word models tend to fail when the language is morphologically complex, like Aramaic, and Hebrew. Samtla addresses these issues by adopting a character-level approach stored in a statistical language model. This means that rather than extracting words, we extract character-sequences representing the morphology of the language, which we then use to match the search terms of the query and rank the documents according to the statistics of the language. Character-based models are language independent as there is no need to preprocess the document, and we can locate words and phrases with a lot of flexibility. As a result Samtla compensates for the variability in language use, spelling errors made by users when they search, and errors in the document as a result of the digitisation process (e.g. OCR errors). 

Figure2
Figure 2: Samtla's document comparison tool displaying a semantically similar passage between two Bibles from different periods. (Click to enlarge image)

 The British Library have been very supportive of the work by openly providing access to their digital archives. The archives ranged in domain, topic, language, and scale, which enabled us to test Samtla’s flexibility to its limits. One of the biggest challenges we faced was indexing larger-scale archives of several gigabytes. Some archives also contained a scan of the original document together with metadata about the structure of the text. This provided a basis for developing new tools that brought researchers closer to the original object, which included highlighting the named entities over both the raw text, and the scanned image.

Currently we are focusing on developing approaches for leveraging the semantics underlying text data in order to help researchers find semantically related information. Semantic annotation is also useful for labelling text data with named entities, and sentiments. Our current aim is to develop approaches for annotating text data in any language or domain, which is challenging due to the fact that languages encode the semantics of a text in different ways.

As a first step we are offering labelled data to researchers, as part of a trial service, in order to help speed up the research process, or provide tagged data for machine learning approaches. If you are interested in participating in this trial, then more information can be found at www.samtla.com.

Figure3
Figure 3: Samtla's annotation tools label the texts with named entities to provide faceted browsing and data layers over the original image. (Click to enlarge image)

 If this blog post has stimulated your interest in working with the British Library's digital collections, start a project and enter it for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.


Posted by BL Labs on behalf of Dr Martyn Harris, Prof Dan Levene, Prof Mark Levene and Dr Dell Zhang.