THE BRITISH LIBRARY

Digital scholarship blog

68 posts categorized "Tools"

23 March 2018

Shine a light on past entertainments with In the Spotlight

Add comment

In this post, Dr Mia Ridge and Alex Mendes provide an update on the Library's latest crowdsourcing project...

People who've explored In the Spotlight, our project helping make historic playbills more findable, might have noticed a line of text just above the 'Save and Continue' button: 'Seen something interesting? Add a note'.

Insights from your comments

Since the project began, we've received almost 700 comments [update - it's actually over 1900, across all projects]. Some of them simply tell us that an image is blank or upside-down, but many others share interesting findings. We love hearing from you, and we've been highlighting individual comments on Twitter (@LibCrowds) and on our forum.

Comments have pointed out spectacles including 'a Terrific Eruption of Mount Vesuvius, accompanied by TORRENTS OF BURNING LAVA' and a 'Serpent vomiting Fire'. New amenities mentioned include lighting ('600 wax lights and a new set of gold chandeliers' or new gas lighting) and the addition of backs to seats. Famous actors spotted include Sarah Siddons, Jenny Lind and Ira Aldridge, while Mr Kean has caused all kinds of trouble.

Lots of comments are about performances that aren't plays, from hornpipes to tableaux to ballets, songs, speeches, fireworks, scientific demonstrations, performing animals, panoramas, conjuring and juggling tricks, lists of scenery, gun tricks, pantomimes, acrobatics, excerpts from plays, and even the 'reenactment of the Coronation'! We're thinking hard about the best way to deal with them (and with playbills that don't include a year), and will post to the forum and twitter to ask for your ideas soon.

General updates

Since we first shared the link, there have been over 4,700 visitors from 91 countries. About 80% are primarily English-speakers, with Russian, German and French the next most popular languages.

We've had over 42,000 contributions from over 630 participants (with 1499 participants registered on the platform overall). Together, they've helped complete 34 projects by undertaking countless marking and transcription tasks to make genres, dates and play titles searchable.

Each project is based on a specific volume of playbills from a regional theatre or theatres. The fastest projects were 'Theatre Royal, Bristol 1819-1823 (Vol. 2)', completed in 8 days, 31 minutes, with 'Miscellaneous Plymouth theatres 1796-1882 (Vol. 1)' a close second at 8 days, 5 hours, 30 mins. We currently have playbills from theatres in Dublin, Hull, Nottingham - Oswestry or Plymouth - which will be completed first?

Recent blog posts include a wonderful story from PhD student and In the Spotlight participant Edward Mills tracing an ancient custom through the Library's digitised collections in The Flitch of Bacon: An Unexpected Journey Through the Collections of the British Library, and Christian Algar on the 'rich pageant' of historical playbills.

You might have noticed some small changes to the navigation and data pages as we updated the software this week. Most of the changes were behind the scenes, providing additional admin and analysis functions to ensure that data sent off to the catalogue is as accurate as possible.

image from http://s3.amazonaws.com/feather-files-aviary-prod-us-east-1/98739f1160a9458db215cec49fb033ee/2018-03-23/3bfdfe7285d54738a6f225032e20b995.png
Visitors have come from all over the world, but we'd love to reach more

 

Thank you!

We're grateful to everyone who's made a large or small contribution, but particular thanks to Barbara G, David Y, Dina S, Ervins S, Jo B, John L, Katharine S, Kathryn P-S, Lisa G, Maria Antonia V-S, Martin B, mistrec, Olga K, Raphael H, Rosie C, Sharon E, sylvmorris1, Tabitha M, thtrisdead, Tif D, Vijay V and various anonymous posters for your comments. Your comments are also helping us work out how to tweak some of the interfaces so people can let us know about a problem with a task by clicking a button, so expect more improvements in the future!

Step into the Spotlight

It's easy to try out In the Spotlight - you don't need to register, so you can start marking out the titles of plays or transcribing the titles, dates or genres of plays straight away. Give it a go and let us know what you find!

image from http://s3.amazonaws.com/feather-files-aviary-prod-us-east-1/98739f1160a9458db215cec49fb033ee/2018-03-23/63194392defb46a8bae006ea04dc7148.png
There are wonders galore waiting for the spotlight

14 March 2018

Working with BL Labs in search of Sir Jagadis Chandra Bose

Add comment

The 19th Century British Library Newspapers Database offers a rich mine of material to be sourced for a comprehensive view of British life in the nineteenth and early twentieth century. The online archive comprises 101 full-text titles of local, regional, and national newspapers across the UK and Ireland, and thanks to optical character recognition, they are all fully searchable. This allows for extensive data mining across several millions worth of newspaper pages. It’s like going through the proverbial haystack looking for the equally proverbial needle, but with a magnet in hand.

For my current research project on the role of the radio during the British Raj, I wanted to find out more about Sir Jagadis Chandra Bose (1858–1937), whose contributions to the invention of wireless telegraphy were hardly acknowledged during his lifetime and all but forgotten during the twentieth century.

J.C.Bose
Jagadish Chandra Bose in Royal Institution, London
(Image from Wikimedia Commons)

The person who is generally credited with having invented the radio is Guglielmo Marconi (1874–1937). In 1909, he and Karl Ferdinand Braun (1850–1918) were awarded the Nobel Prize in Physics “in recognition of their contributions to the development of wireless telegraphy”. What is generally not known is that almost ten years before that, Bose invented a coherer that would prove to be crucial for Marconi’s successful attempt at wireless telegraphy across the Atlantic in 1901. Bose never patented his invention, and Marconi reaped all the glory.

In his book Jagadis Chandra Bose and the Indian Response to Western Science, Subrata Dasgupta gives us four reasons as to why Bose’s contributions to radiotelegraphy have been largely forgotten in the West throughout the twentieth century. The first reason, according to Dasgupta, is that Bose changed research interest around 1900. Instead of continuing and focusing his work on wireless telegraphy, Bose became interested in the physiology of plants and the similarities between inorganic and living matter in their responses to external stimuli. Bose’s name thus lost currency in his former field of study.

A second reason that contributed to the erasure of Bose’s name is that he did not leave a legacy in the form of students. He did not, as Dasgupta puts it, “found a school of radio research” that could promote his name despite his personal absence from the field. Also, and thirdly, Bose sought no monetary gain from his inventions and only patented one of his several inventions. Had he done so, chances are that his name would have echoed loudly through the century, just as Marconi’s has done.

“Finally”, Dasgupta writes, “one cannot ignore the ‘Indian factor’”. Dasgupta wonders how seriously the scientific western elite really took Bose, who was the “outsider”, the “marginal man”, the “lone Indian in the hurly-burly of western scientific technology”. And he wonders how this affected “the seriousness with which others who came later would judge his significance in the annals of wireless telegraphy”.

And this is where the BL’s online archive of nineteenth-century newspapers comes in. Looking at newspaper coverage about Bose in the British press at the time suggests that Bose’s contributions to wireless telegraphy were soon to be all but forgotten during his lifetime. When Bose died in 1937, Reuters Calcutta put out a press release that was reprinted in several British newspapers. As an example, the following notice was published in the Derby Evening Telegraph of November 23rd, 1937, on Bose’s death:

Newspaper clipping announcing death of JC Bose
Notice in the Derby Evening Telegraph of November 23rd, 1937

This notice is as short as it is telling in what it says and does not say about Bose and his achievements: he is remembered as the man “who discovered a heart beat in trees”. He is not remembered as the man who almost invented the radio. He is remembered for the Western honours that are bestowed upon him (the Knighthood and his Fellowship of the Royal Society), and he is remembered as the founder of the Bose Research Institute. He is not remembered for his career as a researcher and inventor; a career that span five decades and saw him travel extensively in India, Europe and the United States.

The Derby Evening Telegraph is not alone in this act of partial remembrance. Similar articles appeared in Dundee’s Evening Telegraph and Post and The Gloucestershire Echo on the same day. The Aberdeen Press and Journal published a slightly extended version of the Reuters press release on November 24th that includes a brief account of a lecture by Bose in Whitehall in 1929, during which Bose demonstrated “that plants shudder when struck, writhe in the agonies of death, get drunk, and are revived by medicine”. However, there is again no mention of Bose’s work as a physicist or of his contributions to wireless telegraphy. The same is true for obituaries published in The Nottingham Evening Post on November 23rd, The Western Daily Press and Bristol Mirror on November 24th, another article published in the Aberdeen Press and Journal on November 26th, and two articles published in The Manchester Guardian on November 24th.

The exception to the rule is the obituary published in The Times on November 24th. Granted, with a total of 1116 words it is significantly longer than the Reuters press release, but this is also partly the point, as it allows for a much more comprehensive account of Bose’s life and achievements. But even if we only take the first two sentences of The Times obituary, which roughly add up to the word count of the Reuters press release, we are already presented with a different account altogether:

“Our Calcutta Correspondent telegraphs that Sir Jagadis Chandra Bose, F.R.S., died at Giridih, Bengal, yesterday, having nearly reached the age of 79. The reputation he won by persistent investigation and experiment as a physicist was extended to the general public in the Western world, which he frequently visited, by his remarkable gifts as a lecturer, and by the popular appeal of many of his demonstrations.”

We know that he was a physicist; the focus is on his skills as a researcher and on his talents as a lecturer rather than on his Western titles and honours, which are mentioned in passing as titles to his name; and we immediately get a sense of the significance of his work within the scientific community and for the general public. And later on in the article, it is finally acknowledged that Bose “designed an instrument identical in principle with the 'coherer' subsequently used in all systems of wireless communication. Another early invention was an instrument for verifying the laws of refraction, reflection, and polarization of electric waves. These instruments were demonstrated on the occasion of his first appearance before the British Association at the 1896 meeting at Liverpool”.

Posted by BL Labs on behalf of Dr Christin Hoene, a BL Labs Researcher in Residence at the British Library. Dr Hoene is a Leverhulme Early Career Fellow in English Literature at the University of Kent. 

If you are interested in working with the British Library's digital collections, why not come along to one of our events that we are holding at universities around the UK this year? We will be holding a roadshow at the University of Kent on 25 April 2018. You can see a programme for the day and book your place through this Eventbrite page. 

12 March 2018

The Ground Truth: Transcribing historical Arabic Scientific Manuscripts for OCR research

Add comment

Announcing a collaborative transcription project to support state-of-the-art research in automatic handwritten text recognition for historical Arabic texts

Cultural heritage institutions around the world are digitising hundreds of thousands of pages of historical Arabic manuscript and archive collections. Making these fully text searchable has the potential to truly transform scholarship, opening up this rich content for discovery and enabling large-scale analysis.

Computer scientists and scholars are working on this challenge, building systems which can automatically transcribe images of handwritten text, but for historical Arabic script a solution remains just out of reach.

Our aim is to contribute to continued research in this area by building an open image and ground truth dataset of historical handwritten Arabic texts, ensuring historical Arabic collections benefit from state-of-the-art developments in handwritten text recognition.

What is Ground Truth?

Optical Character Recognition (OCR) systems essentially turn a picture of text into text itself—in other words, producing something like a .TXT or .DOC file from a scanned .JPG of a printed or handwritten page. Most OCR systems require ground truth, a set of files which represent the truthful record of elements of an image, for training and evaluation purposes.

The ground truth of an image’s text content, for instance, is the complete and accurate record of every character and word in the image.

By knowing what the system is supposed to recognise on a page of handwritten text, researchers can both train their system to recognise the characters as well as test how well the system does once trained.

Transcription
 

  
View more transcriptions in progress from this manuscript (Or 3366) on the platform 

A collaborative approach

This project is a proof of concept exploring whether the creation of such a dataset can be done collaboratively at scale, using the collective expertise of volunteers around the world. At the heart of this approach is the Library’s enduring commitment to creating new and interesting ways to connect diverse communities of interest and expertise, be it scholars, the general public, computer scientists, students, and curators, around our collections. For this we are utilising a free and open-source platform, From the Page, which allows anyone with an interest in historical Arabic manuscripts to experience them up close, many for the first time, to discuss, learn and share expertise in their transcription.

Helping transform research

The Digital Scholarship Department was able to fund the development of this open source platform to support Right-to-Left transcription, a feature which will benefit any scholar wishing to use the software for their own transcription needs. Any transcriptions produced in this pilot will be transformed into ground truth resources, hosted by the British Library and made freely available, without rights restriction, for anyone wishing to advance the state-of-the-art in optical character recognition technology. Specifically, resources created will be contributed to ground-breaking projects already underway such as Transkribus, the Open Islamic Texts Initiative, the IMPACT Centre of Competence Image and Ground Truth Resources and more!

Visit the new Arabic Scientific Manuscripts of the British Library transcription platform and download our Getting Started Guide for more detail (an Arabic version will be available shortly). 

  

Posted by Nora McGregor, Digital Curator, British Library

 

28 February 2018

Announcing the BL Labs roadshows locations and dates for 2018!

Add comment

The @BL_Labs Roadshows: dates and locations for 2018

Do you want to learn more about the British Library’s digital collections? Are you interested in discovering how other researchers have used our digitised material in creative and innovative ways? Would you like to give us feedback on the kinds of services we are providing and would like to provide for digital scholars? Come and meet Library staff and gain an insight into some of the opportunities and challenges of working with our digital content. Get advice, pick up tips, and consider entering the digital project you have been working on for one of the BL Labs Awards (deadline Thursday 11th October 2018).

Our @BL_Labs Roadshows will be held at university departments across the UK between March and June 2018. Events will include presentations from the British Library and host institutions, practical hands-on workshops, a chance to explore and discuss what you may do with some of the Library's data and for you to speak to and get feedback from experts. We’re also keen to hear your views on some of the long-term services the British Library is hoping to develop for those who want to work with our digital collections and data.

Register for one of the roadshows! They are FREE to attend and OPEN TO ALL (unless otherwise stated). For further details about locations we are visiting this year, see below: 

Scanned British Isles with places JPEG correctetd
BL Labs Roadshow locations for 2018

March

  • Monday 26 March 2018 (10:00 – 13:00) - BL Labs Roadshow at CityLIS (City University of London Department of Library and Information Science), London (internal event)

April

May

June

  • Tuesday 5 June 2018 (12:00 – 16:00) - BL Labs Roadshow at the University of Leeds, Leeds
  • Wednesday 27 June 2018 (09:00 – 13:00) - BL Labs Roadshow at the University of Birmingham, Birmingham

You will be able to view the full programme details for each of the roadshows, and book your place via Eventbrite. Links will be live shortly or visit our events page.

For any further questions, please contact us at labs@bl.uk.

The British Library Labs project is funded by the Andrew W. Mellon Foundation and the British Library.

Posted by BL Labs

22 February 2018

BL Labs 2017 Symposium: Picturing Canada and Interactive Map (Staff Award Runner Up)

Add comment

Putting collection metadata on the map: Picturing Canada

The Picturing Canada project began in 2012 as a British Library, Eccles Centre and Wikimedia UK collaboration to digitise a collection and experiment with releasing high quality reproductions of collection items into the public domain. At its heart the project sought to open up an under-used collection of photographs, connecting them with new audiences and uses outside of the walls of the British Library. It also provided a template for the Library’s subsequent public domain releases and has been provided many around with an insight into the depth of the Library’s Canadian collections.

Before the collection could be released it needed to be digitised and robust metadata created. Fortunately the Library had a good working batch of metadata created off the back of work done by researchers from Dalhousie University in the 1980s. The initial use of this to the project was clear but in digitising the images and putting them and the metadata online something became apparent; most images had some sort of information (be it a title or a photographer’s studio address) that could be used to determine a geographical location for the images.

At the time, this realisation was parked for future investigation but the 2015 exhibition, ‘Canada Through the Lens’, drawing off the same digitised collection, opened up an opportunity to try and use this information to map the collection and generate new insights into its contents. Much of the coordinate determination and mapping was done by Joan Francis, co-awardee of the BL Labs runner-up prize, who worked to find and add coordinates for the photographs. This was a relatively simple but time-consuming process involving finding locations in the metadata image title or, in the case of a photographer’s studio address, on the photograph itself. These text-based locations were then converted into co-ordinates compatible with Google Fusion Tables (there’s an excellent tutorial here) and added to records for each image.

 

The result of this is the map that you see above, a series of points which can be clicked on to see a partial metadata record for the item as well as a link to the photograph itself on Wikipedia Commons. As the work is time-consuming and fraught with potential error we have still only worked to a robust mapping of about four fifths of the collection and this is the work you see here. Interestingly, map is not just a useful finding aid – although it performs this function very well.

Mapping the collection also provides insight into the geography of photographic production in Canada during the period this collection was created (1895 – 1923). It is clear, for instance, how significant the eastern metropolitan areas of Toronto, Montreal and Quebec are to Canada’s photographic production in this period. Similarly, the corridors of production seen running close to the Canada-US border and occasionally spurring north also suggest the significance of the railroad to Canada’s photographic economy. So the map helps users to find images but also offers more questions; an exciting prospect for continued work.

Posted by BL Labs on behalf of Philip Hatfield and Joan Francis

Submit a project for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.

21 February 2018

BL Labs 2017 Symposium: Opening up the British Library’s Early Indian Printed Books Collection (Staff Award Winner)

Add comment

Making the British Library’s valuable collection of early Bengali books more accessible to researchers and the general public around the world rests heavily on the collaborative work undertaken across different teams of the library and partners in the UK and abroad. The commitment and passion of the project team has relied on the contribution and expertise of collaborators, as well as the forward thinking vision of the library, partners and fundraisers.

Receiving the BL Labs Staff Award 2017 is a great opportunity to thank everyone involved. 

Members of the Two Centuries of Indian Print team receiving the British Library Labs award at the Symposium on 30th October.
Members of the Two Centuries of Indian Print team receiving the British Library Labs award at the Symposium on 30th October 2017
 
Tom Derrick (Digital Curator) was in India at the same time the team received their Award.
Tom Derrick (Digital Curator) was in India at the same time the team received their Award

The Two Centuries of Indian Print project is a partnership between the British Library, the School of Cultural Texts and Records (SCTR) at Jadavpur University, Srishti Institute of Art, Design and Technology, and the Library at SOAS University of London, among others. It has also involved collaborations with the National Library of India, and other institutions in India.

The AHRC Newton-Bhabha Fund and the Department for Business, Energy and Industrial Strategy have generously funded the work undertaken so far by the project, focusing on early printed Bengali books. Many are unavailable in other library collections or are extremely difficult to locate and access. The project has undertaken a variety of initiatives from the digitisation of books and enhancement of the catalogue records in English and Bengali, to stimulating the use of digital humanities tools and techniques, running a programme of digital skills sharing and capacity building workshops, and hosting the South Asia Series seminars. All of these initiatives greatly contribute to the discovery and study of the collection. The project is also conducting ground breaking work in finding a solution to Optical Character Recognition (OCR) in Bangla script. OCR is not available for South Asian languages currently and harnessing viable Optical Character Recognition technology would enable full text search of the books, paving the way for researchers to use natural language processing techniques to perform large scale analysis across a large corpus of text covering a diverse range of topics relating to Indian society, religion, and politics to name but a few. Doing so will increase the possibilities for new discoveries in this academic field. 

However, despite its status as one of the most widely spoken languages in the world, Bangla script has been greatly underserved by providers of OCR solutions. This is due in part to the orthographical and typographical variances that have taken place in recent centuries that make building a dictionary and character ‘classifier’ more challenging. Due to the wide date range of the books we are digitising, these issues affect the quality of OCR. The physical condition of our historical books, including faded text, presents additional difficulties for creating machine readable versions of the books. 

To overcome these obstacles, the project team has been advancing the development of OCR for Bangla through the organisation of an international competition which reviewed the state-of-the-art in commercial and open source text recognition tools. The results of the competition will be announced at the ICDAR 2017 conference in Kyoto later this month. Watch this space! The competition dataset has been made openly available for download and reuse for any researchers or institutions who would like to experiment with OCR for Bengali.

A page from the Animal Biographies, VT 1712 showing its transcription produced for the ICDAR 2017 competition
A page from the Animal Biographies, VT 1712 showing its transcription produced for the ICDAR 2017 competition

The project has organised two Skills Exchange Programmes, hosting mid-career Library professionals from the the National Library of India at the British Library for a week, providing a packed programme of tours and talks from all areas of the Library. The project has also conducted digital skills sharing and capacity building workshops for library professionals and archivists from cultural heritage institutions in India. The first workshop took place at Jadavpur University, Kolkata, in December 2016. Library and information professionals from cultural heritage institutions in Bengal took part in a one-day event to learn more about how information technology is transforming humanities research today and in turn Library services, as well as the methods for interrogating humanities-related datasets.

Afterthe success of this first workshop another event was held in July 2017, at which more than 30 library professionals discussed OCR developments for Bangla, trying out different tools and discussing digital scholarship techniques and projects. Most recently, the project’s digital curator facilitated a workshop around Digitisation Standards at the International Conference of Asian Libraries in Delhi. The workshops continue in earnest in the new year with another digital humanities skills workshop planned for January 2018 to be held in partnership with the Srishti Institute of Art, Design, and Technology.

Attendees of the workshop held at Jadavpur University in December 2016 taking part in a group activity to discuss the application of digital humanities methods to library collections
Attendees of the workshop held at Jadavpur University in December 2016 taking part in a group activity to discuss the application of digital humanities methods to library collections

The Project Team also held a two day Academic Symposium on South Asian book history at Jadavpur University in the summer, with 17 speakers from India, wider South Asia, and the UK. Attendance was between 50-70 people a day and feedback was very good.  We plan to have a publication arising from this Symposium, and to upload a video to our project webspace. The project also hosts a popular series of talks based around the Two Centuries of Indian Print project and the British Library’s South Asia collections. The seminars take place fortnightly at the British Library. So far we have hosted a range of academics and researchers, from PhD students to senior academics from the UK and abroad, who share cutting-edge research with discussion chaired by curators and specialists in the field. The seminars have been a great success attracting large attendances and speakers from around the world. We also host a number of show and tells of our material to raise awareness for our collection and to engage in community outreach.

Everyone on the project is thrilled to have won this award and we will be working hard in 2018 to continue bringing the Two Centuries of Indian Print project to the attention and use of researchers and the general public.

Submit a project for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.

Posted by BL Labs on behalf of The Two Centuries of Indian Print team.

15 February 2018

BL Labs 2017 Symposium: Git Lit, Learning & Teaching Award Runner Up

Add comment

Applications of Distributed Version Control Technologies Toward the Creation of 50,000 Digital Scholarly Editions

The British Library maintains a collection of roughly 50,000 digital texts, scanned from public-domain books, most of which were originally published in the 19th century. As scanned books, their text format is Analyzed Layout and Text Object (ALTO) Extensible Markup Language (XML), a verbose markup format created by Optical Character Recognition (OCR) software, and one which is only marginally human-readable. Our project, Git-Lit, converts each text to the plain text format Markdown, creates version-controlled repositories for each using the distributed version control system Git, and posts the repositories to the project management platform GitHub, where they can be edited by anyone. Along the way, websites for each text, optimized for readability, are automatically generated via GitHub Pages. These websites integrate with the annotation platform Hypothes.is, enabling them to be annotated. In this way, Git-Lit aims to make this collection of British Library electronic texts discoverable, readable, editable, annotatable, and downloadable.

A Screenshot of the Website Automatically Generated from the British Library Electronic Text
A Screenshot of the Website Automatically Generated from the British Library Electronic Text


The biggest advantage of using a distributed version control system like Git is that it leverages the kinds of decentralized collaboration workflows that have long been in use in software development. Open-source software and web development, for which Git and GitHub were originally designed, is a much-studied methodology, long proven to be more effective than closed-source methods. Rather than maintain a central silo for serving code and electronic texts, the decentralized approach ensures a plurality of textual versions. Since anyone may copy ("fork") a project, modify it, and create their own version, there is no one central, canonical text, but many. Each version may freely borrow ("pull") from others, request that others integrate their changes ("pull request"), and discuss potential changes ("issues") using the project management subsystems of GitHub. This workflow streamlines collaboration, and encourages external contributions. Furthermore, since each change ("commit") requires a description of the commit, and reasons for it, the Git platform enforces the kind of editorial documentation necessary for scholarly editing. We like to think of git-based editing, therefore, as scholarly editing, and GitHub-based collaboration as a democratization of scholarly editing.

Furthermore, since GitHub allows instant editing of texts in the web browser, it is a simple and intuitive method of crowdsourcing the text cleanup process. Since OCRd texts are often full of errors, GitHub allows any reader to correct an obvious OCR error she or he finds. The analogous process of reporting errors to centralized text repositories like Project Gutenberg has been known to take several years. On GitHub, however, it is instantaneous.

Not the least advantage of this setup is the automated creation of websites from the plain text sources. Not only does this transform the markdown to a clean, readable edition of the text, but it provides integration with the annotation platform Hypothes.is. Hypothes.is allows for social annotation of a text, making it ideal for classroom use. Professors may assign a British Library text as a course reading, and may require their students annotate it, an activity which can generate discussions in the limitless virtual margins of this electronic textual space.

The Git-Lit project has so far posted around 50 texts to GitHub, as prototypes, with the full corpus of roughly 50,000 texts soon to come. After the full corpus is processed in this way, we'll begin enhancing some of the metadata. So far, we have developed techniques for probabilistically inferring the language of each text, and using Ben Schmidt's document vectorization method, Stable Random Projection, we have been able to probabilistically infer Library of Congress classifications, as well. This enables the automatic generation of sub-corpora like PR (British Literature), or PZ (American Literature).

In the coming year, we hope to integrate the Git-Lit transformed British Library texts into a structured database, further enhancing the discoverability of its texts. We have just received a micro-grant from NYC-DH to help launch Corpus-DB, a project also aiming to produce textual corpora, and through Corpus-DB, we will soon create a SQL database containing the metadata, our enhanced and inferred metadata, and other aggregated book data gleaned from public APIs. This will soon allow readers and computational text analysts the ability to download groups of British Library electronic texts. Users interested in downloading, say, all novels set in London, will be able to get a complete full-text dump of all public-domain novels in this category by visiting a URL such as api.corpus-db.org/novels/setting/London. We expect that this will greatly streamline the corpus creation process that takes up so much of the time of a computational text analysis.

Both Git-Lit and Corpus-DB are open-source projects, open to contributions from anyone, regardless of skill. If you'd like to contribute to our project in some way, get in contact with us, and we'll tell you how you can help.

Jonathan Reeve
Jonathan Reeve

Jonathan Reeve is a third-year graduate student in the Department of English and Comparative Literature at Columbia University, where he specializes in computational literary analysis. Find his recent experiments at jonreeve.com.

If this blog post has stimulated your interest in working with the British Library's digital collections, start a project and enter it for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library to find out who wins.

Posted by BL Labs on behalf of Jonathan Reeve

13 February 2018

BL Labs 2017 Symposium: Samtla, Research Award Runner Up

Add comment

Samtla (Search And Mining Tools for Labelling Archives) was developed to address a need in the humanities for research tools that help to search, browse, compare, and annotate documents stored in digital archives. The system was designed in collaboration with researchers at Southampton University, whose research involved locating shared vocabulary and phrases across an archive of Aramaic Magic Texts from Late Antiquity. The archive contained texts written in Aramaic, Mandaic, Syriac, and Hebrew languages. Due to the morphological complexity of these languages, where morphemes are attached to a root morpheme to mark gender and number, standard approaches and off-the-shelf software were not flexible enough for the task, as they tended to be designed to work with a specific archive or user group. 

Figure1
Figure 1: Samtla supports tolerant search allowing queries to be matched exactly and approximately. (Click to enlarge image)

  Samtla is designed to extract the same or similar information that may be expressed by authors in different ways, whether it is in the choice of vocabulary or the grammar. Traditionally search and text mining tools have been based on words, which limits their use to corpora containing languages were 'words' can be easily identified and extracted from text, e.g. languages with a whitespace character like English, French, German, etc. Word models tend to fail when the language is morphologically complex, like Aramaic, and Hebrew. Samtla addresses these issues by adopting a character-level approach stored in a statistical language model. This means that rather than extracting words, we extract character-sequences representing the morphology of the language, which we then use to match the search terms of the query and rank the documents according to the statistics of the language. Character-based models are language independent as there is no need to preprocess the document, and we can locate words and phrases with a lot of flexibility. As a result Samtla compensates for the variability in language use, spelling errors made by users when they search, and errors in the document as a result of the digitisation process (e.g. OCR errors). 

Figure2
Figure 2: Samtla's document comparison tool displaying a semantically similar passage between two Bibles from different periods. (Click to enlarge image)

 The British Library have been very supportive of the work by openly providing access to their digital archives. The archives ranged in domain, topic, language, and scale, which enabled us to test Samtla’s flexibility to its limits. One of the biggest challenges we faced was indexing larger-scale archives of several gigabytes. Some archives also contained a scan of the original document together with metadata about the structure of the text. This provided a basis for developing new tools that brought researchers closer to the original object, which included highlighting the named entities over both the raw text, and the scanned image.

Currently we are focusing on developing approaches for leveraging the semantics underlying text data in order to help researchers find semantically related information. Semantic annotation is also useful for labelling text data with named entities, and sentiments. Our current aim is to develop approaches for annotating text data in any language or domain, which is challenging due to the fact that languages encode the semantics of a text in different ways.

As a first step we are offering labelled data to researchers, as part of a trial service, in order to help speed up the research process, or provide tagged data for machine learning approaches. If you are interested in participating in this trial, then more information can be found at www.samtla.com.

Figure3
Figure 3: Samtla's annotation tools label the texts with named entities to provide faceted browsing and data layers over the original image. (Click to enlarge image)

 If this blog post has stimulated your interest in working with the British Library's digital collections, start a project and enter it for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.


Posted by BL Labs on behalf of Dr Martyn Harris, Prof Dan Levene, Prof Mark Levene and Dr Dell Zhang.