THE BRITISH LIBRARY

Digital scholarship blog

73 posts categorized "Tools"

30 July 2018

British Library Labs Staff Awards 2018: Looking for entries now!

Add comment

Four-light-bulbs

Nominate a British Library staff member or a team that has done something exciting, innovative and cool with the British Library’s digital collections or data.

The 2018 British Library Labs Staff Award, now in its third year, gives recognition to current British Library staff who have created something brilliant using the Library’s digital collections or data

Perhaps you know of a project that developed new forms of knowledge, or an activity that delivered commercial value to the library. Did the person or team create an artistic work that inspired, stimulated, amazed and provoked? Do you know of a project developed by the Library where quality learning experiences were generated using the Library’s digital content? 

You may nominate a current member of British Library staff, a team, or yourself, for the Staff Award using this form.

The deadline for submission is 12:00 (BST), Friday 12 October 2018.

Nominees will be highlighted on Monday 12 November 2018 at the British Library Labs Annual Symposium where some (winners and runners-up) will also be asked to talk about their projects.

The Staff Award complements the British Library Labs Awards, introduced in 2015, which recognises outstanding work that has been done in the broader community. Last year’s winners drew attention to artistic, research, and entrepreneurial activities that used our digital collections.

British Library Labs is a project within the Digital Scholarship department at the British Library that supports and inspires the use of the Library's digital collections and data in exciting and innovative ways. It is funded by the Andrew W. Mellon Foundation.

If you have any questions, please contact us at labs@bl.uk.

@bl_labs #bldigital @bl_digischol

16 July 2018

Crowdsourcing comedy: date and genre results from In the Spotlight

Add comment

Beatrice Ashton-Lelliott is a PhD researcher at the University of Portsmouth studying the presentation of nineteenth-century magicians in biographies, literature, and the popular press. She is currently a research placement student on the British Library’s In the Spotlight project, cleaning and contextualising the crowdsourced playbills data. She can be found on Twitter at @beeashlell and you can join the In the Spotlight project at playbills.libcrowds.com.

In this blog post I discuss the data created so far by In the Spotlight volunteers via crowdsourcing – which has already thrown out quite a few surprises along the way! All of the data which I discuss was cleaned using Open Refine, with some manual intervention by me to group categories such as genre. My first post below highlights the most notable results to come out of the date and genre tasks so far, and a second post will present similar findings for play titles and playwrights.

Dates

I started off by analysing the dates generated by the projects as, to be honest, it seemed easiest! One of the problems we’ve encountered with the date tasks, however, is that a number of the playbills do not show a full date.  This is notable in itself but unsurprising – why would a playbill in the eighteenth or nineteenth century need a full date when they weren’t expected to last two hundred years into the future? With that in mind, this is by no means an exhaustive data set.

After creating a simple graph of the most popular dates, it became clear that we had a huge spike in the number of performances in 1825. Was something relevant to theatre history happening during this year, or were the sources of the playbill collections just unusually pro-active in 1825 after taking some time off? Was the paper stock quality better, so more playbills have lasted? The outside influence of the original collector or owner of these playbills is also something to consider, for instance, maybe he was more interested in one type of performance than others, had more time to collect playbills in certain years or in certain places, and so on. A final potential factor is that this data also only comes from the volumes added to the site projects so far, and so isn’t indicative of the Library’s playbills as a whole.

Aside from source or collector influence, some other possible explanations do present themselves. Britain in general was growing exponentially, with London in particular becoming one of the biggest cities in the world, and this era also saw the birth of railways and the extravagant influence of figures such as George IV. As this is coming off the back of what seems to be a very slow year in 1824, however, perhaps it is best just to chalk this up to the activity of the collectors. We also have another noticeable spike in 1829, but by no means as dramatic as that of 1825. I’ve spent a bit of time comparing the number of performances seen in the volumes with other online performance date tools, such as UMass's Adelphi calendar and Godwin’s Diary to compare numbers, but would love to hear any further insights into this!

alt="Graph of most popular dates"
A graph showing the most popular performance dates

Genre

The main issue I faced in working with the genre data was the wide variety of descriptors used on the playbills themselves. For instance, I encountered burlesque, burletta and burlesque burletta – which of the first two categories would the last one go under? When I went back to the playbills themselves, it was also clear that many of the ‘genres’ generated were more like comments from theatre managers or just descriptions e.g. ‘an amusing sketch’. With this in mind, genre was the dataset which I ‘interfered’ with the most from a cleaning point of view.

Some of the calls I made were to group anything cited as ‘dramatic ___’ with drama more widely, unless it had a notable second qualifier, such as pantomime, Romance or sketch. I also grouped anything mentioning ‘historical’ together, as from a research point of view this is probably the most prominent aspect, grouped harlequinades with pantomimes (although I know this might be controversial!) and grouped anything which involved a large organisation, such as military, Masonic or national performances, under ‘organisational’. Some were difficult to separate – I did wonder about grouping variety and vaudeville together, but as there were so few of each it seemed better to leave them be.

With these qualifications in mind, by far the most popular genre in the collections was farce, which I kept distinct from comedy, clocking up 537 performances from the projects. This was closely followed by comedy more generally with 527 performances, with the drama (197), melodrama (150) and tragedy (135) trailing afterwards. Once again, it could purely be that the original collectors of these volumes had more of a taste for comedy than drama, but there is such a wide gap in popularity from the volumes so far that it seems fair to conclude that the regional theatre-going public of the late eighteenth and early nineteenth centuries preferred to be cheered rather than saddened by their entertainment.

alt="Graph of the most popular genres"
A graph showing the most popular genres in records transcribed to date

You can contribute to this research

The more contributions we receive, the more accurate the titles, genre and dates results will be, so whether you’re looking out for your local theatre or interested in the more unusual performances which crop up, get involved with the project today at playbills.libcrowds.com. In the Spotlight is well on the way to hitting 100,000 contributions – make sure that you’re one of them!

15 June 2018

Team @BL_DigiSchol join @thecarpentries at #CarpentryCon2018 in Dublin

Add comment

Conference opening
CarpentryCon 2018
UCD campus
UCD Dublin

Members of the Digital Scholarship team, Alex, Rossitza and Stella, attended The Carpentries community inaugural conference held on the relaxing campus of University College Dublin 29 May-1 June. The atmosphere at the event was energising thanks to the enthusiasm of the community members who volunteer to teach computational, coding and data science skills to researchers worldwide.

The theme of the event “Building Locally, Connecting Globally” permeated the rich programme of talks and interactive sessions that focused on sharing knowledge, networking and developing new content and strategies for strengthening and growing The Carpentries. A report on the conference has been published by Belinda Weaver and this blog post by Raniere Silva summarises well some of the key messages.

Our team exhibited a poster on the Digital Scholarship staff training programme that creates opportunities for staff at The British Library to develop the necessary skills and knowledge to support emerging areas of modern scholarship.

Poster
Digital Scholarship poster
Team poster
Rossitza, Stella and Alex

Thus, particularly relevant for us were the sessions led by Belinda Weaver and Chris Erdmann about growing the software and data skills training provision for library professionals. We engaged in a conversation with members of The Library Carpentry community about how best to review and create new curricula and resources, as well as how the needs of the broader culture heritage professionals may vary. There are opportunities to work with university departments, professional bodies and regional consortia to get library and other GLAM professionals involved with The Library Carpentry. Watch this space for our team's involvement with The Carpentries and for further updates follow The Library Carpentry blog and Twitter feed, and The Carpentry Clippings newsletter.

Below are just few highlights from the sessions we took part in:

@frameshiftlic : Diversity and inclusion go hand in hand. Much more needs to be done to increase diversity and inclusivity in the technology sector.

The Carpentries community uses GitHub to maintain training materials and good guidance was provided on how to clone and fork repositories and submit pull requests.  A great teaching resource Happy Git and GitHub for the useR is being developed for Software Carpentry by Jennifer Bryan

Greg Wilson offered advice on how to keep refreshing teaching methods and content for both the learners’ and instructors’ benefit. His reading list for engaging learners includes The discussion book: 50 great ways to get people talking and Understanding how we learn: A Visual Guide

Tracy Teal talked about the funding model, operations and infrastructure of The Carpentries who have updated their website, logo, handbook and a Code of Conduct. Curriculum development, equality and inclusion, and building local capacity for training remain high priorities for the community.

Most engaging was the interactive breakout session on developing a new software carpentry lesson on High Performance Computing (HPC). The session leader Alan O’Cais used the classroom engagement platform Socrative to gather attendees’ feedback on existing lessons, appropriate content and the learner profile.

Other great sessions covered best approaches to teaching live coding at university, post workshop community development strategies, and how organisations, such as The Software Sustainability Institute and ELIXIR, have been supporting The Carpentries community initiatives.

CarpentryCon group photo Flickr 6000x4000
#CarpentryCon 2018 delegates. Image by Bérénice Batut available at https://flic.kr/p/252fVid under CC-BY-SA 2.0



14 May 2018

Seeing British Library collections through a digital lens

Add comment

Digital Curator Mia Ridge writes: in this guest post, Dr Giles Bergel describes some experiments with the Library's digitised images...

The University of Oxford’s Visual Geometry Group has been working with a number of British Library curators to apply computer vision technology to their collections. On April 5 of this year I was invited by BL Digital Curator Dr. Mia Ridge to St. Pancras to showcase some of this work and to give curators the opportunity to try the tools out for themselves.  

Image1
Visual Geometry’s VISE tool matching two identical images from separate books digitised for the British Library’s Two Centuries of Indian Print project.

Computer vision - the extraction of meaning from images - has made considerable strides in recent years, particularly through the application of so-called ‘deep learning’ to large datasets. Cultural collections provide some of the most interesting test-cases for computer vision researchers, due to their complexity; the intensity of interest that researchers bring to them; and to their importance for human well-being. Can computers see collections as humans do? Computer vision is perhaps better regarded as a powerful lens rather than as a substitute for human curation. A computer can search a large collection of images far more quickly than can a single picture researcher: while it will not bring the same contextual understanding to bear on an image, it has the advantage of speed and comprehensiveness. Sometimes, a computer vision system can surprise the researcher by suggesting similarities that weren’t readily apparent.

As a relatively new technology, computer vision attracts legitimate concerns about privacy, ethics and fairness. By making its state of the art tools freely available, Visual Geometry hope to encourage experimentation and responsible use, and to enlist users to help determine what they can and cannot do. Cultural collections provide a searching test-case for the state of the art, due to their diversity as media (prints, paintings, stamped images, photographs, film and more) each of which invite different responses. One BL curator made a telling point by searching the BBC News collection with the term 'football': the system was presented with images previously tagged with that word that related to American, Gaelic, Rugby and Association football. Although inconclusive due to lack of sufficiently specific training data, the test asked whether a computer could (or should) pick the most popular instances; attempt to generalise across multiple meanings; or discern separate usages. Despite increases in processing power and in software methods, computers' ability to generalise; to extract semantic meaning from images or texts; and to cope with overlapping or ambiguous concepts remains very basic.  

Other tests with BL images have been more immediately successful. Visual Geometry's Traherne tool, developed originally to detect differences in typesetting in early printed books, worked well with many materials that exhibit small differences, such as postage stamps or doctored photographs. Visual Geometry's Image Search Engine (VISE) has shown itself capable of retrieving matching illustrations in books digitised for the Library's Indian Print project, as well as certain bookbinding features, or popular printed ballads. Some years ago Visual Geometry produced a search interface for the Library's 1 Million Images release. A collaboration between the Library's Endangered Archives programme and Oxford researcher David Zeitlyn on the archive of Cameroonian studio photographer Jacques Toussele employed facial recognition as well as pattern detection. VGG's facial recognition software works on video (BBC News, for example) as well as still photographs and art, and is soon to be freely released to join other tools under the banner of the Seebibyte Project.    

I'll be returning to the Library in June to help curators explore using the tools with their own images. For more information on the work of Visual Geometry on cultural collections, subscribe to the project's Google Group or contact Giles Bergel.      

Dr. Giles Bergel is a digital humanist based in the Visual Geometry Group in the Department of Engineering Science at the University of Oxford.  

The event was supported by the Seebibyte project under an EPSRC Programme Grant EP/M013774/1

 

21 April 2018

On the Road (Again)

Add comment

Flickr image: Wanderer
Image from the British Library’s Million Images on Flickr, found on p 198 of 'The Cruise of the Land Yacht “Wanderer”; or, thirteen hundred miles in my caravan, etc' by William Gordon Stables, 1886.

Now that British Summer Time has officially arrived, and with it some warmer weather, British Library Labs are hitting the road again with a series of events in Universities around the UK. The aim of these half-day roadshows is to inspire people to think about using the library's digitised collections and datasets in their research, art works, sound installations, apps, businesses... you name it!

A digitised copy of a manuscript is a very convenient medium to work on, especially if you are unable to visit the library in person and order an original item up to a reading room. But there are so many other uses for digitised items! Come along to one of the BL Labs Roadshows at a University department near you and find out more about the methods used by researchers in Digital Scholarship, from data-mining and crowd sourcing to optical character recognition for transcribing the words from an imaged page into searchable text. 

At each of the roadshow events, there will be speakers from the host institution describing some of the research projects they have already completed using digitised materials, as well as members of the British Library who will be able to talk with you about proposed research plans involving digitised resources. 

The locations of this year's roadshows are: 

Mon 9th April - BL Labs Roadshow 2018 (Open University) - internal event

Mon 26th March - BL Labs Roadshow 2018 (CityLIS) - internal event

Thu 12th April - BL Labs Roadshow 2018 (University of Bristol & Cardiff Digital Cultures Network)

Tue 24th April - BL Labs Roadshow 2018 (UCL)

Wed 25th April - BL Labs Roadshow 2018 (University of Kent)

Wed 2nd May - BL Labs Roadshow 2018 (University of Edinburgh)

Tue 15th May - BL Labs Roadshow 2018 (University of Wolverhampton)

Wed 16th May - BL Labs Roadshow 2018 (University of Lincoln)

Tue 5th June - BL Labs Roadshow 2018 (University of Leeds)

  BL Labs Roadshows 2018
See a full programme and book your place using the Eventbrite page for each event.

If you want to discover more about the Digital Collections, and Digital Scholarship at the British Library, follow us on Twitter @BL_Labs, read our Blog Posts, and get in touch with BL Labs if you have some burning research questions!

23 March 2018

Shine a light on past entertainments with In the Spotlight

Add comment

In this post, Dr Mia Ridge and Alex Mendes provide an update on the Library's latest crowdsourcing project...

People who've explored In the Spotlight, our project helping make historic playbills more findable, might have noticed a line of text just above the 'Save and Continue' button: 'Seen something interesting? Add a note'.

Insights from your comments

Since the project began, we've received almost 700 comments [update - it's actually over 1900, across all projects]. Some of them simply tell us that an image is blank or upside-down, but many others share interesting findings. We love hearing from you, and we've been highlighting individual comments on Twitter (@LibCrowds) and on our forum.

Comments have pointed out spectacles including 'a Terrific Eruption of Mount Vesuvius, accompanied by TORRENTS OF BURNING LAVA' and a 'Serpent vomiting Fire'. New amenities mentioned include lighting ('600 wax lights and a new set of gold chandeliers' or new gas lighting) and the addition of backs to seats. Famous actors spotted include Sarah Siddons, Jenny Lind and Ira Aldridge, while Mr Kean has caused all kinds of trouble.

Lots of comments are about performances that aren't plays, from hornpipes to tableaux to ballets, songs, speeches, fireworks, scientific demonstrations, performing animals, panoramas, conjuring and juggling tricks, lists of scenery, gun tricks, pantomimes, acrobatics, excerpts from plays, and even the 'reenactment of the Coronation'! We're thinking hard about the best way to deal with them (and with playbills that don't include a year), and will post to the forum and twitter to ask for your ideas soon.

General updates

Since we first shared the link, there have been over 4,700 visitors from 91 countries. About 80% are primarily English-speakers, with Russian, German and French the next most popular languages.

We've had over 42,000 contributions from over 630 participants (with 1499 participants registered on the platform overall). Together, they've helped complete 34 projects by undertaking countless marking and transcription tasks to make genres, dates and play titles searchable.

Each project is based on a specific volume of playbills from a regional theatre or theatres. The fastest projects were 'Theatre Royal, Bristol 1819-1823 (Vol. 2)', completed in 8 days, 31 minutes, with 'Miscellaneous Plymouth theatres 1796-1882 (Vol. 1)' a close second at 8 days, 5 hours, 30 mins. We currently have playbills from theatres in Dublin, Hull, Nottingham - Oswestry or Plymouth - which will be completed first?

Recent blog posts include a wonderful story from PhD student and In the Spotlight participant Edward Mills tracing an ancient custom through the Library's digitised collections in The Flitch of Bacon: An Unexpected Journey Through the Collections of the British Library, and Christian Algar on the 'rich pageant' of historical playbills.

You might have noticed some small changes to the navigation and data pages as we updated the software this week. Most of the changes were behind the scenes, providing additional admin and analysis functions to ensure that data sent off to the catalogue is as accurate as possible.

image from http://s3.amazonaws.com/feather-files-aviary-prod-us-east-1/98739f1160a9458db215cec49fb033ee/2018-03-23/3bfdfe7285d54738a6f225032e20b995.png
Visitors have come from all over the world, but we'd love to reach more

 

Thank you!

We're grateful to everyone who's made a large or small contribution, but particular thanks to Barbara G, David Y, Dina S, Ervins S, Jo B, John L, Katharine S, Kathryn P-S, Lisa G, Maria Antonia V-S, Martin B, mistrec, Olga K, Raphael H, Rosie C, Sharon E, sylvmorris1, Tabitha M, thtrisdead, Tif D, Vijay V and various anonymous posters for your comments. Your comments are also helping us work out how to tweak some of the interfaces so people can let us know about a problem with a task by clicking a button, so expect more improvements in the future!

Step into the Spotlight

It's easy to try out In the Spotlight - you don't need to register, so you can start marking out the titles of plays or transcribing the titles, dates or genres of plays straight away. Give it a go and let us know what you find!

image from http://s3.amazonaws.com/feather-files-aviary-prod-us-east-1/98739f1160a9458db215cec49fb033ee/2018-03-23/63194392defb46a8bae006ea04dc7148.png
There are wonders galore waiting for the spotlight

14 March 2018

Working with BL Labs in search of Sir Jagadis Chandra Bose

Add comment

The 19th Century British Library Newspapers Database offers a rich mine of material to be sourced for a comprehensive view of British life in the nineteenth and early twentieth century. The online archive comprises 101 full-text titles of local, regional, and national newspapers across the UK and Ireland, and thanks to optical character recognition, they are all fully searchable. This allows for extensive data mining across several millions worth of newspaper pages. It’s like going through the proverbial haystack looking for the equally proverbial needle, but with a magnet in hand.

For my current research project on the role of the radio during the British Raj, I wanted to find out more about Sir Jagadis Chandra Bose (1858–1937), whose contributions to the invention of wireless telegraphy were hardly acknowledged during his lifetime and all but forgotten during the twentieth century.

J.C.Bose
Jagadish Chandra Bose in Royal Institution, London
(Image from Wikimedia Commons)

The person who is generally credited with having invented the radio is Guglielmo Marconi (1874–1937). In 1909, he and Karl Ferdinand Braun (1850–1918) were awarded the Nobel Prize in Physics “in recognition of their contributions to the development of wireless telegraphy”. What is generally not known is that almost ten years before that, Bose invented a coherer that would prove to be crucial for Marconi’s successful attempt at wireless telegraphy across the Atlantic in 1901. Bose never patented his invention, and Marconi reaped all the glory.

In his book Jagadis Chandra Bose and the Indian Response to Western Science, Subrata Dasgupta gives us four reasons as to why Bose’s contributions to radiotelegraphy have been largely forgotten in the West throughout the twentieth century. The first reason, according to Dasgupta, is that Bose changed research interest around 1900. Instead of continuing and focusing his work on wireless telegraphy, Bose became interested in the physiology of plants and the similarities between inorganic and living matter in their responses to external stimuli. Bose’s name thus lost currency in his former field of study.

A second reason that contributed to the erasure of Bose’s name is that he did not leave a legacy in the form of students. He did not, as Dasgupta puts it, “found a school of radio research” that could promote his name despite his personal absence from the field. Also, and thirdly, Bose sought no monetary gain from his inventions and only patented one of his several inventions. Had he done so, chances are that his name would have echoed loudly through the century, just as Marconi’s has done.

“Finally”, Dasgupta writes, “one cannot ignore the ‘Indian factor’”. Dasgupta wonders how seriously the scientific western elite really took Bose, who was the “outsider”, the “marginal man”, the “lone Indian in the hurly-burly of western scientific technology”. And he wonders how this affected “the seriousness with which others who came later would judge his significance in the annals of wireless telegraphy”.

And this is where the BL’s online archive of nineteenth-century newspapers comes in. Looking at newspaper coverage about Bose in the British press at the time suggests that Bose’s contributions to wireless telegraphy were soon to be all but forgotten during his lifetime. When Bose died in 1937, Reuters Calcutta put out a press release that was reprinted in several British newspapers. As an example, the following notice was published in the Derby Evening Telegraph of November 23rd, 1937, on Bose’s death:

Newspaper clipping announcing death of JC Bose
Notice in the Derby Evening Telegraph of November 23rd, 1937

This notice is as short as it is telling in what it says and does not say about Bose and his achievements: he is remembered as the man “who discovered a heart beat in trees”. He is not remembered as the man who almost invented the radio. He is remembered for the Western honours that are bestowed upon him (the Knighthood and his Fellowship of the Royal Society), and he is remembered as the founder of the Bose Research Institute. He is not remembered for his career as a researcher and inventor; a career that span five decades and saw him travel extensively in India, Europe and the United States.

The Derby Evening Telegraph is not alone in this act of partial remembrance. Similar articles appeared in Dundee’s Evening Telegraph and Post and The Gloucestershire Echo on the same day. The Aberdeen Press and Journal published a slightly extended version of the Reuters press release on November 24th that includes a brief account of a lecture by Bose in Whitehall in 1929, during which Bose demonstrated “that plants shudder when struck, writhe in the agonies of death, get drunk, and are revived by medicine”. However, there is again no mention of Bose’s work as a physicist or of his contributions to wireless telegraphy. The same is true for obituaries published in The Nottingham Evening Post on November 23rd, The Western Daily Press and Bristol Mirror on November 24th, another article published in the Aberdeen Press and Journal on November 26th, and two articles published in The Manchester Guardian on November 24th.

The exception to the rule is the obituary published in The Times on November 24th. Granted, with a total of 1116 words it is significantly longer than the Reuters press release, but this is also partly the point, as it allows for a much more comprehensive account of Bose’s life and achievements. But even if we only take the first two sentences of The Times obituary, which roughly add up to the word count of the Reuters press release, we are already presented with a different account altogether:

“Our Calcutta Correspondent telegraphs that Sir Jagadis Chandra Bose, F.R.S., died at Giridih, Bengal, yesterday, having nearly reached the age of 79. The reputation he won by persistent investigation and experiment as a physicist was extended to the general public in the Western world, which he frequently visited, by his remarkable gifts as a lecturer, and by the popular appeal of many of his demonstrations.”

We know that he was a physicist; the focus is on his skills as a researcher and on his talents as a lecturer rather than on his Western titles and honours, which are mentioned in passing as titles to his name; and we immediately get a sense of the significance of his work within the scientific community and for the general public. And later on in the article, it is finally acknowledged that Bose “designed an instrument identical in principle with the 'coherer' subsequently used in all systems of wireless communication. Another early invention was an instrument for verifying the laws of refraction, reflection, and polarization of electric waves. These instruments were demonstrated on the occasion of his first appearance before the British Association at the 1896 meeting at Liverpool”.

Posted by BL Labs on behalf of Dr Christin Hoene, a BL Labs Researcher in Residence at the British Library. Dr Hoene is a Leverhulme Early Career Fellow in English Literature at the University of Kent. 

If you are interested in working with the British Library's digital collections, why not come along to one of our events that we are holding at universities around the UK this year? We will be holding a roadshow at the University of Kent on 25 April 2018. You can see a programme for the day and book your place through this Eventbrite page. 

12 March 2018

The Ground Truth: Transcribing historical Arabic Scientific Manuscripts for OCR research

Add comment

Announcing a collaborative transcription project to support state-of-the-art research in automatic handwritten text recognition for historical Arabic texts

Cultural heritage institutions around the world are digitising hundreds of thousands of pages of historical Arabic manuscript and archive collections. Making these fully text searchable has the potential to truly transform scholarship, opening up this rich content for discovery and enabling large-scale analysis.

Computer scientists and scholars are working on this challenge, building systems which can automatically transcribe images of handwritten text, but for historical Arabic script a solution remains just out of reach.

Our aim is to contribute to continued research in this area by building an open image and ground truth dataset of historical handwritten Arabic texts, ensuring historical Arabic collections benefit from state-of-the-art developments in handwritten text recognition.

What is Ground Truth?

Optical Character Recognition (OCR) systems essentially turn a picture of text into text itself—in other words, producing something like a .TXT or .DOC file from a scanned .JPG of a printed or handwritten page. Most OCR systems require ground truth, a set of files which represent the truthful record of elements of an image, for training and evaluation purposes.

The ground truth of an image’s text content, for instance, is the complete and accurate record of every character and word in the image.

By knowing what the system is supposed to recognise on a page of handwritten text, researchers can both train their system to recognise the characters as well as test how well the system does once trained.

Transcription
 

  
View more transcriptions in progress from this manuscript (Or 3366) on the platform 

A collaborative approach

This project is a proof of concept exploring whether the creation of such a dataset can be done collaboratively at scale, using the collective expertise of volunteers around the world. At the heart of this approach is the Library’s enduring commitment to creating new and interesting ways to connect diverse communities of interest and expertise, be it scholars, the general public, computer scientists, students, and curators, around our collections. For this we are utilising a free and open-source platform, From the Page, which allows anyone with an interest in historical Arabic manuscripts to experience them up close, many for the first time, to discuss, learn and share expertise in their transcription.

Helping transform research

The Digital Scholarship Department was able to fund the development of this open source platform to support Right-to-Left transcription, a feature which will benefit any scholar wishing to use the software for their own transcription needs. Any transcriptions produced in this pilot will be transformed into ground truth resources, hosted by the British Library and made freely available, without rights restriction, for anyone wishing to advance the state-of-the-art in optical character recognition technology. Specifically, resources created will be contributed to ground-breaking projects already underway such as Transkribus, the Open Islamic Texts Initiative, the IMPACT Centre of Competence Image and Ground Truth Resources and more!

Visit the new Arabic Scientific Manuscripts of the British Library transcription platform and download our Getting Started Guide for more detail (an Arabic version will be available shortly). 

  

Posted by Nora McGregor, Digital Curator, British Library