THE BRITISH LIBRARY

Digital scholarship blog

18 posts categorized "South Asia"

26 February 2019

Competition to automate text recognition for printed Bangla books

Add comment

You may have seen the exciting news last week that the British Library has launched a competition on recognition of historical Arabic scientific manuscripts that will run as part of ICDAR2019. We thought it only fair to cover printed material too! So we’re running another competition, also at ICDAR, for automated text recognition of rare and unique printed books written in Bangla that have been digitised through the Library's Two Centuries of Indian Print project.

Some of you may remember the Bangla printed books competition which took place at ICDAR2017 which generated significant interest among academic institutions and technology providers both in India and across the world. The 2017 competition set the challenge of finding an optimal solution for automating recognition of Bangla printed text and resulted in Google’s method performing best for both text detection and layout analysis.

Fast forward to 2019 and, thanks to Jadavpur University in Kolkata, we have added more ground truth transcriptions for competition entrants to train their OCR systems with. We hope that the competition encourages submissions again from cutting-edge OCR methods leading to a solution that can truly open up these historic books, dating between 1713 and 1914, for text mining, enabling scholars of South Asian studies to explore hundreds of thousands of pages on a scale that has not been possible until now.

AletheiaGroundTruth

              Image showing a transcribed page from one of the Bengali books featured in the ICDAR2019 competition

As with the Arabic competition, we are collaborating with PRImA (Pattern Recognition & Image Analysis Research Lab) who will provide expert and objective evaluation of OCR results produced through the competition. The final results will be revealed at the ICDAR2019 conference in Sydney in September.

So if you missed out last time but are interested in testing your OCR systems on our books the competition is now open! For instructions of how to apply and more about the competition, please visit https://www.primaresearch.org/REID2019/

 

This post is by Tom Derrick, Digital Curator for Two Centuries of Indian Print, British Library. He is on Twitter as @TommyID83 and Two Centuries of Indian Print tweet from @BL_IndianPrint

 

05 February 2019

BL Labs 2018 Research Award Honourable Mention: 'Doctoral theses as alternative forms of knowledge: Surfacing "Southern" perspectives on student engagement with internationalisation'

Add comment

This guest blog is by Professor Catherine Montgomery, recipient of one of two Honourable Mentions in the 2018 BL Labs Awards Research category for her work with the British Library's EThOS collection.British Library slide 1

 ‘Contemporary universities are powerful institutions, interlinked on a global scale; but they embed a narrow knowledge system that reflects and reproduces social inequalities on a global scale’ (Connell, 2017).

Having worked with doctoral students for many years and learned much in this process my curiosity was sparked by the EThOS collection at the British Library. EThOS houses a large proportion of UK doctoral theses completed in British Universities and comprises a digital repository of around 500,000 theses. Doctoral students use this repository regularly but mostly as a means of exploring examples of doctorates in their chosen area of research. In my experience, doctoral students are often looking at formats or methodologies when they consult EThOS rather than exploring the knowledge provided in the theses.

So when I began to think about the EThOS collection as a whole, I came to the conclusion that it is a vastly under-used but incredibly powerful resource. Doctoral knowledge is not often thought of as a coherent body of knowledge, although individual doctoral theses are sometimes quoted and consulted by academics and other doctoral students. It is also important to remember that of 84,630 Postgraduate Research students studying full time in the UK in 2016/17, half of them, 42,325, were non-UK students, with 29,875 students being from beyond the EU. So in this sense, the knowledge represented in the EThOS collection is an important international body of knowledge.

So I began to explore the EThOS collection with some help from a group of PhD students (Gihan Ismail, Luyao Li and Yanru Xu, all doctoral candidates at the Department of Education at the University of Bath) and the EThOS library team. I wanted to interrogate the collection for a particular field of knowledge and because my research field is internationalisation of higher education, I carried out a search in EThOS for theses written in the decade 2008 to 2018 focusing on student engagement with internationalisation. This generated an initial data set of 380 doctoral theses which we downloaded into the software package NVivo. We then worked on refining the data set, excluding theses irrelevant to the topic (I was focusing on higher education so, for example, internationalisation at school-level topics were excluded) coming up with a final data set of 94 theses around the chosen topic. The EThOS team at the British Library helped at this point and carried out a separate search, coming up with a set of 78 theses using a specific adjacent word search and they downloaded these into a spreadsheet for us. The two data sets were consistent with each other which was really useful triangulation in our exploration of the use of the EThOS repository.

This description makes it sound very straightforward but there were all sorts of challenges, many of them technology related, including the fact that we were working with very large amounts of text as each of the 380 theses was around 100,000 words long or more and this slowed down the NVivo software and sometimes made it crash. There were also challenges in the search process as some earlier theses in the collection were in different formats; some were scanned and therefore not searchable.

The outcomes of the work with the EThOS collection were fascinating. Various patterns emerged from the analysis of the doctoral theses and the most prominent of these were insights into the geographies of student engagement with internationalisation; issues of methodologies and theory; and different constructions of internationalisation in higher education.

The theses were written by students from 38 different countries of the globe and examined internationalisation of higher education in African countries, the Americas and Australia, across the Asian continent and Europe. Despite this diversity amongst the students, most of the theses investigated internationalisation in the UK or international students in the UK. The international students also often carried out research on their own countries’ higher education systems and there was some limited comparative research but all of these compared their own higher education systems with one or (rarely) two others. There was only a minority of students who researched the higher education systems of international contexts different from their own national context.

A similar picture emerged when I considered the sorts of theories and ideas students were using to frame their research. There was a predominance of Western theory used by the international students to cast light on their non-western educational contexts, with many theses relying on concepts commonly associated with Western theory such as social capital, global citizenship or communities of practice. The ways in which the doctoral theses constructed ideas of internationalisation also appeared in many cases to be following a well-worn track and explored familiar concepts of internationalisation including challenges of pedagogy, intercultural interaction and the student experience. Having said this, there were also some innovative, creative and critical insights into students engaging with internationalisation, showing that alternative perspectives and different ways of thinking were generated by the theses of the EThOS collection.

Raewyn Connell, an educationalist I used in the analysis of this project tells us that in an unequal society we need ‘the view-from-below’ to challenge dominant ways of thought. I would argue that we should think about doctoral knowledge as ‘the-view-from-below’, and doctoral theses can offer us alternative perspectives and challenges to the previous narratives of issues such as internationalisation. However, it may be that the academy will need to make space for these alternative or ‘Southern’ perspectives to come in and this will rely on the capacity of the participants, both supervisors and students, to be open to negotiation in theories and ideas, something which another great scholar, Boaventura De Sousa Santos, describes as intercultural translation of knowledge.

I am very grateful indeed to the British Library and the EThOS team for developing this incredible source of digital scholarship and for their support in this project. I was delighted to be given an honourable mention in the British Library Research Lab awards and I am intending to take this work forward and explore the EThOS repository further. I was fascinated and excited to find that a growing number of countries are also developing and improving access to their doctoral research repositories (Australia, Canada, China, South Africa and USA to name but a few). This represents a huge comparative and open access data set which could be used to explore alternative perspectives on ‘taken-for-granted’ knowledge. Where better to start than with doctoral theses?

More information on the project can be found in this published article:

Montgomery, C. (2018). Surfacing ‘Southern’ perspectives on student engagement with internationalisation: doctoral theses as alternative forms of knowledge. Journal of Studies in International Education. (23) 1 123-138. https://doi.org/10.1177/1028315318803743

British Library slide 2

Watch Professor Montgomery receiving her award and talking about her project on our YouTube channel (clip runs from 6.57 to 10.39):

Find out more about Digital Scholarship and BL Labs. If you have a project which uses British Library digital content in innovative and interesting ways, consider applying for an award this year! The 2019 BL Labs Symposium will take place on Monday 11 November at the British Library.

29 October 2018

Using Transkribus for automated text recognition of historical Bengali Books

Add comment

In this post Tom Derrick, Digital Curator, Two Centuries of Indian Print, explains the Library's recent use of Transkribus for automated text recognition of Bengali printed books.

Are you working with digitised printed collections that you want to 'unlock' for keyword search and text mining? Maybe you have already heard about Transkribus but thought it could only be used for automated recognition of handwritten texts. If so you might be surprised to hear it also does a pretty good job with printed texts too. You might be even more surprised to hear it does an impressive job with printed texts in Indian scripts! At least that is what we have found from recent testing with a batch of 19th century printed books written in Bengali script that have been digitised through the British Library’s Two Centuries of Indian Print project.

Transkribus is a READ project and available as a free tool for users who want to automate recognition of historical documents. The British Library has already had some success using Transkribus on manuscripts from our India Office collection, and it was that which inspired me to see how it would perform on the Bengali texts, which provides an altogether different type of challenge.

For a start, most text recognition solutions either do not support Indian scripts, or do not reach close to the same level of recognition as they do with documents written in English or other Latin scripts. In part this is down to supply and demand. Mainstream providers of tools have prioritised Western customers, yet there is also the relative lack of digitised Indian texts that can be used to train text recognition engines.

These text recognition engines have also been well trained on modern dictionaries and a collection of historical texts like the Bengali books will often contain words which are no longer in use. Their aged physicality also brings with it the delights of faded print, blotchy paper and other paper-based gremlins that keeps conservationists in work yet disrupts automated text recognition. Throw in an extensive alphabet that contains more diverse and complicated character forms than English and you can start to piece together how difficult it can be to train recognition engines to achieve comparable results with Bengali texts.

So it was with more with hope than expectation I approached Transkribus. We began by selecting 50 pages from the Bengali books representing the variety of typographical and layout styles within the wider collection of c. 500,000 pages as much as possible. Not an easy task! We uploaded these to Transkribus, manually segmenting paragraphs into text regions and automating line recognition. We then manually transcribed the texts to create a ground truth which, together with the scanned page images, were used to train the recurrent neural network within Transkribus to create a model for the 5,700 transcribed words.

Transkribus_Bengali_screenshot                                 View of a segmented page from one of the British Library's Bengali books along with its transcription, within the Transkribus viewer. 

The model was tested on a few pages from the wider collection and the results clearly communicated via the graph below. The model achieved an average character error rate (CER) of 21.9%, which is comparable to the best results we have seen from other text recognition services. Word accuracy of 61% was based on the number of words that were misspelled in the automated transcription compared to the ground truth. Eventually we would like to use automated transcriptions to support keyword searching of the Bengali books online and the higher the word accuracy increases the chances of users pulling back all relevant hits from their keyword search. We noticed the results often missed the upper zone of certain Bengali characters, i.e. the part of the character or glyph which resides above the matra line that connects characters in Bengali words. Further training focused on recognition of these characters may improve the results.

TranskribusResultsGraph showing the learning curve of the Bengali model using the Transkribus HTR tool.      

Our training set of 50 pages is very small compared to other projects using Transkribus and so we think the accuracy could be vastly improved by creating more transcriptions and re-training the model. However, we're happy with these initial results and would encourage others in a similar position to give Transkribus a try.

 

 

24 July 2018

Workshop for South Asian Archivists and Librarians

Add comment

Members of the Two Centuries of Indian Print team have just returned from a fascinating trip to Delhi where we took part in a packed programme of activities organised as part of the Association for Asian Studies conference.

We spent most of the week with a group of archivists brought together from a variety of academic and cultural institutions across India and as far away as Cambodia and Australia. What united us was a shared passion for preserving South Asian heritage. As part of the program we led a workshop on Digitisation Standards as practiced by the British Library which also considered the key challenges organisations face when digitising cultural heritage material, including everything from selecting material and scanning, through to post-processing, online display and user engagement. The workshop also featured a paper on the IFLA guidelines for digitisation and (what we hope) was fun activity in which archivists were presented with different case studies of archival collections and asked to consider a digitisation strategy. It certainly sparked a lot of conversation! See photo below

 

Group activity

Workshop participants taking part in a group activity

 

Undeterred by the inhospitable weather occupying Delhi, we ventured out and were fortunate enough to receive some very thorough and illuminating tours of the Archives and Research Centre for Ethnomusicology, Centre for Art and Archaeology, The National Archives, Indira Gandhi National Centre for the Arts, and Sangeet Natak Akademi where we learned about their respective collections, conservation facilities and digitisation projects.

 

ARCE_audiovisual
Taking part in a tour of the audiovisual lab at the Archives and Research Centre for Ethnomusicology 

 

This marked the end of a trip which has connected us with inspiring professionals who we hope to collaborate on more events in the near future.

Our thanks go out to the organisers of what turned out to be a very engaging week of activities, to the American Institute of Indian Studies, to Ashoka University, and to the hosts of our workshop, the India International Centre.

 

01 May 2018

New Digital Curator in the Digital Scholarship Team

Add comment

Adi Keinan-SchoonbaertHello all! My name is Adi Keinan-Schoonbaert, and I’m the new Digital Curator for Asian and African collections at the British Library. One of the core remits of the Digital Scholarship team is to enable and encourage the reuse of the Library’s digital collections. When it comes to Asian and African collections, there are always interesting projects and initiatives going on. One is the Two Centuries of Indian Print project, which just started a second phase in March 2018 – a project with a strong Digital Humanities strand led by Digital Curator Tom Derrick. Another example is a collaborative transcription project, supporting the transcription of handwritten historical Arabic scientific works for Handwritten Text Recognition (HTR) research with the help of volunteers.

To give a bit of a background about myself and how I got to the Library: I’m an archaeologist and heritage professional by education and practice, with a PhD in Heritage Studies from University College London (2013). As a field archaeologist I used to record large quantities of excavation-related data – all manually, on paper. This was probably the first time I saw the potential of applying digital tools and technologies to record, manage and share archaeological data.

My first meaningful engagement with archaeological data and digital technologies started in 2005, when I joined the Israeli-Palestinian Archaeology Working Group (IPAWG) to create a database of all archaeological sites surveyed or excavated by Israel in the West Bank since its occupation in 1967, and its linking with a Geographic Information System (GIS), enabling the spatial visualisation and querying of this data for the first time. The research potential of this GIS-linked database proved so great, that I’ve decided to further explore it in a PhD dissertation. My dissertation focused on archaeological databases covering the occupied West Bank, and I was especially interested in the nature of archaeological records and the way they reflect particular research interests and heritage management priorities, as well as variability in data quality, coverage, accuracy and reliability.

Following my PhD I stayed at UCL Institute of Archaeology as a post-doctoral research associate, and participated in a project called MicroPasts, a UCL-British Museum collaboration. This project used web-based, crowdsourcing methods to allow traditional academics and other communities in archaeology to co-produce innovative open datasets. The MicroPasts crowdsourcing platform provided a great variety of projects through which people could contribute – from transcribing British Museum card catalogues, through tagging videos on the Roman Empire, to photomasking images in preparation for 3D modelling of museum objects.

With the main phase of the MicroPasts project coming to an end, I joined the British Library as Digital Curator (Polonsky Fellow) for the Hebrew Manuscripts Digitisation Project. This role allowed me to create and implement a digital strategy for engaging, accessing and promoting a specific digitised collection, working closely with curators and the Digital Scholarship team. My work included making the collection digitally accessible (on data.bl.uk, working with British Library Labs) and encouraging open licensing, creating a website, promoting the collection in different ways, researching available digital methods to explore and exploit collections in novel ways, and implementing tools such as an online catalogue records viewer (TEI XML), OpenRefine, and 3D modelling.

A 6-months backpacking trip to Asia unexpectedly prepared me for my new role at the Library. I was delighted to join – or re-join – the Library’s Digital Research team, this time as Digital Curator for Asian and African Collections. I find these collections especially intriguing due to their diversity, richness and uniqueness. These include mostly manuscripts, printed books, periodicals, newspapers, photographs and e-resources from Africa, the Middle East (including Qatar Digital Library), Central Asia, East Asia (including the International Dunhuang Project), South Asia, SE Asia – as well as the Visual Arts materials.

I’m very excited to join the Library’s Digital Research team work alongside Neil Fitzgerald, Nora McGregor, Mia Ridge and Stella Wisdom and learn from their rich experience. Feel free to get in touch with us via digitalresearch@bl.uk or Twitter - @BL_AdiKS for me, or @BL_DigiSchol for the Digital Scholarship team.

14 March 2018

Working with BL Labs in search of Sir Jagadis Chandra Bose

Add comment

The 19th Century British Library Newspapers Database offers a rich mine of material to be sourced for a comprehensive view of British life in the nineteenth and early twentieth century. The online archive comprises 101 full-text titles of local, regional, and national newspapers across the UK and Ireland, and thanks to optical character recognition, they are all fully searchable. This allows for extensive data mining across several millions worth of newspaper pages. It’s like going through the proverbial haystack looking for the equally proverbial needle, but with a magnet in hand.

For my current research project on the role of the radio during the British Raj, I wanted to find out more about Sir Jagadis Chandra Bose (1858–1937), whose contributions to the invention of wireless telegraphy were hardly acknowledged during his lifetime and all but forgotten during the twentieth century.

J.C.Bose
Jagadish Chandra Bose in Royal Institution, London
(Image from Wikimedia Commons)

The person who is generally credited with having invented the radio is Guglielmo Marconi (1874–1937). In 1909, he and Karl Ferdinand Braun (1850–1918) were awarded the Nobel Prize in Physics “in recognition of their contributions to the development of wireless telegraphy”. What is generally not known is that almost ten years before that, Bose invented a coherer that would prove to be crucial for Marconi’s successful attempt at wireless telegraphy across the Atlantic in 1901. Bose never patented his invention, and Marconi reaped all the glory.

In his book Jagadis Chandra Bose and the Indian Response to Western Science, Subrata Dasgupta gives us four reasons as to why Bose’s contributions to radiotelegraphy have been largely forgotten in the West throughout the twentieth century. The first reason, according to Dasgupta, is that Bose changed research interest around 1900. Instead of continuing and focusing his work on wireless telegraphy, Bose became interested in the physiology of plants and the similarities between inorganic and living matter in their responses to external stimuli. Bose’s name thus lost currency in his former field of study.

A second reason that contributed to the erasure of Bose’s name is that he did not leave a legacy in the form of students. He did not, as Dasgupta puts it, “found a school of radio research” that could promote his name despite his personal absence from the field. Also, and thirdly, Bose sought no monetary gain from his inventions and only patented one of his several inventions. Had he done so, chances are that his name would have echoed loudly through the century, just as Marconi’s has done.

“Finally”, Dasgupta writes, “one cannot ignore the ‘Indian factor’”. Dasgupta wonders how seriously the scientific western elite really took Bose, who was the “outsider”, the “marginal man”, the “lone Indian in the hurly-burly of western scientific technology”. And he wonders how this affected “the seriousness with which others who came later would judge his significance in the annals of wireless telegraphy”.

And this is where the BL’s online archive of nineteenth-century newspapers comes in. Looking at newspaper coverage about Bose in the British press at the time suggests that Bose’s contributions to wireless telegraphy were soon to be all but forgotten during his lifetime. When Bose died in 1937, Reuters Calcutta put out a press release that was reprinted in several British newspapers. As an example, the following notice was published in the Derby Evening Telegraph of November 23rd, 1937, on Bose’s death:

Newspaper clipping announcing death of JC Bose
Notice in the Derby Evening Telegraph of November 23rd, 1937

This notice is as short as it is telling in what it says and does not say about Bose and his achievements: he is remembered as the man “who discovered a heart beat in trees”. He is not remembered as the man who almost invented the radio. He is remembered for the Western honours that are bestowed upon him (the Knighthood and his Fellowship of the Royal Society), and he is remembered as the founder of the Bose Research Institute. He is not remembered for his career as a researcher and inventor; a career that span five decades and saw him travel extensively in India, Europe and the United States.

The Derby Evening Telegraph is not alone in this act of partial remembrance. Similar articles appeared in Dundee’s Evening Telegraph and Post and The Gloucestershire Echo on the same day. The Aberdeen Press and Journal published a slightly extended version of the Reuters press release on November 24th that includes a brief account of a lecture by Bose in Whitehall in 1929, during which Bose demonstrated “that plants shudder when struck, writhe in the agonies of death, get drunk, and are revived by medicine”. However, there is again no mention of Bose’s work as a physicist or of his contributions to wireless telegraphy. The same is true for obituaries published in The Nottingham Evening Post on November 23rd, The Western Daily Press and Bristol Mirror on November 24th, another article published in the Aberdeen Press and Journal on November 26th, and two articles published in The Manchester Guardian on November 24th.

The exception to the rule is the obituary published in The Times on November 24th. Granted, with a total of 1116 words it is significantly longer than the Reuters press release, but this is also partly the point, as it allows for a much more comprehensive account of Bose’s life and achievements. But even if we only take the first two sentences of The Times obituary, which roughly add up to the word count of the Reuters press release, we are already presented with a different account altogether:

“Our Calcutta Correspondent telegraphs that Sir Jagadis Chandra Bose, F.R.S., died at Giridih, Bengal, yesterday, having nearly reached the age of 79. The reputation he won by persistent investigation and experiment as a physicist was extended to the general public in the Western world, which he frequently visited, by his remarkable gifts as a lecturer, and by the popular appeal of many of his demonstrations.”

We know that he was a physicist; the focus is on his skills as a researcher and on his talents as a lecturer rather than on his Western titles and honours, which are mentioned in passing as titles to his name; and we immediately get a sense of the significance of his work within the scientific community and for the general public. And later on in the article, it is finally acknowledged that Bose “designed an instrument identical in principle with the 'coherer' subsequently used in all systems of wireless communication. Another early invention was an instrument for verifying the laws of refraction, reflection, and polarization of electric waves. These instruments were demonstrated on the occasion of his first appearance before the British Association at the 1896 meeting at Liverpool”.

Posted by BL Labs on behalf of Dr Christin Hoene, a BL Labs Researcher in Residence at the British Library. Dr Hoene is a Leverhulme Early Career Fellow in English Literature at the University of Kent. 

If you are interested in working with the British Library's digital collections, why not come along to one of our events that we are holding at universities around the UK this year? We will be holding a roadshow at the University of Kent on 25 April 2018. You can see a programme for the day and book your place through this Eventbrite page. 

21 February 2018

BL Labs 2017 Symposium: Opening up the British Library’s Early Indian Printed Books Collection (Staff Award Winner)

Add comment

Making the British Library’s valuable collection of early Bengali books more accessible to researchers and the general public around the world rests heavily on the collaborative work undertaken across different teams of the library and partners in the UK and abroad. The commitment and passion of the project team has relied on the contribution and expertise of collaborators, as well as the forward thinking vision of the library, partners and fundraisers.

Receiving the BL Labs Staff Award 2017 is a great opportunity to thank everyone involved. 

Members of the Two Centuries of Indian Print team receiving the British Library Labs award at the Symposium on 30th October.
Members of the Two Centuries of Indian Print team receiving the British Library Labs award at the Symposium on 30th October 2017
 
Tom Derrick (Digital Curator) was in India at the same time the team received their Award.
Tom Derrick (Digital Curator) was in India at the same time the team received their Award

The Two Centuries of Indian Print project is a partnership between the British Library, the School of Cultural Texts and Records (SCTR) at Jadavpur University, Srishti Institute of Art, Design and Technology, and the Library at SOAS University of London, among others. It has also involved collaborations with the National Library of India, and other institutions in India.

The AHRC Newton-Bhabha Fund and the Department for Business, Energy and Industrial Strategy have generously funded the work undertaken so far by the project, focusing on early printed Bengali books. Many are unavailable in other library collections or are extremely difficult to locate and access. The project has undertaken a variety of initiatives from the digitisation of books and enhancement of the catalogue records in English and Bengali, to stimulating the use of digital humanities tools and techniques, running a programme of digital skills sharing and capacity building workshops, and hosting the South Asia Series seminars. All of these initiatives greatly contribute to the discovery and study of the collection. The project is also conducting ground breaking work in finding a solution to Optical Character Recognition (OCR) in Bangla script. OCR is not available for South Asian languages currently and harnessing viable Optical Character Recognition technology would enable full text search of the books, paving the way for researchers to use natural language processing techniques to perform large scale analysis across a large corpus of text covering a diverse range of topics relating to Indian society, religion, and politics to name but a few. Doing so will increase the possibilities for new discoveries in this academic field. 

However, despite its status as one of the most widely spoken languages in the world, Bangla script has been greatly underserved by providers of OCR solutions. This is due in part to the orthographical and typographical variances that have taken place in recent centuries that make building a dictionary and character ‘classifier’ more challenging. Due to the wide date range of the books we are digitising, these issues affect the quality of OCR. The physical condition of our historical books, including faded text, presents additional difficulties for creating machine readable versions of the books. 

To overcome these obstacles, the project team has been advancing the development of OCR for Bangla through the organisation of an international competition which reviewed the state-of-the-art in commercial and open source text recognition tools. The results of the competition will be announced at the ICDAR 2017 conference in Kyoto later this month. Watch this space! The competition dataset has been made openly available for download and reuse for any researchers or institutions who would like to experiment with OCR for Bengali.

A page from the Animal Biographies, VT 1712 showing its transcription produced for the ICDAR 2017 competition
A page from the Animal Biographies, VT 1712 showing its transcription produced for the ICDAR 2017 competition

The project has organised two Skills Exchange Programmes, hosting mid-career Library professionals from the the National Library of India at the British Library for a week, providing a packed programme of tours and talks from all areas of the Library. The project has also conducted digital skills sharing and capacity building workshops for library professionals and archivists from cultural heritage institutions in India. The first workshop took place at Jadavpur University, Kolkata, in December 2016. Library and information professionals from cultural heritage institutions in Bengal took part in a one-day event to learn more about how information technology is transforming humanities research today and in turn Library services, as well as the methods for interrogating humanities-related datasets.

Afterthe success of this first workshop another event was held in July 2017, at which more than 30 library professionals discussed OCR developments for Bangla, trying out different tools and discussing digital scholarship techniques and projects. Most recently, the project’s digital curator facilitated a workshop around Digitisation Standards at the International Conference of Asian Libraries in Delhi. The workshops continue in earnest in the new year with another digital humanities skills workshop planned for January 2018 to be held in partnership with the Srishti Institute of Art, Design, and Technology.

Attendees of the workshop held at Jadavpur University in December 2016 taking part in a group activity to discuss the application of digital humanities methods to library collections
Attendees of the workshop held at Jadavpur University in December 2016 taking part in a group activity to discuss the application of digital humanities methods to library collections

The Project Team also held a two day Academic Symposium on South Asian book history at Jadavpur University in the summer, with 17 speakers from India, wider South Asia, and the UK. Attendance was between 50-70 people a day and feedback was very good.  We plan to have a publication arising from this Symposium, and to upload a video to our project webspace. The project also hosts a popular series of talks based around the Two Centuries of Indian Print project and the British Library’s South Asia collections. The seminars take place fortnightly at the British Library. So far we have hosted a range of academics and researchers, from PhD students to senior academics from the UK and abroad, who share cutting-edge research with discussion chaired by curators and specialists in the field. The seminars have been a great success attracting large attendances and speakers from around the world. We also host a number of show and tells of our material to raise awareness for our collection and to engage in community outreach.

Everyone on the project is thrilled to have won this award and we will be working hard in 2018 to continue bringing the Two Centuries of Indian Print project to the attention and use of researchers and the general public.

Submit a project for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.

Posted by BL Labs on behalf of The Two Centuries of Indian Print team.

27 July 2017

A workshop on Optical Character Recognition for Bangla

Add comment

I was fortunate enough to travel to Kolkata recently along with other members of the Two Centuries of Indian Print team where we ran a workshop on ‘Developments with Optical Character Recognition for Bangla’. The event, which took place at Jadavpur University, proved an excellent forum to share knowledge in this area of growing interest and was reflected in the range of library professionals, academics and computer scientists who attended from ten institutions across Bengal and from the US.

Applying Optical Character Recognition (OCR) to printed texts is one of the key expectations of 21st century scholars and library users, who want to quickly find information online that accurately meets their research needs. Cultural institutions are gateways to millions of items containing knowledge that can transform modern research. The workshop builds on our recently launched OCR Competition for Rare Indian Books  and looked at the developments, challenges and opportunities of OCR in opening up vast quantities of knowledge to digital researchers.

Dr. Naira Khan from the University of Dhaka’s Computational Linguistics department kicked off the workshop by introducing the key process of how OCR works, including ‘pre-processing’ steps such as binarisation which reduces a scanned page of text to its binary form to remove background noise, isolating only the text on the page. Skew detection, another pre-processing technique, corrects scans with angular text that can cause problems for OCR systems that require perfectly horizontal or vertical text. Dr. Khan moved on to explain how OCR systems segment pages into text and non-text regions right down to pixel detection to recognise word boundaries. When it comes down to recognising individual characters, Bangla script presents some unique challenges, containing such a vast range of compound characters, vowel signs and ligatures, not to mention the distinctive top line connecting characters known as the ‘Matra’. Breaking the characters into their geometric features such as lines, arcs and circles enables combinations of features to be formed, classified as characters and expressed in digital form as OCR output.  

Naira_blog_imageadjustment

Dr. Khan introducing the concepts of OCR

After Dr. Khan’s inspiring talk attendees learned of the British Library’s particular challenge searching for an OCR solution for our 19th century Bengali books currently being digitised, and the potential use of an OCR’d dataset for Digital Humanities researchers wanting to perform text and data mining. The books span an enormous range of genres from works by religious missionaries, to those covering food, science and works of fiction. So obtaining OCR would enable automated searching and analysis of the full text across hundreds of thousands of pages that could lead to exciting research discoveries in South Asian studies.   

The event concluded with a practical session during which attendees used different OCR software on a sample of the BL’s digitised Bengali books. They experimented with Tesseract, Google Drive, i2ocr and newOCR. The general consensus was Google Drive proved to be the most accurate! Although, there are other tools we have only just begun to try out such as Transkribus that may be useful.

PracticalExercise_blogWorkshop participants trying out various OCR tools

All-in-all the workshop proved a really worthwhile exercise in widening knowledge among Indian institutions about the challenges and possible uses of OCR for Bangla. The work currently being undertaken by universities and technology centres using state-of-the-art machine learning techniques to perform text recognition will hopefully close the gap between Bangla (as well as other Indic scripts) and Latin scripts when it comes to efficient OCR tools.

 

This is a post by Tom Derrick, Digital Curator for the Two Centuries of Indian Print project.