Digital scholarship blog

Enabling innovative research with British Library digital collections

32 posts categorized "South Asia"

05 February 2019

BL Labs 2018 Research Award Honourable Mention: 'Doctoral theses as alternative forms of knowledge: Surfacing "Southern" perspectives on student engagement with internationalisation'

This guest blog is by Professor Catherine Montgomery, recipient of one of two Honourable Mentions in the 2018 BL Labs Awards Research category for her work with the British Library's EThOS collection.British Library slide 1

 ‘Contemporary universities are powerful institutions, interlinked on a global scale; but they embed a narrow knowledge system that reflects and reproduces social inequalities on a global scale’ (Connell, 2017).

Having worked with doctoral students for many years and learned much in this process my curiosity was sparked by the EThOS collection at the British Library. EThOS houses a large proportion of UK doctoral theses completed in British Universities and comprises a digital repository of around 500,000 theses. Doctoral students use this repository regularly but mostly as a means of exploring examples of doctorates in their chosen area of research. In my experience, doctoral students are often looking at formats or methodologies when they consult EThOS rather than exploring the knowledge provided in the theses.

So when I began to think about the EThOS collection as a whole, I came to the conclusion that it is a vastly under-used but incredibly powerful resource. Doctoral knowledge is not often thought of as a coherent body of knowledge, although individual doctoral theses are sometimes quoted and consulted by academics and other doctoral students. It is also important to remember that of 84,630 Postgraduate Research students studying full time in the UK in 2016/17, half of them, 42,325, were non-UK students, with 29,875 students being from beyond the EU. So in this sense, the knowledge represented in the EThOS collection is an important international body of knowledge.

So I began to explore the EThOS collection with some help from a group of PhD students (Gihan Ismail, Luyao Li and Yanru Xu, all doctoral candidates at the Department of Education at the University of Bath) and the EThOS library team. I wanted to interrogate the collection for a particular field of knowledge and because my research field is internationalisation of higher education, I carried out a search in EThOS for theses written in the decade 2008 to 2018 focusing on student engagement with internationalisation. This generated an initial data set of 380 doctoral theses which we downloaded into the software package NVivo. We then worked on refining the data set, excluding theses irrelevant to the topic (I was focusing on higher education so, for example, internationalisation at school-level topics were excluded) coming up with a final data set of 94 theses around the chosen topic. The EThOS team at the British Library helped at this point and carried out a separate search, coming up with a set of 78 theses using a specific adjacent word search and they downloaded these into a spreadsheet for us. The two data sets were consistent with each other which was really useful triangulation in our exploration of the use of the EThOS repository.

This description makes it sound very straightforward but there were all sorts of challenges, many of them technology related, including the fact that we were working with very large amounts of text as each of the 380 theses was around 100,000 words long or more and this slowed down the NVivo software and sometimes made it crash. There were also challenges in the search process as some earlier theses in the collection were in different formats; some were scanned and therefore not searchable.

The outcomes of the work with the EThOS collection were fascinating. Various patterns emerged from the analysis of the doctoral theses and the most prominent of these were insights into the geographies of student engagement with internationalisation; issues of methodologies and theory; and different constructions of internationalisation in higher education.

The theses were written by students from 38 different countries of the globe and examined internationalisation of higher education in African countries, the Americas and Australia, across the Asian continent and Europe. Despite this diversity amongst the students, most of the theses investigated internationalisation in the UK or international students in the UK. The international students also often carried out research on their own countries’ higher education systems and there was some limited comparative research but all of these compared their own higher education systems with one or (rarely) two others. There was only a minority of students who researched the higher education systems of international contexts different from their own national context.

A similar picture emerged when I considered the sorts of theories and ideas students were using to frame their research. There was a predominance of Western theory used by the international students to cast light on their non-western educational contexts, with many theses relying on concepts commonly associated with Western theory such as social capital, global citizenship or communities of practice. The ways in which the doctoral theses constructed ideas of internationalisation also appeared in many cases to be following a well-worn track and explored familiar concepts of internationalisation including challenges of pedagogy, intercultural interaction and the student experience. Having said this, there were also some innovative, creative and critical insights into students engaging with internationalisation, showing that alternative perspectives and different ways of thinking were generated by the theses of the EThOS collection.

Raewyn Connell, an educationalist I used in the analysis of this project tells us that in an unequal society we need ‘the view-from-below’ to challenge dominant ways of thought. I would argue that we should think about doctoral knowledge as ‘the-view-from-below’, and doctoral theses can offer us alternative perspectives and challenges to the previous narratives of issues such as internationalisation. However, it may be that the academy will need to make space for these alternative or ‘Southern’ perspectives to come in and this will rely on the capacity of the participants, both supervisors and students, to be open to negotiation in theories and ideas, something which another great scholar, Boaventura De Sousa Santos, describes as intercultural translation of knowledge.

I am very grateful indeed to the British Library and the EThOS team for developing this incredible source of digital scholarship and for their support in this project. I was delighted to be given an honourable mention in the British Library Research Lab awards and I am intending to take this work forward and explore the EThOS repository further. I was fascinated and excited to find that a growing number of countries are also developing and improving access to their doctoral research repositories (Australia, Canada, China, South Africa and USA to name but a few). This represents a huge comparative and open access data set which could be used to explore alternative perspectives on ‘taken-for-granted’ knowledge. Where better to start than with doctoral theses?

More information on the project can be found in this published article:

Montgomery, C. (2018). Surfacing ‘Southern’ perspectives on student engagement with internationalisation: doctoral theses as alternative forms of knowledge. Journal of Studies in International Education. (23) 1 123-138. https://doi.org/10.1177/1028315318803743

British Library slide 2

Watch Professor Montgomery receiving her award and talking about her project on our YouTube channel (clip runs from 6.57 to 10.39):

Find out more about Digital Scholarship and BL Labs. If you have a project which uses British Library digital content in innovative and interesting ways, consider applying for an award this year! The 2019 BL Labs Symposium will take place on Monday 11 November at the British Library.

24 July 2018

Workshop for South Asian Archivists and Librarians

Members of the Two Centuries of Indian Print team have just returned from a fascinating trip to Delhi where we took part in a packed programme of activities organised as part of the Association for Asian Studies conference.

We spent most of the week with a group of archivists brought together from a variety of academic and cultural institutions across India and as far away as Cambodia and Australia. What united us was a shared passion for preserving South Asian heritage. As part of the program we led a workshop on Digitisation Standards as practiced by the British Library which also considered the key challenges organisations face when digitising cultural heritage material, including everything from selecting material and scanning, through to post-processing, online display and user engagement. The workshop also featured a paper on the IFLA guidelines for digitisation and (what we hope) was fun activity in which archivists were presented with different case studies of archival collections and asked to consider a digitisation strategy. It certainly sparked a lot of conversation! See photo below

 

Group activity

Workshop participants taking part in a group activity

 

Undeterred by the inhospitable weather occupying Delhi, we ventured out and were fortunate enough to receive some very thorough and illuminating tours of the Archives and Research Centre for Ethnomusicology, Centre for Art and Archaeology, The National Archives, Indira Gandhi National Centre for the Arts, and Sangeet Natak Akademi where we learned about their respective collections, conservation facilities and digitisation projects.

 

ARCE_audiovisual
Taking part in a tour of the audiovisual lab at the Archives and Research Centre for Ethnomusicology 

 

This marked the end of a trip which has connected us with inspiring professionals who we hope to collaborate on more events in the near future.

Our thanks go out to the organisers of what turned out to be a very engaging week of activities, to the American Institute of Indian Studies, to Ashoka University, and to the hosts of our workshop, the India International Centre.

 

01 May 2018

New Digital Curator in the Digital Scholarship Team

Adi Keinan-SchoonbaertHello all! My name is Adi Keinan-Schoonbaert, and I’m the new Digital Curator for Asian and African collections at the British Library. One of the core remits of the Digital Scholarship team is to enable and encourage the reuse of the Library’s digital collections. When it comes to Asian and African collections, there are always interesting projects and initiatives going on. One is the Two Centuries of Indian Print project, which just started a second phase in March 2018 – a project with a strong Digital Humanities strand led by Digital Curator Tom Derrick. Another example is a collaborative transcription project, supporting the transcription of handwritten historical Arabic scientific works for Handwritten Text Recognition (HTR) research with the help of volunteers.

To give a bit of a background about myself and how I got to the Library: I’m an archaeologist and heritage professional by education and practice, with a PhD in Heritage Studies from University College London (2013). As a field archaeologist I used to record large quantities of excavation-related data – all manually, on paper. This was probably the first time I saw the potential of applying digital tools and technologies to record, manage and share archaeological data.

My first meaningful engagement with archaeological data and digital technologies started in 2005, when I joined the Israeli-Palestinian Archaeology Working Group (IPAWG) to create a database of all archaeological sites surveyed or excavated by Israel in the West Bank since its occupation in 1967, and its linking with a Geographic Information System (GIS), enabling the spatial visualisation and querying of this data for the first time. The research potential of this GIS-linked database proved so great, that I’ve decided to further explore it in a PhD dissertation. My dissertation focused on archaeological databases covering the occupied West Bank, and I was especially interested in the nature of archaeological records and the way they reflect particular research interests and heritage management priorities, as well as variability in data quality, coverage, accuracy and reliability.

Following my PhD I stayed at UCL Institute of Archaeology as a post-doctoral research associate, and participated in a project called MicroPasts, a UCL-British Museum collaboration. This project used web-based, crowdsourcing methods to allow traditional academics and other communities in archaeology to co-produce innovative open datasets. The MicroPasts crowdsourcing platform provided a great variety of projects through which people could contribute – from transcribing British Museum card catalogues, through tagging videos on the Roman Empire, to photomasking images in preparation for 3D modelling of museum objects.

With the main phase of the MicroPasts project coming to an end, I joined the British Library as Digital Curator (Polonsky Fellow) for the Hebrew Manuscripts Digitisation Project. This role allowed me to create and implement a digital strategy for engaging, accessing and promoting a specific digitised collection, working closely with curators and the Digital Scholarship team. My work included making the collection digitally accessible (on data.bl.uk, working with British Library Labs) and encouraging open licensing, creating a website, promoting the collection in different ways, researching available digital methods to explore and exploit collections in novel ways, and implementing tools such as an online catalogue records viewer (TEI XML), OpenRefine, and 3D modelling.

A 6-months backpacking trip to Asia unexpectedly prepared me for my new role at the Library. I was delighted to join – or re-join – the Library’s Digital Research team, this time as Digital Curator for Asian and African Collections. I find these collections especially intriguing due to their diversity, richness and uniqueness. These include mostly manuscripts, printed books, periodicals, newspapers, photographs and e-resources from Africa, the Middle East (including Qatar Digital Library), Central Asia, East Asia (including the International Dunhuang Project), South Asia, SE Asia – as well as the Visual Arts materials.

I’m very excited to join the Library’s Digital Research team work alongside Neil Fitzgerald, Nora McGregor, Mia Ridge and Stella Wisdom and learn from their rich experience. Feel free to get in touch with us via [email protected] or Twitter - @BL_AdiKS for me, or @BL_DigiSchol for the Digital Scholarship team.

14 March 2018

Working with BL Labs in search of Sir Jagadis Chandra Bose

The 19th Century British Library Newspapers Database offers a rich mine of material to be sourced for a comprehensive view of British life in the nineteenth and early twentieth century. The online archive comprises 101 full-text titles of local, regional, and national newspapers across the UK and Ireland, and thanks to optical character recognition, they are all fully searchable. This allows for extensive data mining across several millions worth of newspaper pages. It’s like going through the proverbial haystack looking for the equally proverbial needle, but with a magnet in hand.

For my current research project on the role of the radio during the British Raj, I wanted to find out more about Sir Jagadis Chandra Bose (1858–1937), whose contributions to the invention of wireless telegraphy were hardly acknowledged during his lifetime and all but forgotten during the twentieth century.

J.C.Bose
Jagadish Chandra Bose in Royal Institution, London
(Image from Wikimedia Commons)

The person who is generally credited with having invented the radio is Guglielmo Marconi (1874–1937). In 1909, he and Karl Ferdinand Braun (1850–1918) were awarded the Nobel Prize in Physics “in recognition of their contributions to the development of wireless telegraphy”. What is generally not known is that almost ten years before that, Bose invented a coherer that would prove to be crucial for Marconi’s successful attempt at wireless telegraphy across the Atlantic in 1901. Bose never patented his invention, and Marconi reaped all the glory.

In his book Jagadis Chandra Bose and the Indian Response to Western Science, Subrata Dasgupta gives us four reasons as to why Bose’s contributions to radiotelegraphy have been largely forgotten in the West throughout the twentieth century. The first reason, according to Dasgupta, is that Bose changed research interest around 1900. Instead of continuing and focusing his work on wireless telegraphy, Bose became interested in the physiology of plants and the similarities between inorganic and living matter in their responses to external stimuli. Bose’s name thus lost currency in his former field of study.

A second reason that contributed to the erasure of Bose’s name is that he did not leave a legacy in the form of students. He did not, as Dasgupta puts it, “found a school of radio research” that could promote his name despite his personal absence from the field. Also, and thirdly, Bose sought no monetary gain from his inventions and only patented one of his several inventions. Had he done so, chances are that his name would have echoed loudly through the century, just as Marconi’s has done.

“Finally”, Dasgupta writes, “one cannot ignore the ‘Indian factor’”. Dasgupta wonders how seriously the scientific western elite really took Bose, who was the “outsider”, the “marginal man”, the “lone Indian in the hurly-burly of western scientific technology”. And he wonders how this affected “the seriousness with which others who came later would judge his significance in the annals of wireless telegraphy”.

And this is where the BL’s online archive of nineteenth-century newspapers comes in. Looking at newspaper coverage about Bose in the British press at the time suggests that Bose’s contributions to wireless telegraphy were soon to be all but forgotten during his lifetime. When Bose died in 1937, Reuters Calcutta put out a press release that was reprinted in several British newspapers. As an example, the following notice was published in the Derby Evening Telegraph of November 23rd, 1937, on Bose’s death:

Newspaper clipping announcing death of JC Bose
Notice in the Derby Evening Telegraph of November 23rd, 1937

This notice is as short as it is telling in what it says and does not say about Bose and his achievements: he is remembered as the man “who discovered a heart beat in trees”. He is not remembered as the man who almost invented the radio. He is remembered for the Western honours that are bestowed upon him (the Knighthood and his Fellowship of the Royal Society), and he is remembered as the founder of the Bose Research Institute. He is not remembered for his career as a researcher and inventor; a career that span five decades and saw him travel extensively in India, Europe and the United States.

The Derby Evening Telegraph is not alone in this act of partial remembrance. Similar articles appeared in Dundee’s Evening Telegraph and Post and The Gloucestershire Echo on the same day. The Aberdeen Press and Journal published a slightly extended version of the Reuters press release on November 24th that includes a brief account of a lecture by Bose in Whitehall in 1929, during which Bose demonstrated “that plants shudder when struck, writhe in the agonies of death, get drunk, and are revived by medicine”. However, there is again no mention of Bose’s work as a physicist or of his contributions to wireless telegraphy. The same is true for obituaries published in The Nottingham Evening Post on November 23rd, The Western Daily Press and Bristol Mirror on November 24th, another article published in the Aberdeen Press and Journal on November 26th, and two articles published in The Manchester Guardian on November 24th.

The exception to the rule is the obituary published in The Times on November 24th. Granted, with a total of 1116 words it is significantly longer than the Reuters press release, but this is also partly the point, as it allows for a much more comprehensive account of Bose’s life and achievements. But even if we only take the first two sentences of The Times obituary, which roughly add up to the word count of the Reuters press release, we are already presented with a different account altogether:

“Our Calcutta Correspondent telegraphs that Sir Jagadis Chandra Bose, F.R.S., died at Giridih, Bengal, yesterday, having nearly reached the age of 79. The reputation he won by persistent investigation and experiment as a physicist was extended to the general public in the Western world, which he frequently visited, by his remarkable gifts as a lecturer, and by the popular appeal of many of his demonstrations.”

We know that he was a physicist; the focus is on his skills as a researcher and on his talents as a lecturer rather than on his Western titles and honours, which are mentioned in passing as titles to his name; and we immediately get a sense of the significance of his work within the scientific community and for the general public. And later on in the article, it is finally acknowledged that Bose “designed an instrument identical in principle with the 'coherer' subsequently used in all systems of wireless communication. Another early invention was an instrument for verifying the laws of refraction, reflection, and polarization of electric waves. These instruments were demonstrated on the occasion of his first appearance before the British Association at the 1896 meeting at Liverpool”.

Posted by BL Labs on behalf of Dr Christin Hoene, a BL Labs Researcher in Residence at the British Library. Dr Hoene is a Leverhulme Early Career Fellow in English Literature at the University of Kent. 

If you are interested in working with the British Library's digital collections, why not come along to one of our events that we are holding at universities around the UK this year? We will be holding a roadshow at the University of Kent on 25 April 2018. You can see a programme for the day and book your place through this Eventbrite page. 

21 February 2018

BL Labs 2017 Symposium: Opening up the British Library’s Early Indian Printed Books Collection (Staff Award Winner)

Making the British Library’s valuable collection of early Bengali books more accessible to researchers and the general public around the world rests heavily on the collaborative work undertaken across different teams of the library and partners in the UK and abroad. The commitment and passion of the project team has relied on the contribution and expertise of collaborators, as well as the forward thinking vision of the library, partners and fundraisers.

Receiving the BL Labs Staff Award 2017 is a great opportunity to thank everyone involved. 

Members of the Two Centuries of Indian Print team receiving the British Library Labs award at the Symposium on 30th October.
Members of the Two Centuries of Indian Print team receiving the British Library Labs award at the Symposium on 30th October 2017
 
Tom Derrick (Digital Curator) was in India at the same time the team received their Award.
Tom Derrick (Digital Curator) was in India at the same time the team received their Award

The Two Centuries of Indian Print project is a partnership between the British Library, the School of Cultural Texts and Records (SCTR) at Jadavpur University, Srishti Institute of Art, Design and Technology, and the Library at SOAS University of London, among others. It has also involved collaborations with the National Library of India, and other institutions in India.

The AHRC Newton-Bhabha Fund and the Department for Business, Energy and Industrial Strategy have generously funded the work undertaken so far by the project, focusing on early printed Bengali books. Many are unavailable in other library collections or are extremely difficult to locate and access. The project has undertaken a variety of initiatives from the digitisation of books and enhancement of the catalogue records in English and Bengali, to stimulating the use of digital humanities tools and techniques, running a programme of digital skills sharing and capacity building workshops, and hosting the South Asia Series seminars. All of these initiatives greatly contribute to the discovery and study of the collection. The project is also conducting ground breaking work in finding a solution to Optical Character Recognition (OCR) in Bangla script. OCR is not available for South Asian languages currently and harnessing viable Optical Character Recognition technology would enable full text search of the books, paving the way for researchers to use natural language processing techniques to perform large scale analysis across a large corpus of text covering a diverse range of topics relating to Indian society, religion, and politics to name but a few. Doing so will increase the possibilities for new discoveries in this academic field. 

However, despite its status as one of the most widely spoken languages in the world, Bangla script has been greatly underserved by providers of OCR solutions. This is due in part to the orthographical and typographical variances that have taken place in recent centuries that make building a dictionary and character ‘classifier’ more challenging. Due to the wide date range of the books we are digitising, these issues affect the quality of OCR. The physical condition of our historical books, including faded text, presents additional difficulties for creating machine readable versions of the books. 

To overcome these obstacles, the project team has been advancing the development of OCR for Bangla through the organisation of an international competition which reviewed the state-of-the-art in commercial and open source text recognition tools. The results of the competition will be announced at the ICDAR 2017 conference in Kyoto later this month. Watch this space! The competition dataset has been made openly available for download and reuse for any researchers or institutions who would like to experiment with OCR for Bengali.

A page from the Animal Biographies, VT 1712 showing its transcription produced for the ICDAR 2017 competition
A page from the Animal Biographies, VT 1712 showing its transcription produced for the ICDAR 2017 competition

The project has organised two Skills Exchange Programmes, hosting mid-career Library professionals from the the National Library of India at the British Library for a week, providing a packed programme of tours and talks from all areas of the Library. The project has also conducted digital skills sharing and capacity building workshops for library professionals and archivists from cultural heritage institutions in India. The first workshop took place at Jadavpur University, Kolkata, in December 2016. Library and information professionals from cultural heritage institutions in Bengal took part in a one-day event to learn more about how information technology is transforming humanities research today and in turn Library services, as well as the methods for interrogating humanities-related datasets.

Afterthe success of this first workshop another event was held in July 2017, at which more than 30 library professionals discussed OCR developments for Bangla, trying out different tools and discussing digital scholarship techniques and projects. Most recently, the project’s digital curator facilitated a workshop around Digitisation Standards at the International Conference of Asian Libraries in Delhi. The workshops continue in earnest in the new year with another digital humanities skills workshop planned for January 2018 to be held in partnership with the Srishti Institute of Art, Design, and Technology.

Attendees of the workshop held at Jadavpur University in December 2016 taking part in a group activity to discuss the application of digital humanities methods to library collections
Attendees of the workshop held at Jadavpur University in December 2016 taking part in a group activity to discuss the application of digital humanities methods to library collections

The Project Team also held a two day Academic Symposium on South Asian book history at Jadavpur University in the summer, with 17 speakers from India, wider South Asia, and the UK. Attendance was between 50-70 people a day and feedback was very good.  We plan to have a publication arising from this Symposium, and to upload a video to our project webspace. The project also hosts a popular series of talks based around the Two Centuries of Indian Print project and the British Library’s South Asia collections. The seminars take place fortnightly at the British Library. So far we have hosted a range of academics and researchers, from PhD students to senior academics from the UK and abroad, who share cutting-edge research with discussion chaired by curators and specialists in the field. The seminars have been a great success attracting large attendances and speakers from around the world. We also host a number of show and tells of our material to raise awareness for our collection and to engage in community outreach.

Everyone on the project is thrilled to have won this award and we will be working hard in 2018 to continue bringing the Two Centuries of Indian Print project to the attention and use of researchers and the general public.

Submit a project for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.

Posted by BL Labs on behalf of The Two Centuries of Indian Print team.

27 July 2017

A workshop on Optical Character Recognition for Bangla

I was fortunate enough to travel to Kolkata recently along with other members of the Two Centuries of Indian Print team where we ran a workshop on ‘Developments with Optical Character Recognition for Bangla’. The event, which took place at Jadavpur University, proved an excellent forum to share knowledge in this area of growing interest and was reflected in the range of library professionals, academics and computer scientists who attended from ten institutions across Bengal and from the US.

Applying Optical Character Recognition (OCR) to printed texts is one of the key expectations of 21st century scholars and library users, who want to quickly find information online that accurately meets their research needs. Cultural institutions are gateways to millions of items containing knowledge that can transform modern research. The workshop builds on our recently launched OCR Competition for Rare Indian Books  and looked at the developments, challenges and opportunities of OCR in opening up vast quantities of knowledge to digital researchers.

Dr. Naira Khan from the University of Dhaka’s Computational Linguistics department kicked off the workshop by introducing the key process of how OCR works, including ‘pre-processing’ steps such as binarisation which reduces a scanned page of text to its binary form to remove background noise, isolating only the text on the page. Skew detection, another pre-processing technique, corrects scans with angular text that can cause problems for OCR systems that require perfectly horizontal or vertical text. Dr. Khan moved on to explain how OCR systems segment pages into text and non-text regions right down to pixel detection to recognise word boundaries. When it comes down to recognising individual characters, Bangla script presents some unique challenges, containing such a vast range of compound characters, vowel signs and ligatures, not to mention the distinctive top line connecting characters known as the ‘Matra’. Breaking the characters into their geometric features such as lines, arcs and circles enables combinations of features to be formed, classified as characters and expressed in digital form as OCR output.  

Naira_blog_imageadjustment

Dr. Khan introducing the concepts of OCR

After Dr. Khan’s inspiring talk attendees learned of the British Library’s particular challenge searching for an OCR solution for our 19th century Bengali books currently being digitised, and the potential use of an OCR’d dataset for Digital Humanities researchers wanting to perform text and data mining. The books span an enormous range of genres from works by religious missionaries, to those covering food, science and works of fiction. So obtaining OCR would enable automated searching and analysis of the full text across hundreds of thousands of pages that could lead to exciting research discoveries in South Asian studies.   

The event concluded with a practical session during which attendees used different OCR software on a sample of the BL’s digitised Bengali books. They experimented with Tesseract, Google Drive, i2ocr and newOCR. The general consensus was Google Drive proved to be the most accurate! Although, there are other tools we have only just begun to try out such as Transkribus that may be useful.

PracticalExercise_blogWorkshop participants trying out various OCR tools

All-in-all the workshop proved a really worthwhile exercise in widening knowledge among Indian institutions about the challenges and possible uses of OCR for Bangla. The work currently being undertaken by universities and technology centres using state-of-the-art machine learning techniques to perform text recognition will hopefully close the gap between Bangla (as well as other Indic scripts) and Latin scripts when it comes to efficient OCR tools.

 

This is a post by Tom Derrick, Digital Curator for the Two Centuries of Indian Print project.

17 July 2017

A Wonderland of Knowledge - Behind the Scenes of the British Library (Nadya Miryanova work experience)

Posted by Nadya Miryanova BL Labs School Work Placement Student, currently studying at Lady Eleanor Holles, working with Mahendra Mahey, Manager of BL Labs.

British Library
Introduction to the British Library

Day 1

It was with a mixture of anticipation, curiosity and excitement that I opened the door to the staff entrance and started my two week work placement in the world’s largest library. I have been placed with BL Labs in the Digital Scholarship department, where I am working with Mahendra Mahey (Project Manager of BL Labs) for the following two weeks. After the inescapable health and safety induction, I am now extremely well acquainted with the BL’s elaborate fire alarm system, and following lunch at the staff restaurant, Mahendra provided me with an introduction to the British Library and explained the work undertaken by the BL Labs.

When most people hear the word ‘library’, conventional ideas typically spring to mind, including a copious number of books, and, of course, a disgruntled librarian ironically rather loudly encouraging silence every five minutes. I must admit that initially, my perspective was the same.

However, my viewpoint was soon to be completely turned around.

BL interior
British Library interior

An extraordinary institution, the British Library is indeed widely known for its remarkable collection of books, it is home to around 14 million. However, contrary to popular belief, these are only a small section of the Library’s vast collections. In fact, the British Library actually has an extremely diverse range of items, ranging from patents to musical scores, and from ancient artefacts dating as far back as 1000 BC to this morning’s newspapers, altogether giving a grand figure of approximately 200 million documented items. I was also delighted to discover that the British Library has the world’s largest collection of stamps! It is estimated that if somebody looked at 5 items each day, it would take an astonishing 80,000 years to see the whole of BL collections. 

I learnt that the objective of the BL Labs is to encourage scholars, innovators, artists, entrepreneurs and educators to work with the Library's digital collections, supporting its mission to try to ensure that the wealth and diversity of the Library’s intellectual digital heritage is available for the research, creativity and fulfilment of everyone. At BL Labs, anyone is invited to address an important research question(s) or ideas which uses the Library’s digital content and data, by entering the annual Awards or becoming involved in a collaborative project or even just using the collections in whatever way they want.

Although initially a little nervous when entering this immense institution, my fears evaporated completely, when on my very first day of working here, I was brought immediately into a friendly, welcoming atmosphere, promoted by the sincere kindness and interest that I was met with from each member of the Library's staff. 

Books Image
The George the IV British Library book collection

Day 2

At precisely 9 o’clock in the morning, I found myself seated at my office desk, looking at the newly filled out Outlook calendar on my computer to see what new and exciting tasks I would be faced with that day and looking out for any upcoming events. My Tuesday consisted mostly of independent work at my desk, and after a quick catch-up with Mahendra at 9.30, where we discussed the working plan for the day and reviewed yesterday’s work, I sat down to start my second full day of work at the British Library.

BL labs symposium
British Library Labs leaflet

Between 2013-2016, the British Library Labs held a competition, which looked for transformative project ideas that used the British Library’s digital collections and data in new and exciting ways. The BL Labs Awards recognises outstanding and innovative work that has been carried out using these collections. Mahendra had previously introduced me to the Labs Competition and Awards pages of the BL Labs website, and my main objective was to update the ideas and project submissions on this page, specifically adding the remaining Competition 2016 Entries, reviewing the 2015 and 2014 entries and checking that they were all complete with no entries missing. The competition entries can be accessed via the online archive.

This was an excellent opportunity for me to work on a new editing platform and further enhance my editing skills, which will doubtlessly prove very useful in everyday life as well as in the future. As I worked through editing and updating the pages, what struck me most was the incredible diversity and wide variety of ideas within the competition entries. From a project exploring Black Abolitionists and their presence in Britain, to the proposed creation of a Victorian meme machine, and from a planned political meeting’s mapper, to a suggested Alice in Wonderland bow tie design, each idea was entirely unique and original, despite the fact that each entry was adhering to the same brief. I was mesmerised by the amount of thought and careful planning that was evident in every submission, each one was intricately detailed and provided a careful and thorough plan of work. 

Victorian Meme
An example of a Victorian meme

After finishing lunch relatively early, I found myself with half an hour of my allocated break still left, and took the opportunity to explore the library. I walked down to the visitor’s entrance, and took a moment to admire the King’s library, a majestic tower of books standing in the British Library's centre. Stepping closer, I was able to read some of the inscriptions on the spines of the books, and was delighted to see that one of them was a book of Catullus’ poetry, poetry that I previously had studied in Latin GCSE. The scope of knowledge that lies within this library is practically endless, and it led me to reflect on the importance of the work of the BL Labs. I thought back to the competition entries, they prove that the possibilities for projects truly have no limit. The BL Labs are able to give scholars, academics and students the opportunity to access some of these digital collections such as books very easily and in any part of the world. Without this access, many of the wonderful projects that the BL currently works on would not be possible.

With that thought fresh in my mind, I was brought back to reality, and returned to my desk to continue working, this time on my mini-project. My last task for the day involved brainstorming ideas for this project. A direct focus was soon established, and I decided to explore the Russian language titles in the 65,000 digitised 19th Century Microsoft books. Later on, I shall be writing a blog post detailing my experience of working on this project.

Day 3

As the Piccadilly line train arrived at St Pancras, I actually managed to step and head off in the completely right direction for the first time that week (needless to say, my sense of direction is not the best). Feeling rather proud of myself, I walked with a skip in my step, ready to immerse myself in whatever plan of work awaited today.

I looked at the schedule of the day and my heart leapt, I was to be attending my first ever proper staff meeting. It was a very technical meeting, started off by the Head of Digital Scholarship, Adam Faquhar, who talked about current activities taking place in the Digital Scholarship department. Everyone made contributions to the general discussion in the meeting and Mahendra talked about the development of the BL Labs work and the progress made so far. It also provided me with an opportunity to talk about some of the things I was presently doing and I found that everybody was very receptive and supportive. I found it very interesting to be introduced to people who work in the same area on a day-to-day basis with the British Library and enjoyed hearing about all the different projects currently being undertaken.

SherlockNet Web interface
SherlockNet web interface

I then began working on some YouTube transcription work on the winners of the 2016 BL Labs competition, the first one being SherlockNet. The SherlockNet team worked to use convolutional neural networks to automatically tag and caption the British Library Flickr collection of digitised images taken largely from 19th Century books. If that doesn't sound impressive enough, consider the fact that this entry was submitted by three people, who were just 19 years old (undergraduate university students). My work involved listening carefully to each one of the interviews, and typing on a separate word document exactly what Luda Zhao, Karen Wang and Brian Do were talking about. This word document would then be used to make subtitles for the final film and would prove invaluable when creating a storyboard for the final cut down interview. 

BL poster
British Library Alice in Wonderland Poster

Day 4

As I turned the corner of Midland Road and stood to face the traffic lights, my gaze wondered over to the now familiar Alice in Wonderland poster that had the ‘British Library’ printed on it in block capitals. I smiled as I looked up at the Cheshire cat that was perched neatly on top of the first 'I' in the words 'British Library' and the cat smiled back, revealing a wide toothy grin. Alice, likewise, was looking up at the Cheshire cat, and in that moment, her situation was made very credible to me. She was surrounded by this entirely new world of Wonderland, and in a similar way, I find myself in a parallel world of continuous acquisition of knowledge, as each day I am learning something new, with the British Library being the Wonderland. A wonderful and well-known literary extract from Lewis Carol came to mind:

 “`Would you tell me, please, which way I ought to go from here?' (Alice)

That depends a good deal on where you want to get to,' said the Cat.

`I don't much care where--' said Alice.

`Then it doesn't matter which way you go,' said the Cat.

`--so long as I get somewhere,' Alice added as an explanation.

`Oh, you're sure to do that,' said the Cat, `if you only walk long enough.'

With this in mind, I briskly walked over to the doors of the office.

The beginning of my day consisted mostly of working on my own project, further classifiying a sub collection of Russian titles from the digitised collection of 65,000 books mostly from the 19th century. I worked on further enhancing the organisation and categorisation of these books, establishing a clear methodical approach that began with sorting the books into 2 categories-fiction and non-fiction. Curiously, the majority of the titles were actually non-fiction. After an e-mail correspondence with Katya Rogatchevskaia, Lead Curator East European Collections, I discovered that most of the books that were part of the digitisation were acquired at the time when they were published, so they were selected by Katya’s distant predecessors, a fact I found remarkable.

Nicholas II abdication in Russian
The Act of Abdication of Nicholas II and his brother Grand Duke Michael,
published as a placard that would be distributed
by hand or pasted to walls (shelfmark: HS.74/1870),
an example of a Russian language title that is now digitised

For the second-half of the day, I focussed once more on the YouTube transcriptions work and managed to finish transcribing the interviews for SherlockNet. I then discussed with Mahendra how I would storyboard the interviews in preparation for the film editing process. First, I would have to pick out specific sections of the interview that were most suitable to use in the film, marking the exact timings when the person started speaking to when they finished, and I then placed the series of timings in a chronological order. I was also able to choose the music for the end product (possibly my favourite part!), and I based my selection of the music on the mood of the videos and my perception of the characters of the individuals. I concluded my day by finding a no-copyright YouTube music page and discovered an assortment of possible music tracks. I managed to narrow down the selection to four possible soundtracks, which included titles such as ‘Spring in my Step’ and ‘Good Starts’.

Day 5

As I swiped my staff pass across the reader which permits access into the building, I checked my phone to see what the time was. It was 8.30am and concurrently, I caught sight of the date, Friday 14th July. I stopped in my tracks. Today was marking my first full working week at the British Library, I could hardly believe how quickly the time went! It forcibly reminded me of the inscription on my clock at home, ‘tempus fugit’ (time flees) because if there’s one thing that has gone abnormally fast here at my time at the BL, it’s time.

Hebrew manuscript
Digitised Hebrew Manuscript available through the British Library

In the morning, I attended a meeting discussing an event Mahendra is planning around the Digitised Hebrew manuscripts, and I was lucky enough to meet Ilana Tahan, the Lead Curator of Hebrew and Christian Orient Collections. The meeting included a telephone call to Eva Frojmovic, an academic at the Centre for Jewish Studies in the School of Fine Art of the History of Art and Cultural Studies in the University of Leeds. The discussion was centered mostly on an event that would be taking place where the BL would be talking about its collection of digitised Hebrew manuscripts in order to promote their free use to the general public. The very beautiful Hebrew manuscripts could actually have a very wide target audience, perhaps additionally reaching outside the academic learning sphere and having the potential to be used in the creative/artistic space.

Contrary to popular belief, the collection of 1302 digitised manuscripts can be used by anyone and everyone, leading to exciting possibilities and new projects. The amazing thing about the digital collections is that it makes it possible for someone who does not live in London to access them, where ever they may be in the world, and they can be looked at digitally, and can be used to enhance any learning experience, ranging from seminars or lessons to PhD research projects. The actual hard-copy of the manuscripts can also be, of course, accessed in the British Library. The structure and timings of the event were discussed, and a date was set for the next meeting and for the event. To finish the meeting, Mahendra offered an explanation of the handwriting recognition transcription process for the manuscripts. There are 22 letters in the Hebrew alphabet, and each individual handwritten letter is recognised as a shape by the computer, though it's important that the computer has ground truth (i.e. examples of human transcribed manuscripts). Each letter and word is recognised and processed and will very cleverly convert the original Hebrew handwritten-script written into computerised Hebrew script. This means it would then allow someone to search for words in the manuscript, easily and quickly using a computerised search tool. 

Ilana looking at manuscripts
Ilana Tahan, Lead Curator of Hebrew and Christian Orient Collections,
looking through Hebrew manuscripts

For the majority of the afternoon, I was floating between a variety of different projects, doing more work on the YouTube transcriptions and enhancing my mini-project, as well as creating a table of the outstanding blogs that still had to be published on the British Library's Digital Scholarship blog.

At the end of the day, I did a review of my first week, evaluating the progress that I had made with Mahendra. Throughout the week, I feel that I have enhanced and developed a number of invaluable skills, and have gained an incredible insight into the working world.

I will be writing about my second week, as well as my mini-project soon, so please come and visit this blog again if you are interested to find out more about some of the work being done at the British Library.

 

 

22 March 2017

British Library Launches OCR Competition for Rare Indian Books

Calling all transcription enthusiasts! We’ve launched a competition to find an accurate and automated transcription solution for our rare Indian books and printed catalogue records, currently being digitised through the Two Centuries of Indian Print project. 

The competition, in partnership with the University of Salford’s PRIMA Research Lab, is part of the International Conference on Document Analysis and Recognition, taking place in Kyoto, Japan this November. The winners will be announced at a special event during the conference.

Digitised images of the books will be made openly available through the library’s website and we hope this competition will produce transcriptions that enable full text search and discovery of this rich material. Sharing XML transcriptions will also give researchers the foundation to apply computational tools and methods such as text mining that may lead to new insights into book and publishing history in India.   

Split into two challenges, those wishing to participate in the competition can enter either or both.

The first challenge is to find an automated transcription for the 19th century printed books written in Bengali script. Optical Character Recognition of many non-Latin scripts is a developing area, but still presents a considerable barrier for libraries and other cultural institutions hoping to open up their material for scholarly research.

Vt1712_Schoolbook_lion_0007

Above: A page from 'Animal Biography', one of the Bengali books being digitised as part of Two Centuries of Indian Print (VT 1712)

 

Challenge number two involves our printed catalogue records, known as ‘Quarterly Lists’. These describe books published in India between 1867 and 1967. The lists are arranged in tables and therefore accurately representing the layout of the data is important if researchers are able to use computational methods to identify chunks of information such as the place of publication and cost of the book.    

Quarterly_List

 Above: A typical double page from the Quarterly Lists (SV 412/8)

 

With the competition now open, we’ve already gone some way to helping participants by manually transcribing a few pages to create ‘ground truth’ using PRIMA's editing tool, Aletheia.  You can watch a video introducing the competition. So if you or anyone you know would like to enter, do please register and you could be contributing to this landmark project, and picking up an award for your troubles!   

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs