THE BRITISH LIBRARY

Digital scholarship blog

15 posts categorized "South Asia"

24 July 2018

Workshop for South Asian Archivists and Librarians

Add comment

Members of the Two Centuries of Indian Print team have just returned from a fascinating trip to Delhi where we took part in a packed programme of activities organised as part of the Association for Asian Studies conference.

We spent most of the week with a group of archivists brought together from a variety of academic and cultural institutions across India and as far away as Cambodia and Australia. What united us was a shared passion for preserving South Asian heritage. As part of the program we led a workshop on Digitisation Standards as practiced by the British Library which also considered the key challenges organisations face when digitising cultural heritage material, including everything from selecting material and scanning, through to post-processing, online display and user engagement. The workshop also featured a paper on the IFLA guidelines for digitisation and (what we hope) was fun activity in which archivists were presented with different case studies of archival collections and asked to consider a digitisation strategy. It certainly sparked a lot of conversation! See photo below

 

Group activity

Workshop participants taking part in a group activity

 

Undeterred by the inhospitable weather occupying Delhi, we ventured out and were fortunate enough to receive some very thorough and illuminating tours of the Archives and Research Centre for Ethnomusicology, Centre for Art and Archaeology, The National Archives, Indira Gandhi National Centre for the Arts, and Sangeet Natak Akademi where we learned about their respective collections, conservation facilities and digitisation projects.

 

ARCE_audiovisual
Taking part in a tour of the audiovisual lab at the Archives and Research Centre for Ethnomusicology 

 

This marked the end of a trip which has connected us with inspiring professionals who we hope to collaborate on more events in the near future.

Our thanks go out to the organisers of what turned out to be a very engaging week of activities, to the American Institute of Indian Studies, to Ashoka University, and to the hosts of our workshop, the India International Centre.

 

01 May 2018

New Digital Curator in the Digital Scholarship Team

Add comment

Adi Keinan-SchoonbaertHello all! My name is Adi Keinan-Schoonbaert, and I’m the new Digital Curator for Asian and African collections at the British Library. One of the core remits of the Digital Scholarship team is to enable and encourage the reuse of the Library’s digital collections. When it comes to Asian and African collections, there are always interesting projects and initiatives going on. One is the Two Centuries of Indian Print project, which just started a second phase in March 2018 – a project with a strong Digital Humanities strand led by Digital Curator Tom Derrick. Another example is a collaborative transcription project, supporting the transcription of handwritten historical Arabic scientific works for Handwritten Text Recognition (HTR) research with the help of volunteers.

To give a bit of a background about myself and how I got to the Library: I’m an archaeologist and heritage professional by education and practice, with a PhD in Heritage Studies from University College London (2013). As a field archaeologist I used to record large quantities of excavation-related data – all manually, on paper. This was probably the first time I saw the potential of applying digital tools and technologies to record, manage and share archaeological data.

My first meaningful engagement with archaeological data and digital technologies started in 2005, when I joined the Israeli-Palestinian Archaeology Working Group (IPAWG) to create a database of all archaeological sites surveyed or excavated by Israel in the West Bank since its occupation in 1967, and its linking with a Geographic Information System (GIS), enabling the spatial visualisation and querying of this data for the first time. The research potential of this GIS-linked database proved so great, that I’ve decided to further explore it in a PhD dissertation. My dissertation focused on archaeological databases covering the occupied West Bank, and I was especially interested in the nature of archaeological records and the way they reflect particular research interests and heritage management priorities, as well as variability in data quality, coverage, accuracy and reliability.

Following my PhD I stayed at UCL Institute of Archaeology as a post-doctoral research associate, and participated in a project called MicroPasts, a UCL-British Museum collaboration. This project used web-based, crowdsourcing methods to allow traditional academics and other communities in archaeology to co-produce innovative open datasets. The MicroPasts crowdsourcing platform provided a great variety of projects through which people could contribute – from transcribing British Museum card catalogues, through tagging videos on the Roman Empire, to photomasking images in preparation for 3D modelling of museum objects.

With the main phase of the MicroPasts project coming to an end, I joined the British Library as Digital Curator (Polonsky Fellow) for the Hebrew Manuscripts Digitisation Project. This role allowed me to create and implement a digital strategy for engaging, accessing and promoting a specific digitised collection, working closely with curators and the Digital Scholarship team. My work included making the collection digitally accessible (on data.bl.uk, working with British Library Labs) and encouraging open licensing, creating a website, promoting the collection in different ways, researching available digital methods to explore and exploit collections in novel ways, and implementing tools such as an online catalogue records viewer (TEI XML), OpenRefine, and 3D modelling.

A 6-months backpacking trip to Asia unexpectedly prepared me for my new role at the Library. I was delighted to join – or re-join – the Library’s Digital Research team, this time as Digital Curator for Asian and African Collections. I find these collections especially intriguing due to their diversity, richness and uniqueness. These include mostly manuscripts, printed books, periodicals, newspapers, photographs and e-resources from Africa, the Middle East (including Qatar Digital Library), Central Asia, East Asia (including the International Dunhuang Project), South Asia, SE Asia – as well as the Visual Arts materials.

I’m very excited to join the Library’s Digital Research team work alongside Neil Fitzgerald, Nora McGregor, Mia Ridge and Stella Wisdom and learn from their rich experience. Feel free to get in touch with us via digitalresearch@bl.uk or Twitter - @BL_AdiKS for me, or @BL_DigiSchol for the Digital Scholarship team.

14 March 2018

Working with BL Labs in search of Sir Jagadis Chandra Bose

Add comment

The 19th Century British Library Newspapers Database offers a rich mine of material to be sourced for a comprehensive view of British life in the nineteenth and early twentieth century. The online archive comprises 101 full-text titles of local, regional, and national newspapers across the UK and Ireland, and thanks to optical character recognition, they are all fully searchable. This allows for extensive data mining across several millions worth of newspaper pages. It’s like going through the proverbial haystack looking for the equally proverbial needle, but with a magnet in hand.

For my current research project on the role of the radio during the British Raj, I wanted to find out more about Sir Jagadis Chandra Bose (1858–1937), whose contributions to the invention of wireless telegraphy were hardly acknowledged during his lifetime and all but forgotten during the twentieth century.

J.C.Bose
Jagadish Chandra Bose in Royal Institution, London
(Image from Wikimedia Commons)

The person who is generally credited with having invented the radio is Guglielmo Marconi (1874–1937). In 1909, he and Karl Ferdinand Braun (1850–1918) were awarded the Nobel Prize in Physics “in recognition of their contributions to the development of wireless telegraphy”. What is generally not known is that almost ten years before that, Bose invented a coherer that would prove to be crucial for Marconi’s successful attempt at wireless telegraphy across the Atlantic in 1901. Bose never patented his invention, and Marconi reaped all the glory.

In his book Jagadis Chandra Bose and the Indian Response to Western Science, Subrata Dasgupta gives us four reasons as to why Bose’s contributions to radiotelegraphy have been largely forgotten in the West throughout the twentieth century. The first reason, according to Dasgupta, is that Bose changed research interest around 1900. Instead of continuing and focusing his work on wireless telegraphy, Bose became interested in the physiology of plants and the similarities between inorganic and living matter in their responses to external stimuli. Bose’s name thus lost currency in his former field of study.

A second reason that contributed to the erasure of Bose’s name is that he did not leave a legacy in the form of students. He did not, as Dasgupta puts it, “found a school of radio research” that could promote his name despite his personal absence from the field. Also, and thirdly, Bose sought no monetary gain from his inventions and only patented one of his several inventions. Had he done so, chances are that his name would have echoed loudly through the century, just as Marconi’s has done.

“Finally”, Dasgupta writes, “one cannot ignore the ‘Indian factor’”. Dasgupta wonders how seriously the scientific western elite really took Bose, who was the “outsider”, the “marginal man”, the “lone Indian in the hurly-burly of western scientific technology”. And he wonders how this affected “the seriousness with which others who came later would judge his significance in the annals of wireless telegraphy”.

And this is where the BL’s online archive of nineteenth-century newspapers comes in. Looking at newspaper coverage about Bose in the British press at the time suggests that Bose’s contributions to wireless telegraphy were soon to be all but forgotten during his lifetime. When Bose died in 1937, Reuters Calcutta put out a press release that was reprinted in several British newspapers. As an example, the following notice was published in the Derby Evening Telegraph of November 23rd, 1937, on Bose’s death:

Newspaper clipping announcing death of JC Bose
Notice in the Derby Evening Telegraph of November 23rd, 1937

This notice is as short as it is telling in what it says and does not say about Bose and his achievements: he is remembered as the man “who discovered a heart beat in trees”. He is not remembered as the man who almost invented the radio. He is remembered for the Western honours that are bestowed upon him (the Knighthood and his Fellowship of the Royal Society), and he is remembered as the founder of the Bose Research Institute. He is not remembered for his career as a researcher and inventor; a career that span five decades and saw him travel extensively in India, Europe and the United States.

The Derby Evening Telegraph is not alone in this act of partial remembrance. Similar articles appeared in Dundee’s Evening Telegraph and Post and The Gloucestershire Echo on the same day. The Aberdeen Press and Journal published a slightly extended version of the Reuters press release on November 24th that includes a brief account of a lecture by Bose in Whitehall in 1929, during which Bose demonstrated “that plants shudder when struck, writhe in the agonies of death, get drunk, and are revived by medicine”. However, there is again no mention of Bose’s work as a physicist or of his contributions to wireless telegraphy. The same is true for obituaries published in The Nottingham Evening Post on November 23rd, The Western Daily Press and Bristol Mirror on November 24th, another article published in the Aberdeen Press and Journal on November 26th, and two articles published in The Manchester Guardian on November 24th.

The exception to the rule is the obituary published in The Times on November 24th. Granted, with a total of 1116 words it is significantly longer than the Reuters press release, but this is also partly the point, as it allows for a much more comprehensive account of Bose’s life and achievements. But even if we only take the first two sentences of The Times obituary, which roughly add up to the word count of the Reuters press release, we are already presented with a different account altogether:

“Our Calcutta Correspondent telegraphs that Sir Jagadis Chandra Bose, F.R.S., died at Giridih, Bengal, yesterday, having nearly reached the age of 79. The reputation he won by persistent investigation and experiment as a physicist was extended to the general public in the Western world, which he frequently visited, by his remarkable gifts as a lecturer, and by the popular appeal of many of his demonstrations.”

We know that he was a physicist; the focus is on his skills as a researcher and on his talents as a lecturer rather than on his Western titles and honours, which are mentioned in passing as titles to his name; and we immediately get a sense of the significance of his work within the scientific community and for the general public. And later on in the article, it is finally acknowledged that Bose “designed an instrument identical in principle with the 'coherer' subsequently used in all systems of wireless communication. Another early invention was an instrument for verifying the laws of refraction, reflection, and polarization of electric waves. These instruments were demonstrated on the occasion of his first appearance before the British Association at the 1896 meeting at Liverpool”.

Posted by BL Labs on behalf of Dr Christin Hoene, a BL Labs Researcher in Residence at the British Library. Dr Hoene is a Leverhulme Early Career Fellow in English Literature at the University of Kent. 

If you are interested in working with the British Library's digital collections, why not come along to one of our events that we are holding at universities around the UK this year? We will be holding a roadshow at the University of Kent on 25 April 2018. You can see a programme for the day and book your place through this Eventbrite page. 

21 February 2018

BL Labs 2017 Symposium: Opening up the British Library’s Early Indian Printed Books Collection (Staff Award Winner)

Add comment

Making the British Library’s valuable collection of early Bengali books more accessible to researchers and the general public around the world rests heavily on the collaborative work undertaken across different teams of the library and partners in the UK and abroad. The commitment and passion of the project team has relied on the contribution and expertise of collaborators, as well as the forward thinking vision of the library, partners and fundraisers.

Receiving the BL Labs Staff Award 2017 is a great opportunity to thank everyone involved. 

Members of the Two Centuries of Indian Print team receiving the British Library Labs award at the Symposium on 30th October.
Members of the Two Centuries of Indian Print team receiving the British Library Labs award at the Symposium on 30th October 2017
 
Tom Derrick (Digital Curator) was in India at the same time the team received their Award.
Tom Derrick (Digital Curator) was in India at the same time the team received their Award

The Two Centuries of Indian Print project is a partnership between the British Library, the School of Cultural Texts and Records (SCTR) at Jadavpur University, Srishti Institute of Art, Design and Technology, and the Library at SOAS University of London, among others. It has also involved collaborations with the National Library of India, and other institutions in India.

The AHRC Newton-Bhabha Fund and the Department for Business, Energy and Industrial Strategy have generously funded the work undertaken so far by the project, focusing on early printed Bengali books. Many are unavailable in other library collections or are extremely difficult to locate and access. The project has undertaken a variety of initiatives from the digitisation of books and enhancement of the catalogue records in English and Bengali, to stimulating the use of digital humanities tools and techniques, running a programme of digital skills sharing and capacity building workshops, and hosting the South Asia Series seminars. All of these initiatives greatly contribute to the discovery and study of the collection. The project is also conducting ground breaking work in finding a solution to Optical Character Recognition (OCR) in Bangla script. OCR is not available for South Asian languages currently and harnessing viable Optical Character Recognition technology would enable full text search of the books, paving the way for researchers to use natural language processing techniques to perform large scale analysis across a large corpus of text covering a diverse range of topics relating to Indian society, religion, and politics to name but a few. Doing so will increase the possibilities for new discoveries in this academic field. 

However, despite its status as one of the most widely spoken languages in the world, Bangla script has been greatly underserved by providers of OCR solutions. This is due in part to the orthographical and typographical variances that have taken place in recent centuries that make building a dictionary and character ‘classifier’ more challenging. Due to the wide date range of the books we are digitising, these issues affect the quality of OCR. The physical condition of our historical books, including faded text, presents additional difficulties for creating machine readable versions of the books. 

To overcome these obstacles, the project team has been advancing the development of OCR for Bangla through the organisation of an international competition which reviewed the state-of-the-art in commercial and open source text recognition tools. The results of the competition will be announced at the ICDAR 2017 conference in Kyoto later this month. Watch this space! The competition dataset has been made openly available for download and reuse for any researchers or institutions who would like to experiment with OCR for Bengali.

A page from the Animal Biographies, VT 1712 showing its transcription produced for the ICDAR 2017 competition
A page from the Animal Biographies, VT 1712 showing its transcription produced for the ICDAR 2017 competition

The project has organised two Skills Exchange Programmes, hosting mid-career Library professionals from the the National Library of India at the British Library for a week, providing a packed programme of tours and talks from all areas of the Library. The project has also conducted digital skills sharing and capacity building workshops for library professionals and archivists from cultural heritage institutions in India. The first workshop took place at Jadavpur University, Kolkata, in December 2016. Library and information professionals from cultural heritage institutions in Bengal took part in a one-day event to learn more about how information technology is transforming humanities research today and in turn Library services, as well as the methods for interrogating humanities-related datasets.

Afterthe success of this first workshop another event was held in July 2017, at which more than 30 library professionals discussed OCR developments for Bangla, trying out different tools and discussing digital scholarship techniques and projects. Most recently, the project’s digital curator facilitated a workshop around Digitisation Standards at the International Conference of Asian Libraries in Delhi. The workshops continue in earnest in the new year with another digital humanities skills workshop planned for January 2018 to be held in partnership with the Srishti Institute of Art, Design, and Technology.

Attendees of the workshop held at Jadavpur University in December 2016 taking part in a group activity to discuss the application of digital humanities methods to library collections
Attendees of the workshop held at Jadavpur University in December 2016 taking part in a group activity to discuss the application of digital humanities methods to library collections

The Project Team also held a two day Academic Symposium on South Asian book history at Jadavpur University in the summer, with 17 speakers from India, wider South Asia, and the UK. Attendance was between 50-70 people a day and feedback was very good.  We plan to have a publication arising from this Symposium, and to upload a video to our project webspace. The project also hosts a popular series of talks based around the Two Centuries of Indian Print project and the British Library’s South Asia collections. The seminars take place fortnightly at the British Library. So far we have hosted a range of academics and researchers, from PhD students to senior academics from the UK and abroad, who share cutting-edge research with discussion chaired by curators and specialists in the field. The seminars have been a great success attracting large attendances and speakers from around the world. We also host a number of show and tells of our material to raise awareness for our collection and to engage in community outreach.

Everyone on the project is thrilled to have won this award and we will be working hard in 2018 to continue bringing the Two Centuries of Indian Print project to the attention and use of researchers and the general public.

Submit a project for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.

Posted by BL Labs on behalf of The Two Centuries of Indian Print team.

27 July 2017

A workshop on Optical Character Recognition for Bangla

Add comment

I was fortunate enough to travel to Kolkata recently along with other members of the Two Centuries of Indian Print team where we ran a workshop on ‘Developments with Optical Character Recognition for Bangla’. The event, which took place at Jadavpur University, proved an excellent forum to share knowledge in this area of growing interest and was reflected in the range of library professionals, academics and computer scientists who attended from ten institutions across Bengal and from the US.

Applying Optical Character Recognition (OCR) to printed texts is one of the key expectations of 21st century scholars and library users, who want to quickly find information online that accurately meets their research needs. Cultural institutions are gateways to millions of items containing knowledge that can transform modern research. The workshop builds on our recently launched OCR Competition for Rare Indian Books  and looked at the developments, challenges and opportunities of OCR in opening up vast quantities of knowledge to digital researchers.

Dr. Naira Khan from the University of Dhaka’s Computational Linguistics department kicked off the workshop by introducing the key process of how OCR works, including ‘pre-processing’ steps such as binarisation which reduces a scanned page of text to its binary form to remove background noise, isolating only the text on the page. Skew detection, another pre-processing technique, corrects scans with angular text that can cause problems for OCR systems that require perfectly horizontal or vertical text. Dr. Khan moved on to explain how OCR systems segment pages into text and non-text regions right down to pixel detection to recognise word boundaries. When it comes down to recognising individual characters, Bangla script presents some unique challenges, containing such a vast range of compound characters, vowel signs and ligatures, not to mention the distinctive top line connecting characters known as the ‘Matra’. Breaking the characters into their geometric features such as lines, arcs and circles enables combinations of features to be formed, classified as characters and expressed in digital form as OCR output.  

Naira_blog_imageadjustment

Dr. Khan introducing the concepts of OCR

After Dr. Khan’s inspiring talk attendees learned of the British Library’s particular challenge searching for an OCR solution for our 19th century Bengali books currently being digitised, and the potential use of an OCR’d dataset for Digital Humanities researchers wanting to perform text and data mining. The books span an enormous range of genres from works by religious missionaries, to those covering food, science and works of fiction. So obtaining OCR would enable automated searching and analysis of the full text across hundreds of thousands of pages that could lead to exciting research discoveries in South Asian studies.   

The event concluded with a practical session during which attendees used different OCR software on a sample of the BL’s digitised Bengali books. They experimented with Tesseract, Google Drive, i2ocr and newOCR. The general consensus was Google Drive proved to be the most accurate! Although, there are other tools we have only just begun to try out such as Transkribus that may be useful.

PracticalExercise_blogWorkshop participants trying out various OCR tools

All-in-all the workshop proved a really worthwhile exercise in widening knowledge among Indian institutions about the challenges and possible uses of OCR for Bangla. The work currently being undertaken by universities and technology centres using state-of-the-art machine learning techniques to perform text recognition will hopefully close the gap between Bangla (as well as other Indic scripts) and Latin scripts when it comes to efficient OCR tools.

 

This is a post by Tom Derrick, Digital Curator for the Two Centuries of Indian Print project.

17 July 2017

A Wonderland of Knowledge - Behind the Scenes of the British Library (Nadya Miryanova work experience)

Add comment

Posted by Nadya Miryanova BL Labs School Work Placement Student, currently studying at Lady Eleanor Holles, working with Mahendra Mahey, Manager of BL Labs.

British Library
Introduction to the British Library

Day 1

It was with a mixture of anticipation, curiosity and excitement that I opened the door to the staff entrance and started my two week work placement in the world’s largest library. I have been placed with BL Labs in the Digital Scholarship department, where I am working with Mahendra Mahey (Project Manager of BL Labs) for the following two weeks. After the inescapable health and safety induction, I am now extremely well acquainted with the BL’s elaborate fire alarm system, and following lunch at the staff restaurant, Mahendra provided me with an introduction to the British Library and explained the work undertaken by the BL Labs.

When most people hear the word ‘library’, conventional ideas typically spring to mind, including a copious number of books, and, of course, a disgruntled librarian ironically rather loudly encouraging silence every five minutes. I must admit that initially, my perspective was the same.

However, my viewpoint was soon to be completely turned around.

BL interior
British Library interior

An extraordinary institution, the British Library is indeed widely known for its remarkable collection of books, it is home to around 14 million. However, contrary to popular belief, these are only a small section of the Library’s vast collections. In fact, the British Library actually has an extremely diverse range of items, ranging from patents to musical scores, and from ancient artefacts dating as far back as 1000 BC to this morning’s newspapers, altogether giving a grand figure of approximately 200 million documented items. I was also delighted to discover that the British Library has the world’s largest collection of stamps! It is estimated that if somebody looked at 5 items each day, it would take an astonishing 80,000 years to see the whole of BL collections. 

I learnt that the objective of the BL Labs is to encourage scholars, innovators, artists, entrepreneurs and educators to work with the Library's digital collections, supporting its mission to try to ensure that the wealth and diversity of the Library’s intellectual digital heritage is available for the research, creativity and fulfilment of everyone. At BL Labs, anyone is invited to address an important research question(s) or ideas which uses the Library’s digital content and data, by entering the annual Awards or becoming involved in a collaborative project or even just using the collections in whatever way they want.

Although initially a little nervous when entering this immense institution, my fears evaporated completely, when on my very first day of working here, I was brought immediately into a friendly, welcoming atmosphere, promoted by the sincere kindness and interest that I was met with from each member of the Library's staff. 

Books Image
The George the IV British Library book collection

Day 2

At precisely 9 o’clock in the morning, I found myself seated at my office desk, looking at the newly filled out Outlook calendar on my computer to see what new and exciting tasks I would be faced with that day and looking out for any upcoming events. My Tuesday consisted mostly of independent work at my desk, and after a quick catch-up with Mahendra at 9.30, where we discussed the working plan for the day and reviewed yesterday’s work, I sat down to start my second full day of work at the British Library.

BL labs symposium
British Library Labs leaflet

Between 2013-2016, the British Library Labs held a competition, which looked for transformative project ideas that used the British Library’s digital collections and data in new and exciting ways. The BL Labs Awards recognises outstanding and innovative work that has been carried out using these collections. Mahendra had previously introduced me to the Labs Competition and Awards pages of the BL Labs website, and my main objective was to update the ideas and project submissions on this page, specifically adding the remaining Competition 2016 Entries, reviewing the 2015 and 2014 entries and checking that they were all complete with no entries missing. The competition entries can be accessed via the online archive.

This was an excellent opportunity for me to work on a new editing platform and further enhance my editing skills, which will doubtlessly prove very useful in everyday life as well as in the future. As I worked through editing and updating the pages, what struck me most was the incredible diversity and wide variety of ideas within the competition entries. From a project exploring Black Abolitionists and their presence in Britain, to the proposed creation of a Victorian meme machine, and from a planned political meeting’s mapper, to a suggested Alice in Wonderland bow tie design, each idea was entirely unique and original, despite the fact that each entry was adhering to the same brief. I was mesmerised by the amount of thought and careful planning that was evident in every submission, each one was intricately detailed and provided a careful and thorough plan of work. 

Victorian Meme
An example of a Victorian meme

After finishing lunch relatively early, I found myself with half an hour of my allocated break still left, and took the opportunity to explore the library. I walked down to the visitor’s entrance, and took a moment to admire the King’s library, a majestic tower of books standing in the British Library's centre. Stepping closer, I was able to read some of the inscriptions on the spines of the books, and was delighted to see that one of them was a book of Catullus’ poetry, poetry that I previously had studied in Latin GCSE. The scope of knowledge that lies within this library is practically endless, and it led me to reflect on the importance of the work of the BL Labs. I thought back to the competition entries, they prove that the possibilities for projects truly have no limit. The BL Labs are able to give scholars, academics and students the opportunity to access some of these digital collections such as books very easily and in any part of the world. Without this access, many of the wonderful projects that the BL currently works on would not be possible.

With that thought fresh in my mind, I was brought back to reality, and returned to my desk to continue working, this time on my mini-project. My last task for the day involved brainstorming ideas for this project. A direct focus was soon established, and I decided to explore the Russian language titles in the 65,000 digitised 19th Century Microsoft books. Later on, I shall be writing a blog post detailing my experience of working on this project.

Day 3

As the Piccadilly line train arrived at St Pancras, I actually managed to step and head off in the completely right direction for the first time that week (needless to say, my sense of direction is not the best). Feeling rather proud of myself, I walked with a skip in my step, ready to immerse myself in whatever plan of work awaited today.

I looked at the schedule of the day and my heart leapt, I was to be attending my first ever proper staff meeting. It was a very technical meeting, started off by the Head of Digital Scholarship, Adam Faquhar, who talked about current activities taking place in the Digital Scholarship department. Everyone made contributions to the general discussion in the meeting and Mahendra talked about the development of the BL Labs work and the progress made so far. It also provided me with an opportunity to talk about some of the things I was presently doing and I found that everybody was very receptive and supportive. I found it very interesting to be introduced to people who work in the same area on a day-to-day basis with the British Library and enjoyed hearing about all the different projects currently being undertaken.

SherlockNet Web interface
SherlockNet web interface

I then began working on some YouTube transcription work on the winners of the 2016 BL Labs competition, the first one being SherlockNet. The SherlockNet team worked to use convolutional neural networks to automatically tag and caption the British Library Flickr collection of digitised images taken largely from 19th Century books. If that doesn't sound impressive enough, consider the fact that this entry was submitted by three people, who were just 19 years old (undergraduate university students). My work involved listening carefully to each one of the interviews, and typing on a separate word document exactly what Luda Zhao, Karen Wang and Brian Do were talking about. This word document would then be used to make subtitles for the final film and would prove invaluable when creating a storyboard for the final cut down interview. 

BL poster
British Library Alice in Wonderland Poster

Day 4

As I turned the corner of Midland Road and stood to face the traffic lights, my gaze wondered over to the now familiar Alice in Wonderland poster that had the ‘British Library’ printed on it in block capitals. I smiled as I looked up at the Cheshire cat that was perched neatly on top of the first 'I' in the words 'British Library' and the cat smiled back, revealing a wide toothy grin. Alice, likewise, was looking up at the Cheshire cat, and in that moment, her situation was made very credible to me. She was surrounded by this entirely new world of Wonderland, and in a similar way, I find myself in a parallel world of continuous acquisition of knowledge, as each day I am learning something new, with the British Library being the Wonderland. A wonderful and well-known literary extract from Lewis Carol came to mind:

 “`Would you tell me, please, which way I ought to go from here?' (Alice)

That depends a good deal on where you want to get to,' said the Cat.

`I don't much care where--' said Alice.

`Then it doesn't matter which way you go,' said the Cat.

`--so long as I get somewhere,' Alice added as an explanation.

`Oh, you're sure to do that,' said the Cat, `if you only walk long enough.'

With this in mind, I briskly walked over to the doors of the office.

The beginning of my day consisted mostly of working on my own project, further classifiying a sub collection of Russian titles from the digitised collection of 65,000 books mostly from the 19th century. I worked on further enhancing the organisation and categorisation of these books, establishing a clear methodical approach that began with sorting the books into 2 categories-fiction and non-fiction. Curiously, the majority of the titles were actually non-fiction. After an e-mail correspondence with Katya Rogatchevskaia, Lead Curator East European Collections, I discovered that most of the books that were part of the digitisation were acquired at the time when they were published, so they were selected by Katya’s distant predecessors, a fact I found remarkable.

Nicholas II abdication in Russian
The Act of Abdication of Nicholas II and his brother Grand Duke Michael,
published as a placard that would be distributed
by hand or pasted to walls (shelfmark: HS.74/1870),
an example of a Russian language title that is now digitised

For the second-half of the day, I focussed once more on the YouTube transcriptions work and managed to finish transcribing the interviews for SherlockNet. I then discussed with Mahendra how I would storyboard the interviews in preparation for the film editing process. First, I would have to pick out specific sections of the interview that were most suitable to use in the film, marking the exact timings when the person started speaking to when they finished, and I then placed the series of timings in a chronological order. I was also able to choose the music for the end product (possibly my favourite part!), and I based my selection of the music on the mood of the videos and my perception of the characters of the individuals. I concluded my day by finding a no-copyright YouTube music page and discovered an assortment of possible music tracks. I managed to narrow down the selection to four possible soundtracks, which included titles such as ‘Spring in my Step’ and ‘Good Starts’.

Day 5

As I swiped my staff pass across the reader which permits access into the building, I checked my phone to see what the time was. It was 8.30am and concurrently, I caught sight of the date, Friday 14th July. I stopped in my tracks. Today was marking my first full working week at the British Library, I could hardly believe how quickly the time went! It forcibly reminded me of the inscription on my clock at home, ‘tempus fugit’ (time flees) because if there’s one thing that has gone abnormally fast here at my time at the BL, it’s time.

Hebrew manuscript
Digitised Hebrew Manuscript available through the British Library

In the morning, I attended a meeting discussing an event Mahendra is planning around the Digitised Hebrew manuscripts, and I was lucky enough to meet Ilana Tahan, the Lead Curator of Hebrew and Christian Orient Collections. The meeting included a telephone call to Eva Frojmovic, an academic at the Centre for Jewish Studies in the School of Fine Art of the History of Art and Cultural Studies in the University of Leeds. The discussion was centered mostly on an event that would be taking place where the BL would be talking about its collection of digitised Hebrew manuscripts in order to promote their free use to the general public. The very beautiful Hebrew manuscripts could actually have a very wide target audience, perhaps additionally reaching outside the academic learning sphere and having the potential to be used in the creative/artistic space.

Contrary to popular belief, the collection of 1302 digitised manuscripts can be used by anyone and everyone, leading to exciting possibilities and new projects. The amazing thing about the digital collections is that it makes it possible for someone who does not live in London to access them, where ever they may be in the world, and they can be looked at digitally, and can be used to enhance any learning experience, ranging from seminars or lessons to PhD research projects. The actual hard-copy of the manuscripts can also be, of course, accessed in the British Library. The structure and timings of the event were discussed, and a date was set for the next meeting and for the event. To finish the meeting, Mahendra offered an explanation of the handwriting recognition transcription process for the manuscripts. There are 22 letters in the Hebrew alphabet, and each individual handwritten letter is recognised as a shape by the computer, though it's important that the computer has ground truth (i.e. examples of human transcribed manuscripts). Each letter and word is recognised and processed and will very cleverly convert the original Hebrew handwritten-script written into computerised Hebrew script. This means it would then allow someone to search for words in the manuscript, easily and quickly using a computerised search tool. 

Ilana looking at manuscripts
Ilana Tahan, Lead Curator of Hebrew and Christian Orient Collections,
looking through Hebrew manuscripts

For the majority of the afternoon, I was floating between a variety of different projects, doing more work on the YouTube transcriptions and enhancing my mini-project, as well as creating a table of the outstanding blogs that still had to be published on the British Library's Digital Scholarship blog.

At the end of the day, I did a review of my first week, evaluating the progress that I had made with Mahendra. Throughout the week, I feel that I have enhanced and developed a number of invaluable skills, and have gained an incredible insight into the working world.

I will be writing about my second week, as well as my mini-project soon, so please come and visit this blog again if you are interested to find out more about some of the work being done at the British Library.

 

 

22 March 2017

British Library Launches OCR Competition for Rare Indian Books

Add comment

Calling all transcription enthusiasts! We’ve launched a competition to find an accurate and automated transcription solution for our rare Indian books and printed catalogue records, currently being digitised through the Two Centuries of Indian Print project. 

The competition, in partnership with the University of Salford’s PRIMA Research Lab, is part of the International Conference on Document Analysis and Recognition, taking place in Kyoto, Japan this November. The winners will be announced at a special event during the conference.

Digitised images of the books will be made openly available through the library’s website and we hope this competition will produce transcriptions that enable full text search and discovery of this rich material. Sharing XML transcriptions will also give researchers the foundation to apply computational tools and methods such as text mining that may lead to new insights into book and publishing history in India.   

Split into two challenges, those wishing to participate in the competition can enter either or both.

The first challenge is to find an automated transcription for the 19th century printed books written in Bengali script. Optical Character Recognition of many non-Latin scripts is a developing area, but still presents a considerable barrier for libraries and other cultural institutions hoping to open up their material for scholarly research.

Vt1712_Schoolbook_lion_0007

Above: A page from 'Animal Biography', one of the Bengali books being digitised as part of Two Centuries of Indian Print (VT 1712)

 

Challenge number two involves our printed catalogue records, known as ‘Quarterly Lists’. These describe books published in India between 1867 and 1967. The lists are arranged in tables and therefore accurately representing the layout of the data is important if researchers are able to use computational methods to identify chunks of information such as the place of publication and cost of the book.    

Quarterly_List

 Above: A typical double page from the Quarterly Lists (SV 412/8)

 

With the competition now open, we’ve already gone some way to helping participants by manually transcribing a few pages to create ‘ground truth’ using PRIMA's editing tool, Aletheia.  You can watch a video introducing the competition. So if you or anyone you know would like to enter, do please register and you could be contributing to this landmark project, and picking up an award for your troubles!   

24 January 2017

Publication of Quarterly Lists: Catalogues of Indian Books

Add comment

The Two Centuries of Indian Print project is pleased to announce the online availability of some wonderful catalogues held by the library, generally known as the Quarterly Lists. They record books published quarterly and by province of British India between 1867 and 1947.

Digitised for the first time, the Quarterly Lists can now be accessed as searchable PDFs via the British Library's datasets portal, data.bl.uk. Researchers will be able to examine rich bibliographic data about books published throughout India, including the names and address of printers and publishers, publication price and how many copies were sold.

 

SV_412_8_1875-78_0003

 

Our next steps will be to OCR the Quarterly Lists to create ALTO XML for every page, which is designed to show accurate representations of the content layout. This will allow researchers to apply computational tools and methods to look across all of the lists to answer their questions about book history. So if a researcher is interested in what the history of book publishing reveals about a particular time period and place, we would like to make that possible by giving them full access to this dataset.

To get to this point however, we will have to overcome the layout challenge that the Quarterly Lists present. Across all of the lists we have found a few different layout styles which are rather tricky for OCR solutions to handle meaningfully. Note for instance how the list below compares to the one from the Calcutta Gazette above. Through the Digital Research strand of the project we will be seeking out innovative research groups willing to take a crack at improving the OCR quality and accuracy of tabular text extraction from the Quarterly Lists. 

The Quarterly Lists available on data.bl.uk are out of copyright and openly licensed for reuse. If you or anyone you know are interested in using the Quarterly Lists in your research or simply want to find out more about them, feel free to drop me an email; Tom.Derrick@bl.uk or follow more about the project @BL_IndianPrint

You can read more about the history of the Quarterly Lists, in a previous blog I wrote last year.