THE BRITISH LIBRARY

Digital scholarship blog

12 posts categorized "Rare books"

18 February 2019

Updated Eighteenth-Century Collections Online

Add comment

The traditional, somewhat stereotypical image of the researcher of things past has not changed much in recent times. There is nothing easier than to imagine a scholar sitting at a scarcely illuminated wooden desk, surrounded by piles of old hardbound volumes, spending hours on end rummaging through the sheets in search of a clue.

In the field of eighteenth-century studies, this is certainly still the case. Scholars often go on a pilgrimage to prestigious repositories such as the British Library. However, in the last fifteen years or so, technology has started to offer attractive alternatives to the pleasure of travelling to London. Powered by Gale-Cengage, the Eighteenth-Century Collections Online (commonly referred to as ECCO) is a well-known resource that provides access to English-language and foreign-language publications printed in Britain, Ireland and the American colonies during the eighteenth century. This extensive collection contains over 180,000 titles (200,000 volumes) and allows full-text searching of some 32 million pages. These are digital editions based on the Eighteenth Century microfilming that started in 1981 and the English Short Title Catalogue.

New ECCO main screen
New ECCO home page

Moving away from its classic web-1.0 design, the Gale-Cengage team recently decided to revamp the layout of ECCO – indeed, of their entire portfolio of archive products, which include among others the Seventeenth- and Eighteenth-Century Burney Newspapers Collection. The aim is to make the Gale Primary Sources experience more consistent and intuitive for the user. At the head of this delicate operation are product managers Doran Steele and Megan Sullivan, who lead a nine-person team of software developers, content engineers, researchers and designers. Not quite the IT-only type of personnel, Doran and Megan are scholars themselves, respectively holding degrees in History and Information Science and a remarkable passion for all things past. They are responsible for the maintenance of the existing ECCO interface, as well as the development of the upcoming design refresh.

During a recent interview they gave to the authors of this post, Doran and Megan declared their objective of evolving ECCO in line ‘with user expectations of modern online research experiences’. Their driving force was stated very clearly as a bottom-up process. ‘This redesign’, they explained, ‘is informed by user feedback and market research’. A beta version of the new site has been available since the second half of 2018 to enable the Gale-Cengage team to gather feedback about the new design. The product managers specified that the final transition to the ‘new’ ECCO will only be completed once they feel confident that the new experience ‘successfully meets the needs of our users’. The final goal is a better user experience, ‘one that is faster and more intuitive’. To achieve this, a range of new features have been included, such as more filters on search results; results more relevant to the search queries; data visualization tools; improved subject indexing; more options for adjusting the image; and the ability to download in a text format the OCR (optical character recognition) version of a volume. The latter feature will be a particularly welcome innovation for scholars that often need to look up the occurrence of a single word or cut and paste long chunks of text.

ECCO search results
New ECCO search results screen

The options for adjusting the page view are another significant novelty. The beta version boasts new settings to quickly select the preferred zoom level, as well as sliders to increase or decrease the brightness and contrast of the page. These improvements are particularly welcome considering that the quality of the scans remains unchanged. The page quality is not directly related to ECCO. The portal simply allows the consultation of the digitised microfilms included in the first collection (also known as ECCO 1, comprising over 154.000 texts) and the digitisation of a second, smaller collection of books (ECCO 2, over 52.000 titles). This raises an important issue. A plethora of relatively unknown, yet precious eighteenth-century material remains difficult to consult because, on top of the uneven quality in the texts that came out of eighteenth-century printing presses, the original microfilming technology that was employed for the first collection yielded relatively low-resolution results. This causes some hiccups with OCR recognition, thus discouraging the use of quantitative methodologies. But the issue is all the more salient when the category of eighteenth-century visuals is taken into account. At a time when British engravers multiplied in numbers to illustrate the newly-discovered wonders of the natural world or the archaeological remains of Roman cities in England, illustrations became an essential aspect of the eighteenth-century book market and reading experience. While for essential texts such as William Stukeley’s Itinerarium curiosum (1724) or Eleazar Albin and William Derham’s A Natural History of Birds (1734) more refined scans can be found elsewhere, a large number of texts is digitally available only through ECCO 1. Scholars interested in images are either to focus on well-known texts that have been digitised by other providers – with serious consequences in terms of canonicity – or eventually need to plan a visit to major libraries to consult the relevant volumes in person, somehow defeating the very idea of digital reading. Either way, the study of visual culture is somewhat inhibited. Nevertheless, the ‘new’ ECCO promises to enhance the user experience and to offer even more opportunities to engage with outstanding repositories of primary material. If you already had a chance to use the new version, we encourage you to get in touch with Doran and Megan: as your feedback and suggestions can improve ECCO even further.

New ECCO text screen
New ECCO image viewer screen

This post is by Alessio Mattana, Teaching Assistant in Eighteenth-Century Literature at the University of Leeds (on Twitter as @mattanaless), and Dr Giacomo Savani, Teaching Assistant in Ancient History at the University of Leeds (on Twitter as @GiacomoSavani).

29 October 2018

Using Transkribus for automated text recognition of historical Bengali Books

Add comment

In this post Tom Derrick, Digital Curator, Two Centuries of Indian Print, explains the Library's recent use of Transkribus for automated text recognition of Bengali printed books.

Are you working with digitised printed collections that you want to 'unlock' for keyword search and text mining? Maybe you have already heard about Transkribus but thought it could only be used for automated recognition of handwritten texts. If so you might be surprised to hear it also does a pretty good job with printed texts too. You might be even more surprised to hear it does an impressive job with printed texts in Indian scripts! At least that is what we have found from recent testing with a batch of 19th century printed books written in Bengali script that have been digitised through the British Library’s Two Centuries of Indian Print project.

Transkribus is a READ project and available as a free tool for users who want to automate recognition of historical documents. The British Library has already had some success using Transkribus on manuscripts from our India Office collection, and it was that which inspired me to see how it would perform on the Bengali texts, which provides an altogether different type of challenge.

For a start, most text recognition solutions either do not support Indian scripts, or do not reach close to the same level of recognition as they do with documents written in English or other Latin scripts. In part this is down to supply and demand. Mainstream providers of tools have prioritised Western customers, yet there is also the relative lack of digitised Indian texts that can be used to train text recognition engines.

These text recognition engines have also been well trained on modern dictionaries and a collection of historical texts like the Bengali books will often contain words which are no longer in use. Their aged physicality also brings with it the delights of faded print, blotchy paper and other paper-based gremlins that keeps conservationists in work yet disrupts automated text recognition. Throw in an extensive alphabet that contains more diverse and complicated character forms than English and you can start to piece together how difficult it can be to train recognition engines to achieve comparable results with Bengali texts.

So it was with more with hope than expectation I approached Transkribus. We began by selecting 50 pages from the Bengali books representing the variety of typographical and layout styles within the wider collection of c. 500,000 pages as much as possible. Not an easy task! We uploaded these to Transkribus, manually segmenting paragraphs into text regions and automating line recognition. We then manually transcribed the texts to create a ground truth which, together with the scanned page images, were used to train the recurrent neural network within Transkribus to create a model for the 5,700 transcribed words.

Transkribus_Bengali_screenshot                                 View of a segmented page from one of the British Library's Bengali books along with its transcription, within the Transkribus viewer. 

The model was tested on a few pages from the wider collection and the results clearly communicated via the graph below. The model achieved an average character error rate (CER) of 21.9%, which is comparable to the best results we have seen from other text recognition services. Word accuracy of 61% was based on the number of words that were misspelled in the automated transcription compared to the ground truth. Eventually we would like to use automated transcriptions to support keyword searching of the Bengali books online and the higher the word accuracy increases the chances of users pulling back all relevant hits from their keyword search. We noticed the results often missed the upper zone of certain Bengali characters, i.e. the part of the character or glyph which resides above the matra line that connects characters in Bengali words. Further training focused on recognition of these characters may improve the results.

TranskribusResultsGraph showing the learning curve of the Bengali model using the Transkribus HTR tool.      

Our training set of 50 pages is very small compared to other projects using Transkribus and so we think the accuracy could be vastly improved by creating more transcriptions and re-training the model. However, we're happy with these initial results and would encourage others in a similar position to give Transkribus a try.

 

 

08 May 2018

The Italian Academies database – now available in XML

Add comment

Dr Mia Ridge writes: in 2017, we made XML and image files from a four-year, AHRC-funded project: The Italian Academies 1525-1700 available through the Library's open data portal. The original data structure was quite complex, so we would be curious to hear feedback from anyone reusing the converted form for research or visualisations.

In this post, Dr Lisa Sampson, Reader in Early Modern Italian Studies at UCL, and Dr Jane Everson, Emeritus Professor of Italian literature, RHUL, provide further information about the project...

New research opportunities for students of Renaissance and Baroque culture! The Italian Academies database is now available for download. It's in a format called XML which represents the original structure of the database.

This dedicated database results from an eight-year project, funded by the Arts and Humanities Research Council UK, and provides a wealth of information on the Italian learned academies. Around 800 such institutions flourished across the peninsula over the sixteenth and seventeenth centuries, making major contributions to the cultural and scientific debates and innovations of the period, as well as forming intellectual networks across Europe. This database lists a total of 587 Academies from Venice, Padua, Ferrara, Bologna, Siena, Rome, Naples, and towns and cities in southern Italy and Sicily active in the period 1525-1700. Also listed are more than 7,000 members of one or more academies (including major figures like Galileo, as well as women and artists), and almost 1,000 printed works connected with academies held in the British Library. The database therefore provides an essential starting point for research into early modern culture in Italy and beyond. It is also an invitation to further scholarship and data collection, as these totals constitute only a fraction of the data relating to the Academies.

Terracina
Laura Terracina, nicknamed Febea, of the Accademia degli Incogniti, Naples

The database is designed to permit searches from many different perspectives and to allow easy searching across categories. In addition to the three principal fields – Academies, People, Books – searches can be conducted by title keyword, printer, illustrator, dedicatee, censor, language, gender, nationality among others. The database also lists and illustrates the mottoes and emblems of the Academies (where known) and similarly of individual academy members. Illustrations from the books entered in the database include frontispieces, colophons, and images from within texts.

Intronati emblem
Emblem of the Accademia degli Intronati, Siena


The database thus aims to promote research on the Italian Academies in disciplines ranging from literature and history, through art, science, astronomy, mathematics, printing and publishing, censorship, politics, religion and philosophy.

The Italian Academies project which created this database began in 2006 as a collaboration between the British Library and Royal Holloway University of London, funded by the Arts and Humanities Research council and led by Jane Everson. The objective was the creation of a dedicated resource on the publications and membership of the Italian learned Academies active in the period between 1525 and 1700. The software for the database was designed in-house by the British Library and the first tranche of data was completed in 2009 listing information for academies in four cities (Naples, Siena, Bologna and Padua). A second phase, listing information for many more cities, including in southern Italy and Sicily, developed the database further, between 2010 and 2014, with a major research grant from the AHRC and collaboration with the University of Reading.

The exciting possibilities now opened up by the British Library’s digital data strategy look set to stimulate new research and collaborations by making the records even more widely available, and easily downloadable, in line with Open Access goals. The Italian Academies team is now working to develop the project further with the addition of new data, and the incorporation into a hub of similar resources.

The Italian Academies project team members welcome feedback on the records and on the adoption of the database for new research (contact: www.italianacademies.org).

The original database remains accessible at http://www.bl.uk/catalogues/ItalianAcademies/Default.aspx 

An Introduction to the database, its aims, contents and objectives is available both at this site and at the new digital data site: https://data.bl.uk/iad/

Jane E. Everson, Royal Holloway University of London

Lisa Sampson, University College, London

22 March 2017

British Library Launches OCR Competition for Rare Indian Books

Add comment

Calling all transcription enthusiasts! We’ve launched a competition to find an accurate and automated transcription solution for our rare Indian books and printed catalogue records, currently being digitised through the Two Centuries of Indian Print project. 

The competition, in partnership with the University of Salford’s PRIMA Research Lab, is part of the International Conference on Document Analysis and Recognition, taking place in Kyoto, Japan this November. The winners will be announced at a special event during the conference.

Digitised images of the books will be made openly available through the library’s website and we hope this competition will produce transcriptions that enable full text search and discovery of this rich material. Sharing XML transcriptions will also give researchers the foundation to apply computational tools and methods such as text mining that may lead to new insights into book and publishing history in India.   

Split into two challenges, those wishing to participate in the competition can enter either or both.

The first challenge is to find an automated transcription for the 19th century printed books written in Bengali script. Optical Character Recognition of many non-Latin scripts is a developing area, but still presents a considerable barrier for libraries and other cultural institutions hoping to open up their material for scholarly research.

Vt1712_Schoolbook_lion_0007

Above: A page from 'Animal Biography', one of the Bengali books being digitised as part of Two Centuries of Indian Print (VT 1712)

 

Challenge number two involves our printed catalogue records, known as ‘Quarterly Lists’. These describe books published in India between 1867 and 1967. The lists are arranged in tables and therefore accurately representing the layout of the data is important if researchers are able to use computational methods to identify chunks of information such as the place of publication and cost of the book.    

Quarterly_List

 Above: A typical double page from the Quarterly Lists (SV 412/8)

 

With the competition now open, we’ve already gone some way to helping participants by manually transcribing a few pages to create ‘ground truth’ using PRIMA's editing tool, Aletheia.  You can watch a video introducing the competition. So if you or anyone you know would like to enter, do please register and you could be contributing to this landmark project, and picking up an award for your troubles!   

09 March 2017

Archaeologies of reading: guest post from Matthew Symonds, Centre for Editing Lives and Letters

Add comment

Digital Curator Mia Ridge: today we have a guest post by Matthew Symonds from the Centre for Editing Lives and Letters on the Archaeologies of reading project, based on a talk he did for our internal '21st century curatorship' seminar series. Over to Matt...

Some people get really itchy about the idea of making notes in books, and dare not defile the pristine printed page. Others leave their books a riot of exclamation marks, sarcastic incredulity and highlighter pen.

Historians – even historians disciplined by spending years in the BL’s Rare Books and Manuscripts rooms – would much prefer it if people did mark books, preferably in sentences like “I, Famous Historical Personage, have read this book and think the following having read it…”. It makes it that much easier to investigate how people engaged with the ideas and information they read.

Brilliantly for us historians, rare books collections are filled with this sort of material. The problem is it’s also difficult to catalogue and make discoverable (nota bene – it’s hard because no institutions could afford to employ and train sufficient cataloguers, not because librarians don’t realise this is an issue).

The Archaeology of Reading in Early Modern Europe (AOR) takes digital images of books owned and annotated by two renaissance readers, the professional reader Gabriel Harvey and the extraordinary polymath John Dee, transcribes and translates all the comments in the margin, and marks up all traces of a reader’s intervention with the printed book and puts the whole thing on the Internet in a way designed to be useful and accessible to researchers and the general public alike.

image from https://s3.amazonaws.com/feather-client-files-aviary-prod-us-east-1/2017-03-09/76bacc2c-befe-4e7c-b729-c49cf47adf0b.png
Screenshot, The Archaeology of Reading in Early Modern Europe

AOR is a digital humanities collaboration between the Centre for Editing Lives and Letters (CELL) at University College London, Johns Hopkins University and Princeton University, and generously funded by the Andrew W. Mellon Foundation.

More importantly, it’s also a collaboration between academic researchers, librarians and software engineers. An absolutely vital consideration of how we planned AOR, how we work on it, how we’re planning to expand it, was to identify a project that could offer a common ground to be shared between these three interests, where each party would have something to gain from it.

As one of the researchers, it was really important to me to avoid forming some sort of “client-provider” relationship with the librarians who curate and know so much about my sources, and the software engineers who build the digital infrastructure.

But we do use an academic problem as a means of giving our project a focus. In 1990, Antony Grafton and the late Lisa Jardine published their seminal article ‘“Studied for Action: how Gabriel Harvey read his Livy’ in the journal Past & Present.

One major insight of the article is that people read books in conjunction with one another, often for specific, pragmatic purposes. People didn’t pick up a book from their shelves, open at page one and proceed through to the finis, marking up as they went. They put other books next to them, books that explained, clarified, argued with one another.

By studying the marginalia, it’s possible to reconstruct these pathways across a library, recreating the strategies people used to manage the vast quantities of information they had at their disposal.

In order to produce this archaeology of reading, we’ve built a “digital bookwheel”, an attempt to recreate the revolving reading desk of the renaissance period which allowed the lucky owner to manoeuvre back and forth their books. From here, the user can call up the books we’ve digitised, read the transcriptions, and search for particular words and concepts.

image from http://s3.amazonaws.com/feather-files-aviary-prod-us-east-1/98739f1160a9458db215cec49fb033ee/2017-03-09/ac83353a40f24bea921e478b1450993e.png
Screenshot, The Archaeology of Reading in Early Modern Europe


It’s built out of open source materials, leveraging the International Image Interoperability Framework (IIIF) and the IIIF-compliant Mirador 2 Viewer. Interested parties can download the XML files of our transcriptions, as well as the data produced in the process.

The exciting thing for us is that all the work on creating this digital infrastructure – which is very much a work in progress -- has provided us with the raw materials for asking new research questions, questions that can only be asked by getting away from our computer and returning back to the rare books room.

24 January 2017

Publication of Quarterly Lists: Catalogues of Indian Books

Add comment

The Two Centuries of Indian Print project is pleased to announce the online availability of some wonderful catalogues held by the library, generally known as the Quarterly Lists. They record books published quarterly and by province of British India between 1867 and 1947.

Digitised for the first time, the Quarterly Lists can now be accessed as searchable PDFs via the British Library's datasets portal, data.bl.uk. Researchers will be able to examine rich bibliographic data about books published throughout India, including the names and address of printers and publishers, publication price and how many copies were sold.

 

SV_412_8_1875-78_0003

 

Our next steps will be to OCR the Quarterly Lists to create ALTO XML for every page, which is designed to show accurate representations of the content layout. This will allow researchers to apply computational tools and methods to look across all of the lists to answer their questions about book history. So if a researcher is interested in what the history of book publishing reveals about a particular time period and place, we would like to make that possible by giving them full access to this dataset.

To get to this point however, we will have to overcome the layout challenge that the Quarterly Lists present. Across all of the lists we have found a few different layout styles which are rather tricky for OCR solutions to handle meaningfully. Note for instance how the list below compares to the one from the Calcutta Gazette above. Through the Digital Research strand of the project we will be seeking out innovative research groups willing to take a crack at improving the OCR quality and accuracy of tabular text extraction from the Quarterly Lists. 

The Quarterly Lists available on data.bl.uk are out of copyright and openly licensed for reuse. If you or anyone you know are interested in using the Quarterly Lists in your research or simply want to find out more about them, feel free to drop me an email; Tom.Derrick@bl.uk or follow more about the project @BL_IndianPrint

You can read more about the history of the Quarterly Lists, in a previous blog I wrote last year.

03 November 2016

Quarterly Lists: Digitally Researching Catalogues of Indian Books

Add comment

As well as digitising rare early printed Indian books, the Two Centuries of Indian Print project is making available online some wonderful catalogues held by the library, generally known as the Quarterly Lists, recording all books published quarterly and by province of British India between 1867 and 1947.

The catalogues will complement the Bengali printed books and I’d like to use this blog to share a bit more about what the Quarterly Lists are and what we are doing to make them as accessible as possible for researchers of book history who want to apply digital research methods to explore their rich contents.

Firstly, a little more about the origins of these catalogues. With the passing of The (Indian) Press and Registration of Books Act, 1867 it became mandatory for all books published in provinces of British India to be sent to the provincial secretariat library for registration.  Both the India Office Library and the British Museum Library in London, later to be united in the British Library’s collection, were separately given the privilege of requesting books from these lists free of charge in what amounted to a colonial legal deposit arrangement. The act was passed with the aim of recording the ever growing number of publications originating from the various printing presses throughout India, its purpose political as well as archival.  Not all works that issued from the presses were recorded in the lists and only a small percentage were actually deposited in the London collections.  The library curators in London selected only those works which they thought were important or interesting.  The Quarterly lists were originally published as appendices in the official provincial newspapers, such as the Calcutta Gazette (below).

  SV_412_8_1875-78_0003

 

SV_412_8_1875-78_0004

 

Although Independence brought an end to the arrangement for depositing publications with the India Office Library and British Museum Library, the practice of publishing catalogues of registered printed books continued until the late 1960s.

Now digitised for the first time, the Quarterly Lists will be made available as searchable PDFs via the British Library's new datasets portal, data.bl.uk, in November. Researchers will be able to examine a rich bibliographic data about books published throughout India, including the name and address of printers and publishers. If you are interested in accessing this collection please contact Tom.Derrick@bl.uk

Our next steps will be to OCR the Quarterly Lists to create ALTO XML for every page, which is designed to show accurate representations of the content layout. This will allow researchers to apply computational tools and methods to look across all of the lists to answer their questions about book history. So if a researcher is interested in what the history of book publishing reveals about a particular time period and place, we would like to make that possible by giving them full access to this dataset.

To get to this point however, we will have to overcome the layout challenge that the Quarterly Lists present. Across all of the lists we have found a few different layout styles which are rather tricky for OCR solutions to handle meaningfully. Note for instance how the list below compares to the one from the Calcutta Gazette above. Through the Digital Research strand of the project we will be seeking out innovative research groups willing to take a crack at improving the OCR quality and accuracy of tabular text extraction from the Quarterly Lists. 

  SV_412_8_1935_0016

If you or anyone you know are interested in using the Quarterly Lists in your research or simply want to find out more about them, feel free to drop me an email; Tom.Derrick@bl.uk or follow more about the project @BL_IndianPrint

 

04 July 2016

Two Centuries of Indian Print: Enhancing Scholarly Research

Add comment

Tom Derrick will be working as a Digital Curator within the Digital Research Team at the British Library on a project titled ‘Two Centuries of Indian Print’. This project will digitise rare Bengali printed books and provide opportunities for innovative research at the intersection of Digital Humanities and South Asian studies. He Tweets @tommyid83, and can also be contacted by email at Tom.Derrick@bl.uk.

 

Only a week into my new role I can already see the benefits of the work that the digital research team delivers. I attended a fascinating presentation of the two latest BL Lab award-winning projects. I was impressed to see how young researchers are collaborating with the digital research team here to find innovative methods to open up new avenues for their own research as well as for other academics and the general public.      

I have joined the British Library from a digital publisher of historical primary sources and am excited to use my experience engaging with researchers to facilitate academic interrogation of the Two Centuries of Indian Print project data. This two-year pilot will make, freely available online, digitised Bengali books drawn from the extensive South Asian printed book collection at the British Library along with a selection from SOAS. The books digitised as part of the pilot will span 1801-1867, the bulk of which are religious tracts. It is part of a wider initiative by the British Library to catalogue and make available printed Indian books in 22 South Asian languages, covering 1714-1914.

 Ab_Haval  Ab haval, a poetical account in Gujarati on the disastrous floods at Ahmadabad, 1875

 

Over the course of the next two years, I'll be engaging with researchers, particularly in the fields of South Asian studies and Digital Humanities, to explore the opportunities and challenges involved in applying digital research methods and tools to this newly digitised collection. A key area I'll be looking at is how to ensure the metadata and digitised text produced will cater to the needs and interests of an academic community interested in performing large-scale data analysis. This will involve finding an optimal solution to making the Bengali script machine readable so the full text can be searched and ‘mined’ by researchers. We'll also be developing a series of workshops to provide academics and professionals from Indian institutions, particularly the GLAM (Galleries, Libraries, Archives and Museums) sector, to gain new skills to support digital research.  

Sanskrit_Hymn_2 Illustration from an early printed edition of the Adityahṛdayam, a devotional hymn in Sanskrit to the Sun God, seen here on his chariot drawn by seven horses, Bombay, 1862

 

It is a privilege to be here working for the British Library, an institution I have always admired for its mission and core values and I am proud to support that continued effort through stimulating an international community of researchers to access what will prove to be a fascinating collection. We’ll be posting further blogs describing the progress of the project, so watch this space! If you have any questions about the project or ideas relating to innovative use of the collection, please do email me at Tom.Derrick@bl.uk