THE BRITISH LIBRARY

Digital scholarship blog

33 posts categorized "Printed books"

19 March 2019

BL Labs 2018 Commercial Award Runner Up: 'The Seder Oneg Shabbos Bentsher'

Add comment

This guest blog was written by David Zvi Kalman on behalf of the team that received the runner up award in the 2018 BL Labs Commercial category.

32_god_web2

The bentsher is a strange book, both invisible and highly visible. It is not among the more well known Jewish books, like the prayerbook, Hebrew Bible, or haggadah. You would be hard pressed to find a general-interest bookstore selling a copy. Still, enter the house of a traditional Jew and you’d likely find at least a few, possibly a few dozen. In Orthodox communities, the bentsher is arguably the most visible book of all.

Bentshers are handbooks containing the songs and blessings, including the Grace after Meals, that are most useful for Sabbath and holiday meals, as well as larger gatherings. They are, as a rule, quite small. These days, bentshers are commonly given out as party favors at Jewish weddings and bar/bat mitzvahs, since meals at those events require them anyway. Many bentshers today have personalized covers relating the events at which they were given.

Bentshers have never gone out of print. By this I mean that printing began with the invention of the printing press and has never stopped. They are small, but they have always been useful. Seder Oneg Shabbos, the version which I designed, was released 500 years after the first bentsher was published. It is, in a sense, a Half Millennium Anniversary Special Edition.

SederOneg_4

Bentshers, like other Jewish books, could be quite ornate; some were written and illustrated by hand. Over the years, however, bentshers have become less and less interesting, largely in order to lower the unit cost. In order to make it feasible for wedding planners to order hundreds at a time, all images were stripped from the books, the books themselves became very small, and any interest in elegant typography was quickly eliminated. My grandfather, who designed custom covers for wedding bentshers, simply called the book, “the insert.” Custom prayerbooks were no different from custom matchbooks.

This particular bentsher was created with the goal of bucking this trend; I attempted to give the book the feel of the some of the Jewish books and manuscripts of the past, using the research I was able to gather a graduate student in the field of Jewish history. Doing this required a great deal of image research; for this, the British Library’s online resources were incredible valuable. Of the more than one hundred images in the book, a plurality are from the British Library’s collections.

https://data.bl.uk/hebrewmanuscripts/

https://www.bl.uk/hebrew-manuscripts

OS_36_37

In addition to its visual element, this bentsher differs from others in two important ways. First, it contains ritual languages that is inclusive of those in the LGBTQ community, and especially for those conducting same-sex weddings. In addition, the book contains songs not just in Hebrew, but in Yiddish, as well; this was a homage to two Yiddishists who aided in creating the bentsher’s content. The bentsher was first used at their wedding.

SederOneg_3

More here: https://shabb.es/sederonegshabbos/

Watch David accepting the runner up award and talking about the Seder Oneg Shabbos Bentsher on our YouTube channel (clip runs from 5.33 to 7.26): 

David Zvi Kalman was responsible for the book’s design, including the choice of images. He is a doctoral candidate at the University of Pennsylvania, where he focuses on the relationship between Jewish history and the history of technology. Sarah Wolf is a specialist in rabbinics and is an assistant professor at the Jewish Theology Seminary of America. Joshua Schwartz is a doctoral student at New York University, where he studies Jewish mysticism. Sarah and Joshua were responsible for most of the books translations and transliterations. Yocheved and Yudis Retig are Yiddishists and were responsible for the book’s Yiddish content and translations.

Find out more about Digital Scholarship and BL Labs. If you have a project which uses British Library digital content in innovative and interesting ways, consider applying for an award this year! The 2019 BL Labs Symposium will take place on Monday 11 November at the British Library.

28 February 2019

The World Wide Lab: Building Library Labs - Part II

Add comment

BL Flickr Copenhagen 1

We're setting sail for Denmark! Along with colleagues from the UK, Austria, Belgium, Egypt, Finland, Germany, Ireland, Latvia, Luxembourg, the Netherlands, Qatar, Spain, Sweden and the USA, we will be mooring at Copenhagen's Black Diamond, waterfront home to Denmark's Royal Library, for the second International Building Library Labs event: 4-5 March 2019.

Danish lib & BL logis

For some time now, leading national, state, university and public libraries around the world have been creating 'digital lab type environments'. The purpose of these 'laboratories' is to afford access to their institutions' digital content - the digitised and 'born digital' collections as well as data - and to provide a space where users can experiment and work with that content in creative, innovative and inspiring ways. Our shared ethos is to open up our collections for everyone: digital researchers, artists, entrepreneurs, educators, and everyone in between.

BL Labs has been running in such a capacity for six years. In September 2018, we hosted a 2-day workshop at the British Library in London for invited participants from national, state and university libraries - the first event of its kind in the world. It was a resounding success, and it was decided that we should organise a second event, this time in collaboration with our colleagues in Copenhagen.

11248527023_2655ce2ceb_oNext week's participants, from over 30 institutions, will be sharing lessons learned, talking about innovative projects and services that have used their digital collections and data in clever ways, and continuing to establish the foundations for an international network of Library Labs. We aim to work together in the spirit of collaboration so that we can continue to build even better Library Labs for our users in the future.

Our packed programme is available to view on Eventbrite or as a Googledoc. We still have a few spaces left so if you are interested in coming along, you can still book here. As well as presentations and plenary debates, we will have eight lightning talks with topics ranging from how to handle big data to how to run a data visualisation lab. To accommodate our many delegates, with their own interests and specialisms, we will break out into 12 parallel discussion groups focusing on subjects such as how to set up a lab; how to get access to data; moving from 'project' lab to 'business as usual'; data curation; how to deal with large datasets; and using Labs & Makerspaces for data-driven research and innovation in creative industries. 

We will blog again after the event, and provide links to some of the presentations and outputs. Watch this space! 

11150060314_bcf2b92af3_o

Danish-themed images trawled from our British Library Flickr Images set: pages 37, 126, and 15 of Copenhagen, the Capital of Denmark, published by the Danish Tourist Society, 1898. Find the original book here.

Posted by Eleanor Cooper on behalf of BL Labs

26 February 2019

Competition to automate text recognition for printed Bangla books

Add comment

You may have seen the exciting news last week that the British Library has launched a competition on recognition of historical Arabic scientific manuscripts that will run as part of ICDAR2019. We thought it only fair to cover printed material too! So we’re running another competition, also at ICDAR, for automated text recognition of rare and unique printed books written in Bangla that have been digitised through the Library's Two Centuries of Indian Print project.

Some of you may remember the Bangla printed books competition which took place at ICDAR2017 which generated significant interest among academic institutions and technology providers both in India and across the world. The 2017 competition set the challenge of finding an optimal solution for automating recognition of Bangla printed text and resulted in Google’s method performing best for both text detection and layout analysis.

Fast forward to 2019 and, thanks to Jadavpur University in Kolkata, we have added more ground truth transcriptions for competition entrants to train their OCR systems with. We hope that the competition encourages submissions again from cutting-edge OCR methods leading to a solution that can truly open up these historic books, dating between 1713 and 1914, for text mining, enabling scholars of South Asian studies to explore hundreds of thousands of pages on a scale that has not been possible until now.

AletheiaGroundTruth

              Image showing a transcribed page from one of the Bengali books featured in the ICDAR2019 competition

As with the Arabic competition, we are collaborating with PRImA (Pattern Recognition & Image Analysis Research Lab) who will provide expert and objective evaluation of OCR results produced through the competition. The final results will be revealed at the ICDAR2019 conference in Sydney in September.

So if you missed out last time but are interested in testing your OCR systems on our books the competition is now open! For instructions of how to apply and more about the competition, please visit https://www.primaresearch.org/REID2019/

 

This post is by Tom Derrick, Digital Curator for Two Centuries of Indian Print, British Library. He is on Twitter as @TommyID83 and Two Centuries of Indian Print tweet from @BL_IndianPrint

 

18 February 2019

Updated Eighteenth-Century Collections Online

Add comment

The traditional, somewhat stereotypical image of the researcher of things past has not changed much in recent times. There is nothing easier than to imagine a scholar sitting at a scarcely illuminated wooden desk, surrounded by piles of old hardbound volumes, spending hours on end rummaging through the sheets in search of a clue.

In the field of eighteenth-century studies, this is certainly still the case. Scholars often go on a pilgrimage to prestigious repositories such as the British Library. However, in the last fifteen years or so, technology has started to offer attractive alternatives to the pleasure of travelling to London. Powered by Gale-Cengage, the Eighteenth-Century Collections Online (commonly referred to as ECCO) is a well-known resource that provides access to English-language and foreign-language publications printed in Britain, Ireland and the American colonies during the eighteenth century. This extensive collection contains over 180,000 titles (200,000 volumes) and allows full-text searching of some 32 million pages. These are digital editions based on the Eighteenth Century microfilming that started in 1981 and the English Short Title Catalogue.

New ECCO main screen
New ECCO home page

Moving away from its classic web-1.0 design, the Gale-Cengage team recently decided to revamp the layout of ECCO – indeed, of their entire portfolio of archive products, which include among others the Seventeenth- and Eighteenth-Century Burney Newspapers Collection. The aim is to make the Gale Primary Sources experience more consistent and intuitive for the user. At the head of this delicate operation are product managers Doran Steele and Megan Sullivan, who lead a nine-person team of software developers, content engineers, researchers and designers. Not quite the IT-only type of personnel, Doran and Megan are scholars themselves, respectively holding degrees in History and Information Science and a remarkable passion for all things past. They are responsible for the maintenance of the existing ECCO interface, as well as the development of the upcoming design refresh.

During a recent interview they gave to the authors of this post, Doran and Megan declared their objective of evolving ECCO in line ‘with user expectations of modern online research experiences’. Their driving force was stated very clearly as a bottom-up process. ‘This redesign’, they explained, ‘is informed by user feedback and market research’. A beta version of the new site has been available since the second half of 2018 to enable the Gale-Cengage team to gather feedback about the new design. The product managers specified that the final transition to the ‘new’ ECCO will only be completed once they feel confident that the new experience ‘successfully meets the needs of our users’. The final goal is a better user experience, ‘one that is faster and more intuitive’. To achieve this, a range of new features have been included, such as more filters on search results; results more relevant to the search queries; data visualization tools; improved subject indexing; more options for adjusting the image; and the ability to download in a text format the OCR (optical character recognition) version of a volume. The latter feature will be a particularly welcome innovation for scholars that often need to look up the occurrence of a single word or cut and paste long chunks of text.

ECCO search results
New ECCO search results screen

The options for adjusting the page view are another significant novelty. The beta version boasts new settings to quickly select the preferred zoom level, as well as sliders to increase or decrease the brightness and contrast of the page. These improvements are particularly welcome considering that the quality of the scans remains unchanged. The page quality is not directly related to ECCO. The portal simply allows the consultation of the digitised microfilms included in the first collection (also known as ECCO 1, comprising over 154.000 texts) and the digitisation of a second, smaller collection of books (ECCO 2, over 52.000 titles). This raises an important issue. A plethora of relatively unknown, yet precious eighteenth-century material remains difficult to consult because, on top of the uneven quality in the texts that came out of eighteenth-century printing presses, the original microfilming technology that was employed for the first collection yielded relatively low-resolution results. This causes some hiccups with OCR recognition, thus discouraging the use of quantitative methodologies. But the issue is all the more salient when the category of eighteenth-century visuals is taken into account. At a time when British engravers multiplied in numbers to illustrate the newly-discovered wonders of the natural world or the archaeological remains of Roman cities in England, illustrations became an essential aspect of the eighteenth-century book market and reading experience. While for essential texts such as William Stukeley’s Itinerarium curiosum (1724) or Eleazar Albin and William Derham’s A Natural History of Birds (1734) more refined scans can be found elsewhere, a large number of texts is digitally available only through ECCO 1. Scholars interested in images are either to focus on well-known texts that have been digitised by other providers – with serious consequences in terms of canonicity – or eventually need to plan a visit to major libraries to consult the relevant volumes in person, somehow defeating the very idea of digital reading. Either way, the study of visual culture is somewhat inhibited. Nevertheless, the ‘new’ ECCO promises to enhance the user experience and to offer even more opportunities to engage with outstanding repositories of primary material. If you already had a chance to use the new version, we encourage you to get in touch with Doran and Megan: as your feedback and suggestions can improve ECCO even further.

New ECCO text screen
New ECCO image viewer screen

This post is by Alessio Mattana, Teaching Assistant in Eighteenth-Century Literature at the University of Leeds (on Twitter as @mattanaless), and Dr Giacomo Savani, Teaching Assistant in Ancient History at the University of Leeds (on Twitter as @GiacomoSavani).

30 January 2019

Reading 35,000 Books: The UCD Contagion Project and the British Library Digital Corpus - Workshop & Roundtable

Add comment

A guest post by Geradine Meany, Professor of Cultural Theory in the School of English, Drama and Film and Derek Greene, Assistant Professor at the School of Computer Science, both at the University College Dublin who are organising a FREE workshop and roundtable together with the BL Labs team on Thursday 20 February 2019 at the British Library in London.

How do you set about finding specific references and thematic associations in the massive digital resource represented by the British Library Nineteenth Century Book Corpus, originally digitised through a collaboration with Microsoft?

The Contagion, Biopolitics and Cultural Memory project at UCD Dublin set out to illuminate culturally and historically specific understandings of disease and contagion that appear within the fiction in the corpus. In order to do so, the project team extracted over 35,000 unique volumes out of a total of 65,000 in English and built a searchable interface of 12.3 million individual pages of text, which can be filtered and sorted using the corpus metadata (e.g. author, title, year, etc). The interface incorporates an index of the topical catalogue of volumes used by the British Library from 1823-1985 (within Alston index). Using a combination of OCR text recognition and manual annotation, we have extracted data the two top levels of the index, covering over 98% of the English language texts in the corpus. So for the first time it is possible to reliably identify and extract fiction, drama, history, topography, etc, from the corpus.

35000books
Extracting data from 35,000 digitised books

To allow researchers to further filter the corpus to identify texts from niche topic areas, the interface supports the semi-automatic creation of word lexicons, built upon modern “word embedding” natural language processing methods. By combining the resulting lexicons with existing corpus metadata and the data extracted from digitised version of the Alston Index, researchers can efficiently create and export small topical sub-corpora for subsequent close reading.

The Contagion project team is currently using information retrieval and word embeddings to identify texts for close reading. This combination allows us to track key trends pertaining to illness and contagion in the corpus, and interpret these findings with particular reference to current and historical debates surrounding biopolitics, medical culture and migration. Clusters of associations between contagion, poverty and morality are identifiable within the corpus. However, to date our research indicates that Victorians were more worried about religious contamination from migrants and minorities than they were about contagious diseases.

A key feature of the project is the intersection of methodologies and concepts from English literature, automated text mining, and medical humanities. This involves using data analytics as a mode of interpretation not a substitute for it, a way of engaging with the extent and complexity of cultural production in the nineteenth century. Cultural data resists giving definitive yes or no answers to the questions put to it by researchers, but the more cultural data we analyse the better we can map the processes of cultural change and continuity, in all their complexity. The process of tracking themes, topics, and associations enabled by the new interface offers an opportunity to work with and far beyond the existing canon of nineteenth century fiction, itself radically expanded by the last 20 years of scholarship. The identification within the corpus of a very large collection of 3 volume novels indicates that the popular novel is very well represented, for example, while the ability to identify and extract ‘Collected Works’ indicates which writers their contemporaries expected to remain central to the tradition of fiction.

On February 20th 2019, the FREE ‘Reading 35,000 Books’ workshop and roundtable will present the project’s work to date, and will also include discussion by scholars of nineteenth century literature and the British Library Labs of the future development and use of the new searchable interface, including exporting topical sub-corpora for further research.

The event is supported by the Irish Research Council.

 

29 October 2018

Using Transkribus for automated text recognition of historical Bengali Books

Add comment

In this post Tom Derrick, Digital Curator, Two Centuries of Indian Print, explains the Library's recent use of Transkribus for automated text recognition of Bengali printed books.

Are you working with digitised printed collections that you want to 'unlock' for keyword search and text mining? Maybe you have already heard about Transkribus but thought it could only be used for automated recognition of handwritten texts. If so you might be surprised to hear it also does a pretty good job with printed texts too. You might be even more surprised to hear it does an impressive job with printed texts in Indian scripts! At least that is what we have found from recent testing with a batch of 19th century printed books written in Bengali script that have been digitised through the British Library’s Two Centuries of Indian Print project.

Transkribus is a READ project and available as a free tool for users who want to automate recognition of historical documents. The British Library has already had some success using Transkribus on manuscripts from our India Office collection, and it was that which inspired me to see how it would perform on the Bengali texts, which provides an altogether different type of challenge.

For a start, most text recognition solutions either do not support Indian scripts, or do not reach close to the same level of recognition as they do with documents written in English or other Latin scripts. In part this is down to supply and demand. Mainstream providers of tools have prioritised Western customers, yet there is also the relative lack of digitised Indian texts that can be used to train text recognition engines.

These text recognition engines have also been well trained on modern dictionaries and a collection of historical texts like the Bengali books will often contain words which are no longer in use. Their aged physicality also brings with it the delights of faded print, blotchy paper and other paper-based gremlins that keeps conservationists in work yet disrupts automated text recognition. Throw in an extensive alphabet that contains more diverse and complicated character forms than English and you can start to piece together how difficult it can be to train recognition engines to achieve comparable results with Bengali texts.

So it was with more with hope than expectation I approached Transkribus. We began by selecting 50 pages from the Bengali books representing the variety of typographical and layout styles within the wider collection of c. 500,000 pages as much as possible. Not an easy task! We uploaded these to Transkribus, manually segmenting paragraphs into text regions and automating line recognition. We then manually transcribed the texts to create a ground truth which, together with the scanned page images, were used to train the recurrent neural network within Transkribus to create a model for the 5,700 transcribed words.

Transkribus_Bengali_screenshot                                 View of a segmented page from one of the British Library's Bengali books along with its transcription, within the Transkribus viewer. 

The model was tested on a few pages from the wider collection and the results clearly communicated via the graph below. The model achieved an average character error rate (CER) of 21.9%, which is comparable to the best results we have seen from other text recognition services. Word accuracy of 61% was based on the number of words that were misspelled in the automated transcription compared to the ground truth. Eventually we would like to use automated transcriptions to support keyword searching of the Bengali books online and the higher the word accuracy increases the chances of users pulling back all relevant hits from their keyword search. We noticed the results often missed the upper zone of certain Bengali characters, i.e. the part of the character or glyph which resides above the matra line that connects characters in Bengali words. Further training focused on recognition of these characters may improve the results.

TranskribusResultsGraph showing the learning curve of the Bengali model using the Transkribus HTR tool.      

Our training set of 50 pages is very small compared to other projects using Transkribus and so we think the accuracy could be vastly improved by creating more transcriptions and re-training the model. However, we're happy with these initial results and would encourage others in a similar position to give Transkribus a try.

 

 

24 July 2018

Workshop for South Asian Archivists and Librarians

Add comment

Members of the Two Centuries of Indian Print team have just returned from a fascinating trip to Delhi where we took part in a packed programme of activities organised as part of the Association for Asian Studies conference.

We spent most of the week with a group of archivists brought together from a variety of academic and cultural institutions across India and as far away as Cambodia and Australia. What united us was a shared passion for preserving South Asian heritage. As part of the program we led a workshop on Digitisation Standards as practiced by the British Library which also considered the key challenges organisations face when digitising cultural heritage material, including everything from selecting material and scanning, through to post-processing, online display and user engagement. The workshop also featured a paper on the IFLA guidelines for digitisation and (what we hope) was fun activity in which archivists were presented with different case studies of archival collections and asked to consider a digitisation strategy. It certainly sparked a lot of conversation! See photo below

 

Group activity

Workshop participants taking part in a group activity

 

Undeterred by the inhospitable weather occupying Delhi, we ventured out and were fortunate enough to receive some very thorough and illuminating tours of the Archives and Research Centre for Ethnomusicology, Centre for Art and Archaeology, The National Archives, Indira Gandhi National Centre for the Arts, and Sangeet Natak Akademi where we learned about their respective collections, conservation facilities and digitisation projects.

 

ARCE_audiovisual
Taking part in a tour of the audiovisual lab at the Archives and Research Centre for Ethnomusicology 

 

This marked the end of a trip which has connected us with inspiring professionals who we hope to collaborate on more events in the near future.

Our thanks go out to the organisers of what turned out to be a very engaging week of activities, to the American Institute of Indian Studies, to Ashoka University, and to the hosts of our workshop, the India International Centre.

 

08 May 2018

The Italian Academies database – now available in XML

Add comment

Dr Mia Ridge writes: in 2017, we made XML and image files from a four-year, AHRC-funded project: The Italian Academies 1525-1700 available through the Library's open data portal. The original data structure was quite complex, so we would be curious to hear feedback from anyone reusing the converted form for research or visualisations.

In this post, Dr Lisa Sampson, Reader in Early Modern Italian Studies at UCL, and Dr Jane Everson, Emeritus Professor of Italian literature, RHUL, provide further information about the project...

New research opportunities for students of Renaissance and Baroque culture! The Italian Academies database is now available for download. It's in a format called XML which represents the original structure of the database.

This dedicated database results from an eight-year project, funded by the Arts and Humanities Research Council UK, and provides a wealth of information on the Italian learned academies. Around 800 such institutions flourished across the peninsula over the sixteenth and seventeenth centuries, making major contributions to the cultural and scientific debates and innovations of the period, as well as forming intellectual networks across Europe. This database lists a total of 587 Academies from Venice, Padua, Ferrara, Bologna, Siena, Rome, Naples, and towns and cities in southern Italy and Sicily active in the period 1525-1700. Also listed are more than 7,000 members of one or more academies (including major figures like Galileo, as well as women and artists), and almost 1,000 printed works connected with academies held in the British Library. The database therefore provides an essential starting point for research into early modern culture in Italy and beyond. It is also an invitation to further scholarship and data collection, as these totals constitute only a fraction of the data relating to the Academies.

Terracina
Laura Terracina, nicknamed Febea, of the Accademia degli Incogniti, Naples

The database is designed to permit searches from many different perspectives and to allow easy searching across categories. In addition to the three principal fields – Academies, People, Books – searches can be conducted by title keyword, printer, illustrator, dedicatee, censor, language, gender, nationality among others. The database also lists and illustrates the mottoes and emblems of the Academies (where known) and similarly of individual academy members. Illustrations from the books entered in the database include frontispieces, colophons, and images from within texts.

Intronati emblem
Emblem of the Accademia degli Intronati, Siena


The database thus aims to promote research on the Italian Academies in disciplines ranging from literature and history, through art, science, astronomy, mathematics, printing and publishing, censorship, politics, religion and philosophy.

The Italian Academies project which created this database began in 2006 as a collaboration between the British Library and Royal Holloway University of London, funded by the Arts and Humanities Research council and led by Jane Everson. The objective was the creation of a dedicated resource on the publications and membership of the Italian learned Academies active in the period between 1525 and 1700. The software for the database was designed in-house by the British Library and the first tranche of data was completed in 2009 listing information for academies in four cities (Naples, Siena, Bologna and Padua). A second phase, listing information for many more cities, including in southern Italy and Sicily, developed the database further, between 2010 and 2014, with a major research grant from the AHRC and collaboration with the University of Reading.

The exciting possibilities now opened up by the British Library’s digital data strategy look set to stimulate new research and collaborations by making the records even more widely available, and easily downloadable, in line with Open Access goals. The Italian Academies team is now working to develop the project further with the addition of new data, and the incorporation into a hub of similar resources.

The Italian Academies project team members welcome feedback on the records and on the adoption of the database for new research (contact: www.italianacademies.org).

The original database remains accessible at http://www.bl.uk/catalogues/ItalianAcademies/Default.aspx 

An Introduction to the database, its aims, contents and objectives is available both at this site and at the new digital data site: https://data.bl.uk/iad/

Jane E. Everson, Royal Holloway University of London

Lisa Sampson, University College, London