20 October 2020
The Botish Library: developing a poetry printing machine with Python
This is a guest post by Giulia Carla Rossi, Curator of Digital Publications at the British Library. You can find her @giugimonogatari.
In June 2020 the Office for Students announced a campaign to fill 2,500 new places on artificial intelligence and data science conversion courses in universities across the UK. While I’m not planning to retrain in cyber, I was lucky enough to be in the cohort for the trial run of one of these courses: Birkbeck’s Postgraduate Certificate in Applied Data Science. The course started as a collaborative project between The British Library, The National Archives and Birkbeck University to develop a computing course aimed at professionals working in the cultural heritage sector. The trial run has now ended and the course is set to start in full from January 2021.
The course is designed for graduates who are new to computer science – which was perfect for me, as I had no previous coding knowledge besides some very basic HTML and CSS. It was a very steep learning curve, starting from scratch and ending with developing my own piece of software, but it was great to see how code could be applied to everyday issues to facilitate and automate parts of our workload. The fact that it was targeted at information professionals and that we could use existing datasets to learn from real life examples made it easier to integrate study with work. After a while, I started to look at the everyday tasks in my to-do list and wonder “Can this be solved with Python?”
After a taught module (Demystifying Computing with Python), students had to work on an individual project module and develop a software based on their work (to solve an issue, facilitate a task, re-use and analyse existing resources). I had an idea of the themes I wanted to explore – as Curator of Digital Publications, I’m interested in new media and platforms used to deliver content, and how text and stories are shaped by these tools. When I read about French company Short Édition and the short story vending machine in Canary Wharf I knew I had found my project.
My project is to build a stand-alone printer that prints random poems from a dataset of out-of-copyright texts. A little portable Bot-ish (sic!) Library to showcase the British Library collections and fill the world with more poetry.
For my project, I decided to use the British Library’s “Digitised printed books (18th-19th century)” collection. This comprises over 60,000 volumes of 18th and 19th century texts, digitised in partnership with Microsoft and made available under Public Domain Mark. My work focused on the metadata dataset and the dataset of OCR derived text (shout out to the Digital Research team for kindly providing me with this dataset, as its size far exceeded what my computer is able to download).
The British Library actively encourages researchers to use its “digital collection and data in exciting and innovative ways” and projects with similar goals to mine had been undertaken before. In 2017, Dr Jennifer Batt worked with staff at the British Library on a data mining project: her goal was to identify poetry within a dataset of 18th Century digitised newspapers from the British Library’s Burney Collection. In her research, Batt argued that employing a set of recurring words didn’t help her finding poetry within the dataset, as only very few of the poems included key terms like ‘stanza’ and ‘line’ – and none included the word ‘poem’. In my case, I chose to work with the metadata dataset first, as a way of filtering books based on their title, and while, as Batt proved, it’s unlikely that a poem itself includes a term defining its poetry style I was quite confident that such terms might appear in the title of a poetry collection.
My first step then was to identify books containing poetry, by searching through the metadata dataset using key words associated with poetry. My goal was not to find all the poetry in the dataset, but to identify books containing some form of poetry, that could be reused to create my printer dataset. I used the Poetry Foundation’s online “Glossary of Poetic Terms - Forms & Types of Poems” to identify key terms to use, eliminating the anachronisms (no poetry slam in the 19th century, I'm afraid) and ambiguous terms (“romance” returned too many results that weren’t relevant to my research). The result was 4580 book titles containing one or more poetry-related words.
Creating verses: when coding meets grammar
I then wanted to extract individual poems from my dataset. The variety of book structures and poetry styles made it impossible to find a blanket rule that could be applied to all books. I chose to test my code out on books that I knew had one poem per page, so that I could extract pages and easily get my poems. Because of its relatively simple structure - and possibly because of some nostalgia for my secondary school Italian class - I started my experiments with Giacomo Pincherle’s 1865 translation of Dante’s sonnets, “In Omaggio a Dante. Dante's Memorial. [Containing five sonnets from Dante, Petrarch and Metastasio, with English versions by G. Pincherle, and five original sonnets in English by G. Pincherle.]”
Once I solved the problem of extracting single poems, the issue was ‘reshaping’ the text to match the print edition. Line breaks are essential to the meaning of a poem and the OCR text was just one continuous string of text that completely disregarded the metric and rhythm of the original work. The rationale behind my choice of book was also that sonnets present a fairly regular structure, which I was hoping could be of use when reshaping the text. The idea of using the poem’s metre as a tool to determine line length seemed the most effective choice: by knowing the type of metre used (iambic pentameter, terza rima, etc.) it’s possible to anticipate the number of syllables for each line and where line breaks should occur.
So I created a function to count how many syllables a word has following English grammar rules. As it’s often the case with coding, someone has likely already encountered the same problem as you and, if you’re lucky, they have found a solution: I used a function found online as my base (thank you, StackOverflow), building on it in order to cover as many grammar rules (and exceptions) as I was aware of. I used the same model and adapted it to Italian grammar rules, in order to account for the Italian sonnets in the book as well. I then decided to combine the syllable count with the use of capitalisation at the beginning of a line. This increased the chances of a successful result in case the syllable count would return a wrong result (which might happen whenever typos appear in the OCR text).
It was very helpful that all books in the datasets were digitised and are available to access remotely (you can search for them on the British Library catalogue by using the search term “blmsd”), so I could check and compare my results to the print editions from home even during lockdown. I also tested my functions on sonnets from Henry Thomas Mackenzie Bell’s “Old Year Leaves Being old verses revived. [With the addition of two sonnets.]” and Welbore Saint Clair Baddeley’s “Legend of the Death of Antar, an eastern romance. Also lyrical poems, songs, and sonnets.”
Main challenges and gaps in research
- Typos in the OCR text: Errors and typos were introduced when the books in the collection were first digitised, which translated into exceptions to the rules I devised for identifying and restructuring poems. In order to ensure the text of every poem has been correctly captured and that typos have been fixed, some degree of manual intervention might be required.
- Scalability: The variety of poetry styles and book structures, paired with the lack of tagging around verse text, make it impossible to find a single formula that can be applied to all cases. What I created is quite dependent on a book having one poem per page, and using capitalisation in a certain way.
- Time constraint: the time limit we had to deliver the project - and my very-recently-acquired-and-still-very-much-developing skill set - meant I had to focus on a limited number of books and had to prioritise writing the software over building the printer itself.
One of the outputs of this project is a JSON file containing a dictionary of poetry books. After searching for poetry terms, I paired the poetry titles and relative metadata with their pages from the OCR dataset, so the resulting file combines useful data from the two original datasets (book IDs, titles, authors’ names and the OCR text of each book). It’s also slightly easier to navigate compared to the OCR dataset as books can be retrieved by ID, and each page is an item in a list that can be easily called. One of the next steps will be to upload this onto the British Library data repository, in the hope that people might be encouraged to use it and conduct further research around this data collection.
Another, very obvious, next step is: building the printer! The individual components have already been purchased (Adafruit IoT Pi Printer Project Pack and Raspberry Pi 3). I will then have to build the thermal printer with Raspberry Pi and connect it to my poetry dataset. It’s interesting to note that other higher education institutions and libraries have been experimenting with similar ideas - like the University of Idaho Library’s Vandal Poem of the Day Bot and the University of British Columbia’s randomised book recommendations printer for libraries.
My aim when working on this project was for the printer to be used to showcase British Library collections; the idea was for it to be located in a public area in the Library, to reach new audiences that might not necessarily be there for research purposes. The printer could also be reprogrammed to print different genres and be customised for different occasions (e.g. exhibitions, anniversary celebrations, etc.) All of this was planned before Covid-19 happened, so it might be necessary to slightly adapt things now - and any suggestions in merit are very welcome! :)
Finally, none of this would have been possible without Nora McGregor, Stelios Sotiriadis, Peter Wood, the Digital Scholarship and BL Labs teams, and the support of my line manager and my team.