Digital scholarship blog

Enabling innovative research with British Library digital collections

Introduction

Tracking exciting developments at the intersection of libraries, scholarship and technology. Read more

27 January 2020

How historians can communicate their research online

This blog post is by Jonathan Blaney (Institute of Historical Research), Frances Madden (British Library), Francesca Morselli (DANS), Jane Winters (School of Advanced Study, University of London)

This blog will be published in several other locations including the FREYA blog and the IHR blog

Large satellite receiver
Source: Joshua Hoehne, Unsplash

On 4 December 2019, the FREYA project in collaboration with UCL Centre for Digital Humanities, Institute of Historical Research, the British Library and DARIAH-EU organized a workshop in London on identifiers in research. In particular this workshop - mainly directed to historians and humanities scholars - focused on ways in which they can build and manage an online profile as researchers, using tools such as ORCID IDs. It also covered best practices and methods of citing digital resources to make humanities researchers' work connected and discoverable to others. The workshop had 20 attendees, mainly PhD students from the London area but also curators and independent researchers.

Presentations

Frances Madden from the British Library introduced the day which was supported by the FREYA project which is funded under the EU’s Horizon 2020 programme. FREYA aims to increase the use of persistent identifiers (PIDs) across the research landscape by building up services and infrastructure. The British Library is leading on the Humanities and social sciences aspect of this work.

Frances described how PIDs are central to scholarly communication becoming effective and easy online. We will need PIDs not just for publications but for grey literature, for data, for blog posts, presentations and more. This is clearly a challenge for historians to learn about and use, and the workshop is a contribution to that effort.

PIDs: some historical context

Jonathan Blaney from the Institute of Historical Research said that there is a context to citation and the persistent identifiers which have grown up around traditional forms of print citation. These are almost invisible to us because they are deeply familiar. He gave an example of a reference to the gospel story of the woman taken in adultery:

John 7:53-8:11

There are three conventions here: the name ‘John’ (attached to this gospel since about the 2nd century) the chapter divisions (medieval and ascribed to the English bishop Stephen Langton) and the verse divisions (from the middle of the 16th century).

When learning new forms of referencing, such as the ones under discussion at the workshop, Jonathan suggested that historians should remember their implicit knowledge has been learned. He finished with an anecdote about Harry Belafonte, retold in Anthony Grafton’s The Footnote: A Curious History. As a young sailor Belafonte wanted to follow up on references in a book he had read. The next time he was on shore leave he went to a library and told the librarian:

“Just give me everything you’ve got by Ibid.”

People in conference room watching a presentation

Demonstrating the benefits

Prof Jane Winters from School introduced what she claimed was her most egotistical presentation by explaining her own choices in curating her online presence and also what was beyond her control. She showed the different results of web searches for herself using Google and DuckDuckGo and pointed out how things she had almost forgotten about can still feature prominently in results.

Jane described her own use of Twitter, and highlighted both the benefits and challenges of using social media to communicate research and build an online profile. It was the relatively rigid format of her institutional staff profile that led her to create her own website. Although Jane has an ORCID ID and a page on Humanities Commons, for example, there are many online services she has chosen not to use, such as academia.edu.

This is all very much a matter of personal choice, dependent upon people’s own tastes and willingness to engage with a particular service.

How to use what’s available

Francesca Morselli from DANS gave a presentation aiming to provide useful resources about identifiers for researchers as well as explaining in a simple yet exhaustive way how they "work" and the rationale behind them.

Most importantly PIDs ensure:

  1. Citability and discoverability (both for humans and machine)
  2. Disambiguation (between similar objects)
  3. Linking to related resources
  4. Long-term archiving and findability

Francesca then introduced the support provided by projects and infrastructures: FREYA, DARIAH-EU and ORCID. Among the FREYA project pillars (PID graph, PID Commons, PID Forum), the latter is available for anyone interested in identifiers.

The DARIAH-EU infrastructure for Arts and Humanities has recently launched the DARIAH Campus platform which includes useful resources on PIDs and managing research data (i.e. all materials which are used in supporting research). In 2018 DARIAH also organized a winter school on Open Data Citation, whose resources are archived here.

Dariah

 

A Publisher’s Perspective

Kath Burton from Routledge Journals emphasised how much use publishers make of digital tools to harvest convent, including social media crawlers, data harvesters and third party feeds.

The importance of maximising your impact online when publishing was explained, both before publishing (filling in the metadata, giving a meaningful title) and afterwards (linking to the article from social media and websites), as well as how publishers can help support this.

Kath went on to give an example of Taylor & Francis’s interest in the possibilities of online scholarly communication by describing its commitment to publishing 3D models of research objects, which is does on via Sketchfab page.

Breakout Groups

After the presentations and a coffee break there were group discussions about what everyone had just heard. During the first part, the groups were asked what was new to them in the presentations. It was clear from discussions around the room that attendees had heard much which was new to them. For example, some attendees had ORCID IDs but many were surprised at the range of things for which they could be used, such as in journal articles and logging into systems. They were also struck by the range of things in which publishers were interested such as research data. Many were really interested in the use of personal websites to manage their profile.

When asked what tallied with their experiences, it became clear that they were keen to engage with these systems, setting up ORCID IDs and Humanities Commons profiles but that they felt that they were too early on in their careers to have anything to contribute to these platforms and felt they were designed for established researchers. Jane Winters stressed that one could adopt a broad approach to the term ‘publications’, including posters, presentations and blog posts and encouraged all to share what they had.

Lastly discussion turned to how the group cites digital resources. This led to an interesting conversation around the citation of archived web pages and how to cite webpages which might change over time, with tools such as the Internet Archive being mentioned. There was also discussion about whether one can cite resources such as Wikipedia and it was clear that this was not something which had been encouraged. Jonathan, who has researched this subject, mentioned that he had found established academics are happy to cite Wikipedia than those earlier in their career.

Conclusions

The workshop effectively demonstrated the sheer range of online tools, social media forums and publishing venues (both formal and informal) through which historians can communicate their research online. This is both an opportunity and a problem. It is a challenge to develop an online presence - to decide which methods are most appropriate for different kinds of research and different personalities - but that is just the first step. For research communication to be truly valuable, it is necessary to focus your effort, manage your online activities and take control of how you appear to others in digital spaces. PIDs are invaluable in achieving this, and in helping you to establish a personal research profile that stays with you as you move through your career. At the start of the day, the majority of those who attended the workshop did not know very much about PIDs and how you can put them to use, but we hope that they came away with an enhanced understanding of the issues and possibilities, the awareness that it does not take much effort or skill to make a real difference to how you are perceived online, and some practical advice about next steps.

It was apparent that, with some admirable exceptions, neither higher education institutions nor PID organisations are successfully communicating the value and importance of PIDs to early career researchers. Workshop attendees particularly welcomed the opportunity to hear from a publisher and senior academic about how PIDs are used to structure, present and disseminate academic work. The clear link between communicating research online and public engagement also emerged during the course of the day, and there is obvious potential for collaboration between PID organisations and those involved with training focused on impact and public engagement. We ended the day with lots of ideas for further advocacy and training, and a shared appreciation for the value of PIDs for helping historians to reach out to a range of different audiences online.

20 January 2020

Using Transkribus for Arabic Handwritten Text Recognition

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Twitter as @BL_AdiKS.

 

In the last couple of years we’ve teamed up with PRImA Research Lab in Salford to run competitions for automating the transcription of Arabic manuscripts (RASM2018 and RASM2019), in an ongoing effort to identify good solutions for Arabic Handwritten Text Recognition (HTR).

I’ve been curious to test our Arabic materials with Transkribus – one of the leading tools for automating the recognition of historical documents. We’ve already tried it out on items from the Library’s India Office collection as well as early Bengali printed books, and we were pleased with the results. Several months ago the British Library joined the READ-COOP – the cooperative taking up the development of Transkribus – as a founding member.

As with other HTR tools, Transkribus’ HTR+ engine cannot start automatic transcription straight away, but first needs to be trained on a specific type of script and handwriting. This is achieved by creating a training dataset – a transcription of the text on each page, as accurate as possible, and a segmentation of the page into text areas and line, demarcating the exact location of the text. Training sets are therefore comprised of a set of images and an equivalent set of XML files, containing the location and transcription of the text.

A screenshot from Transkribus, showing the segmentation and transcription of a page from Add MS 7474
A screenshot from Transkribus, showing the segmentation and transcription of a page from Add MS 7474.

 

This process can be done in Transkribus, but in this case I already had a training set created using PRImA’s software Aletheia. I used the dataset created for the competitions mentioned above: 120 transcribed and ground-truthed pages from eight manuscripts digitised and made available through QDL. This dataset is now freely accessible through the British Library’s Research Repository.

Transkribus recommends creating a training set of at least 75 pages (between 5,000 and 15,000 words), however I was interested to find out a few things. First, the methods submitted for the RASM2019 competition worked on a training set of 20 pages, with an evaluation set of 100 pages. Therefore, I wanted to see how Transkribus’ HTR+ engine dealt with the same scenario. It should be noted that the RASM2019 methods were evaluated using PRImA’s evaluation methods, and this is not the case with Transkribus evaluation method – therefore, the results shown here are not accurately comparable, but give some idea on how Transkribus performed on the same training set.

I created four different models to see how Transkribus’ recognition algorithms deal with a growing training set. The models were created as follows:

  • Training model of 20 pages, and evaluation set of 100 pages
  • Training model of 50 pages, and evaluation set of 70 pages
  • Training model of 75 pages, and evaluation set of 45 pages
  • Training model of 100 pages, and evaluation set of 20 pages

The graphs below show each of the four iterations, from top to bottom:

CER of 26.80% for a training set of 20 pages

CER of 19.27% for a training set of 50 pages

CER of 15.10% for a training set of 75 pages

CER of 13.57% for a training set of 100 pages

The results can be summed up in a table:

Training Set (pp.)

Evaluation Set (pp.)

Character Error Rate (CER)

Character Accuracy

20

100

26.80%

73.20%

50

70

19.27%

80.73%

75

45

15.10%

84.9%

100

20

13.57%

86.43%

 

Indeed the accuracy improved with each iteration of training – the more training data the neural networks in Transkribus’ HTR+ engine have, the better the results. With a training set of a 100 pages, Transkribus managed to automatically transcribe the rest of the 20 pages with 86.43% accuracy rate – which is pretty good for historical handwritten Arabic script.

As a next step, we could consider (1) adding more ground-truthed pages from our manuscripts to increase the size of the training set, and by that improve HTR accuracy; (2) adding other open ground truth datasets of handwritten Arabic to the existing training set, and checking whether this improves HTR accuracy; and (3) running a few manuscripts from QDL through Transkribus to see how its HTR+ engine transcribes them. If accuracy is satisfactory, we could see how to scale this up and make those transcriptions openly available and easily accessible.

In the meantime, I’m looking forward to participating at the OpenITI AOCP workshop entitled “OCR and Digital Text Production: Learning from the Past, Fostering Collaboration and Coordination for the Future,” taking place at the University of Maryland next week, and catching up with colleagues on all things Arabic OCR/HTR!

 

13 December 2019

Do you want to see my butterfly collection?

Posted on behalf of Sara Lucas Agutoli, artist, associate professor at the Accademia di Belle Arti di Bologna, BL Labs Artist in residence and runner up in the BL Labs Artistic Award 2019.

Sara Lucas Agutoli
Artist: Sara Lucas Agutoli
(Copyright: Ilenia Arosio)

Sara Lucas Agutoli lives and works between London and Bologna.  Her academic research focuses on the concepts of true and false in art, in particular in photography. In her art S. L. Agutoli merges popular themes with a learned and symbolic system of citations. Working with different media, she reflects on the idea of ongoing transformation – of the spaces, of the body, as well as of aesthetics – and creates personal architectures drawing on her inner experiences, knowledge and visions.

When occupied with my full time job, I often spend the time wandering on the net, looking for pictures that trigger my interest, either because they are odd and curious or aesthetically pleasant and elegant.

Since 2011 I’ve enjoyed calling myself a cyber-flâneur1:. unlike the Parisian strollers described by Baudelaire, I walked through cyber avenues, getting lost amid different digital archives. I glimpsed through collections of images instead of windows, stared at close-ups of manuscripts instead of sunsets on rivers. The net was my city and I just followed my nose walking through it. I wanted to make my curiosity an aesthetic operation. In doing so I’ve come to believe that online archives are my personal church of Saint-Julien-le-Puvre, the chosen venue for my cyber-dadaist performances,
see: https://www.moma.org/collection/works/184056

For years my working activity followed a pattern: a few months of research – during which I spend hours and hours on Flickr Commons browsing online archives of museums and institutions saving selected images on my hard disk–, followed by months in the studio working creatively with the pictures accumulated.

I did accumulate images and emotions, from advertising to family album pictures. I wanted to explore how photography was used in different parts of the world, eras and in different economical contexts.

In 2011, while in Montreal for my first art residence, I analysed the different uses of vernacular photography in the 50s in North America and Italy. To do so, I used the open archives of most of the North American Libraries (New York Public library, Congregation of Sister of St. Joseph in Canada, California Historical Society and many others) and a private physical archive located in a tin box in my grandmother house.

This lead to a series of pictures inspired by this contrast. The series was exhibited in a solo show called Fermez les yeux.

Sara Mickey: Fermez les yeux
Sara Mickey: Fermez les yeux

The vastness and the richness of topics of the images I accumulated triggered constantly my creativity and my sense of humour. They often made me ask myself  “why do those pictures exist”?

The images – especially those more vernacular, random and unforeseen – became the objects trouvés I could rework using my imagination and reality.

During this dadaist-inspired net-surfing, the most fertile encounter of the last years has been the one with the collections of two of the major London institutions: the British Library and the Wellcome Collection digital archives.  

I was about to move from Italy to London and so my artistic research was about to change, inspired by this encounter.

I started to become interested in the aesthetics of the Victorian era and in the concept of the museum as an extension of a wunderkammer.

I started collecting  images of naturalia 2 and decided to transform them into artificialia in my studio.  And so I did, merging and morphing creatively these images. In 2013 I produced a digital collage of a butterfly scientific illustration and a medical vulva lithography and it was exhibited in public space in Bologna during CHEAP poster Festival.

Cheap Poster Festival
Posters as part of the CHEAP poster festival

This collage of images from the British Library and the Wellcome Collection became the first piece of the larger project Il muro delle meraviglie – the wall of wonders – for which I chose to use the wall of my living room in my home/atelier in NW London.

Il muro delle meraviglie started like a joke to mock the colonialist aesthetic of Victorian museum collections and it became a work of art. Among the wonders I added subsequently, you can find that first collage of the butterfly and the vulva, which I decided to call  “Do you want to see my butterfly collection?” to make my queer/ feminist perspective encounter the delicacy of the naturalistic illustration of butterfly.

The title, in Italian, refers to an apparently naïve question which has an explicit sexual allusion.

The person who asks “come see my butterflies’ collection” might be suggesting it to obtain something more, as the butterfly is used as a metaphor for the female sex.

Sara Deep Thrash
Intallazione a DEEP THRASH

This work criticises the male chauvinist obsession for cataloguing, intended as an activity aimed more at showing off, than simply showing. 

It represents a feminist critique and re-appropriation of such images.

Here the butterflies become proper “c*nts” and give visibility to the female genitalia.

It has been exhibited for the first time in 2013 on the streets of Bologna (IT) during CHEAP festival and at Queer demonstration thanks to C*ntemporary

If I didn’t have access to the BL and the Wellcome digital archives, all of this wouldn’t have been possible.

Finally, I would like to thank the support I have received from BL Labs and am excited about the new experiments and projects waiting for me around the corner.

Footnotes

  1. Flâneur: Flâneur is a French term meaning ‘stroller’ or ‘loafer’ used by nineteenth-century French poet Charles Baudelaire to identify an observer of modern urban life. Dada raised the tradition of Flânerie to the level of an aesthetic operation. The Parisian walk described by Walter Benjamin in the 1920s id utilized as an art form that inscribes itself directly in the real space and time, rather than on a medium.
  2. Naturalia : Naturalia, which includes creatures and natural objects, with a particular interest in monsters