Digital scholarship blog

27 January 2020

How historians can communicate their research online

This blog post is by Jonathan Blaney (Institute of Historical Research), Frances Madden (British Library), Francesca Morselli (DANS), Jane Winters (School of Advanced Study, University of London)

This blog will be published in several other locations including the FREYA blog and the IHR blog

Large satellite receiver
Source: Joshua Hoehne, Unsplash

On 4 December 2019, the FREYA project in collaboration with UCL Centre for Digital Humanities, Institute of Historical Research, the British Library and DARIAH-EU organized a workshop in London on identifiers in research. In particular this workshop - mainly directed to historians and humanities scholars - focused on ways in which they can build and manage an online profile as researchers, using tools such as ORCID IDs. It also covered best practices and methods of citing digital resources to make humanities researchers' work connected and discoverable to others. The workshop had 20 attendees, mainly PhD students from the London area but also curators and independent researchers.

Presentations

Frances Madden from the British Library introduced the day which was supported by the FREYA project which is funded under the EU’s Horizon 2020 programme. FREYA aims to increase the use of persistent identifiers (PIDs) across the research landscape by building up services and infrastructure. The British Library is leading on the Humanities and social sciences aspect of this work.

Frances described how PIDs are central to scholarly communication becoming effective and easy online. We will need PIDs not just for publications but for grey literature, for data, for blog posts, presentations and more. This is clearly a challenge for historians to learn about and use, and the workshop is a contribution to that effort.

PIDs: some historical context

Jonathan Blaney from the Institute of Historical Research said that there is a context to citation and the persistent identifiers which have grown up around traditional forms of print citation. These are almost invisible to us because they are deeply familiar. He gave an example of a reference to the gospel story of the woman taken in adultery:

John 7:53-8:11

There are three conventions here: the name ‘John’ (attached to this gospel since about the 2nd century) the chapter divisions (medieval and ascribed to the English bishop Stephen Langton) and the verse divisions (from the middle of the 16th century).

When learning new forms of referencing, such as the ones under discussion at the workshop, Jonathan suggested that historians should remember their implicit knowledge has been learned. He finished with an anecdote about Harry Belafonte, retold in Anthony Grafton’s The Footnote: A Curious History. As a young sailor Belafonte wanted to follow up on references in a book he had read. The next time he was on shore leave he went to a library and told the librarian:

“Just give me everything you’ve got by Ibid.”

People in conference room watching a presentation

Demonstrating the benefits

Prof Jane Winters from School introduced what she claimed was her most egotistical presentation by explaining her own choices in curating her online presence and also what was beyond her control. She showed the different results of web searches for herself using Google and DuckDuckGo and pointed out how things she had almost forgotten about can still feature prominently in results.

Jane described her own use of Twitter, and highlighted both the benefits and challenges of using social media to communicate research and build an online profile. It was the relatively rigid format of her institutional staff profile that led her to create her own website. Although Jane has an ORCID ID and a page on Humanities Commons, for example, there are many online services she has chosen not to use, such as academia.edu.

This is all very much a matter of personal choice, dependent upon people’s own tastes and willingness to engage with a particular service.

How to use what’s available

Francesca Morselli from DANS gave a presentation aiming to provide useful resources about identifiers for researchers as well as explaining in a simple yet exhaustive way how they "work" and the rationale behind them.

Most importantly PIDs ensure:

  1. Citability and discoverability (both for humans and machine)
  2. Disambiguation (between similar objects)
  3. Linking to related resources
  4. Long-term archiving and findability

Francesca then introduced the support provided by projects and infrastructures: FREYA, DARIAH-EU and ORCID. Among the FREYA project pillars (PID graph, PID Commons, PID Forum), the latter is available for anyone interested in identifiers.

The DARIAH-EU infrastructure for Arts and Humanities has recently launched the DARIAH Campus platform which includes useful resources on PIDs and managing research data (i.e. all materials which are used in supporting research). In 2018 DARIAH also organized a winter school on Open Data Citation, whose resources are archived here.

Dariah

 

A Publisher’s Perspective

Kath Burton from Routledge Journals emphasised how much use publishers make of digital tools to harvest convent, including social media crawlers, data harvesters and third party feeds.

The importance of maximising your impact online when publishing was explained, both before publishing (filling in the metadata, giving a meaningful title) and afterwards (linking to the article from social media and websites), as well as how publishers can help support this.

Kath went on to give an example of Taylor & Francis’s interest in the possibilities of online scholarly communication by describing its commitment to publishing 3D models of research objects, which is does on via Sketchfab page.

Breakout Groups

After the presentations and a coffee break there were group discussions about what everyone had just heard. During the first part, the groups were asked what was new to them in the presentations. It was clear from discussions around the room that attendees had heard much which was new to them. For example, some attendees had ORCID IDs but many were surprised at the range of things for which they could be used, such as in journal articles and logging into systems. They were also struck by the range of things in which publishers were interested such as research data. Many were really interested in the use of personal websites to manage their profile.

When asked what tallied with their experiences, it became clear that they were keen to engage with these systems, setting up ORCID IDs and Humanities Commons profiles but that they felt that they were too early on in their careers to have anything to contribute to these platforms and felt they were designed for established researchers. Jane Winters stressed that one could adopt a broad approach to the term ‘publications’, including posters, presentations and blog posts and encouraged all to share what they had.

Lastly discussion turned to how the group cites digital resources. This led to an interesting conversation around the citation of archived web pages and how to cite webpages which might change over time, with tools such as the Internet Archive being mentioned. There was also discussion about whether one can cite resources such as Wikipedia and it was clear that this was not something which had been encouraged. Jonathan, who has researched this subject, mentioned that he had found established academics are happy to cite Wikipedia than those earlier in their career.

Conclusions

The workshop effectively demonstrated the sheer range of online tools, social media forums and publishing venues (both formal and informal) through which historians can communicate their research online. This is both an opportunity and a problem. It is a challenge to develop an online presence - to decide which methods are most appropriate for different kinds of research and different personalities - but that is just the first step. For research communication to be truly valuable, it is necessary to focus your effort, manage your online activities and take control of how you appear to others in digital spaces. PIDs are invaluable in achieving this, and in helping you to establish a personal research profile that stays with you as you move through your career. At the start of the day, the majority of those who attended the workshop did not know very much about PIDs and how you can put them to use, but we hope that they came away with an enhanced understanding of the issues and possibilities, the awareness that it does not take much effort or skill to make a real difference to how you are perceived online, and some practical advice about next steps.

It was apparent that, with some admirable exceptions, neither higher education institutions nor PID organisations are successfully communicating the value and importance of PIDs to early career researchers. Workshop attendees particularly welcomed the opportunity to hear from a publisher and senior academic about how PIDs are used to structure, present and disseminate academic work. The clear link between communicating research online and public engagement also emerged during the course of the day, and there is obvious potential for collaboration between PID organisations and those involved with training focused on impact and public engagement. We ended the day with lots of ideas for further advocacy and training, and a shared appreciation for the value of PIDs for helping historians to reach out to a range of different audiences online.

20 January 2020

Using Transkribus for Arabic Handwritten Text Recognition

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Twitter as @BL_AdiKS.

 

In the last couple of years we’ve teamed up with PRImA Research Lab in Salford to run competitions for automating the transcription of Arabic manuscripts (RASM2018 and RASM2019), in an ongoing effort to identify good solutions for Arabic Handwritten Text Recognition (HTR).

I’ve been curious to test our Arabic materials with Transkribus – one of the leading tools for automating the recognition of historical documents. We’ve already tried it out on items from the Library’s India Office collection as well as early Bengali printed books, and we were pleased with the results. Several months ago the British Library joined the READ-COOP – the cooperative taking up the development of Transkribus – as a founding member.

As with other HTR tools, Transkribus’ HTR+ engine cannot start automatic transcription straight away, but first needs to be trained on a specific type of script and handwriting. This is achieved by creating a training dataset – a transcription of the text on each page, as accurate as possible, and a segmentation of the page into text areas and line, demarcating the exact location of the text. Training sets are therefore comprised of a set of images and an equivalent set of XML files, containing the location and transcription of the text.

A screenshot from Transkribus, showing the segmentation and transcription of a page from Add MS 7474
A screenshot from Transkribus, showing the segmentation and transcription of a page from Add MS 7474.

 

This process can be done in Transkribus, but in this case I already had a training set created using PRImA’s software Aletheia. I used the dataset created for the competitions mentioned above: 120 transcribed and ground-truthed pages from eight manuscripts digitised and made available through QDL. This dataset is now freely accessible through the British Library’s Research Repository.

Transkribus recommends creating a training set of at least 75 pages (between 5,000 and 15,000 words), however I was interested to find out a few things. First, the methods submitted for the RASM2019 competition worked on a training set of 20 pages, with an evaluation set of 100 pages. Therefore, I wanted to see how Transkribus’ HTR+ engine dealt with the same scenario. It should be noted that the RASM2019 methods were evaluated using PRImA’s evaluation methods, and this is not the case with Transkribus evaluation method – therefore, the results shown here are not accurately comparable, but give some idea on how Transkribus performed on the same training set.

I created four different models to see how Transkribus’ recognition algorithms deal with a growing training set. The models were created as follows:

  • Training model of 20 pages, and evaluation set of 100 pages
  • Training model of 50 pages, and evaluation set of 70 pages
  • Training model of 75 pages, and evaluation set of 45 pages
  • Training model of 100 pages, and evaluation set of 20 pages

The graphs below show each of the four iterations, from top to bottom:

CER of 26.80% for a training set of 20 pages

CER of 19.27% for a training set of 50 pages

CER of 15.10% for a training set of 75 pages

CER of 13.57% for a training set of 100 pages

The results can be summed up in a table:

Training Set (pp.)

Evaluation Set (pp.)

Character Error Rate (CER)

Character Accuracy

20

100

26.80%

73.20%

50

70

19.27%

80.73%

75

45

15.10%

84.9%

100

20

13.57%

86.43%

 

Indeed the accuracy improved with each iteration of training – the more training data the neural networks in Transkribus’ HTR+ engine have, the better the results. With a training set of a 100 pages, Transkribus managed to automatically transcribe the rest of the 20 pages with 86.43% accuracy rate – which is pretty good for historical handwritten Arabic script.

As a next step, we could consider (1) adding more ground-truthed pages from our manuscripts to increase the size of the training set, and by that improve HTR accuracy; (2) adding other open ground truth datasets of handwritten Arabic to the existing training set, and checking whether this improves HTR accuracy; and (3) running a few manuscripts from QDL through Transkribus to see how its HTR+ engine transcribes them. If accuracy is satisfactory, we could see how to scale this up and make those transcriptions openly available and easily accessible.

In the meantime, I’m looking forward to participating at the OpenITI AOCP workshop entitled “OCR and Digital Text Production: Learning from the Past, Fostering Collaboration and Coordination for the Future,” taking place at the University of Maryland next week, and catching up with colleagues on all things Arabic OCR/HTR!