Our highlights from Digital Humanities 2019: Rossitza and Daniel
We've put together a series of posts about our experiences at the Digital Humanities conference in Utrecht this month. In this post, Digital Curator Dr Rossitza Atanassova and Daniel Van Strien from the British Library / Alan Turing Institute's Living with Machines project shares their impressions. See also Mia and Yann's post, and Nora and Giorgia's post.
I loved the variety of the topics and formats in the conference programme and I have tweeted about some of most interesting talks I attended. I have to say movement between sessions was a bit complicated by the proliferation of stairs and escalators in the venue, which otherwise presented great views of Utrecht and offered comfy cushions to relax on during lunch! Like Mia and Nora I was inspired by the @LibsDH meetup, whilst my most surprising encounter was with the winning skeleton-poster.
Of particular interest to me were the sessions on digitised newspapers and related conversations between researchers and collections holding institutions. Back in the office I will reflect on some of the discussions and will continue to engage with the ‘Researchers & Libraries working together on improving digitised newspapers’ and the Digital Historical Periodica Groups. Many of the talks illustrated the importance of semantic annotations for synoptic examination of historical periodicals and I hope to apply at work my learning from the excellent pre-conference workshop on Named Entity Processing delivered by @ImpressoProject
I also found enjoyable and cool the panel on Exploring AV Corpora in the Humanities, in particular the presentation on the Distant Viewing Toolkit (DVT) for the Cultural Analysis of Moving. And outside the conference I had fun taking a walk along the artistic light-themed route to explore Utrecht city-centre. I enjoyed the conference so much that I have submitted DH2020 reviewer self-nomination!
Daniel Van Strien
I thought I would focus on a couple of sessions relating to OCR at the conference that I would be keen to explore further as part of the Living with Machines project. In particular I am keen to further explore two tools for OCR; Transkribus and Kraken.
Transkribus was discussed in the context of doing OCR on newspapers as part of the Impresso project in the paper ‘Improving OCR of Black Letter in Historical Newspapers: The Unreasonable Effectiveness of HTR Models on Low-Resolution Images’. Although I have previously heard about the tool I was particularly interested to hear about how it was being used to work with newspapers as I have primarily heard about its use in handwritten text recognition. The paper also gave some initial idea of how much ground truth data might need to be generated before training a new OCR engine for newspaper text. As part of the impreso project a167 pages of ground truth data was created, not trivial by any means but much lower than what might be expected. With this amount of data the project was able to generate a substantial improvement in the quality of OCR over various version of ABBYY software.
The second tool was Kraken which was introduced in the paper ‘Kraken - an Universal Text Recognizer for the Humanities’. I was particularly interested to hear about how this tool could be easily trained with new annotations to recognise new types and languages. For the most part Living with Machines will be relying on previously generated OCR but there may be occasions when it is worth investing time to try and produce more accurate OCR. For these occasions, testing Kraken further would be one nice starting point particularly because of the relative ease it provides in training data at the line rather than word level. This makes annotating the ground truth data (a little) less painful and time consuming.