THE BRITISH LIBRARY

Digital scholarship blog

206 posts categorized "Data"

24 March 2021

Welcome to the British Library’s new Wikimedian in Residence

Add comment

Hello, I’m Dr Lucy Hinnie and I’ve just joined the Digital Scholarship team as the new Wikimedian-in-Residence, in conjunction with Wikimedia UK and the Eccles Centre. My role is to work with the Library to develop and support colleagues with projects using Wikidata, Wikibase and Wikisource.

Bringing underrepresented people and marginalised communities to the fore is a huge part of this remit, and I am looking to be as innovative in our partnerships as we can be, with a view to furthering the movement towards decolonisation. I’m going to be working with curators and members of staff throughout the Library to identify and progress opportunities to accelerate this work.

I have recently returned from a two-year stay in Canada, where I lived and worked on Treaty Six territory and the homeland of the Métis. Working and living in Saskatchewan was a hugely formative experience for me, and highlighted the absolute necessity of forward-thinking, reconciliatory work in decolonisation.

Picture of two black bear sculptures in the snow at Wanuskewin Heritage Park
Wanuskewin Heritage Park, Saskatoon, December 2020

2020 was my year of immersion in Wikimedia – I participated in a number of events, including outreach work by Dr Erin O’Neil at the University of Alberta, Women in Red edit-a-thons with Ewan McAndrew at the University of Edinburgh and the Unfinished Business edit-a-thon run by Leeds Libraries and the British Library. In December 2020 I coordinated and ran my own Wikithon in conjunction with the National Library of Scotland, as part of my postdoctoral project ‘Digitising the Bannatyne MS’.

Page from the Bannatyne Manuscript, stating 'heir begynnys ane ballat buik [writtin] in the yeir of god 1568'
Front page of the Bannatyne MS, National Library of Scotland, Adv MS 1.1.6. (CC BY 4.0)

Since coming into post at the start of this March I have worked hard to make connections with organisations such as IFLA, Code the City and Art+Feminism. I’ve also been creating introductory materials to engage audiences with Wikidata, and thinking about how best to utilise the coming months.

Andrew Gray took up post as the first British Library Wikipedian in Residence nearly ten years ago, you can read more about this earlier residency here and here. Much has changed since then, but reflection on the legacy of Wikimedia activity is a crucial part of ensuring that the work we do is useful, engaging, vibrant and important. I want to use creative thinking to produce output that opens up BL digital collections in relevant, culturally sensitive and engaging ways.

I am excited to get started! I’ll be blogging here regularly about my residency, so please do subscribe to this blog to follow my progress.

This post is by Wikimedian in Residence Lucy Hinnie (@BL_Wikimedian)

19 February 2021

AURA Research Network Second Workshop Write-up

Add comment

Keen followers of this blog may remember a post from last December, which shared details of a virtual workshop about AI and Archives: Current Challenges and Prospects of Digital and Born-digital archives. This topic was one of three workshop themes identified by the Archives in the UK/Republic of Ireland & AI (AURA) network, which is a forum promoting discussions on how Artificial Intelligence (AI) can be applied to cultural heritage archives, and to explore issues with providing access to born digital and hybrid digital/physical collections.

The first AURA workshop on Open Data versus Privacy organised by Annalina Caputo from Dublin City University, took place on 16-17 November 2020. Rachel MacGregor provides a great write-up of this event here.

Here at the British Library, we teamed up with our friends at The National Archives to curate the second AURA workshop exploring the current challenges and prospects of born-digital archives, this took place online on 28-29 January 2021. The first day of the workshop held on 28 January was organised by The National Archives, you can read more about this day here, and the following day, 29 January, was organised by the BL, videos and slides for this can be found on the AURA blog and I've included them in this post.

AURA

The format for both days of the second AURA workshop comprised of four short presentations, two interactive breakout room sessions and a wider round-table discussion. The aim being that the event would generate dialogue around key challenges that professionals across all sectors are grappling with, with a view to identifying possible solutions.

The first day covered issues of access both from infrastructural and user’s perspectives, plus the ethical implications of the use of AI and advanced computational approaches to archival practices and research. The second day discussed challenges of access to email archives, and also issues relating to web archives and emerging format collections, including web-based interactive narratives. A round-up of  the second day is below, including recorded videos of the presentations for anyone unable to attend on the day.

Kicking off day two, a warm welcome to the workshop attendees was given by Rachel Foss, Head of Contemporary Archives and Manuscripts at the British Library, Larry Stapleton, Senior academic and international consultant from the Waterford Institute of Technology and Mathieu d’ Aquin, Professor of Informatics at the National University of Ireland Galway.

The morning session on Email Archives: challenges of access and collaborative initiatives was chaired by David Kirsch, Associate Professor, Robert H. Smith School of Business, University of Maryland. This featured two presentations:

The first of these was  about Working with ePADD: processes, challenges and collaborative solutions in working with email archives, by Callum McKean, Curator for Contemporary Literary and Creative Archives, British Library and Jessica Smith, Creative Arts Archivist, John Rylands Library, University of Manchester. Their slides can be viewed here and here. Apologies that the recording of Callum's talk is clipped, this was due to connectivity issues on the day.

The second presentation was Finding Light in Dark Archives: Using AI to connect context and content in email collections by Stephanie Decker, Professor of History and Strategy, University of Bristol and Santhilata Venkata, Digital Preservation Specialist & Researcher at The National Archives in the UK.

After their talks, the speakers proposed questions and challenges that attendees could discuss in smaller break-out rooms. Questions given by speakers of the morning session were:

  1. Are there any other appraisal or collaborative considerations that might improve our practices and offer ways forward?
  2. What do we lose by emphasizing usability for researchers?
  3. Should we start with how researchers want to use email archives now and in the future, rather than just on preservation?
  4. Potentialities of email archives as organizational, not just individual?

These questions led to discussions about, file formats, collection sizes, metadata standards and ways to interpret large data sets. There was interest in how email archives might allow researchers to reconstruct corporate archives, e.g. understand social dynamics of the office and understand decision making processes. It was felt that there is a need to understand the extent to which email represents organisation-level context. More questions were raised including:

  • To what extent is it part of the organisational records and how should it be treated?
  • How do you manage the relationship between constant organisational functions and structure (a CEO) and changing individuals?
  • Who will be looking at organisational email in the future and how?

It was mentioned that there is a need to distinguish between email as data and email as an artifact, as the use-cases and preservation needs may be markedly different.

Duties of care that exist between depositors, tool designers, archivists and researchers was discussed and a question was asked about how we balance these?

  • Managing human burden
  • Differing levels of embargo
  • Institutional frameworks

There was discussion of the research potential for comparing email and social media collections, e.g. tweet archives and also the difficulties researchers face in getting access to data sets. The monetary value of email archives was also raised and it was mentioned that perceived value, hasn’t been translated into monetary value.

Researcher needs and metadata was another topic brought up by attendees, it was suggested that the information about collections in online catalogues needs to be descriptive enough for researchers to decide if they wish to visit an institution, to view digital collections on a dedicated terminal. It was also suggested that archives and libraries need to make access restrictions, and the reasoning for these, very clear to users. This would help to manage expectations, so that researchers will know when to visit on-site because remote access is not possible. It was mentioned that it is challenging to identify use cases, but it was noted that without deeper understanding of researcher needs, it can be hard to make decisions about access provision.

It was acknowledged that the demands on human-processing are still high for born digital archives, and the relationship between tools and professionals still emergent. So there was a question about whether researchers could be involved in collaborations more, and to what extent will there be an onus on their responsibilities and liabilities in relation to usage of born digital archives?

Lots of food for thought before the break for lunch!

The afternoon session chaired by Nicole Basaraba, Postdoctoral Researcher, Studio Europa, Maastricht University, discussed Emerging Formats, Interactive Narratives and Socio-Cultural Questions in AI.

The first afternoon presentation Collecting Emerging Formats: Capturing Interactive Narratives in the UK Web Archive was given by Lynda Clark, Post-doctoral research fellow in Narrative and Play at InGAME: Innovation for Games and Media Enterprise, University of Dundee, and Giulia Carla Rossi, Curator for Digital Publications, British Library. Their slides can be viewed here.  

The second afternoon session was Women Reclaiming AI: a collectively designed AI Voice Assistant by Coral Manton, Lecturer in Creative Computing, Bath Spa University, her slides can be seen here.

Following the same format as in the morning, after these presentations, the speakers proposed questions and challenges that attendees could discuss in smaller break-out rooms. Questions given by speakers of the afternoon session were:

  1. Should we be collecting examples of AIs, as well as using AI to preserve collections? What are the Implications of this
  2. How do we get more people to feel that they can ask questions about AI?
  3. How do we use AI to think about the complexity of what identity is and how do we engineer it so that technologies work for the benefit of everyone?

There was a general consensus, which acknowledged that AI is becoming a significant and pervasive part of our life. However it was felt that there are many aspects we don't fully understand. In the breakout groups workshop participants raised more questions, including:

  • Where would AI-based items sit in collections?
  • Why do we want it?
  • How to collect?
  • What do we want to collect? User interactions? The underlying technology? Many are patented technologies owned by corporations, so this makes it challenging. 
  • What would make AI more accessible?
  • Some research outputs may be AI-based - do we need to collect all the code, or just the end experience produced? If the latter, could this be similar to documenting evidence e.g. video/sound recordings or transcripts.
  • Could or should we use AI to collect? Who’s behind the AI? Who gets to decide what to archive and how? Who’s responsible for mistakes/misrepresentations made by the AI?

There was debate about how to define AI in terms of a publication/collection item, it was felt that an understanding of this would help to decide what archives and libraries should be collecting, and understand what is not being collected currently. It was mentioned that a need for user input is a critical factor in answering questions like this. A number of challenges of collecting using AI were raised in the group discussions, including:

  • Lack of standardisation in formats and metadata
  • Questions of authorship and copyright
  • Ethical considerations
  • Engagement with creators/developers

It was suggested that full scale automation is not completely desirable and some kind of human element is required for specialist collections. However, AI might be useful for speeding up manual human work.

There was discussion of problems of bias in data, that existing prejudices are baked into datasets and algorithms. This led to more questions about:

  • Is there is a role for curators in defining and designing unbiased and more representative data sets to more fairly reflect society?
  • Should archives collect training data, to understand underlying biases?
  • Who is the author of AI created text and dialogue? Who is the legally responsible person/orgnisation?
  • What opportunities are there for libraries and archives to teach people about digital safety through understanding datasets and how they are used?

Participants also questioned:

  • Why do we humanise AI?
  • Why do we give AI a gender?
  • Is society ready for a genderless AI?
  • Could the next progress in AI be a combination of human/AI? A biological advancement? Human with AI “components” - would that make us think of AIs as fallible?

With so many questions and a lack of answers, it was felt that fiction may also help us to better understand some of these issues, and Rachel Foss ended the roundtable discussion by saying that she is looking forward to reading Kazuo Ishiguro’s new novel Klara and the Sun, about an artificial being called Klara who longs to find a human owner, which is due to be published next month by Faber.

Thanks to everyone who spoke at and participated in this AURA workshop, to make it a lively and productive event. Extra special thanks to Deirdre Sullivan for helping to run the online event smoothly. Looking ahead, the third workshop on Artificial Intelligence and Archives: What comes next? is being organised by the University of Edinburgh in partnership with the AURA project team, and is scheduled to take place on Tuesday 16 March 2021. Please do join the AURA mailing list and follow #AURA_network on social media to be part of the network's ongoing discussions.

This post is by Digital Curator Stella Wisdom (@miss_wisdom)

11 February 2021

Investigating Instances of Arabic Verb Form X in the BL/QFP Translation Memory

Add comment

The Arabic language has a root+pattern morphology where words are formed by casting a (usually 3-letter) root into a morphological template of affixed letters in the beginning, middle and/or end of the word. While most of the meaning comes from the root, the template itself adds a layer of meaning. For our latest Hack Day, I investigated uses of Arabic Verb Form X (istafʿal) in the BL/QFP Translation Memory.

I chose this verb form because it conveys the meaning of seeking or acquiring something for oneself, possibly by force. It is a transitive verb form where the subject may be imposing something on the object and can therefore convey subtle power dynamics. For example, it is the form used to translate words such as ‘colonise’ (yastaʿmir) and ‘enslave’ (yastaʿbid). I wanted to get a sense of whether this form could reflect unconscious biases in our translations – an extension of our work in the BLQFP team to address problematic language in cataloguing and translation.

The other reason I chose this verb form is that it is achieved by affixing three consonants to the beginning of the word, which made it possible to search for in our Translation Memory (TM). The TM is a bilingual corpus, stretching back to 2014, of the catalogue descriptions we translate for the digitised India Office Records and Arabic scientific manuscripts on the QDL. We access the TM through our translation management system (memoQ), which offers some basic search functionalities. This includes a ‘wild card’ option where the results list all the words that begin with the three Form X consonants under investigation (است* and يست*).

Snippet of results in memoQ using the wildcard search function
Figure 1: Snippet of results in memoQ using the wildcard search function.

 

My initial search using these two 3-letter combinations returned 2,140 results. I noticed that there were some recurring false positives such as certain place names and the Arabic calque of ‘strategy’ (istrātījiyyah). The most recurring false positive (699 counts), however, was the Arabic verb for ‘receive’ (istalam) – which is unsurprising given frequent references to correspondences being sent and received in catalogue descriptions of IOR files. What makes this verb a false positive is that the ‘s’ is in fact a root consonant, and therefore the verb actually belongs to Form VIII (iftaʿal). 

After eliminating these false positives, I ended up with 1349 matches. From these, I was able to identify 55 unique verbs used in relation to IOR files. I then conducted a more targeted search of three cases of each verb: the perfective (past) istafʿal, the imperfective (present) yastafʿil, and the verbal noun (istifʿāl). I used the wild card function again to capture variations of these cases with suffixes attached (e.g. pronoun or plural suffixes). Although these would have been useful too, I did not look for the active (mustafʿil) and passive (mustafʿal) participles because the single short vowel that differentiates them is rarely represented in Arabic writing. Close scrutiny of the context of each result would have been needed in order to assign them correctly, and I did not have enough time for that in a single day.

List of the Form X verbs found in the TM and their frequency (excluding six verbs that only occur once)
Figure 2: List of the Form X verbs found in the TM and their frequency (excluding six verbs that only occur once)

 

I made a note of the original English term(s) that the Form X verb was used to translate. I then identified seven potentially problematic verbs that required further investigation. These six verbs typically convey an action that is being either forcefully or wrongfully imposed.

Seven potentially problematic verbs that take Form X in the TM
Figure 3: Seven potentially problematic verbs that take Form X in the TM

 

My next step was to investigate the use of these verbs in context more closely. I looked at the most frequent of these verbs (istawlá/yastawlī/istīlaʾ) in our TM, first using the source + target view, and then the three-column concordance view of the target text. The first view allowed me to scrutinise how we have been employing this verb vis-à-vis the original term used in the English catalogue description. It struck me that, in some cases, more neutral verbs such as ‘take’ and ‘take possession of’ were used on the English side; meaning that bias was introduced during translation.

Source + target view of concordance results for the verb istawlá
Figure 4: Source + target view of concordance results for the verb istawlá

 

The second view makes it possible to see the text immediately preceding and succeeding the verb, typically displaying the assigned subject and object of the verb. It therefore shows who is doing what to whom more clearly, even though the script direction goes a bit awry for Arabic. Here, I noticed that the subjects were disproportionately non-British: it is overwhelmingly native rulers and populations, ‘pirates’, and rival countries who were doing the forceful or wrongful taking in the results. This may indicate an unconscious bias that has travelled from the primary sources to the catalogue descriptions and is something that requires further investigation.

Three-column view of concordance results for the verb istawlá
Figure 5: Three-column view of concordance results for the verb istawlá

 

My hack day investigation was conducted in the spirit of continuous reflection on and improvement of our translation process. Using a verb form rather than specific words as a starting point provided an aggregate view of our practices, which is useful in trying to tease out how the descriptions on the QDL may collectively convey an overall stance or attitude. My investigation also demonstrates the value of our TM, not only for facilitating and maintaining consistency in translation, but as a research tool with countless possibilities. My findings from the hack day are naturally rough-and-ready, but they provide the seed for further conversations about problematic language and unconscious bias among translators and cataloguers.

This is a guest post by linguist and translator Dr Mariam Aboelezz (@MariamAboelezz), Translation Support Officer, BL/QFP Project

02 February 2021

Legacies of Catalogue Descriptions and Curatorial Voice: training materials and next steps

Add comment

Over the past year British Library staff have contributed to the AHRC-funded project "Legacies of Catalogue Descriptions and Curatorial Voice: Opportunities for Digital Scholarship". Led by James Baker, University of Sussex, the project set out to demonstrate the value of corpus linguistic methods and computational tools to the study of legacies of catalogues and cultural institutions’ identities. In a previous blogpost James explained the historical research that shaped this project and outlined the original work plan which was somewhat disrupted by the pandemic.

As we approach the end of the first phase of this AHRC project, we want to share the news about the completion as part of this project of the training module on Computational Analysis of Catalogue Data. The materials take into account the interests and needs of the community for which it is intended. In July James and I delivered a couple of training sessions over Zoom for a group of GLAM professionals, some of whom had previously shown interest in our approach to catalogue data by attending Baker’s British Academy-funded “Curatorial Voice” project.

Screenshot from the December session on Zoom showing a task within the training module
Screenshot from the December session on Zoom showing a task within the training module.

 

In response to feedback from these sessions we updated the lessons to query data derived from descriptions of photographs held at the British Library. This dataset reflects better the diversity of catalogue records created by different cataloguers and curators over time. British Library staff then took part in a Hack-and-Yack session which demonstrated the use of AntConc and approaches from computational linguistics for the purposes of examining the Library’s catalogue data and how this could enable catalogue related work. This was welcomed by curators, cataloguers and other collections management staff who saw value in trying this out with their own catalogue data for the purpose of improving its quality, identifying patterns and ultimately making it more accessible to users. In December, the near-finished module was presented over Zoom to a wider group of GLAM professionals from the UK, US and Turkey.

Screenshot from the December session demonstrating how to use the concordance tool in AntConc
Screenshot from the December session demonstrating how to use the concordance tool in AntConc.




We hope that the training module will be widely used and further developed by the community and are delighted that it has already been referenced in a resource for researchers in the Humanities and Social Sciences at the University of Edinburgh. In terms of next steps, the AHRC has granted an extension for holding some partnership development activities with our partners at Yale University and delivering the end-of-project workshop which will hopefully lead to future collaborations in this space.

Screenshot showing James Baker delivering the December training session on Zoom with participants' appreciative comments in the chat
James Baker delivering the December training session on Zoom which participants found really useful.




Personally, I gained a lot from this fruitful collaboration with James and Andrew Salway as it gave me a first-hand experience of developing a “Carpentries-style” lesson, understanding how AntConc works, and applying corpus linguistic methods. I want to thank [British Library staff who took part in the training sessions and in particular those colleagues who supplied catalogue data and shared curatorial expertise: Alan Danskin, Victoria Morris, Nicolas Moretto, Malini Roy, Cam Sharp Jones and Adrian Edwards.

This post is by Digital Curator Rossitza Atanassova (@RossiAtanassova)

29 January 2021

Hacking the BL from home

Add comment

BL/QFP Project and BL BAME Network Hack Day: 13th January, 2021

This is a guest post by the British Library Qatar Foundation Partnership, compiled by Laura Parsons. You can follow the British Library Qatar Foundation Partnership on Twitter at @BLQatar.

We may be unable to visit the British Library in person, or see our colleagues except for on our computer screens, but on Wednesday 13th January we proved that lockdown is no barrier to a Hack Day. For the first time our Hack Day was opened up to British Library staff from outside the BL/QFP Project, as we invited members of the BL BAME Network to join us. It was exciting to have a wide variety of people with different roles and Hack Day experience, which was reflected in the diverse ideas and results displayed on the day. There was no particular subject or theme for this Hack Day. The only objectives were to try or learn something new, meet some people from around the Library and have a bit of fun along the way.

It felt slightly weird holding our Hack Day online via Microsoft Teams, rather than gathered in the BL/QFP Project’s office on the 6th floor of the Library. However, with various types of technology and online platforms, including the Teams breakout function and a shared Google doc, we still managed to work collaboratively whilst working from home. Throughout the Teams rooms, it was great to see and hear amazing ideas, helpful team work, interesting discussions, valuable sharing of skills and knowledge, and laughter.

We hope you enjoy reading about our hacks as much as we enjoyed the process of making them together.

 

Exquisite Corpses

Contributors: Morgane Lirette (Conservator (Books), Conservation), Tan Wang-Ward (Project Manager, Lotus Sutra Manuscripts Digitisation), Matthew Lee (Imaging Support Technician, BL/QFP Project), Darran Murray (Digitisation Studio Manager, BL/QFP Project), Noemi Ortega-Raventos (Content Specialist, Archivist, BL/QFP Project)

Our project for this Hack Day collaboration was centered on the idea of the Exquisite Corpse – a fun and creative game popularised by the Surrealists as a tool to create bizarre and wonderful compositions.

The result was a cross collaborative effort, involving staff from the International Dunhuang Project, Conservation and the BL/QFP Project, that created a series of visual collages using material from the Library's digital collections, Flickr and Instagram accounts as well as the Qatar Digital Library (QDL). We created five exquisite corpses in total.

The biggest takeaway from the day was how easy, fun and creative this process was in facilitating cross library networking and collaboration but also as a tool for invention and exploration of the Library’s diverse collections.

 

Exquisite Corpse image created by collaging material from different images together.
Figure 1: Exquisite Corpse 1: Head part 1 (QDL), Head part 2 (QDL), Head part 3 (QDL), Head part 4 (QDL) Head part 5 (QDL), torso (Flickr), legs (Flickr), feet (Instagram)

 

Exquisite Corpse image 2 - collage
Figure 2: Exquisite Corpse 2: Head (Flickr), torso (BL Catalogue), legs (Instagram), feet (QDL)

 

Exquisite Corpse image 3 - collage
Figure 3: Exquisite Corpse 3: Head (BL Catalogue), torso (Flickr), legs (BL Catalogue), feet (BL Catalogue)

 

Exquisite Corpse image 4 - collage
Figure 4: Exquisite Corpse 4: Head (Flickr), torso (Instagram), legs (QDL), foot 1 (Flickr), foot 2 (Flickr)

 

Exquisite Corpse image 5 - collage
Figure 5: Exquisite Corpse 5: Head (BL Catalogue), torso (QDL), arm (QDL), legs (Flickr), foot 1 (BL Catalogue), foot 2 (BL Catalogue)

 

OCR Text Analysis

Contributors: David Woodbridge (Cataloguer, Gulf History, BL/QFP Project) & Sotirios Alpanis (Head of Digital Operations, BL/QFP Project)

This hack aimed to extend work undertaken as part of the Addressing Problematic Terms Project to explore the BL/QFP’s Optical Character Recognition (OCR) data.

Inspiration for the Hack was drawn from Olivia Vane’s excellent OCR visualisation tool, Steptext. OCR is an automated process employed during the BL/QFP’s digitisation process that ‘reads’ the images captured and turns them into searchable text.

Initially the team came up with a list of terms to search the OCR text for. Then we wrote a Python script to search the OCR files for each term, and output three graphs, built using Bokeh.

Graph displays the number of matches for the term against the year the archive material was created.
Figure 6: This graph displays the number of matches for the term against the year the archive material was created. Click on the image to open an interactive version in a new window.

 

Using the year with the most occurrences of the term, bar chart displays break down of the frequency per shelfmark.
Figure 7: Using the year with the most occurrences of the term, this bar chart  displays the break down the frequency per shelfmark. Click on the image to open an interactive version in a new window.

 

Using the shelfmark with the most matches, this graph displays how often the term occurs in each image capture. Using Bokeh’s inbuilt Hover tool, the graph displays a snippet of the term in context with the rest of the OCR data.
Figure 8: Using the shelfmark with the most matches, this graph displays how often the term occurs in each image capture. Using Bokeh’s inbuilt Hover tool, the graph displays a snippet of the term in context with the rest of the OCR data. Click on the image to open an interactive version in a new window.

 

The results show how it is possible both to identify where specific terms are used in the records and to analyse how they are used over time. This will be of great help as we seek to take the project to the next stage.

 

OCR Exquisite Corpses

Contributor: Sotirios Alpanis (Head of Digital Operations, BL/QFP Project)

Taking inspiration from the Exquisite Corpse Hack project, the code for the OCR text analysis was re-factored to produce OCR Exquisite Corpses. Here is the process:

  1. Taking an initial search term, a shelfmark was picked at random and the term was searched for, this process was repeated until a match was found.
  2. Once a match was made the subsequent four words were selected, completing the first sentence of an exquisite corpse.
  3. The final word of the sentence was then used to begin the process again, creating a link between the two sentences.
  4. This was repeated four times to create surreal nonsense poem.
  5. Finally, using Google Translate’s text to speech service, an mp3 file was created for each poem.

The Hack team nominated some everyday words to generate OCR Exquisite Corpses. Here are some highlights:

  • BREAD and wine: he THEN he in his, POSSESSION of the enemy's ENTRENCHED camp at Brasjoon, ABOUT 80 per cent

Bread OCR Exquisite Corpse

  • BLUE and gold lackered, WORK fur r North & THE 15th November, 1933, WITH ENCLOSURES FOREIGN: Immediate

Blue OCR Exquisite Corpse

  • MUTINY had been prevented BY wandering tribes, small TRIBUTARY to Persia; AND has the honour TO deal with the

Mutiny OCR Exquisite Corpse

 

Investigating Instances of Arabic Verb Form X in the BLQFP Translation Memory

Contributor: Mariam Aboelezz

I investigated uses of Arabic Verb Form X (istafʿala) in the BLQFP Translation Memory using our translation software, memoQ. I chose this verb form because it conveys the meaning of seeking or acquiring something for oneself, possibly by force, and could therefore elicit unconscious bias in our translations. I identified 55 unique verbs that take this form, six of which were potentially problematic. A closer look at the most frequent verb (istawlá; to take forcefully or wrongfully) suggests that some unconscious bias may have travelled from the primary sources to the catalogue descriptions or been introduced during translation. The results provide a prompt for further discussions about problematic language among translators and cataloguers.

Search results from the BLQFP Translation Memory in memoQ for Arabic Verb Form X (istafʿala)
Figure 9: Search results from the BLQFP Translation Memory in memoQ for Arabic Verb Form X (istafʿala)

 

Bar chart displaying the 55 unique verbs identified and their frequency.
Figure 10: Bar chart displaying the 55 unique verbs identified and their frequency.

 

Bar chart displaying the six potentially problematic verbs.
Figure 11: Bar chart displaying the six potentially problematic verbs.

 

Birds of the QDL team

Contributors: Anne Courtney (Cataloguer, Gulf History, BL/QFP Project), Sara Hale (Digitisation Officer, Heritage Made Digital/Asian and African Collections), Francis Owtram (Content Specialist, Gulf History, BL/QFP Project), Annie Ward (Digitisation Workflow Administrator, BL/QFP Project)

The Birds of the QDL team set out to explore how birds appear in the digital records. Sara and Annie used manuscript paintings of bird species as inspiration, creating an animated GIF of a hoopoe and data visualisations of the search results for different birds. Anne tracked bird sightings in one of the IOR ship’s logs by combining quotes from the log with sound recordings and images to help bring the record to life. Francis investigated the Socotra cormorant, British guano extraction and the resistance of the islanders. We enjoyed experimenting with different formats to highlight some of the regional birds and the contexts in which they appear.

Animated gif using an image of a hoopoe bird. Image from: Tarjumah-ʼi ʻAjā’ib al-makhlūqāt ترجمۀ عجائب المخلوقات Anonymous translator [‎397r] (812/958), British Library: Oriental Manuscripts, Or 1621, in Qatar Digital Library and quote from: ''IRAQ AND THE PERSIAN GULF' [‎144v] (293/862), British Library: India Office Records and Private Papers, IOR/L/MIL/17/15/64, in Qatar Digital Library
Animated gif using an image of a hoopoe bird. Image from: Tarjumah-ʼi ʻAjā’ib al-makhlūqāt ترجمۀ عجائب المخلوقات Anonymous translator [‎397r] (812/958), British Library: Oriental Manuscripts, Or 1621, in Qatar Digital Library <https://www.qdl.qa/archive/81055/vdc_100069559270.0x00000d> and quote from: ''IRAQ AND THE PERSIAN GULF' [‎144v] (293/862), British Library: India Office Records and Private Papers, IOR/L/MIL/17/15/64, in Qatar Digital Library <https://www.qdl.qa/archive/81055/vdc_100037366479.0x00005e>

 

Bar chart displaying the number of search results by bird name on the Qatar Digital Library and decorated with bird images from a manuscript (Tarjumah-ʼi ʻAjā’ib al-makhlūqāt ترجمۀ عجائب المخلوقات Anonymous translator, British Library: Oriental Manuscripts, Or 1621, in Qatar Digital Library.
Bar chart displaying the number of search results by bird name on the Qatar Digital Library and decorated with bird images from a manuscript (Tarjumah-ʼi ʻAjā’ib al-makhlūqāt ترجمۀ عجائب المخلوقات Anonymous translator, British Library: Oriental Manuscripts, Or 1621, in Qatar Digital Library <https://www.qdl.qa/archive/81055/vdc_100035587342.0x000001>).

 

Image of the ocean with text reading: “This day we see no birds”. Image from: ‘Sea Song and River Rhyme from Chaucer to Tennyson’ (1887), ed. E D Adams and quote from: Blenheim : Journal [‎16v] (38/209), British Library: India Office Records and Private Papers, IOR/L/MAR/B/697A, in Qatar Digital Library
Figure 14: Image of the ocean with text reading: “This day we see no birds”. Image from: ‘Sea Song and River Rhyme from Chaucer to Tennyson’ (1887), ed. E D Adams and quote from: Blenheim : Journal [‎16v] (38/209), British Library: India Office Records and Private Papers, IOR/L/MAR/B/697A, in Qatar Digital Library <https://www.qdl.qa/archive/81055/vdc_100085281813.0x000027>

 

Map of the island of Socotra from: ‘A Trigonometrical Survey of Socotra by Lieut.ts S.B. Haines and I.R. Wellsted assisted by Lieut. I.P. Sanders and Mess.rs Rennie Cruttenden & Fleming Mids.n, Indian Navy. Engraved by R. Bateman, 72 Long Acre’ [‎8r] (1/2), British Library: Map Collections, IOR/X/3630/13, in Qatar Digital Library
Figure 15: Map of the island of Socotra from: ‘A Trigonometrical Survey of Socotra by Lieut.ts S.B. Haines and I.R. Wellsted assisted by Lieut. I.P. Sanders and Mess.rs Rennie Cruttenden & Fleming Mids.n, Indian Navy. Engraved by R. Bateman, 72 Long Acre’ [‎8r] (1/2), British Library: Map Collections, IOR/X/3630/13, in Qatar Digital Library <https://www.qdl.qa/archive/81055/vdc_100023868004.0x000010>

 

Story-Mapping: The Shater’s Journey

Contributors: Jenny Norton-Wright (Arabic Scientific Manuscripts Curator, BL/QFP Project) & Ula Zeir (Content Specialist, Arabic Language, BL/QFP Project)

Our Hack project aimed to create an interactive map tracing the footsteps of a shater [shāṭir, foot-courier] who made a 700-mile return journey between Gombroon and Shiraz in 1761 bearing an important letter, as recounted in one of the Gombroon Diaries (IOR/G/29/13).

First, we collected background information on the journey and on the term shater, and transcribed the relevant diary entries. We then used the Esri ArcGIS StoryMap Tour platform to visualise and map the events. The Tour function integrates text boxes, captions, and associated images with a background map tracking the points of the journey, and supports hyperlinking to the IOR materials on the QDL.

Image from the start of the story map introducing the Shater journey.
Figure 16: Image from the start of the story map introducing the Shater journey.

 

Image from the story map continuing the Shater journey.
Figure 17: Image from the story map continuing the Shater journey.

 

Image from the story map continuing the Shater journey: a reply is received.
Figure 18: Image from the story map continuing the Shater journey: a reply is received.

 

For more information about the Gombroon Diaries:

Diary and Consultations of Mr Alexander Douglas, Agent of the East India Company at Gombroon [Bandar-e ʻAbbās] in the Persian Gulf, commencing 2 October 1760 and ending 30 December 1761, British Library: India Office Records and Private Papers, IOR/G/29/13, in Qatar Digital Library <https://www.qdl.qa/archive/81055/vdc_100000001251.0x00036a>

 

British Library mosaic

Contributor: Laura Parsons (Digitisation Workflow Administrator, BL/QFP Project)

This project involved learning how to create mosaics using images from the Library and QDL collections. This was inspired by a presentation by Pardaad Chamsaz (Curator Germanic Collections, BL European Studies) about the Decolonising the BL working group of the BL BAME Network. He said that we should remember that the Library is made up of many different people. I decided to try using Mosaically to use multiple images to create an image of the British Library, to show that it takes many parts to make a whole. This also highlights the Library’s vast collections. I then repeated this with images from the QDL to show an image of the QDL homepage.

Mosaic of the British Library using images from the British Library Flickr account
Figure 19: Mosaic of the British Library using images from the British Library Flickr account.

 

Mosaic of the Qatar Digital Library homepage using images from the Qatar Digital Library
Figure 20: Mosaic of the Qatar Digital Library homepage using images from the Qatar Digital Library (https://www.qdl.qa/en).

 

You can also read about the previous Hack Days in the blog posts below:

27 January 2021

Identify yourself!

Add comment

On Friday, 22 January, the Digital Scholarship Team at the British Library held their first 21st Century Curatorship talk of 2021; Identify Yourself: (Almost) everything you ever wanted to know about persistent identifiers but were afraid to ask.

This series of professional development talks and seminars is part of Digital Scholarship Staff Training Programme. They are open to all British Library staff, providing a forum for them to keep up with new developments and emerging technologies in scholarship, libraries and cultural heritage. Usually 21st Century Curatorship talks are given by external guests, but this one involved six speakers from around the Library who work with persistent identifiers (PIDs) in various ways. This talk was also scheduled to coincide with PIDapalooza, the annual festival of persistent identifiers which is taking place over 24 hours this week.

There were many speakers for a one-hour talk but everyone gave a whistle-stop tour around their particular area. Frances Madden began with an introduction to PIDs generally and then gave an overview of a couple of PID-related projects; the Library is a partner in or leading including FREYA and PIDs as IRO Infrastructure. (Side note, PIDs as IRO Infrastructure will feature at PIDapalooza, on Thursday at 09:30 UTC). Frances also explained that you can have persistent identifiers for many types of entities, including articles, datasets, people and organisations. These can all be connected together through the persistent identifier metadata. PIDs are so important because they are reliably unique and persistent over time, important in a library!

Next up Erin Burnand and Emma Rogoz gave an overview of ISNI. The International Standard Name Identifier is an ISO standard used to identify the public identities of parties, persons and organisations associated with creative works. Each ISNI is a sixteen digit string and is accessible by a persistent URI https://isni.org/isni/[isni]. Erin gave an overview of the extensive quality assurance processes ISNI use to ensure very high quality metadata and the work they do with other organisations to provide training and support, as well as consultation with OCLC and ISNI committees and interest groups. ISNI’s use has expanded since its launch in 2010 and now serves various communities: Youtube and Spotify are both registration agencies for the music industry.

Emma described the ways in which the Library is working to embed ISNI into its cataloguing workflows by adding them into the LC/NACO file, which is a collaboration between the Library of Congress and the PCC Network. There is also ongoing work to embed them in legacy bibliographic data through matching algorithms and process. Through the UK Publishers Interest Group, they are working to match authors in publishers’ databases with ISNI and integrate them into their data, which publishers share with the Library. This work has been very successful with high match rates. The Library is also working on a portal so that end users can add information to their own records or request a record be created. Because of the high quality of metadata in the ISNI database, end users will not able to change or delete any information without liaising directly with the ISNI team.

A screenshot demonstrating the ISNI Portal that the BL is working on, as described above
Figure 1: A screenshot of the ISNI portal

Jez Cope described how digital object identifiers work and the role the Library has in assigning them. A DOI is a digital identifier for an object rather than an identifier for a digital object. DOIs are generally assigned to digital objects such as journal articles and datasets but they have been used to identify Roman coins and other physical items too. DOIs are designed primarily to identify objects for the purposes of citation. Jez went onto explain that DOIs are assigned by registration agencies which have members. Unlike ISNI, the metadata control is not centralised and is overseen by the members. The British Library leads a UK consortium of 100+ DataCite members. Jez also mentioned that the machine readability of a DOI and the metadata associated with it can be integrated into the PID Graph, developed in the FREYA project. This allows you to use PID metadata to answer complex queries and understand relationships which are at a two steps away from each other, e.g. which British Library authors have received funding from a particular funding agency. Of course all this information depends on the information being present in the metadata.

Example PID Graphs
Figure 2: Example PID Graphs

Finally we heard from two projects at different stages of completion which are using DOI metadata within the Library. Simon Moffatt described how the Library is using DOIs from journal articles to improve the links from records which have been acquired through different routes. This new service, known as BLDOI, improves the experience of end users using the catalogue but also has the potential to be rolled out to other libraries and users. The solution of a lookup table comparing ARKs (the Library’s internal identifier and DOIs) which is exposed via an API which feeds into the catalogue.

A screenshot of the new search results, displayed on Reading Room PCs, explaining how the new look-up service works.
Figure 3: A screenshot of the new search results, displayed on Reading Room PCs, explaining how the new look-up service works.

Sharon Johnson closed the session by describing a project in its early stages of using Crossref DOI metadata for journal articles to identify where the Library is missing articles which it should have collected via Legal Deposit legislation. This could apply where the Library is missing articles from issues of journals it already collects but also journals which it should collect but does not at this point.

Miraculously, this jam-packed session was completed within an hour and there was even some time for questions at the end. The aim of the session was to provide an overview of the services the Library has related to identifiers and to illustrate their breadth and diversity as well as the number of different teams involved in it. The fact that we had so many speakers and teams represented illustrates this. Hopefully we will be able to hold more detailed sessions on individual topics in the future.

This post is by Frances Madden (@maddenfc), Research Associate (PIDs as IRO Infrastructure) about a recent seminar for British Library staff.

15 January 2021

Happy 20th Birthday Wikipedia

Add comment

Today Wikipedia, the world’s collaborative, online, free encyclopedia is marking it's twentieth birthday. Many celebrations are underway for this, including a #WikiLovesCakes online bake off competition organised by Wikimedia UK, which will be judged by Sandi Toksvig and Nick Poole.

Alas I am lacking in baking skills (though I am excellent at cake eating!), so I’m marking #Wikipedia20 with a reflection on how the British Library has collaborated with Wikimedia and contributed to Wikipedia over the last few years.

WMUK Wikipedia 20th Birthday image with number 20, a birthday cake, the Wikimedia globe and Big Ben

I am also delighted to announce that a memorandum of understanding has been signed this month between the British Library and Wikimedia UK for a new Wikimedian-in-Residence. My colleague Richard Davies who signed this agreement on behalf of the Library said:

“The Library has learnt a great deal both from and since our first Wikipedian-in-Residence in 2012-2013, Andrew Gray. Through this new residency we will be able to build on this hugely successful work with Wikipedia, across all our collection areas. It will also enable the Library to contribute more to the GLAM-Wiki Community in a coordinated and sustainable way, with particular emphasis on increasing the visibility of our digital collections, data and research materials from underrepresented people and marginalised communities through the development of innovative partnership projects.”

We are really looking forward to hosting this new residency, so watch this blog for future updates on this project. Fortunately this residency will be building upon existing experience, as British Library colleagues from many departments have actively engaged with Wikipedia and the Wikimedia family of platforms over several years. I will do my best to give summaries of some of these below:

BL Labs has collaborated with Wikimedia Commons in a number of ways, including:

BL Labs have also supported the excellent Wikipedia project Wiki-Food and (mostly) Women, this is an ongoing partnership with the Oxford Symposium of Food & Cookery (OSFC), which was initiated in 2015 by experienced Wikipedia editor and trainer Roberta Wedge, former OSFC Trustee Bee Wilson, OSFC Director Ursula Heinzelmann and the British Library’s Polly Russell. This project has held regular Wikipedia edit-a-thons at the British Library and in Oxford, providing training and support for Wikipedia editing with the aim of increasing and improving the articles about food, especially ones about women’s contributions to food and cooking culture. When this project started 90% of Wikipedia editors were men and this gender bias was reflected in Wikipedia coverage. There is still a bias, but thanks to the efforts of Wikipedia and many wonderful projects worldwide this gender balance is being addressed. Their plans for edit-a-thon events in 2020 were curtailed by Covid-19, but they did run some online training sessions and surgeries with Roberta Wedge at the OSFC virtual conference in 2020.

Another collaboration addressing gender balance issues was a recent Wikithon: Women in Leeds event, which took place on 22nd November 2020, to create and improve Wikipedia articles about some of the amazing women of Leeds, past and present. This was part of the British Library's cultural programme in Yorkshire, working with other GLAM organisations in the region. It was co-organised by Kenn Taylor from the British Library, in partnership with Rhian Isaac of Leeds Libraries and Lucy Moore of Leeds Museums & Galleries, for the season of events accompanying the British Library’s exhibition, "Unfinished Business: The Fight for Women’s Rights".

Hope Miyoba, Wikimedian in Residence for the Science Museum Group, who is based at the National Science and Media Museum in Bradford, gave an excellent training session on how to edit Wikipedia and the event produced new articles for Catherine Mary Buckton, the first woman elected to public office in Leeds, sharpshooter and circus performer Florence Shufflebottom, and philanthropist Marjorie Ziff who is notable for her contributions to the Jewish community in Leeds, whose article was further improved by the Women in Red editing community. This event also inspired me to create a new Wikipedia article for writer Rosie Garland, who is also a singer in Leeds goth band The March Violets.

Positive feedback was received from participants at this event, with comments such as ‘my 9 year old daughter says she wants to do this forever’, ‘just finished Uni and missing researching things, so this is definitely a good lockdown activity to get into!’ and ‘I’m thinking about how to incorporate women and Wikipedia entries into my teaching!’.

In addition to editing Wikipedia and adding images to Wikimedia Commons, a number of British Library staff have been editing Wikidata. In 2020 Eleanor Casson from the Contemporary Literary and Creative Archives team updated seventy Wikipedia articles and seventy two Wikidata entries with information about their collections, see the entry for the Society of Authors example in the image above, and Graham Jevon from the Endangered Archives Project has been using the Wikidata reconciliation service to validate and create authority records. This work enabled him to create more than three hundred authority records for people identified in a digitised collection of photographs from South America, which will be published online soon. Graham says:

"Wikidata has proved particularly helpful for continued productivity and collaboration while working from home during lockdown. It has enabled a colleague without access to internal cataloguing systems to create and edit authority records in Wikidata, which I can then extract to update the BL’s systems. This is a win-win. It helps us update our own catalogue records while simultaneously enhancing the shared Wikidata resource."

Before I end this post, I also want to flag up the excellent work done by the global GLAM–Wiki community (galleries, libraries, archives, and museums, also including botanic gardens and zoos), which advises and supports cultural institutions to share their resources with the world through collaborative projects with experienced Wikipedia editors.

Also the awesome #1Lib1Ref campaign (abbreviation for one librarian, one reference), which invites librarians around the world, and anyone who has a passion for free knowledge, to add missing references to articles on Wikipedia, with the aim to reduce Wikipedia's backlog of citation needed notices.

Please do add some references and eat some cake to celebrate Wikipedia's 20th birthday this year, I know I will be. You may also like to listen to BBC Radio 4’s The World At One programme from earlier today (15/01/2021), where David Gerard and myself discuss Wikipedia and libraries, you can hear this section from 37 minutes 55 seconds into the recording.

This post is by Digital Curator Stella Wisdom (@miss_wisdom)

31 December 2020

Highlights from crowdsourcing projects at the British Library

Add comment

In this post, Dr Mia Ridge and others celebrate our award-winning contributors and share progress reports from a range of crowdsourcing projects at the British Library.

Despite significant challenges, 2020 was a year of remarkable achievements for crowdsourcing at the British Library. Read on for some highlights.

A quarter of a million contributions on LibCrowds

The LibCrowds platform, which hosts our In the Spotlight project and previously hosted Convert-a-Card, reached an incredible milestone in mid-December - a quarter of a million contributions! Our heartfelt thanks to the nearly 3000 registered volunteers, and countless anonymous others who contributed to this fantastic achievement via our projects.

The official launch - and completion! - of crowdsourcing tasks on Living with Machines

Building on the lessons learnt from earlier experiments, in early December we launched two new crowdsourcing projects with data scientists from the Living with Machines project. These projects aimed to integrate linguistic research questions with tasks that encouraged volunteers to engage with social and technological history in the pages of 19th century newspapers. We learnt a lot and tweaked the project after the feedback from Zooniverse volunteers, and were delighted to be recognised as an official Zooniverse project.

Thanks to the mighty power of Zooniverse volunteers, the tasks were completed within a few days. Analysing the results will keep us busy in the first few months of 2021.

In the Spotlight and Georeferencer contributors are award-winning!

Earlier this year, digital volunteers on the British Library's In the Spotlight and Georeferencer projects were nominated in the Community category of the British Library Labs awards. You can watch the 30 second videos about the nominations for In the Spotlight and Georeferencer on YouTube. Awards winners are decided by BL Labs and other Library staff with the BL Labs Advisory Board, and we're delighted to say that both projects won with a joint award for first place! 

Congratulations to all our contributors for this recognition of your work with our crowdsourcing tasks, and for discussing our collections and sharing your insights with us and others. 

In the Spotlight

In addition to the 255,000+ contributions above, volunteers have completed tasks on 148 volumes of historical playbills. We continue to work with our Metadata Services team to integrate these transcriptions into British Library systems. The project has a remarkable international reach, with visitors to the project from 1736 cities in 104 countries. Whether you're from Accra, Hanoi, London, Moscow, San Antonio or Zagreb - thank you!

Georeferencer

Dr Gethin Rees, Lead Curator for Digital Map Collections, writes:

In 2014 the British Library released over 50,000 images of maps onto the Georeferencer that had been extracted from the millions of Flickr images from 17th-, 18th- and 19th-century books with the help of volunteers. Ever since then the volunteers have been hard at work adding coordinate data on the Georeferencer platform and I am delighted to announce that the collection has now been effectively completed. The upgraded Georeferencer and the time we have all had to spend indoors over the last months appear to have provided the project with a new impetus, well done to all! 

The work of Georeferencer volunteers on this Flickr collection of maps has been invaluable to the Library; the addition of coordinate data from the Flickr collection to the British Library's Aleph catalogue has offered a new metadata perspective for our collections. The Flickr maps can be browsed using an interactive web map allowing the public to easily discover maps of areas where they live or are interested in. We are intending on making the georeferenced maps available as GeoTIFFs on the British Library's Research Repository. A huge thank you to maurice, Janet H, Nigel Slack, Martin Whitton, Benjamin G, John Herridge, Singout, H Barber, Jheald and Michael Ammon and all the Georeferencer community for their amazing work on the platform and feedback over the years.

Find out more: Flickr Maps on the Georeferencer Finished!

Following the completion of the Flickr work, we released just under 8000 images from the K. Top collection onto the BL's Georeferencer. The maps are part of a larger collection of 18,000 digital images of historic maps, views and texts from the Topographical Collection of King George III that have been released into the public domain. The collection has been digitised as part of a seven-year project to catalogue, conserve and digitise the collection which was presented to the Nation in 1823 by King George IV.  The images are made available on the image sharing site Flickr, which links to fully searchable catalogue records on Explore the British Library. The Georeferencers have been making short work of these maps: they were added back in early October and 54% have already been completed. This initial 8000 is the first of two planned Georeferencer releases. 

Find out more: The K.Top: 18,000 digitised maps and views released

Endangered Archives Programme

Dr Graham Jevon, Cataloguer, Endangered Archives Programme, writes:

EAP's Siberian photographs project is close to moving to the next phase. Thanks to the amazing work of all our contributors, one task has been completed and the second task is almost complete.

But we still need your help to tag the last remaining photographs. You don't need any expert knowledge. And like hot mince pies, once they're gone, they're gone. So get tagging before someone else beats you to all the best photos!

In 2021, we are looking forward to processing the results in order to enhance the online catalogue and also to begin an exciting new research project based on the tags you have created - we hope to be able to share more news on this in the coming months!

Meanwhile, Russian curators Katya Rogatchevskaia and Katie McElvanney have been working hard behind the scenes on this project. One of the fruits of this work has been the translation of the Zooniverse platform terms into Russian. This will help enable any future crowdsourcing projects to publish their projects on Zooniverse in Russian as well as English.

Nominate a case study for the 'Collective Wisdom' project

This AHRC-funded project led by Dr Mia Ridge aims to foster an international community of practice and set a research agenda for crowdsourcing in cultural heritage. In March 20201 we'll collaboratively write a book on the state of the art in crowdsourcing in cultural heritage through two intensive week-long 'book sprint' sessions. We'd like to include case studies from a range of projects that include crowdsourcing, online volunteering or digital participation - please get in touch if you'd like to find out more or would like to suggest a project for inclusion.

Find out more: Collective Wisdom Project website.