THE BRITISH LIBRARY

Digital scholarship blog

20 posts categorized "Social sciences"

10 June 2019

Collaborative Digital Scholarship in Action: A Case Study in Designing Impactful Student Learning Partnerships

Add comment

The Arts and Sciences (BASc) department at University College London has been at the forefront of pioneering a renascence of liberal arts and sciences degrees in the UK. As part of its Core modules offering, students select an interdisciplinary elective in Year 2 of their academic programme – from a range of modules specially designed for the department by University College London academics and researchers.

When creating my own module – Information Through the Ages (BASC0033) – as part of this elective set, I was keen to ensure that the student learning experience was both supported and developed in tandem with professional practices and standards, knowing that enabling students to progress their skills developed on the module beyond the module’s own assignments would aid them not only in their own unique academic degree programmes but also provide substantial evidence to future employers of their employability and skills base. Partnering with the British Library, therefore, in designing a data science and data curation project as part of the module’s core assignments, seemed to me to provide an excellent opportunity to enable both a research-based educative framework for students as well as a valuable chance for them to engage in a real-world collaboration, as providing students with external industry partners to collaborate with can contribute an important fillip to their motivation and the learning experience overall – by seeing their assessed work move beyond the confines of the academy to have an impact out in the wider world.

Through discussions with my British Library co-collaborators, Mahendra Mahey and Stella Wisdom, we alighted on the Microsoft Books/BL 19th Century collection dataset as providing excellent potential for student groups to work with for their data curation projects. With its 60,000 public domain volumes, associated metadata and 1 million+ extracted images, it presented as exciting, undiscovered territory across which our student groups might roam and rove, with the results of their work having the potential to benefit future British Library researchers.

Structuring the group project around wrangling a subset of this data: discovering, researching, cleaning and refining it, with the output from each group a curated version of the original dataset we therefore felt presented a number of significant benefits. Students were enabled to explore and develop technical skills such as data curation, software knowledge, archival research, report writing, project development and collaborative working practices, alongside experiencing a real world, digital scholarship learning experience – with the outcomes in turn supporting the British Library’s Digital Scholarship remit regards enabling innovative research based on the British Library digital collections.

Students observed that “working with the data did give me more practical insight to the field of work involved with digitisation work, and it was an enriching experience”, including how they “appreciated how involved and hands-on the projects were, as this is something that I particularly enjoy”. Data curation training was provided on site at the British Library, with the session focused on the use of OpenRefine, “a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.”[1] Student feedback also told us that we could have provided further software training, and more guided dataset exploration/navigation resources, with groups keen to learn more nuanced data curation techniques – something we will aim to respond to in future iterations of the module – but overall, as one student succinctly noted, “I had no idea of the digitalization process and I learned a lot about data science. The training was very useful and I acquired new skills about data cleaning.”

Overall, we had five student groups wrangling the BL 19th Century collection, producing final data subsets in the following areas: Christian and Christian-related texts; Queens of Britain 1510-1946; female authors, 1800-1900 (here's a heatmap this student group produced of the spread of published titles by female authors in the 19th century); Shakespearean works, other author’s adaptations on those works, and any commentary on Shakespeare or his writing; and travel-related books.

In particular, it was excellent to see students fully engaging with the research process around their chosen data subset – exploring its cultural and institutional contexts, as well as navigating metadata/data schemas, requirements and standards.

For example, the Christian texts group considered the issue of different languages as part of their data subset of texts, following this up with textual content analysis to enable accurate record querying and selection. In their project report they noted that “[u]sing our dataset and visualisations as aids, we hope that researchers studying the Bible and Christianity can discover insights into the geographical and temporal spread of Christian-related texts. Furthermore, we hope that they can also glean new information regarding the people behind the translations of Bibles as well as those who wrote about Christianity.”

Similarly, the student group focused on travel-related texts discussed in their team project summary that “[t]he particular value of this curated dataset is that future researchers may be able to use it in the analysis of international points of view. In these works, many cities and nations are being written about from an outside perspective. This perspective is one that can be valuable in understanding historical relations and frames of reference between groups around the world: for instance, the work “Travels in France and Italy, in 1817 and 1818”, published in New York, likely provides an American perspective of Europe, while “Four Months in Persia, and a Visit to Trans-Caspia”, published in London, might detail an extended visit of a European in Persia, both revealing unique perspectives about different groups of people. A comparable work, that may have utilized or benefitted from such a collection, is Hahner’s (1998) “Women Through Women’s Eyes:Latin American Women in Nineteenth Century Travel Accounts.” In it, Hahner explores nineteenth century literature written to unearth the perspectives on Latin American women, specifically noting that the primarily European author’s writings should be understood in the context of their Eurocentric view, entrenched in “patriarchy” and “colonialism” (Hahner, 1998:21). Authors and researchers with a similar intent may use [our] curated British Library dataset comparably – that is, to locate such works.”

Data visualisation by travel books group
Data visualisation by travel books group
Data visualisation by travel books group
Data visualisation by travel books group

Over the ten weeks of the module, alongside their group data curation projects, students covered lecture topics as varied as Is a Star a Document?, "Truthiness" and Truth in a Post-Truth World, Organising Information: Classification, Taxonomies and Beyond!, and Information & Power; worked on an individual archival GIF project which drew on an institutional archival collection to create (and publish on social media) an animated GIF; and spent time in classroom discussions considering questions such as What happens when information is used for dis-informing or mis-informing purposes?; How do the technologies available to us in the 21st century potentially impact on the (data) collection process and its outputs and outcomes?; How might ideas about collections and collecting be transformed in a digital context?; What exactly do we mean by the concepts of Data and Information?; How we choose to classify or group something first requires we have a series of "rules" or instructions which determine the grouping process – but who decides on what the rules are and how might such decisions in fact influence our very understandings of the information the system is supposedly designed to facilitate access to? These dialogues were all situated within the context of both "traditional" collections systems and atypical sites of information storage and collection, with the module aiming to enable students to gain an in-depth knowledge, understanding and critical appreciation of the concept of information, from historical antecedents to digital scientific and cultural heritage forms, in the context of libraries, archives, galleries and museums (including alternative, atypical and emergent sources), and how technological, social, cultural and other changes fundamentally affect our concept of “information.”

“I think this module was particularly helpful in making me look at things in an interdisciplinary light”, one student observed in module evaluation feedback, with others going on to note that “I think the different formats of work we had to do was engaging and made the coursework much more interesting than just papers or just a project … the collaboration with the British Library deeply enriched the experience by providing a direct and visible outlet for any energies expended on the module. It made the material seem more applicable and the coursework more enjoyable … I loved that this module offered different ways of assessment. Having papers, projects, presentations, and creative multimedia work made this course engaging.”

Situating the module’s assessments within such contexts I hope encouraged students to understand the critical, interdisciplinary focus of the field of information studies, in particular the use of information in the context of empire-making and consolidation, and how histories of information, knowledge and power intersect. Combined with a collaborative, interdisciplinary curriculum design approach, which encouraged and supported students to gain technical abilities and navigate teamwork practices, we hope this module can point some useful ways forward in creating and developing engaging learning experiences, which have real world impact.

This blog post is by Sara Wingate-Gray (UCL Senior Teaching Fellow & BASC0033 module leader), Mahendra Mahey (BL Labs Manager) and Stella Wisdom (BL Digital Curator for Contemporary British Collections).

07 February 2019

BL Labs 2018 Research Award Honourable Mention: 'HerStories: Sites of Suffragette Protest and Sabotage'

Add comment

At our symposium in November 2018, BL Labs awarded two Honourable Mentions in the Research category for projects using the British Library's digital collections. This guest blog is by the recipients of one of these - a collaborative project by Professor Krista Cowman at the University of Lincoln and Tamsin Silvey, Rachel Williams, Ben Ellwood and Rosie Ryder at Historic England. 

HerStories: Sites of Suffragette Protest and Sabotage

The project marked the commemoration of the centenaries of some British women winning the Parliamentary vote in February 2018, the right to stand as MPs in November 1918 and of the first election in which women voted in December 1918.  The centenary year caught the public imagination and resulted in numerous commemorative events.  Our project added to these by focussing on the suffragette connections of England’s historic buildings.  Its aim was to uncover the suffragette stories hidden in the bricks and mortar of England’s historic buildings and to highlight the role that the historic built environment played in the militant suffrage movement.  The Women’s Social and Political Union co-ordinated a national campaign of militant activities across the country in the decade before the First World War.  Buildings were integral to this.  The Union rented out shops and offices in larger towns and cities.  It held large public meetings in the streets and inside meeting halls.

Suffragettes also identified buildings as legitimate targets for political sabotage.  The WSPU’s leader, Emmeline Pankhurst, famously urged her followers to strike at the enemy through property.  Buildings were then seen as legitimate targets for political sabotage by suffragettes who broke windows, set fires and placed bombs as part of their campaign to force the government to give votes to women. 

The project used the newly-digitised resources of Votes for Women and The Suffragette to identify historic buildings connected with the militant suffrage campaign.  Local reports in both papers were consulted to compile a database of sites connected to the WSPU across England.

HerStories image 1

This revealed a huge diversity in locations and activities.  Over 5000 entries from more than 300 geographical locations were logged. Some were obscure and mundane such as 6 Bronte Street in Keighley, the contact address for the local WSPU branch for 1908.  Others were much more high–profile including St Paul’s Cathedral where a number of services were disrupted by suffragettes and a bomb was planted.   All of the sites on the database were then compared with the National Heritage List, the official record of England’s protected historic buildings compiled and maintained by Historic England. https://historicengland.org.uk/listing/the-list/

This provided a new data set of over a hundred locations whose historic significance had already been recognised through listing but whose connection to militant suffrage was currently unrecognised. 

These sites were further researched using the British Library’s collection of historic local newspapers to retrieve more detail about their suffragette connections including their contemporary reception. This showed previously unknown detail including an attempted attack on the old Grammar School, King’s Norton, where the Nottingham Evening Post reported how suffragettes who broke in did no damage but left a message on the blackboard saying that they had refrained from damaging it’s ‘olde worlde’ rooms.

HerStories image 2

The team selected 41 sites and updated their entries on The List to include their newly-uncovered suffragette connections. 

The amended entries can be seen in more detail on Historic England’s searchable map at https://historicengland.org.uk/whats-new/news/suffragette-protest-and-sabotage-sites 

The results provided a significant addition to the suffragette centenary commemorations by marking the important connections between suffragette’s fight for the vote and England’s Historic listed buildings.

Watch Krista Cowman and Tamsin Silvey receiving their Honourable Mention award on behalf of their team, and talking about their project on our YouTube channel (clip runs from 10.45 to 13.33): 

Find out more about Digital Scholarship and BL Labs. If you have a project which uses British Library digital content in innovative and interesting ways, consider applying for an award this year! The 2019 BL Labs Symposium will take place on Monday 11 November at the British Library.

15 January 2019

The BL Labs Symposium, 2018

Add comment

On Monday 12th November, 2018, the British Library hosted the sixth annual BL Labs Symposium, celebrating all things digital at the BL. This was our biggest ever symposium with the conference centre at full capacity - proof, if any were needed, of the importance of using British Library digital collections and technologies for innovative projects in the heritage sector.

The delegates were welcomed by our Chief Executive, Roly Keating, and there followed a brilliant keynote by Daniel Pett, Head of Digital and IT at the Fitzwilliam Museum, Cambridge. In his talk, Dan reflected on his 3D modelling projects at the British Museum and the Fitzwilliam, and talked about the importance of experimenting with, re-imagining, and re-mixing cultural heritage digital collections in Galleries, Libraries, Archives and Museums (GLAMs).

This year’s symposium had quite a focus on 3D, with a series of fascinating talks and demonstrations throughout the day by visual artists, digital curators, and pioneers of 3D photogrammetry and data visualisation technologies. The full programme is still viewable on the Eventbrite page, and videos and slides of the presentations will be uploaded in due course.

Composite bl labs 2018 awardees

Each year, BL Labs recognises excellent work that has used the Library's digital content in five categories. The 2018 winners, runners up and honourable mentions were announced at the symposium and presented with their awards throughout the day. This year’s Award recipients were:

Research Award:

Winner: The Delius Catalogue of Works by Joanna Bullivant, Daniel Grimley, David Lewis and Kevin Page at the University of Oxford

Honourable Mention: Doctoral theses as alternative forms of knowledge: Surfacing ‘Southern’ perspectives on student engagement with internationalisation by Catherine Montgomery and a team of researchers at the University of Bath

Honourable Mention: HerStories: Sites of Suffragette Protest and Sabotage by Krista Cowman at the University of Lincoln and Rachel Williams, Tamsin Silvey, Ben Ellwood and Rosie Ryder of Historic England

Artistic Award:

Winner: Another Intelligence Sings by Amanda Baum, Rose Leahy and Rob Walker

Runner Up: Nomad by independent researcher Abira Hussein, and Sophie Dixon and Edward Silverton of Mnemoscene

Teaching & Learning Award:

Winner: Pocket Miscellanies by Jonah Coman

Runner Up: Pocahontas and After by Michael Walling, Lucy Dunkerley and John Cobb of Border Crossings

Commercial Award:

Winner: The Library Collection: Fashion Presentation at London Fashion Week, SS19 by Nabil Nayal in association with Colette Taylor of Vega Associates

Runner Up: The Seder Oneg Shabbos Bentsher by David Zvi Kalman, Print-O-Craft Press

Staff Award:

Winner: The Polonsky Foundation England and France Project: Manuscripts from the British Library and the Bibliothèque nationale de France, 700-1200 by Tuija Ainonen, Clarck Drieshen, Cristian Ispir, Alison Ray and Kate Thomas

Runner Up: The Digital Documents Harvesting and Processing Tool by Andrew Jackson, Sally Halper, Jennie Grimshaw and Nicola Bingham

The judging process is always a difficult one as there is such diversity in the kinds of projects that are up for consideration! So we wanted to also thank all the other entrants for their high quality submissions, and to encourage anyone out there who might be considering applying for a 2019 award!

We will be posting guest blogs by the award recipients over the coming months, so tune in to read more about their projects.

And finally, save the date for this year's symposium, which will be held at the British Library on Monday 11th November, 2019.

25 April 2018

Some challenges and opportunities for digital scholarship in 2018

Add comment

In this post, Digital Curator Dr Mia Ridge shares her presentation notes for a talk on 'challenges and opportunities for digital scholarship' at the British Library's first Research Collaboration 'Open House'.

I'm part of a team that supports the creation and innovative use of the British Library's digital collections. Our working definition of digital scholarship is 'using computational methods to answer existing research questions or challenge existing theoretical paradigms'. In this post/talk, my perspective is informed by my knowledge of the internal processes necessary to support digital scholarship and of the issues that some scholars face when using digital/digitised collections, so I'm not by any means claiming this is a complete list.

Opportunities in digital scholarship

  • Scale: you can explore a bigger body of material computationally - 'reading' thousands, or hundreds of thousands, of volumes of text, images or media files - while retaining the ability to individually examine individual items as research questions arise from that distant reading
  • Perspective: you can see trends, patterns and relationships not apparent from close reading individual items, or gain a broad overview of a topic
  • Speed: you can test an idea or hypothesis on a large dataset; prototype new interfaces; generate classification data about people, places, concepts; transcribe content

Together, these opportunities enable new research questions.

Sample digital scholarship tools and methods

Some of these processes help get data ready for analysis (e.g. turning images of items into transcribed and annotated texts), while others support the analysis of large collections at scale, improve discoverability or enable public engagement.

  • OCR, HTR - optical character recognition, handwritten text recognition
  • Data visualisation for analysis or publication
  • Text and data mining - applying classifications to or analysing texts, images or media. Key terms include natural language processing, corpus linguistics, sentiment analysis, applied machine learning. Examples include: Voyant tools, Clarifai image classification.
  • Mapping and GIS - assigning coordinates to quantitative or qualitative data
  • Public participation and learning including crowdsourcing, citizen science/history. Examples include In the Spotlight, transcribing information from historical playbills.
  • Creative and emerging formats including games
An experiment with image classification with Clarifai
An experiment with image classification with Clarifai

Putting it all together, we have case studies like Dr. Katrina Navickas, BL Labs Winner 2015's Political Meetings Mapper. This project, based on digitised 19th century newspapers, used Python scripts to calculate the meeting date, and extract and geocode their locations to create a map of Chartist meetings.

The Library has created a data portal, data.bl.uk, containing openly licensed datasets. We aim to describe collections in terms of their data format (images, full text, metadata, etc.), licences, temporal and geographic scope, originating purpose (e.g. specific digitisation projects or exhibitions) and collection, and related subjects or themes. Other datasets may be available by request, or digitised via funded partnerships.

We're aware that, currently, it can be hard to use the datasets from data.bl.uk as they can be too large to easily download, store and manipulate. This leads me neatly onto...

Challenges in digital scholarship

  • Digitisation and cataloguing backlog - the material you want mightn't be available without a special digitisation project
  • Providing access to assets for individual items - between copyright and technology, scholars don't always have the ability to download OCR/HTR text, or download all digitised media about an item
  • Providing access to collections as datasets - moving more material into the 'sweet spot' of material that's nicely digitised in suitable formats, usable sizes, with open licences allowing for re-use is an on-going (and expensive, time-consuming process)
  • 'Cleaning' historical data and dealing with gaps in both tools provision and source collections - none of these processes are straightforward
  • Providing access to platforms or suites of tools - how much should the Library take on for researchers, and how much should other institutions or individuals provide?
  • Skills - where will researchers learn digital scholarship methods?
  • Peer review - what if your discipline lacks DS-skilled peers? How can peers judge a website or database if they've only had experience with monographs or articles? How can scholars overcome prejudice about the 'digital'?
  • Versioning datasets as annotations or classifications change, software tools improve over time, transcriptions are corrected, etc - some of these changes may affect the argument you're making

Overall, I hope the opportunities outweigh the challenges, and it's certainly possible to start with small projects with existing tools and digital sources to explore the potential of a larger project.

If you've used BL data, you can enter the BL Labs awards - they don't close until October so you have time to start an experimental project now! You can also ask the Labs team to reality check your digital scholarship idea based on Library collections and data.

Digital scholarship is constantly shifting so on another date I might have come up with different opportunities and challenges. Let me know if you have challenges or opportunities that you think could be included in this very brief overview!

12 April 2018

The 2018 BL Labs Awards: enter before midnight Thursday 11th October!

Add comment

With six months to go before the submission deadline, we would like to announce the 2018 British Library Labs Awards!

The BL Labs Awards are a way of formally recognising outstanding and innovative work that has been created using the British Library’s digital collections and data.

Have you been working on a project that uses digitised material from the British Library's collections? If so, we'd like to encourage you to enter that project for an award in one of our categories.

This year, BL Labs is awarding prizes for a winner and a runner up in four key areas:

  • Research - A project or activity which shows the development of new knowledge, research methods, or tools.
  • Commercial - An activity that delivers or develops commercial value in the context of new products, tools, or services that build on, incorporate, or enhance the Library's digital content.
  • Artistic - An artistic or creative endeavour which inspires, stimulates, amazes and provokes.
  • Teaching / Learning - Quality learning experiences created for learners of any age and ability that use the Library's digital content.

BLAwards2018
BL Labs Awards 2018 Winners (Top-Left- Research Award Winner – A large-scale comparison of world music corpora with computational tools , Top-Right (Commercial Award Winner – Movable Type: The Card Game), Bottom-Left(Artistic Award Winner – Imaginary Cities) and Bottom-Right (Teaching / Learning Award Winner – Vittoria’s World of Stories)

There is also a Staff award which recognises a project completed by a staff member or team, with the winner and runner up being announced at the Symposium along with the other award winners.

The closing date for entering your work for the 2018 round of BL Labs Awards is midnight BST on Thursday 11th October (2018). Please submit your entry and/or help us spread the word to all interested and relevant parties over the next few months. This will ensure we have another year of fantastic digital-based projects highlighted by the Awards!

Read more about the Awards (FAQs, Terms & Conditions etc), practice your application with this text version, and then submit your entry online!

The entries will be shortlisted after the submission deadline (11/10/2018) has passed, and selected shortlisted entrants will be notified via email by midnight BST on Friday 26th October 2018. 

A prize of £500 will be awarded to the winner and £100 to the runner up in each of the Awards categories at the BL Labs Symposium on 12th November 2018 at the British Library, St Pancras, London.

The talent of the BL Labs Awards winners and runners up from the last three years has resulted in a remarkable and varied collection of innovative projects. You can read about some of last year's Awards winners and runners up in our other blogs, links below:

BLAwards2018-Staff
British Library Labs Staff Award Winner – Two Centuries of Indian Print

To act as a source of inspiration for future awards entrants, all entries submitted for awards in previous years can be browsed in our online Awards archive.

For any further information about BL Labs or our Awards, please contact us at labs@bl.uk.

14 March 2018

Working with BL Labs in search of Sir Jagadis Chandra Bose

Add comment

The 19th Century British Library Newspapers Database offers a rich mine of material to be sourced for a comprehensive view of British life in the nineteenth and early twentieth century. The online archive comprises 101 full-text titles of local, regional, and national newspapers across the UK and Ireland, and thanks to optical character recognition, they are all fully searchable. This allows for extensive data mining across several millions worth of newspaper pages. It’s like going through the proverbial haystack looking for the equally proverbial needle, but with a magnet in hand.

For my current research project on the role of the radio during the British Raj, I wanted to find out more about Sir Jagadis Chandra Bose (1858–1937), whose contributions to the invention of wireless telegraphy were hardly acknowledged during his lifetime and all but forgotten during the twentieth century.

J.C.Bose
Jagadish Chandra Bose in Royal Institution, London
(Image from Wikimedia Commons)

The person who is generally credited with having invented the radio is Guglielmo Marconi (1874–1937). In 1909, he and Karl Ferdinand Braun (1850–1918) were awarded the Nobel Prize in Physics “in recognition of their contributions to the development of wireless telegraphy”. What is generally not known is that almost ten years before that, Bose invented a coherer that would prove to be crucial for Marconi’s successful attempt at wireless telegraphy across the Atlantic in 1901. Bose never patented his invention, and Marconi reaped all the glory.

In his book Jagadis Chandra Bose and the Indian Response to Western Science, Subrata Dasgupta gives us four reasons as to why Bose’s contributions to radiotelegraphy have been largely forgotten in the West throughout the twentieth century. The first reason, according to Dasgupta, is that Bose changed research interest around 1900. Instead of continuing and focusing his work on wireless telegraphy, Bose became interested in the physiology of plants and the similarities between inorganic and living matter in their responses to external stimuli. Bose’s name thus lost currency in his former field of study.

A second reason that contributed to the erasure of Bose’s name is that he did not leave a legacy in the form of students. He did not, as Dasgupta puts it, “found a school of radio research” that could promote his name despite his personal absence from the field. Also, and thirdly, Bose sought no monetary gain from his inventions and only patented one of his several inventions. Had he done so, chances are that his name would have echoed loudly through the century, just as Marconi’s has done.

“Finally”, Dasgupta writes, “one cannot ignore the ‘Indian factor’”. Dasgupta wonders how seriously the scientific western elite really took Bose, who was the “outsider”, the “marginal man”, the “lone Indian in the hurly-burly of western scientific technology”. And he wonders how this affected “the seriousness with which others who came later would judge his significance in the annals of wireless telegraphy”.

And this is where the BL’s online archive of nineteenth-century newspapers comes in. Looking at newspaper coverage about Bose in the British press at the time suggests that Bose’s contributions to wireless telegraphy were soon to be all but forgotten during his lifetime. When Bose died in 1937, Reuters Calcutta put out a press release that was reprinted in several British newspapers. As an example, the following notice was published in the Derby Evening Telegraph of November 23rd, 1937, on Bose’s death:

Newspaper clipping announcing death of JC Bose
Notice in the Derby Evening Telegraph of November 23rd, 1937

This notice is as short as it is telling in what it says and does not say about Bose and his achievements: he is remembered as the man “who discovered a heart beat in trees”. He is not remembered as the man who almost invented the radio. He is remembered for the Western honours that are bestowed upon him (the Knighthood and his Fellowship of the Royal Society), and he is remembered as the founder of the Bose Research Institute. He is not remembered for his career as a researcher and inventor; a career that span five decades and saw him travel extensively in India, Europe and the United States.

The Derby Evening Telegraph is not alone in this act of partial remembrance. Similar articles appeared in Dundee’s Evening Telegraph and Post and The Gloucestershire Echo on the same day. The Aberdeen Press and Journal published a slightly extended version of the Reuters press release on November 24th that includes a brief account of a lecture by Bose in Whitehall in 1929, during which Bose demonstrated “that plants shudder when struck, writhe in the agonies of death, get drunk, and are revived by medicine”. However, there is again no mention of Bose’s work as a physicist or of his contributions to wireless telegraphy. The same is true for obituaries published in The Nottingham Evening Post on November 23rd, The Western Daily Press and Bristol Mirror on November 24th, another article published in the Aberdeen Press and Journal on November 26th, and two articles published in The Manchester Guardian on November 24th.

The exception to the rule is the obituary published in The Times on November 24th. Granted, with a total of 1116 words it is significantly longer than the Reuters press release, but this is also partly the point, as it allows for a much more comprehensive account of Bose’s life and achievements. But even if we only take the first two sentences of The Times obituary, which roughly add up to the word count of the Reuters press release, we are already presented with a different account altogether:

“Our Calcutta Correspondent telegraphs that Sir Jagadis Chandra Bose, F.R.S., died at Giridih, Bengal, yesterday, having nearly reached the age of 79. The reputation he won by persistent investigation and experiment as a physicist was extended to the general public in the Western world, which he frequently visited, by his remarkable gifts as a lecturer, and by the popular appeal of many of his demonstrations.”

We know that he was a physicist; the focus is on his skills as a researcher and on his talents as a lecturer rather than on his Western titles and honours, which are mentioned in passing as titles to his name; and we immediately get a sense of the significance of his work within the scientific community and for the general public. And later on in the article, it is finally acknowledged that Bose “designed an instrument identical in principle with the 'coherer' subsequently used in all systems of wireless communication. Another early invention was an instrument for verifying the laws of refraction, reflection, and polarization of electric waves. These instruments were demonstrated on the occasion of his first appearance before the British Association at the 1896 meeting at Liverpool”.

Posted by BL Labs on behalf of Dr Christin Hoene, a BL Labs Researcher in Residence at the British Library. Dr Hoene is a Leverhulme Early Career Fellow in English Literature at the University of Kent. 

If you are interested in working with the British Library's digital collections, why not come along to one of our events that we are holding at universities around the UK this year? We will be holding a roadshow at the University of Kent on 25 April 2018. You can see a programme for the day and book your place through this Eventbrite page. 

13 February 2018

BL Labs 2017 Symposium: Samtla, Research Award Runner Up

Add comment

Samtla (Search And Mining Tools for Labelling Archives) was developed to address a need in the humanities for research tools that help to search, browse, compare, and annotate documents stored in digital archives. The system was designed in collaboration with researchers at Southampton University, whose research involved locating shared vocabulary and phrases across an archive of Aramaic Magic Texts from Late Antiquity. The archive contained texts written in Aramaic, Mandaic, Syriac, and Hebrew languages. Due to the morphological complexity of these languages, where morphemes are attached to a root morpheme to mark gender and number, standard approaches and off-the-shelf software were not flexible enough for the task, as they tended to be designed to work with a specific archive or user group. 

Figure1
Figure 1: Samtla supports tolerant search allowing queries to be matched exactly and approximately. (Click to enlarge image)

  Samtla is designed to extract the same or similar information that may be expressed by authors in different ways, whether it is in the choice of vocabulary or the grammar. Traditionally search and text mining tools have been based on words, which limits their use to corpora containing languages were 'words' can be easily identified and extracted from text, e.g. languages with a whitespace character like English, French, German, etc. Word models tend to fail when the language is morphologically complex, like Aramaic, and Hebrew. Samtla addresses these issues by adopting a character-level approach stored in a statistical language model. This means that rather than extracting words, we extract character-sequences representing the morphology of the language, which we then use to match the search terms of the query and rank the documents according to the statistics of the language. Character-based models are language independent as there is no need to preprocess the document, and we can locate words and phrases with a lot of flexibility. As a result Samtla compensates for the variability in language use, spelling errors made by users when they search, and errors in the document as a result of the digitisation process (e.g. OCR errors). 

Figure2
Figure 2: Samtla's document comparison tool displaying a semantically similar passage between two Bibles from different periods. (Click to enlarge image)

 The British Library have been very supportive of the work by openly providing access to their digital archives. The archives ranged in domain, topic, language, and scale, which enabled us to test Samtla’s flexibility to its limits. One of the biggest challenges we faced was indexing larger-scale archives of several gigabytes. Some archives also contained a scan of the original document together with metadata about the structure of the text. This provided a basis for developing new tools that brought researchers closer to the original object, which included highlighting the named entities over both the raw text, and the scanned image.

Currently we are focusing on developing approaches for leveraging the semantics underlying text data in order to help researchers find semantically related information. Semantic annotation is also useful for labelling text data with named entities, and sentiments. Our current aim is to develop approaches for annotating text data in any language or domain, which is challenging due to the fact that languages encode the semantics of a text in different ways.

As a first step we are offering labelled data to researchers, as part of a trial service, in order to help speed up the research process, or provide tagged data for machine learning approaches. If you are interested in participating in this trial, then more information can be found at www.samtla.com.

Figure3
Figure 3: Samtla's annotation tools label the texts with named entities to provide faceted browsing and data layers over the original image. (Click to enlarge image)

 If this blog post has stimulated your interest in working with the British Library's digital collections, start a project and enter it for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.


Posted by BL Labs on behalf of Dr Martyn Harris, Prof Dan Levene, Prof Mark Levene and Dr Dell Zhang.

02 February 2018

Converting Privy Council Appeals Metadata to Linked Data

Add comment

To continue the series of posts on metadata about appeals to the Judicial Committee of the Privy Council, this post describes the process of converting this data to Linked Data. In the previous post, I briefly explained the concept of Linked Data and outlined the potential benefits of applying this approach to the JCPC dataset. An earlier post explained how cleaning the data enabled me to produce some initial visualisations; a post on the Social Science blog provides some historical context about the JCPC itself.

Data Model

In my previous post, I included the following diagram to show how the Linked JCPC Data might be structured.

JCPCDataModelHumanReadable_V1_20180104

To convert the dataset to Linked Data using this model, each entity represented by a blue node, and each class and property represented by the purple and green nodes need a unique identifier known as a Uniform Resource Indicator (URI). For the entities, I generated these URIs myself based on guidelines provided by the British Library, using the following structure:

  • http://data.bl.uk/jcpc/id/appeal/...
  • http://data.bl.uk/jcpc/id/judgment/...
  • http://data.bl.uk/jcpc/id/location/...

In the above URIs, the ‘...’ is replaced by a unique reference to a particular appeal, judgment, or location, e.g. a combination of the judgment number and year.

To ensure that the data can easily be understood by a computer and linked to other datasets, the classes and properties should be represented by existing URIs from established ontologies. An ontology is a controlled vocabulary (like a thesaurus) that not only defines terms relating to a subject area, but also defines the relationships between those terms. Generic properties and classes, such as titles, dates, names and locations, can be represented by established ontologies like Dublin Core, Friend of a Friend (FOAF) and vCard.

After considerable searching I was unable to find any online ontologies that precisely represent the legal concepts in the JCPC dataset. Instead, I decided to use relevant terms from Wikidata, where available, and to create terms in a new JCPC ontology for those entities and concepts not defined elsewhere. Taking this approach allowed me to concentrate my efforts on the process of conversion, but the possibility remains to align these terms with appropriate legal ontologies in future.

An updated version of the data model shows the ontology terms used for classes and properties (purple and green boxes):

JCPCDataModel_V9_20180104

Rather than include the full URI for each property or class, the first part of the URI is represented by a prefix, e.g. ‘foaf’, which is followed by the specific term, e.g. ‘name’, separated by a colon.

More Data Cleaning

The data model diagram also helped identify fields in the spreadsheet that required further cleaning before conversion could take place. This cleaning largely involved editing the Appellant and Respondent fields to separate multiple parties that originally appeared in the same cell and to move descriptive information to the Appellant/Respondent Description column. For those parties whose names were identical, I additionally checked the details of the case to determine whether they were in fact the same person appearing in multiple appeals/judgments.

Reconciliation

Reconciliation is the process of aligning identifiers for entities in one dataset with the identifiers for those entities in another dataset. If these entities are connected using Linked Data, this process implicitly links all the information about the entity in one dataset to the entity in the other dataset. For example, one of the people in the JCPC dataset is H. G. Wells – if we link the JCPC instance of H. G. Wells to his Wikidata identifier, this will then facilitate access to further information about H. G. Wells from Wikidata:

ReconciliationExample_V1_20180115

 Rather than look up each of these entities manually, I used a reconciliation service provided by OpenRefine, a piece of software I used previously for cleaning the JCPC data. The reconciliation service automatically looks up each value in a particular column from an external source (e.g. an authority file) specified by the user. For each value, it either provides a definite match or a selection of possible matches to choose from. Consultant and OpenRefine guru Owen Stephens has put together a couple of really helpful screencasts on reconciliation.

While reconciliation is very clever, it still requires some human intervention to ensure accuracy. The reconciliation service will match entities with similar names, but they might not necessarily refer to exactly the same thing. As we know, many people have the same name, and the same place names appear in multiple locations all over the world. I therefore had to check all matches that OpenRefine said were ‘definite’, and discard those that matched the name but referred to an incorrect entity.

Locations

I initially looked for a suitable gazetteer or authority file to which I could link the various case locations. My first port of call was Geonames, the standard authority file for linking location data. This was encouraging, as it does include alternative and historical place names for modern places. However, it doesn't contain any additional information about the dates for which each name was valid, or the geographical boundaries of the place at different times (the historical/political nature of the geography of this period was highlighted in a previous post). I additionally looked for openly-available digital gazetteers for the relevant historical period (1860-1998), but unfortunately none yet seem to exist. However, I have recently become aware of the University of Pittsburgh’s World Historical Gazetteer project, and will watch its progress with interest. For now, Geonames seems like the best option, while being aware of its limitations.

Courts

Although there have been attempts to create standard URIs for courts, there doesn’t yet seem to be a suitable authority file to which I could reconcile the JCPC data. Instead, I decided to use the Virtual International Authority File (VIAF), which combines authority files from libraries all over the world. Matches were found for most of the courts contained in the dataset.

Parties

For the parties involved in the cases, I initially also used VIAF, which resulted in few definite matches. I therefore additionally decided to reconcile Appellant, Respondent, Intervenant and Third Party data to Wikidata. This was far more successful than VIAF, resulting in a combined total of about 200 matches. As a result, I was able to identify cases involving H. G. Wells, Bob Marley, and Frederick Deeming, one of the prime suspects for the Jack the Ripper murders. Due to time constraints, I was only able to check those matches identified as ‘definite’; more could potentially be found by looking at each party individually and selecting any appropriate matches from the list of possible options.

Conversion

Once the entities were separated from each other and reconciled to external sources (where possible), the data was ready to convert to Linked Data. I did this using LODRefine, a version of OpenRefine packaged with plugins for producing Linked Data. LODRefine converts an OpenRefine project to Linked Data based on an ‘RDF skeleton’ specified by the user. RDF stands for Resource Description Framework, and is the standard by which Linked Data is represented. It describes each relationship in the dataset as a triple, comprising a subject, predicate and object. The subject is the entity you’re describing, the object is either a piece of information about that entity or another entity, and the predicate is the relationship between the two. For example, in the data model diagram we have the following relationship:

  AppealTitleTriple_V1_20180108

This is a triple, where the URI for the Appeal is the subject, the URI dc:title (the property ‘title’ in the Dublin Core terms vocabulary) is the predicate, and the value of the Appeal Title column is the object. I expressed each of the relationships in the data model as a triple like this one in LODRefine’s RDF skeleton. Once this was complete, it was simply a case of clicking LODRefine’s ‘Export’ button and selecting one of the available RDF formats. Having previously spent considerable time writing code to convert data to RDF, I was surprised and delighted by how quick and simple this process was.

Publication

The Linked Data version of the JCPC dataset is not yet available online as we’re currently going through the process of ascertaining the appropriate licence to publish it under. Once this is confirmed, the dataset will be available to download from data.bl.uk in both RDF/XML and Turtle formats.

The next post in this series will look at what can be done with the JCPC data following its conversion to Linked Data.

This post is by Sarah Middle, a PhD placement student at the British Library researching the appeal cases heard by the Judicial Committee of the Privy Council (JCPC).  Sarah is on twitter as @digitalshrew.