Digital scholarship blog

14 posts categorized "Social sciences"

13 February 2018

BL Labs 2017 Symposium: Samtla, Research Award Runner Up

Add comment

Samtla (Search And Mining Tools for Labelling Archives) was developed to address a need in the humanities for research tools that help to search, browse, compare, and annotate documents stored in digital archives. The system was designed in collaboration with researchers at Southampton University, whose research involved locating shared vocabulary and phrases across an archive of Aramaic Magic Texts from Late Antiquity. The archive contained texts written in Aramaic, Mandaic, Syriac, and Hebrew languages. Due to the morphological complexity of these languages, where morphemes are attached to a root morpheme to mark gender and number, standard approaches and off-the-shelf software were not flexible enough for the task, as they tended to be designed to work with a specific archive or user group. 

Figure 1: Samtla supports tolerant search allowing queries to be matched exactly and approximately. (Click to enlarge image)

  Samtla is designed to extract the same or similar information that may be expressed by authors in different ways, whether it is in the choice of vocabulary or the grammar. Traditionally search and text mining tools have been based on words, which limits their use to corpora containing languages were 'words' can be easily identified and extracted from text, e.g. languages with a whitespace character like English, French, German, etc. Word models tend to fail when the language is morphologically complex, like Aramaic, and Hebrew. Samtla addresses these issues by adopting a character-level approach stored in a statistical language model. This means that rather than extracting words, we extract character-sequences representing the morphology of the language, which we then use to match the search terms of the query and rank the documents according to the statistics of the language. Character-based models are language independent as there is no need to preprocess the document, and we can locate words and phrases with a lot of flexibility. As a result Samtla compensates for the variability in language use, spelling errors made by users when they search, and errors in the document as a result of the digitisation process (e.g. OCR errors). 

Figure 2: Samtla's document comparison tool displaying a semantically similar passage between two Bibles from different periods. (Click to enlarge image)

 The British Library have been very supportive of the work by openly providing access to their digital archives. The archives ranged in domain, topic, language, and scale, which enabled us to test Samtla’s flexibility to its limits. One of the biggest challenges we faced was indexing larger-scale archives of several gigabytes. Some archives also contained a scan of the original document together with metadata about the structure of the text. This provided a basis for developing new tools that brought researchers closer to the original object, which included highlighting the named entities over both the raw text, and the scanned image.

Currently we are focusing on developing approaches for leveraging the semantics underlying text data in order to help researchers find semantically related information. Semantic annotation is also useful for labelling text data with named entities, and sentiments. Our current aim is to develop approaches for annotating text data in any language or domain, which is challenging due to the fact that languages encode the semantics of a text in different ways.

As a first step we are offering labelled data to researchers, as part of a trial service, in order to help speed up the research process, or provide tagged data for machine learning approaches. If you are interested in participating in this trial, then more information can be found at

Figure 3: Samtla's annotation tools label the texts with named entities to provide faceted browsing and data layers over the original image. (Click to enlarge image)

 If this blog post has stimulated your interest in working with the British Library's digital collections, start a project and enter it for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.

Posted by BL Labs on behalf of Dr Martyn Harris, Prof Dan Levene, Prof Mark Levene and Dr Dell Zhang.

02 February 2018

Converting Privy Council Appeals Metadata to Linked Data

Add comment

To continue the series of posts on metadata about appeals to the Judicial Committee of the Privy Council, this post describes the process of converting this data to Linked Data. In the previous post, I briefly explained the concept of Linked Data and outlined the potential benefits of applying this approach to the JCPC dataset. An earlier post explained how cleaning the data enabled me to produce some initial visualisations; a post on the Social Science blog provides some historical context about the JCPC itself.

Data Model

In my previous post, I included the following diagram to show how the Linked JCPC Data might be structured.


To convert the dataset to Linked Data using this model, each entity represented by a blue node, and each class and property represented by the purple and green nodes need a unique identifier known as a Uniform Resource Indicator (URI). For the entities, I generated these URIs myself based on guidelines provided by the British Library, using the following structure:


In the above URIs, the ‘...’ is replaced by a unique reference to a particular appeal, judgment, or location, e.g. a combination of the judgment number and year.

To ensure that the data can easily be understood by a computer and linked to other datasets, the classes and properties should be represented by existing URIs from established ontologies. An ontology is a controlled vocabulary (like a thesaurus) that not only defines terms relating to a subject area, but also defines the relationships between those terms. Generic properties and classes, such as titles, dates, names and locations, can be represented by established ontologies like Dublin Core, Friend of a Friend (FOAF) and vCard.

After considerable searching I was unable to find any online ontologies that precisely represent the legal concepts in the JCPC dataset. Instead, I decided to use relevant terms from Wikidata, where available, and to create terms in a new JCPC ontology for those entities and concepts not defined elsewhere. Taking this approach allowed me to concentrate my efforts on the process of conversion, but the possibility remains to align these terms with appropriate legal ontologies in future.

An updated version of the data model shows the ontology terms used for classes and properties (purple and green boxes):


Rather than include the full URI for each property or class, the first part of the URI is represented by a prefix, e.g. ‘foaf’, which is followed by the specific term, e.g. ‘name’, separated by a colon.

More Data Cleaning

The data model diagram also helped identify fields in the spreadsheet that required further cleaning before conversion could take place. This cleaning largely involved editing the Appellant and Respondent fields to separate multiple parties that originally appeared in the same cell and to move descriptive information to the Appellant/Respondent Description column. For those parties whose names were identical, I additionally checked the details of the case to determine whether they were in fact the same person appearing in multiple appeals/judgments.


Reconciliation is the process of aligning identifiers for entities in one dataset with the identifiers for those entities in another dataset. If these entities are connected using Linked Data, this process implicitly links all the information about the entity in one dataset to the entity in the other dataset. For example, one of the people in the JCPC dataset is H. G. Wells – if we link the JCPC instance of H. G. Wells to his Wikidata identifier, this will then facilitate access to further information about H. G. Wells from Wikidata:


 Rather than look up each of these entities manually, I used a reconciliation service provided by OpenRefine, a piece of software I used previously for cleaning the JCPC data. The reconciliation service automatically looks up each value in a particular column from an external source (e.g. an authority file) specified by the user. For each value, it either provides a definite match or a selection of possible matches to choose from. Consultant and OpenRefine guru Owen Stephens has put together a couple of really helpful screencasts on reconciliation.

While reconciliation is very clever, it still requires some human intervention to ensure accuracy. The reconciliation service will match entities with similar names, but they might not necessarily refer to exactly the same thing. As we know, many people have the same name, and the same place names appear in multiple locations all over the world. I therefore had to check all matches that OpenRefine said were ‘definite’, and discard those that matched the name but referred to an incorrect entity.


I initially looked for a suitable gazetteer or authority file to which I could link the various case locations. My first port of call was Geonames, the standard authority file for linking location data. This was encouraging, as it does include alternative and historical place names for modern places. However, it doesn't contain any additional information about the dates for which each name was valid, or the geographical boundaries of the place at different times (the historical/political nature of the geography of this period was highlighted in a previous post). I additionally looked for openly-available digital gazetteers for the relevant historical period (1860-1998), but unfortunately none yet seem to exist. However, I have recently become aware of the University of Pittsburgh’s World Historical Gazetteer project, and will watch its progress with interest. For now, Geonames seems like the best option, while being aware of its limitations.


Although there have been attempts to create standard URIs for courts, there doesn’t yet seem to be a suitable authority file to which I could reconcile the JCPC data. Instead, I decided to use the Virtual International Authority File (VIAF), which combines authority files from libraries all over the world. Matches were found for most of the courts contained in the dataset.


For the parties involved in the cases, I initially also used VIAF, which resulted in few definite matches. I therefore additionally decided to reconcile Appellant, Respondent, Intervenant and Third Party data to Wikidata. This was far more successful than VIAF, resulting in a combined total of about 200 matches. As a result, I was able to identify cases involving H. G. Wells, Bob Marley, and Frederick Deeming, one of the prime suspects for the Jack the Ripper murders. Due to time constraints, I was only able to check those matches identified as ‘definite’; more could potentially be found by looking at each party individually and selecting any appropriate matches from the list of possible options.


Once the entities were separated from each other and reconciled to external sources (where possible), the data was ready to convert to Linked Data. I did this using LODRefine, a version of OpenRefine packaged with plugins for producing Linked Data. LODRefine converts an OpenRefine project to Linked Data based on an ‘RDF skeleton’ specified by the user. RDF stands for Resource Description Framework, and is the standard by which Linked Data is represented. It describes each relationship in the dataset as a triple, comprising a subject, predicate and object. The subject is the entity you’re describing, the object is either a piece of information about that entity or another entity, and the predicate is the relationship between the two. For example, in the data model diagram we have the following relationship:


This is a triple, where the URI for the Appeal is the subject, the URI dc:title (the property ‘title’ in the Dublin Core terms vocabulary) is the predicate, and the value of the Appeal Title column is the object. I expressed each of the relationships in the data model as a triple like this one in LODRefine’s RDF skeleton. Once this was complete, it was simply a case of clicking LODRefine’s ‘Export’ button and selecting one of the available RDF formats. Having previously spent considerable time writing code to convert data to RDF, I was surprised and delighted by how quick and simple this process was.


The Linked Data version of the JCPC dataset is not yet available online as we’re currently going through the process of ascertaining the appropriate licence to publish it under. Once this is confirmed, the dataset will be available to download from in both RDF/XML and Turtle formats.

The next post in this series will look at what can be done with the JCPC data following its conversion to Linked Data.

This post is by Sarah Middle, a PhD placement student at the British Library researching the appeal cases heard by the Judicial Committee of the Privy Council (JCPC).  Sarah is on twitter as @digitalshrew.   

01 February 2018

BL Labs 2017 Symposium: A large-scale comparison of world music corpora with computational tools, Research Award Winner

Add comment

A large-scale comparison of world music corpora with computational tools.

By Maria Panteli, Emmanouil Benetos, and Simon Dixon from the Centre for Digital Music, Queen Mary University of London

The comparative analysis of world music cultures has been the focus of several ethnomusicological studies in the last century. With the advances of Music Information Retrieval and the increased accessibility of sound archives, large-scale analysis of world music with computational tools is today feasible. We combine music recordings from two archives, the Smithsonian Folkways Recordings and British Library Sound Archive, to create one of the largest world music corpora studied so far (8200 geographically balanced recordings sampled from a total of 70000 recordings). This work was submitted for the 2017 British Library Labs Awards - Research category.

Our aim is to explore relationships of music similarity between different parts of the world. The history of cultural exchange goes back many years and music, an essential cultural identifier, has travelled beyond country borders. But is this true for all countries? What if a country is geographically isolated or its society resisted external musical influence? Can we find such music examples whose characteristics stand out from other musics in the world? By comparing folk and traditional music from 137 countries we aim to identify geographical areas that have developed a unique musical character.

Maria Panteli fig 1

Methodology: Signal processing and machine learning methods are combined to extract meaningful music representations from the sound recordings. Data mining methods are applied to explore music similarity and identify outlier recordings.

We use digital signal processing tools to extract music descriptors from the sound recordings capturing aspects of rhythm, timbre, melody, and harmony. Machine learning methods are applied to learn high-level representations of the music and the outcome is a projection of world music recordings to a space respecting music similarity relations. We use data mining methods to explore this space and identify music recordings that are most distinct compared to the rest of our corpus. We refer to these recordings as ‘outliers’ and study their geographical patterns. More details on the methodology are provided here.


  Maria Panteli fig 2


Distribution of outliers per country: The colour scale corresponds to the normalised number of outliers per country, where 0% indicates that none of the recordings of the country were identified as outliers and 100% indicates that all of the recordings of the country are outliers.

We observed that out of 137 countries, Botswana had the most outlier recordings compared to the rest of the corpus. Music from China, characterised by bright timbres, was also found to be relatively distinct compared to music from its neighbouring countries. Analysis with respect to different features revealed that African countries such as Benin and Botswana, indicated the largest amount of rhythmic outliers with recordings often featuring the use of polyrhythms. Harmonic outliers originated mostly from Southeast Asian countries such as Pakistan and Indonesia, and African countries such as Benin and Gambia, with recordings often featuring inharmonic instruments such as the gong and bell. You can explore and listen to music outliers in this interactive visualisation. The datasets and code used in this project are included in this link.

Maria Panteli fig 3

Interactive visualisation to explore and listen to music outliers.

This line of research makes a large-scale comparison of recorded music possible, a significant contribution for ethnomusicology, and one we believe will help us understand better the music cultures of the world.

Posted by British Library Labs.


31 January 2018

Linking Privy Council Appeals Data

Add comment

This post continues a series of blog posts relating to a PhD placement project that seeks to make data about appeals heard by the Judicial Committee of the Privy Council (JCPC) available in new formats, to enhance discoverability, and to increase the potential for new historical and socio-legal research questions. Previous posts looked at the historical context of the JCPC and related online resources, as well as the process of cleaning the data and producing some initial visualisations.

When looking at the metadata about JCPC judgments between 1860 and 1998, it became clear to me that what was in fact being represented here was a network of appeals, judgments, courts, people, organisations and places. Holding this information in a spreadsheet can be extremely useful, as demonstrated by the visualisations created previously; however, this format does not accurately capture the sometimes complex relationships underlying these cases. As such, I felt that a network might be a more representative way of structuring the data, based on a Linked Data model.

Linked Data was first introduced by Tim Berners-Lee in 2006. It comprises a set of tools and techniques for connecting datasets based on features they have in common in a format that can be understood by computers. Structuring data in this way can have huge benefits for Humanities research, and has already been used in many projects – examples include linking ancient and historical texts based on the places mentioned within them (Pelagios) and bringing together information about people’s experiences of listening to music (Listening Experience Database). I decided to convert the JCPC data to Linked Data to make relationships between the entities contained within the dataset more apparent, as well as link to external sources, where available, to provide additional context to the judgment documents.

The image below shows how the fields from the JCPC spreadsheet might relate to each other in a Linked Data structure.


In this diagram:

  • Blue nodes represent distinct entities (specific instances of e.g. Judgment, Appellant, Location)
  • Purple nodes represent the classes that define these entities, i.e. what type of entity each blue node is (terms that represent the concepts of e.g. Judgment, Appellant, Location)
  • Green nodes represent properties that describe those entities (e.g. ‘is’, ‘has title’, ‘has date’)
  • Orange nodes represent the values of those properties (e.g. Appellant Name, Judgment Date, City)
  • Red nodes represent links to external sources that describe that entity

Using this network structure, I converted the JCPC data to Linked Data; the conversion process is outlined in detail in the next blog post in this series.

A major advantage of converting the JCPC data to Linked Data is the potential it provides for integration with other sources. This means that search queries can be conducted and visualisations can be produced that use the JCPC data in combination with one or more other datasets, such as those relating to a similar historical period, geographical area(s), or subject. Rather than these datasets existing in isolation from each other, connecting them could fill in gaps in the information and highlight new relationships involving appeals, judgments, locations or the parties involved. This could open up the possibilities for new research questions in legal history and beyond.

Linking the JCPC data will also allow new types of visualisation to be created, either by connecting it to other datasets, or on its own. One option is network visualisations, where the data is filtered based on various search criteria (e.g. by location, time period or names of people/organisations) and the results are displayed using the network structure shown above. Looking at the data as a network can demonstrate at a glance how the different components relate to each other, and could indicate interesting avenues for future research. In a later post in this series, I’ll look at some network visualisations created from the linked JCPC data, as well as what we can (and can’t) learn from them.

This post is by Sarah Middle, a PhD placement student at the British Library researching the appeal cases heard by the Judicial Committee of the Privy Council (JCPC).  Sarah is on twitter as @digitalshrew.    

21 December 2017

Cleaning and Visualising Privy Council Appeals Data

Add comment

This blog post continues a recent post on the Social Sciences blog about the historical context of the Judicial Committee of the Privy Council (JCPC), useful collections to support research and online resources that facilitate discovery of JCPC appeal cases.

I am currently undertaking a three-month PhD student placement at the British Library, which aims enhance the discoverability of the JCPC collection of case papers and explore the potential of Digital Humanities methods for investigating questions about the court’s caseload and its actors. Two methods that I’ll be using include creating visualisations to represent data about these judgments and converting this data to Linked Data. In today’s post, I’ll focus on the process of cleaning the data and creating some initial visualisations; information about Linked Data conversion will appear in a later post.

The data I’m using refers to appeal cases that took place between 1860 and 1998. When I received the data, it was held in a spreadsheet where information such as ‘Judgment No.’, ‘Appellant’, ‘Respondent’, ‘Country of Origin’, ‘Judgment Date’ had been input from Word documents containing judgment metadata. This had been enhanced by generating a ‘Unique Identifier’ for each case by combining the judgment year and number, adding the ‘Appeal No.’ and ‘Appeal Date’ (where available) by consulting the judgment documents, and finding the ‘Longitude’ and ‘Latitude’ for each ‘Country of Origin’. The first few rows looked like this:


Data cleaning with OpenRefine

Before visualising or converting the data, some data cleaning had to take place. Data cleaning involves ensuring that consistent formatting is used across the dataset, there are no errors, and that the correct data is in the correct fields. To make it easier to clean the JCPC data, visualise potential issues more immediately, and ensure that any changes I make are consistent across the dataset, I'm using OpenRefine. This is free software that works in your web browser (but doesn't require a connection to the internet), which allows you to filter and facet your data based on values in particular columns, and batch edit multiple cells. Although it can be less efficient for mathematical functions than spreadsheet software, it is definitely more powerful for cleaning large datasets that mostly consist of text fields, like the JCPC spreadsheet.

Geographic challenges

Before visualising judgments on a map, I first looked at the 'Country of Origin' column. This column should more accurately be referred to as 'Location', as many of the entries were actually regions, cities or courts, instead of countries. To make this information more meaningful, and to allow comparison across countries e.g. where previously only the city was included, I created additional columns for 'Region', 'City' and 'Court', and populated the data accordingly:


An important factor to bear in mind here is that place names relate to their judgment date, as well as geographical area. Many of the locations previously formed part of British colonies that have since become independent, with the result that names and boundaries have changed over time. Therefore, I had to be sensitive to each location's historical and political context and ensure that I was inputting e.g. the region and country that a city was in on each specific judgment date.

In addition to the ‘Country of Origin’ field, the spreadsheet included latitude and longitude coordinates for each location. Following an excellent and very straightforward tutorial, I used these coordinates to create a map of all cases using Google Fusion Tables:

While this map shows the geographic distribution of JCPC cases, there are some issues. Firstly, multiple judgments (sometimes hundreds or thousands) originated from the same court, and therefore have the same latitude and longitude coordinates. This means that on the map they appear exactly on top of each other and it's only possible to view the details of the top 'pin', no matter how far you zoom in. As noted in a previous blog post, a map like this is already used by the Institute of Advanced Legal Studies (IALS); however, as it is being used here to display a curated subset of judgments, the issue of multiple judgments per location does not apply. Secondly, it only includes modern place names, which it does not seem to be possible to remove.

I then tried using Tableau Public to see if it could be used to visualise the data in a more accurate way. After following a tutorial, I produced a map that used the updated ‘Country’ field (with the latitude and longitude detected by Tableau) to show each country where judgments originated. These are colour coded in a ‘heatmap’ style, where ‘hotter’ colours like red represent a higher number of cases than ‘colder’ colours such as blue.

This map is a good indicator of the relative number of judgments that originated in each country. However, Tableau (understandably and unsurprisingly) uses the modern coordinates for these countries, and therefore does not accurately reflect their geographical extent when the judgments took place (e.g. the geographical area represented by ‘India’ in much of the dataset was considerably larger than the geographical area we know as India today). Additionally, much of the nuance in the colour coding is lost because the number of judgments originating from India (3,604, or 41.4%) are far greater than that from any other country. This is illustrated by a pie chart created using Google Fusion Tables:

Using Tableau again, I thought it would also be helpful to go to the level of detail provided by the latitude and longitude already included in the dataset. This produced a map that is more attractive and informative than the Google Fusion Tables example, in terms of the number of judgments from each set of coordinates.

The main issue with this map is that it still doesn't provide a way in to the data. There are 'info boxes' that appear when you hover over a dot, but these can be misleading as they contain combined information from multiple cases, e.g. if one of the cases includes a court, this court is included in the info box as if it applies to all the cases at that point. Ideally what I'd like here would be for each info box to link to a list of cases that originated at the relevant location, including their judgment number and year, to facilitate ordering and retrieval of the physical copy at the British Library. Additionally, each judgment would link to the digitised documents for that case held by the British and Irish Legal Information Institute (BAILII). However, this is unlikely to be the kind of functionality Tableau was designed for - it seems to be more for overarching visualisations than to be used as a discovery tool.

The above maps are interesting and provide a strong visual overview that cannot be gained from looking at a spreadsheet. However, they would not assist users in accessing further information about the judgments, and do not accurately reflect the changing nature of the geography during this period.

Dealing with dates

Another potentially interesting aspect to visualise was case duration. It was already known prior to the start of the placement that some cases were disputed for years, or even decades; however, there was no information about how representative these cases were of the collection as a whole, or how duration might relate to other factors, such as location (e.g. is there a correlation between case duration and  distance from the JCPC headquarters in London? Might duration also correlate with the size and complexity of the printed record of proceedings contained in the volumes of case papers?).

The dataset includes a Judgment Date for each judgment, with some cases additionally including an Appeal Date (which started to be recorded consistently in the underlying spreadsheet from 1913). Although the Judgment Date shows the exact day of the judgment, the Appeal Date only gives the year of the appeal. This means that we can calculate the case duration to an approximate number of years by subtracting the year of appeal from the year of judgment.

Again, some data cleaning was required before making this calculation or visualising the information. Dates had previously been recorded in the spreadsheet in a variety of formats, and I used OpenRefine to ensure that all dates appeared in the form YYYY-MM-DD:


3) does it indicate possibility of lengthy set of case papers.?

It was then relatively easy to copy the year from each date to a new ‘Judgment Year’ column, and subtract the ‘Appeal Year’ to give the approximate case duration. Performing this calculation was quite helpful in itself, because it highlighted errors in some of the dates that were not found through format checking. Where the case duration seemed surprisingly long, or had a negative value, I looked up the original documents for the case and amended the date(s) accordingly.

Once the above tasks were complete, I created a bar chart in Google Fusion Tables to visualise case duration – the horizontal axis represents the approximate number of years between the appeal and judgment dates (e.g. if the value is 0, the appeal was decided in the same year that it was registered in the JCPC), and the vertical axis represents the number of cases:


This chart clearly shows that the vast majority of cases were up to two years in length, although this will also potentially include appeals of a short duration registered at the end of one year and concluded at the start of the next. A few took much longer, but are difficult to see due to the scale necessary to accommodate the longest bars. While this is a useful way to find particularly long cases, the information is incomplete and approximate, and so the maps would potentially be more helpful to a wider audience.

Experimenting with different visualisations and tools has given me a better understanding of what makes a visualisation helpful, as well as considerations that must be made when visualising the JCPC data. I hope to build on this work by trying out some more tools, such as the Google Maps API, but my next post will focus on another aspect of my placement – conversion of the JCPC data to Linked Data.

This post is by Sarah Middle, a PhD placement student at the British Library researching the appeal cases heard by the Judicial Committee of the Privy Council (JCPC).  Sarah is on twitter as @digitalshrew.    

06 June 2017

Digital Conversations @BL - Web Archives: truth, lies and politics

Add comment

Next week we are spoiled for choice here at the British Library with two topical and fascinating evening events about data and digital technology. On Monday 12 June there is the first  public Data Debate delivered in collaboration with the Alan Turing Institute about the complex issue of data in healthcare, for more details check out this blog post.  Then on Wednesday 14 June there is a Digital Conversation event on Web Archives: truth, lies and politics in the 21st century. Where a panel of scholars and experts in the field of web archiving and digital studies, will discuss the role of web and social media archives in helping us, as digital citizens, to navigate through a complex and changing information landscape.

Web archiving began in 1996 with the Internet Archive and these days many university and national libraries around the world have web archiving initiatives. The British Library started web archiving in 2004, and from 2013 we have collected an annual snapshot of all UK web sites. As such, there are rich web archive collections documenting political and social movements at international and local levels; including the Library of Congress collections on the Arab Spring, and the UK Web Archive collections on past General Elections.

The Digital Conversation will be chaired by Eliane Glaser, author of Get Real: How to See Through the Hype, Spin and Lies of Modern Life, the panel includes Jane Winters, Chair of Digital Humanities, School of Advanced Study, University of London, Valérie Schafer, Historian at the French National Center for Scientific Research (Institute for Communication Sciences, CNRS), Jefferson Bailey, Director of Web Archiving Programs at the Internet Archive and Andrew Jackson, Web Archiving Technical Lead at the British Library.

For more information and to book tickets go here. Hope to see you there!

Grow the real economy ijclark
Image credit: Grow the real economy by ijclark, depicting the Occupy London protest camp in 2011, CC BY 2.0

This Digital Conversations event is part of the Web Archiving Week 12-16 June co-hosted by the British Library and the School of Advanced Study, University of London. This is a week of conferences, hackathons and talks in London to discuss recent advances in web archiving and research on the archived web. You can follow tweets from the conferences and the Digital Conversation on Twitter, using the hashtag #WAweek2017.

This post is by Digital Curator Stella Wisdom, on twitter as @miss_wisdom.

20 July 2016

Dealing with Optical Character Recognition errors in Victorian newspapers

Add comment

This second (of two) posts featuring speakers at an internal seminar on spatial humanities is by Amelia Joulain-Jay of Lancaster University. Let's hear from Amelia...

Have you browsed through the British Library’s Collection of Nineteenth Century Newspapers? Then you have probably searched for a word in an article, only to find that some instances of that word were highlighted, and not others. In the following article, for example, (which comes from the 24th August 1833 edition of the Leeds Mercury), searching for ‘Magistrates’ (without 'fuzzy search') highlights one instance in the second paragraph, but misses the instance in the first paragraph.

Screenshot from “COUNTY RATE”, Leeds Mercury, 24 Aug. 1833, British Library Newspapers
Figure 1. Image snap of “COUNTY RATE”, Leeds Mercury, 24 Aug. 1833, British Library Newspapers (login may be required). [Last accessed 13 Jul. 2016]

That’s because what you see is a picture of the original source, and you (as a human) are able to read it. But the search engine is searching through OCR output – text generated by Optical Character Recognition (OCR) software which tries to guess what characters are represented on an image. The OCR output for the passage above actually looks like this:

COUNTY RATE tvtaN s s fl s Loud complaintst have been madc and we believe jstly of the unequal pressure of the County Rate ripon the differenrt townships and parishes of and it has In consequence been deter inmosl to make a general survey and to establisB a new scale of ment To this the trading and tnanufacturing interests of the Riding do not object tiorgfl tile effect will doubtless be to advance their assessmcnts in coparlison with those of the agricultural parhitras But we confess that it wa with setrprise we heard that any of the Mogistrates in holding their Courts for the assessment of the respective townships had reated them into secret tribunals and that they lad excluded from their sittings thoso wlto are mainly interested in ascertaining the principles which goreen the raluation of propertt and the full and fair develtpmemnt of which can alone rcuider the decislons of their Courts either satisfactory or permaneent The frank and manly example set by tire township of Leeds dorg h0onour to tbe parish officers and we must say wIthout wishling to give offence to those for swhoimt we feel nothing but respect that the line of conduct r sued by ithe Magistrates at Bradford on Btoaday last in excludintgi a parist officer from their Court swhen they knew that he was tire organ of tie towvnship hltich contributes most targely to this impost il the ltole Riding and when lie lasi explained to them in latigniagr srfaitiently courteous anid respectful that lie sotght only rltv crlsis of public jusrice requires a anuch ittore satisfnectory explanation than toas either given on Lhat tccasion or than ee apprehendl con be give n for adopting one of the roost objectionrble characteristics of the Court of the Holy lrquisition

Figure 2. OCR data for “COUNTY RATE”, Leeds Mercury, 24 Aug. 1833, British Library Newspapers.

You can read a lot of it, but there are errors, including the first occurrence of ‘Magistrates’ which is spelt ‘Mogistrates’.

Guessing what characters are in an image is not an easy task for computers, especially when the images are of historical newspapers which can be in varying states of conservation, and often contain complex layouts with columns, illustrations and different font types and sizes all on the same page.

So, how much of a problem is this, and can the errors be corrected?

This is what I have been investigating for my PhD project, as part of the Spatial Humanities project and in association with the Centre for Corpus Approaches to the Social Sciences.

In a nutshell: it’s not very easy to correct OCR errors automatically because errors can be very dissimilar to their correct form – in the example above, for example, the phrase ‘language sufficiently courteous’ has become ‘latigniagr srfaitiently courteous’ in the OCR output. Normalization software (like spell-checkers) often assume that the errors and their corrections will have many letters in common (as if they were playing a game of anagrams), but this assumption is often incorrect, as in the example above. So how can OCR errors be corrected? One state-of-the-art commercial software package I tested, Overproof, uses a technique the designers call ‘reverse OCR’: basically, they compare images of correct words to the image of the source! A simple-sounding idea which turns out to work well; you can read more about it in 'Correcting noisy OCR: context beats confusion' (login may be required).

And how much of a problem are the errors? Well, it depends what you are using the texts for. Leaving aside the question of using search engines, and its 'traps for the unwary', if you are interested in analysing patterns of discourses in texts, the main problem you will face is that the errors are not distributed evenly throughout the texts. This makes it difficult to predict how the errors might affect the retrieval of a particular word/phrase you are interested in. But if you follow some common-sense advice, you can stay on safe ground:

  1. Don’t over-interpret absences. (In OCR’ed texts, something which is missing may simply be something which is irretrievable because it is affected by OCR errors.)
  2. Focus on patterns for which you can find many different examples: ‘real-word errors’ (errors which happen to coincide with a word which actually exists, such as ‘Prussia’ which becomes ‘Russia’ when the OCR misses out the ‘P’) do exist, but they do not normally occur very often. Keep an eye out for them, but if you form a hypothesis on the basis of many examples, you are on safe ground!

In conclusion, digitized historical texts may suffer from OCR errors. It is important to be aware of the issue, but do not let this hold you back from using such sources in your research – following some simple rules of thumb (such as not placing too much emphasis on absences and focussing on patterns for which there are many different examples) can keep you on safe ground.

08 February 2016

Cambridge @BL_Labs Roadshow Mon 15 Feb (9.30am - 12.30pm) and (1.30pm - 4.30pm)

Add comment Comments (0)

The @BL_Labs roadshow moves onto Cambridge and we still have a few places available for our FREE and open to all afternoon showcase event on Monday 15 February between 1.30pm - 4.30pm (booking essential). The event is kindly hosted by the Digital Humanities Network of researchers at the University Cambridge who are interested in how the use of digital tools is transforming scholarship in the humanities and social sciences.

  BL_Labs_roadshow-cambridge Cambridge-digital-humanities-netowrk
@BL_Labs Roadshow in Cambridge - Mon 15 Feb (0930 - 1230 and 1330 - 1630), hosted by the Digital Humanities Network at the University of Cambridge.

Building a search engine that works for you (9.30am - 12.30pm).Building-search-engine-that-works-for-you-2

Building a search engine that works for you, Cambridge - Mon 15 Feb (9.30am - 12.30pm).

Led by British Library Labs Technical Lead Ben O'Steen, a special workshop will be held in the morning (9.30am - 12.30pm) which gets under the 'hood' of search engines. Attendees will load some texts from the largely 19th Century British Library digitised Book collection into a search engine to explore the problems, opportunities and assumptions made when creating such a service. The session will be using Elasticsearch, Python, Git and Notepad++.

The aim is to step people through the challenges and compromises required to have something as simple as a Google search service and to explore a few ways to tailor it to specific needs. It involves dealing with XML and the quality of real world data and use python code to put data into and query Elasticsearch. This 3-hour workshop will give participants an understanding of how search engines work from the inside. No technical knowledge is required as a prerequisite but spaces are strictly limited and the focus of this workshop will be on practical application of the ideas. University of Cambridge researchers and students have priority for bookings however you can now book hereHowever, please contact Anne Alexander to see if there have been any last minute cancelations, especially if you are from outside the University and would like to attend.

Labs and Digital Research Showcase with an 'Ideas Lab' (1.30pm-4.30pm).

The showcase in the afternoon (1.30pm-4.30pm) will provide participants an opportunity to:

  • Understand what Digital Research activity is being carried out at the British Library.
  • Discover the digital collections the British Library has, understand some of the challenges of using them and even take some away with you.
  • Learn how researchers found and revived forgotten Victorian jokes and Political meetings from our digital archives.
  • Understand how special games and computer code have been developed to help tag un-described images and make new art.
  • Find out about a tool that links digitised handwritten manuscripts to transcribed texts and one that creates statistically representative samples from the British Library’s book collections.
  • Consider how the intuitions of a DJ could be used to mix and perform the Library's digital collections.
  • Talk to Library staff about how you might use some of the Library's digital content innovatively.
  • Get advice, pick up tips and feedback on your ideas and projects for the 2016 BL Labs Competition (deadline 11 April) and Awards (deadline 5 September).

For more information about the afternoon session, a detailed programme and to book your place, visit the Labs & Digital Research Showcase with an 'Ideas Lab' event page.

Posted by Mahendra Mahey, Manager of BL Labs.

The BL Labs project is funded by the Andrew W. Mellon Foundation.