20 March 2015
Texcavator in Residence
This is a guest post by Melvin Wevers, Utrecht University
As part of a three-week research stay at the British Library, I looked whether and how the British historical newspaper collection could be implemented within my own research project. I am a PhD candidate within the Translantis research program based at Utrecht University in the Netherlands. The program uses digital humanities tools to analyze how the United States has served as a cultural model for the Netherlands in the long twentieth century. A sister-project Asymmetrical Encounters, which is based in Utrecht, London, and Trier looks at similar processes within a European context. Our main sources include Dutch newspapers that are part of the National Library of the Netherlands (KB).
My research at the British Library served two main goals. First, to investigate how the British newspaper data could be incorporated into my project’s research tool Texcavator. Second, to analyze to what extent the newspapers can be used for historical research using computational methods, such as Full-Text searching, Topic Modeling, and Named Entity Recognition.
Texcavator allows researchers to search through newspapers archives using full-text search and more advanced search strategies such as wildcards and fuzzy searching. Secondly, the tool allows researchers to create timelines and word clouds which can also be tagged with Named Entity annotators or sentiment mining modules. Thirdly, the tool has an export function that allows the researcher to create subcorpora that can be used within other tools and analytical software such as Mallet or Text Voyant. Texcavator uses Elastic Search (ES) as a search and analytics engine. ES makes use of JSON files that need to be formatted in a particular schema in order to work within Texcavator. This includes information on newspaper title, date of publication, article type, newspaper type, and its spatial distribution.
The newspaper data on the servers of the British Library makes use of an XML scheme. In order to parse the XML files into an JSON format, I have made use of a PERL parser that modified the scheme to work with Texcavator and converted it into a JSON file. A python script enabled me to batch index the files into an Elastic Search index. Next, I installed Texcavator and configured it to communicate with the Elastic Search index. This shows that it is fairly easy to include the BL newspaper data into an ES index that can be queried.
After this, I set out to determine whether the historical newspapers are suited for historical analysis. One of the challenges of working with this archive is the poor quality of the digitized texts. The articles are not legible as they include a lot of gibberish produced by the optical character recognition (OCR). Using Texcavator these newspaper still prove to be useful for historical research. As Texcavator is able to combine both OCR-ed texts with the images of the article, the research can use key-word searches to find (some of the) articles that contain this word, which can then be read by looking at the images: the OCR facilitates the searching, and the images are used to close-read the articles. Smart queries using wildcards as well as ES optimization can improve the precision and recall of searching.
The texts can also be cleaned up by for instance removing all stop words, or words that have a very low frequency (these often include bad OCR). After cleaning up the texts, techniques such as Topic Modeling and Named Entity recognition can still be applied. The OCR quality of some newspapers in the archive (such as the Pall Mall Gazette) has already been improved. Using the ES search index, I have exported this specific newspaper into a sub corpus that I have tagged with location entities and run through the Topic Modeling engine Mallet. I am using this corpus to analyze the international outlook between 1860 and 1900. For more on this see, my paper proposal "Reporting the Empire" for DH Benelux.
One of the first steps in working with the data at the British Library includes making the data available to researcher in an index which allows for the creation of derived datasets based on either specific queries or meta-data such as newspaper title or spatial distribution. The ability to create derived datasets is a mandatory step within the larger digital humanities workflow. The datasets can then be processed using freely available text mining tools and visualization libraries such as D3.JS or Gephi. I further explained this particular approach to Digital Humanities in a talk I gave at the UCL DH seminar on the 25th of February. The slides of my presentation "Doing Digital History" can be found on Slideshare.