THE BRITISH LIBRARY

Digital scholarship blog

3 posts from September 2014

26 September 2014

Applying Forensics to Preserving the Past: Current Activities and Future Possibilities

First Digital Lives Research Workshop 2014 at the British Library
 


DL Workshop Holder
 

 

With more and more libraries, archives and museums manifestly adopting forensic approaches and tools for handling and processing born digital objects both in the UK and overseas it seemed a good time to take stock. Archivists and curators were invited (via professional email listservs) to submit a short paper for an inclusive and interactive workshop stretching over two days in London. 

Institutions are applying digital forensics across the entire lifecycle from appraisal through to content analysis, and have begun to establish workflows that embrace forensic techniques such as the use of write blockers for the formation of disk images, the extraction of metadata and the searching,  filtering and interpreting of digital data, notably the appropriate management of sensitive information. 

There are two sides to digital forensics for it begins with the protection of digital evidence and concludes with the retrospective analysis of past events and objects. Papers reflecting both aspects were submitted for the workshop (download DLRW 2014 Outline). 

It provided participants with opportunities to report on current activities, highlight gaps, constraints and possibilities, and to discuss and agree collective steps and actions. 

 

DLRW2014 delegates v3

 

As the following list demonstrates delegates came from a diverse range of institutions: universities, libraries, galleries and archives, and the private sector.

Matthew Addis, Arkivum

Fran Baker, John Rylands Library, University of Manchester 

Thom Carter, London School of Economics Library

Dianne Dietrich, Cornell University Library

Rachel Foss, British Library

Claus Jensen, Royal Library of Denmark and Copenhagen University Library

Jeremy Leighton John, British Library

Svenja Kunze, Bodleian Library, University of Oxford

John Langdon, Tate Gallery

Cal Lee, University of North Carolina at Chapel Hill

Caroline Martin, John Rylands Library, University of Manchester (contributor to paper)

Helen Melody, British Library

Stephen Rigden, National Library of Scotland

Elinor Robinson, London School of Economics Library

Susan Thomas, Bodleian Library, University of Oxford 

Dorothy Waugh, Emory University

 

  DLRWtables1

 

I gave an introduction to the original Digital Lives Research project and a brief overview of the ensuing internal projects at the British Library (Personal Digital Manuscripts and Personal Digital Archives), while Aquiles Alencar-Brayner gave an introduction to Digital Scholarship at the British Library including the award winning BL Labs project. 

Short talks presented overviews of current activities at the National Library of Scotland, University of Manchester and London School of Economics and the establishment of forensic and digital archiving at these institutions, including the value of a secure and dedicated workspace, the use of a forensic tool for examining large numbers of emails, the integration of forensic techniques within existing working environments and practices, and the importance of tailored training. 

Other talks were directed at specific applications of forensic tools in the preservation of complex digital objects in the Rose Goldsen Archive of New Media at Cornell University Library, the capture of computer games at the Royal Library of Denmark, and the challenges of capturing the floppy disks of poet and author Lucille Clifton at Emory University, these media being derived from a Magnavox Videowriter

 

  HKA scrnsht

 

My colleagues Rachel Foss and Helen Melody and I presented a paper on the Hanif Kureishi Archive, a collection of paper and digital materials, recently acquired by the British Library’s literary curators: specifically, outlining the use of digital forensics for appraisal and textual analysis.  

Prior to acquisition Rachel and I previewed the archive using fuzzy hashing (a technique for quickly identifying similar files). 

  

HKA1 fmpro scrnsht
 

After the archive was obtained and forensically captured, metadata were extracted from the digital objects and made available along with curatorial versions of the text documents, and Helen catalogued them using the British Library’s Integrated Archives and Manuscripts System

 

  HKA1 catalogue scrnsht2

 

HKA1 catalogue scrnsht3
  

HKA1 catalogue scrnsht1

One of the most exciting aspects of the archive is a set of 53 drafts of Hanif Kureishi’s novel Something To Tell You, which Rachel, Helen and I decided to explore as an example for the workshop. 

 

 

  HKA1 Graph Logical Size (log) vs Modif Date
              

Figure 1. Logical file size plotted against last modified date: an editing history

 

We used the sdhash tool (produced by Vassil Roussev of the University of New Orleans and incorporated within the BitCurator framework). Like the ssdeep fuzzy hashing tool (which has been incorporated into Forensic Toolkit, FTK), it identifies similarities among files but uses a distinct approach.

  Sdhash STTY scrnsht2

With BitCurator it is possible to direct sdhash at a set of files and ask the tool to first create the similarity digests and then to make pairwise comparisons across the similarity digests for all files: each pair of files being assigned a similarity digest score. 

 

  Sdhash apparent date diff modulus dots 2 crp

 

Figure 2. Similarity score (sdhash) plotted against absolute difference in indicated dates (days) between files (each point represents a pair of draft files): apparently and generally the greater the number of days between the files of a pair, the lower the similarity score

 

This is a preliminary analysis and readers of this blog entry who are familiar with statistical methods may recognise that it might be better to use partial regression or a similar statistical approach. A further small point, as Dr Roussev has emphasised, a 100% similarity does not mean that the files are identical; cryptographic hashes can serve this purpose and are to be incorporated in future versions of the sdhash tool which is still under active development.  

 

Following the more formal talks we began an open discussion with the aim to identify some priority topics, and subsequently we divided into three groups to address: metadata, access and sensitivity, respectively, concluding the first day. On the second day, we focussed the conversation even more and as two groups addressed cataloguing and metadata on the one hand, and tools and workflows on the other hand. 

Steps towards specific conclusions and recommended actions were made in preparation for publication and dissemination. 

The desire to continue and extend the collaboration was strongly expressed, and fittingly Cal Lee concluded the workshop by updating us on developments of the BitCurator platform and the launch of the BitCurator Consortium, an important invitation for institutions to participate and for individuals to collaborate. 

BitCurator is going from strength to strength: receiving an extension of the project, formally launching the BitCurator Consortium, and releasing Version 1.0 of the BitCurator software.  

 

Many congratulations to Fran and Caroline on their email project becoming a finalist for the Digital Preservation Awards 2014: the University of Manchester Library’s Carcanet Press Archive project which among many things explored the use of the forensic tool Email Examiner along with Aid4Mail (which incidentally has a forensic version). 

 

  IMGP6218crp

 

The workshop was jointly organised by me, Cal Lee (University of North Carolina at Chapel Hill) and Susan Thomas (Bodleian Library, University of Oxford).  

Very many thanks to the delegates for all of their participation over the two days. 

Jeremy Leighton John, Curator of eMSS 

@emsscurator

15 September 2014

Finding Jokes - The Victorian Meme Machine

Posted on behalf of Bob Nicholson.

The Victorian Meme Machine is a collaboration between the British Library Labs and Dr Bob Nicholson (Edge Hill University). The project will create an extensive database of Victorian jokes and then experiment with ways to recirculate them out over social media. For an introduction to the project, take a look at this blog post or this video presentation.

Vmm_background
Stage One: Finding Jokes

Whenever I tell people that I’m working with the British Library to develop an archive of nineteenth-century jokes, they often look a bit confused. “I didn’t think the Victorians had a sense of humour”, somebody told me recently. This is a common misconception. We’re all used to thinking of the Victorians as dour and humourless; as a people who were, famously, ‘not amused’. But this couldn’t be further from the truth. In fact, jokes circulated at all levels of Victorian culture. While most of them have now been lost to history, a significant number have survived in the pages of books, periodicals, newspapers, playbills, adverts, diaries, songbooks, and other pieces of printed ephemera. There are probably millions of Victorian jokes sitting in libraries and archives just waiting to be rediscovered – the challenge lies in finding them.   

In truth, we don’t know how many Victorian gags have been preserved in the British Library’s digital collections. Type the word ‘jokes’ into the British Newspaper Archive or the JISC Historical Texts collection and you’ll find a handful of them fairly quickly. But this is just the tip of the iceberg. There are many more jests hidden deeper in these archives. Unfortunately, they aren’t easy to uncover. Some appear under peculiar titles, others are scattered around as unmarked column fillers, and many have aged so poorly that they no longer look like jokes at all. Figuring out an effective way to find and isolate these scattered fragments of Victorian humour is one of the main aims of our project. Here’s how we’re approaching it.

Firstly, we’ve decided to focus our attention on two main sources: books and newspapers. While it’s certainly possible to find jokes elsewhere, these sources provide the largest concentrations of material. A dedicated joke book, such as this Book of Humour, Wit and Wisdom, contains hundreds of viable jokes in a single package. Similarly, many Victorian newspapers carried weekly joke columns containing around 30 gags at a time – over the course of a year, a regularly printed column yields more than 1,500 jests. If we can develop an efficient way to extract jokes from these texts then we’ll have a good chance of meeting our target of 1 million gags.

  Jokes_background

Our initial searches have focused on two digital collections:

1)      The 19th Century British Library Newspapers Database.

2)      A collection of nineteenth-century books digitised by Microsoft.

In order to interrogate these databases we’ve compiled a continually-expanding list of search terms. Obvious keywords like ‘jokes’ and ‘jests’ have proven to be effective, but we’ve also found material using words like ‘quips’, ‘cranks’, ‘wit’, ‘fun’, ‘jingles’, ‘humour’, ‘laugh’, ‘comic’, ‘snaps’, and ‘siftings’. However, while these general search terms are useful, they don’t catch everything. Consider these peculiarly-named columns from the Hampshire Telegraph:

  Joke_snippets

At first glance, they look like recipes for buckwheat cakes – in fact, they’re columns of imported American jokes named after what was evidently considered to be a characteristically Yankee delicacy. I would never have found these columns using conventional keyword searches. Uncovering material like this is much more laborious, and requires us to manually look for peculiarly-named books and joke columns.

In the case of newspapers, this requires a bit of educated guesswork. Most joke columns appeared in popular weekly papers, or in the weekend editions of mass-market dailies. So, weighty, morning broadsheets like the London Times are unlikely to yield many gags. Similarly, while the placement of jokes columns varied from paper to paper (and sometimes from issue to issue), they were typically placed at the back of the paper alongside children’s columns, fashion advice, recipes, and other miscellaneous tit-bits of entertainment. Finally, once a newspaper has been proven to contain one set of joke columns, the likelihood is that more will be found under other names. For example, initial keyword searches seem to suggest that the Newcastle Weekly Courant discontinued its long-running ‘American Humour’ column in 1888. In fact, the column was simply renamed ‘Yankee Snacks’ and continued to appear under this title for another 8 years.

Tracking a single change of identity like this is fairly straightforward; once the new title has been identified we simply need to add it to our list of search terms. Unfortunately, the editorial whims of some newspapers are harder to follow. For example, the Hampshire Telegraph often scattered multiple joke columns throughout a single issue. To make things even more complicated, they tended to rename and reposition these columns every couple of weeks. Here’s a sample of the paper’s American humour columns, all drawn from the first 6 months of 1892:

Snippets_black_background
For papers like this, the only option is to manually locate jokes columns one at a time. In other words, while our initial set of core keywords should enable us to find and extract thousands of joke columns fairly quickly, more nuanced (and more laborious) methods will be required in order to get the rest.

It’s important to stress that jokes were not always printed in organised collections. Some newspapers mixed humour with other pieces of entertaining miscellany under titles such as ‘Varieties’ or ‘Our Carpet Bag’. The same is true of books, which often combined jokes with short stories, comic songs, and material for parlour games. While it’s fairly easy to find these collections, recognising and filtering out the jokes is more problematic. As our project develops, we’d like to experiment with some kind of joke-detection tool that pick out content with similar formatting and linguistic characteristics to the jokes we’ve already found. For example, conversational jokes usually have capitalised names (or pronouns) followed by a colon and, in some cases, include a descriptive phrase enclosed in brackets. So, if a text includes strings of characters like “Jack (…):” or “She (…):“ then there’s a good chance that it might be a joke. Similarly, many jokes begin with a capitalised title followed by a full-stop and a hyphen, and end with an italicised attribution. Here’s a characteristic example of all three trends in action:

Small_snippet

Unfortunately, conventional search interfaces aren’t designed to recognise nuances in punctuation, so we’ll need to build something ourselves. For now, we’ve chosen to focus our efforts on harvesting the low-hanging fruit found in clearly defined collections of jokes.

                The project is still in the pilot stage, but we’ve already identified the locations of more than 100,000 jokes. This is more than enough for our current purposes, but I hope we’ll be able to push onwards towards a million as the project expands. The most effective way to do this may well to be harness the power of crowdsourcing and invite users of the database to help us uncover new sources. It’s clear from our initial efforts that a fully-automated approach won’t be effective. Finding and extracting large quantities of jokes – or, indeed, any specific type of content – from among the millions of pages of books and newspapers held in the library’s collection requires a combination of computer-based searching and human intervention. If we can bring more people on board we’ll be able to find and process the jokes much faster.

Finding gags is just the first step. In the next blog post I’ll explain how we’re extracting joke columns from the library’s digital collections, importing them into our own database, and transcribing their contents. Stay tuned!

 

01 September 2014

Wikimania and UK Wikimedian of the Year 2014 Awards

This year it was very exciting that Wikimania 2014, the official annual conference of the Wikimedia Foundation, was held in the UK for the first time. It was a wonderful opportunity to catch up with old friends; such as the Library’s previous Wikipedian-in-Residence Andrew Gray, I also met new interesting folk, many from other libraries and cultural heritage institutions around the world, as there was a whole strand of the programme devoted to the GLAM sector.

Wikimedia UK used the conference closing ceremony for Jimmy Wales to announce the winners of the UK Wikimedian of the Year 2014 awards. The main award went to Ed Saperia for his hard work in organising Wikimania 2014. GLAM of the Year went to our friends (and my old colleagues) at the National Library of Scotland, Educational Institution of the Year to the University of Portsmouth; with Professor Humphrey Southall, who the British Library collaborates with on the successful Geofreferencer project, collecting their award. It was also very pleasing for Honourable Mentions to be given to Andy Mabbett, also known as Pigsonthewing, who started the Voice Intro Project and last, but definitely not least, the British Library received an Honourable Mention for the Mechanical Curator and Flickr Commons image release. Ben O’Steen from British Library Labs who created the Mechanical Curator received the award on behalf of the Library and had the privilege of shaking Jimmy Wales’ hand on stage.

UK_Wikimedian_of_the_Year_awards_2014_(01)

Wikimedia UK 2014 Award Winners, including Ben O’Steen from the British Library

Plans are now under way for next year’s Wikimania in Mexico City, which will take place 15-19 July 2015 in a library for the first time: la Biblioteca Vasconcelos (Vasconcelos Library) also known as the Megabiblioteca ("megalibrary"); from looking at photos I can see why it has this nickname! It also includes a huge whale sculpture by Gabriel Orozco in the centre of the building, for more info on how this was created and assembled check out this Tate blog post.

 

Stella Wisdom

Curator, Digital Research

@miss_wisdom