08 April 2020

Born-digital Literary Archives — How We’re Capturing the Future

by Callum McKean, Curator of Contemporary Literary and Creative Archives. For more information about the challenges and opportunities posed by born-digital material elsewhere in the Library, see the Digital Scholarship Blog and the extensive work of the Library's Digital Preservation Department.

I. Capturing a Moving Target

If you’re adjusting to working from home, look around your working area: maybe you have a home office or you’ve repurposed the dining table, or you're out in the garden. Did anyone in your team — when making the move — scramble to transport the filing cabinets or stacks of unsorted paper that had accumulated on their desks to their new workspace? If not, this is sufficient evidence that the way a lot of us work has radically changed; a platitude that doesn’t get less true with repetition. Your computer (and, more often than not, the network) has at least partially relegated or replaced the paper in your professional life with ‘Digital Objects’ — a useful but deceptively complex archival term — defined by the Society of American Archivists as, “a unit of information that includes properties (attributes or characteristics of the object) and may also include methods (means of performing operations on the object)”. If the word ‘object’ seems ontologically insufficient to carry such a definition, with its emphasis on process, relationship, and contingency, then this is part of the problem we’re facing. (Archivists — and traditional archival methodologies — have a clear (and often justified) tendency to fetishise permanence and fixity). 

This shift towards the ‘digital’ is no less dramatic in the personal archives of the novelists, poets and playwrights collected by the Contemporary Archives and Manuscripts department at the Library, whose historical remit (c.1950-) traces the rapidly evolving landscape of personal computing in the latter half of the twentieth century and the explosion of the internet and social media, which is by no means complete, in the twenty-first. This shifting landscape means that, more often than not, we collect ‘hybrid’ archives comprised of traditional paper material and — depending on the donor’s enthusiasm for new forms of technology — a variety of digital formats, including floppy-disks, CD-ROMs, spinning hard-drives, USB sticks, and even laptop computers. Creative writers are, much more-so than institutions, academics and scientists, given to superstition and mysticism regarding the tools of their trade. Most are methodologically conservative and eager to link their ability to produce work to their idiosyncratic habits and tools. (On this blog, Chris Beckett’s discussion of Will Self’s use of post-it notes and typewriters in one of our most significant hybrid archives is an excellent case-study of how complex these interrelationships can become).


Photograph of Amsaft branded Floppy Disk form the Archive of Wendy Cope

A double-sided Amsoft branded Compact floppy disk from the Archive of poet Wendy Cope, dated 1989.

What becomes apparent when attempting to capture, preserve, arrange and fix these ‘Digital Objects’ is that an undeniable materiality — often partially erased by the term ‘digital’ — is fundamental to their structure. The history of computing is a history of design miracles, both technical and aesthetic. Recovering a long-forgotten Word Perfect file from an Amstrad Floppy Disk is an archeological task, demanding attention to the structure and format of the data, its physical housing, and the software-codex able to make sense of it. Like a dig, it requires sensitive excavation equipment capable of moving the object without altering or destroying it. Similarly, capturing a hard-drive demands knowledge of how it reads, writes and stores data mechanically in order that when we act upon it we capture everything (including, interestingly, apparently empty space) and disturb as little as possible. (In a strange turn, archivists have learned to use software and hardware first developed by law-enforcement for these and other tasks).


Photograph of Kryoflux machine used for magnetically reading legacy floppy disks

A Kryoflux machine reads a 3.5” floppy disk using magnetic resonance technology to achieve a complete capture where possible, often helping us to recover partially-corrupted legacy data.

II. Representing exchange

Next to draft material, correspondence is another major component of the traditional literary archive. The movement from paper to digital has been just as pronounced in this area too, with e-mail becoming the dominant mode of communication for the vast majority of our donors. Unsurprisingly, the collection of e-mail archives presents its own challenges, both technical and curatorial. In much the same way that a letter might come to us within an envelope, an e-mail message is held within a machine-readable envelope — from which it is possible to glean similar kinds of data about sender, receiver, the path which the messaged travelled through on its journey from one to the other. All of this data must be preserved in order to retain the integrity of the archival collection, but much of it must also be withheld from public access for a significant period of time in order to comply with legal restrictions relating to the use of personal data.

Photograph of letter addressed to BS Johnson from Samuel Beckett

Screenshot of e-mail metadata

A side by side metadata comparison of a letter and an e-mail. The envelope sent by Samuel Beckett to B.S. Johnson contains critical metadata about dates and receiver, as well as about the French post-offices through which the letter travelled before reaching Johnson. The e-mail metadata contains much of the same information (highlighted) in a machine-readable format.

As well as these technical challenges, the preservation and access provision for e-mail archives must take into account its threaded nature  — its a conversation and so is not particularly amenable to the archival logic of ‘deliverable units' which guides our approach to paper manuscripts. Additionally, any robust archival process must consider e-mail's increased tendency to include rich media; including attachments such as word-processing files, images and sometimes even audiovisual material. The scale of the challenge for collecting institutions is huge. The largest e-mail archive held at the Library, (of the poet Wendy Cope, comprised of around 25,000 individual messages) contains everything from family correspondence, professional booking requests, draft revisions and shopping lists. Making sure that this material complies with data protection regulation in the UK before it is released is obviously a considerable task. Fortunately, software tools like ePadd, an e-mail archiving tool developed at Stanford University, exist to alleviate some of the issues; allowing us to filter and process messages more efficiently through the implementation of a tool-assisted approach.

Screenshot taken from Stanford University's ePadd E-mail Archiving Project

ePadd’s user friendly interface allows curators to filter messages by correspondent, attachment and assign user-generated labels.

III. Managing Scale

Scale is a double-edged for born-digital literary archives. The growing size of these collections undoubtedly renders some established archival cataloguing techniques inadequate. Equally, as the the kinds of media stored on consumer-level storage devices become more complex, traditional techniques for information organisation and control become either too labour intensive or impossible to adapt to this new context. Nevertheless, the scale of structured metadata available for these new kinds of collection items allows us to explore new techniques for data visualisation and ‘enhanced curation’ in ways that would be impossible for more traditional archival collections.


Bar chart showing file type distribution in a born digital archive Screenshot showing enhanced metadata for a born digital archive

Examples of how structured metadata can allow us to visualise and compile data in interesting ways using computing languages such as Python. The bar graph shows a time distribution for files in the Virago archive, with the 24 hour clock on the x axis and number of files on the y axis. The text describes some statistics and metrics for the same archive.


What next?

The processing, preservation and access provision for born digital literary archives is very much still an open field. The future is uncertain, but consequently still very exciting. Although there are many challenges ahead, if we are willing and able to leverage the technology, there are innumerable new discoveries to be made about the collections we hold, some of which would have been unthinkable just a short time ago. In this way, our driving motivation for born-digital is no different than it is for paper -- to preserve, interpret and provide access to our collections for the inspiration and enjoyment of everyone.