THE BRITISH LIBRARY

Digital scholarship blog

4 posts from June 2013

26 June 2013

Why we are big on (big)data

Big data. Yes, that again. It is tempting to think we in the humanities are a bit behind the times in getting excited right now about using large datasets to answer research questions. After all our colleagues in the sciences have been using big(ger) data for ages, and economic-minded social historians were doing quantitative work with lots of stuff in the 60s and 70s. Yet I think the key is that we humanists mean something different by big data: that being collections of stuff of a size which is both too big to deal with on our own (ergo without computers, both for processing and memorising) and without changing the ways in which we do what we do. Our big data then is not the same size as the big data produced by the LHC, but I suspect [citation needed!] that the shift in scale in each field is equivalently big.

So big data – or, to paraquote Torsten Reimer "thinking big with data" – has lurched to the forefront of humanistic inquiry. There are dissenters. Some just don't like the buzzyness word. Others see big data digital humanities (DH) as a by-product not of good research but of the managerial humanities, of a chase for money as opposed to a chase for answers. The latter I find especially troubling. Arts and humanities funders have latched onto big data DH not (IMO) because it offers results, outputs, stuff, but because it offers a genuine opportunity to do things with stuff differently, to not just ask a slightly different set of questions but to consider a different way of formulating those questions in the first place.

Of course such semantic tricks are nothing new either. Novelty will always claim the earth (especially when there are juicy funding grants involved). But believe it or not big data may actually be different, for it promises like nothing before a genuine cross-disciplinary endeavour. There is a simple reason for this: we can't be at the same time everyday historians, historians of the big picture, historians of the small picture, quantitative historians, qualitative historians, programming historians, statisticians with an eye for history, and programmers with a sense of how to interrogate past phenomena (I’ll spare you the full list of possible permutations, skillsets, et cetera). But we can build teams of people who as a sum of their parts approach just that (or even a sum of collective parts we have yet to imagine). And with genuine teams comes a potential for work with genuine novelty.

A scientist friend said to me recently that 'what you guys consider a thank you in a footnote we call a third author'. Those 'thank yous' often go to people we consider part of our team: colleagues, PhD students, RAs, postdocs. But those people are rarely considered co-authors. Big data changes that. If I am unable to work alone to 'read' the data, neither can I work alone to interpret the data nor to publish my results from that data. And as my co-authors will have to be those from the fringes of my discipline and beyond in order for us to get a handle on the big data, the decision to use big data will in turn disrupt what it means to be a historian, literary scholar et al, and what it means to produce outputs in those fields.

Having evangelised, you won't be surprised to learn that I write this whilst returning from a workshop on big data in the arts and humanities (our slides on British Library digital content here). I don't think it is letting the cat out of the bag too much to say that at that event the organisers, the AHRC, confirmed that next week they will be announcing a major funding call centring on big data. From the discussions today with researchers and colleagues in the sector, I can't wait to see how this investment changes humanistic inquiry.

@j_w_baker

21 June 2013

Seeking trends in article titles

Metadata can offer an interesting perspective on what has been published. And as you might expect, at the British Library we hold plenty of metadata on work published over the last 40-50 years. Included within this is metadata for journal articles.

Inspired by Ben Schmidt’s recent work on trends within academia (in this case, theses), I set about looking for trends within a set of metadata for Paleontology journal articles (circa 20,000 between 1991 and 2011) which we share openly (see link above). As Franco Moretti argues (Critical Inquiry, 2009), titles of works are powerful things: offering (often simultaneously) summaries, puffs and descriptions of the content within. This holds true for journal article titles, the standard function of which is to entice the reader, to demonstrate to the reader why they might wish to read the content within. (As a related aside, a wonder whether [citation needed!] article titles from a pre-search term era display different trends to those in a post-search term era: for I’d expect that in an age of Google, discoverability via search is a key component in the construction of article titles)

I decided to see if article titles differed between the journals they were published in, and began this by making a list of top keywords in those article titles as a means of rationalising them. The top 25 words (filtered for stop words) is as follows:

new,2,354    late,2,275    early,1,508    formation,1,325

upper,1,226    basin,1,213    cretaceous,1,185    china,1,136

middle,1,131    lower,1,087    southern,1,060    implications,954

central,855    south,829    northern,825    miocene,799

jurassic,793    western,737    triassic,732    evolution,717

evidence,683    record,677    fossil,675    species,666

ordovician,657

From this list I decided to exclude non-specific adjectives relating to time and geography (perhaps in doing so showing my ignorance toward my data, more on which later), and so ended up with the following top 10 list:

formation    basin    cretaceous

china    implications    miocene

jurassic    triassic    evolution

evidence

At this point I could have counted the occurrences of these words in the article titles published in each journal. Instead I chose to process the data in Gephi. Though aware of the reservations around the readability of network graphs, having had some success in the past with using Gephi to get a sense of data I gave it a go.

To get the data into Gephi, I had to convert it into ‘nodes’ and ‘edges’ (data: nodes, edges). Both the article title keywords and the journal titles were mapped as nodes (the round blobs in the image below), with an edges (the lines connecting the nodes) encoded for every occurrence of a keyword within a journal (with the force binding the network directed from the keyword to the journal). In short, the bigger the node the more occurrences, the bigger the edge the more connections. I then applied a Force Atlas algorithm to the data, processed the data for groupings, and pressed go. After some manual adjustments for legibility (the Gephi project can be downloaded here if you are interested in the nuances of my logic), I ended up with is this: (full size png)

 

Paleontology-top10keywords-forceatlas-1992-2011
Paleontology top ten article keywords against journal titles (1992-2011), Force Atlas network built using Gephi (CC BY 2.0 UK)


What does it tell us? First of all, and unsurprisingly, the top keyword - ‘formation’ - is found at the heart of the network. Similarly with occurrences put into the network as raw data (as opposed to as a percentage of all titles in the journal) those journals better represented in the data are closer to the centre. Finally, the graph shows that the top 10 keywords are strongly related.

Frankly, I really don’t know what to make of it: I certainly don’t want to make any bold claims (such ‘China and evidence are unrelated!’ or ‘Lethaia doesn’t accept articles about China!). What I will say is that I do know that if I knew more about the data (Know Your Big Data!), had used a larger number of keywords and had introduced time into the proceedings (change over time being, of course, the historians bread and butter) I might be able to start seeing some interesting trends.

And that is precisely what I am going to do. Shortly, I will be taking delivery of a similar dataset for journals published in the field of History (which I know well) and will be repeating the exercise (with the tweaks mentioned above). If you can think of anything in particular you’d like me to look out for or want to comment on what I’ve done thus far, please let me know.

James Baker

@j_w_baker

13 June 2013

First World War Workshop – 29 June 2013

As part of our digitisation programme in partnership with Europeana Collections 1914 – 1918, the British Library is organising a workshop on Saturday, 29 June using content digitised for the programme, most of which is not yet available online. The collection comprises a myriad of items including, maps, manuscripts, music sheets, photographs, official publications, trench journals, pamphlets, ephemera and a large number of rare and out-of-print material. Metadata for all items in the collection will be available for download on the day.

The aim of the workshop is to bring together a variety of users including historians, genealogists, hackers and others who are interested in exploring how the content can be used and presented in innovative ways and in connection to existing open tools such as geo-referencing platforms and OCR scripts for extracting information from the content and linking the material to other similar resources. More information on the workshop including the description of the collection and programme for the day can be accessed via http://bit.ly/11fs9RX

The workshop will run from 10:30 to 16:30 at the British Library Foyle Suite. Those who want to attend the day need to do register at http://www.eventbrite.co.uk/event/7052374843#

We look forward to seeing you there.

Girdwood Collection, Territorials of the Seaforth Highlanders in a trench, 5 August (Photo_24-248)

12 June 2013

A page, but not as we know it

It is commonplace to describe something new in relation to something that is known: think 'motion picture', 'spaceship', 'email' or 'smartphone'. The word 'webpage' is no different. And indeed in a sense many webpages are similar to the pages found in books or newspapers: they holds static media (text, image); core elements of them read from top to bottom; their headers, footers, cut-aways and advertisements orientate, guide and entice the reader; and in URLs they possess a (relatively) unique system of identifiers. It is hard to think of another name these digital objects could have been given.

It is also commonplace for the new thing to - linguistically speaking - replace the old thing: think 'motion picture' and 'the pictures', 'spaceship' and 'ship', 'email' and 'mail', or 'smartphone' and 'phone'. The same goes for 'webpage' and 'page'. Here by virtue of this act of redefinition, the 'page' absorbs features of the webpage not (or less) possible in book or newspaper pages: features such as dynamic content, user interaction, and direct links to other pages (or, more precisely, other pages that are not part of a sequence defined by the author whose work is the main content held by the page).

8412591969_7d13b89bb5_c
Aaron Swartz memorial at Internet Archive in San Francisco photograph courtesy of Flickr user Steve Rhodes / Creative Commons Licensed

All of this makes the webpage-cum-page appear both familiar and unsettling, conservative and disruptive, old and new. These elements of lineage are crucial, for they have allowed us (among other things) to think of preserving the webpage as akin to preserving the page. Yes the challenges of novelty and disruption are discussed and debated (on which I'm not qualified to comment), but at the most basic level the webpage stuff that is being collected by Internet Archive or the UK Web Archive is page level stuff. (This is not to say I don't think page level stuff should be archived. Far from it, the fragility of webpages is well known (see Rosenzweig, 2003) and without these efforts valuable data on our society would be lost.)

But what are these pages and how can historians use them? A seminar jointly hosted by the Digital History seminar and the Archives and Society seminar at the Institute of Historical Research sought last night to tackle this very problem, asking quite simply 'Is this a new class of primary source for historians?'. After a presentation on the UK Web Archive and the Analytical Access to the Dark Domain Archive project both the speakers and the audience were largely in agreement that yes, the web archive is a new class of primary source, of historical stuff.

Does this make our nonclementure for what this stuff is problematic? For to call a webpage a page is to potentially place it into a category for which it is ill-suited and the techniques for investigating that category under huge-strain. Take a normal news article from the Guardian website as an example. The page contains a story, framing, context and advertisements: all very page like. But those adverts are dynamic as opposed to static, their content quite possibly targeted depending on the IP address accessing the URL and different each time the page is refreshed. The page also contains moderated comments, ranked as default by oldest first but malleable to user preferences. In short, when you visit the website it is unlikely to be the same as when I visit the website, so an archived version can only be one possible version of a webpage at a particular historical moment. Not very page like behaviour. Of course we might (quite rightly for the most part) say that the 'core' of the page, the textual content that historians are likely to be interested in will remain the same regardless of these peripheral changes. And yet as the growth of mainstream live blogs demonstrates (such as those covering the Taksim Quare protests), the web is moving toward dynamic content over static content as default: embedded video, maps and text content streams are now commonplace, and are likely to become more so as the web develops.

The webpage then is a rapidly evolving beast whose capacity to change whilst still being called a 'page' complicates how we do research using webpages and how we preserve the internet. It is a page but not a page as we knew it, a semantic shift worth keeping in mind as we prepare for an era of born-digital historical scholarship.

@j_w_baker