Digital scholarship blog

21 June 2013

Seeking trends in article titles

Metadata can offer an interesting perspective on what has been published. And as you might expect, at the British Library we hold plenty of metadata on work published over the last 40-50 years. Included within this is metadata for journal articles.

Inspired by Ben Schmidt’s recent work on trends within academia (in this case, theses), I set about looking for trends within a set of metadata for Paleontology journal articles (circa 20,000 between 1991 and 2011) which we share openly (see link above). As Franco Moretti argues (Critical Inquiry, 2009), titles of works are powerful things: offering (often simultaneously) summaries, puffs and descriptions of the content within. This holds true for journal article titles, the standard function of which is to entice the reader, to demonstrate to the reader why they might wish to read the content within. (As a related aside, a wonder whether [citation needed!] article titles from a pre-search term era display different trends to those in a post-search term era: for I’d expect that in an age of Google, discoverability via search is a key component in the construction of article titles)

I decided to see if article titles differed between the journals they were published in, and began this by making a list of top keywords in those article titles as a means of rationalising them. The top 25 words (filtered for stop words) is as follows:

new,2,354    late,2,275    early,1,508    formation,1,325

upper,1,226    basin,1,213    cretaceous,1,185    china,1,136

middle,1,131    lower,1,087    southern,1,060    implications,954

central,855    south,829    northern,825    miocene,799

jurassic,793    western,737    triassic,732    evolution,717

evidence,683    record,677    fossil,675    species,666


From this list I decided to exclude non-specific adjectives relating to time and geography (perhaps in doing so showing my ignorance toward my data, more on which later), and so ended up with the following top 10 list:

formation    basin    cretaceous

china    implications    miocene

jurassic    triassic    evolution


At this point I could have counted the occurrences of these words in the article titles published in each journal. Instead I chose to process the data in Gephi. Though aware of the reservations around the readability of network graphs, having had some success in the past with using Gephi to get a sense of data I gave it a go.

To get the data into Gephi, I had to convert it into ‘nodes’ and ‘edges’ (data: nodes, edges). Both the article title keywords and the journal titles were mapped as nodes (the round blobs in the image below), with an edges (the lines connecting the nodes) encoded for every occurrence of a keyword within a journal (with the force binding the network directed from the keyword to the journal). In short, the bigger the node the more occurrences, the bigger the edge the more connections. I then applied a Force Atlas algorithm to the data, processed the data for groupings, and pressed go. After some manual adjustments for legibility (the Gephi project can be downloaded here if you are interested in the nuances of my logic), I ended up with is this: (full size png)


Paleontology top ten article keywords against journal titles (1992-2011), Force Atlas network built using Gephi (CC BY 2.0 UK)

What does it tell us? First of all, and unsurprisingly, the top keyword - ‘formation’ - is found at the heart of the network. Similarly with occurrences put into the network as raw data (as opposed to as a percentage of all titles in the journal) those journals better represented in the data are closer to the centre. Finally, the graph shows that the top 10 keywords are strongly related.

Frankly, I really don’t know what to make of it: I certainly don’t want to make any bold claims (such ‘China and evidence are unrelated!’ or ‘Lethaia doesn’t accept articles about China!). What I will say is that I do know that if I knew more about the data (Know Your Big Data!), had used a larger number of keywords and had introduced time into the proceedings (change over time being, of course, the historians bread and butter) I might be able to start seeing some interesting trends.

And that is precisely what I am going to do. Shortly, I will be taking delivery of a similar dataset for journals published in the field of History (which I know well) and will be repeating the exercise (with the tweaks mentioned above). If you can think of anything in particular you’d like me to look out for or want to comment on what I’ve done thus far, please let me know.

James Baker



The comments to this entry are closed.