21 June 2013
Seeking trends in article titles
Metadata can offer an interesting perspective on what has been published. And as you might expect, at the British Library we hold plenty of metadata on work published over the last 40-50 years. Included within this is metadata for journal articles.
Inspired by Ben Schmidt’s recent work on trends within academia (in this case, theses), I set about looking for trends within a set of metadata for Paleontology journal articles (circa 20,000 between 1991 and 2011) which we share openly (see link above). As Franco Moretti argues (Critical Inquiry, 2009), titles of works are powerful things: offering (often simultaneously) summaries, puffs and descriptions of the content within. This holds true for journal article titles, the standard function of which is to entice the reader, to demonstrate to the reader why they might wish to read the content within. (As a related aside, a wonder whether [citation needed!] article titles from a pre-search term era display different trends to those in a post-search term era: for I’d expect that in an age of Google, discoverability via search is a key component in the construction of article titles)
I decided to see if article titles differed between the journals they were published in, and began this by making a list of top keywords in those article titles as a means of rationalising them. The top 25 words (filtered for stop words) is as follows:
new,2,354 late,2,275 early,1,508 formation,1,325
upper,1,226 basin,1,213 cretaceous,1,185 china,1,136
middle,1,131 lower,1,087 southern,1,060 implications,954
central,855 south,829 northern,825 miocene,799
jurassic,793 western,737 triassic,732 evolution,717
evidence,683 record,677 fossil,675 species,666
From this list I decided to exclude non-specific adjectives relating to time and geography (perhaps in doing so showing my ignorance toward my data, more on which later), and so ended up with the following top 10 list:
formation basin cretaceous
china implications miocene
jurassic triassic evolution
At this point I could have counted the occurrences of these words in the article titles published in each journal. Instead I chose to process the data in Gephi. Though aware of the reservations around the readability of network graphs, having had some success in the past with using Gephi to get a sense of data I gave it a go.
To get the data into Gephi, I had to convert it into ‘nodes’ and ‘edges’ (data: nodes, edges). Both the article title keywords and the journal titles were mapped as nodes (the round blobs in the image below), with an edges (the lines connecting the nodes) encoded for every occurrence of a keyword within a journal (with the force binding the network directed from the keyword to the journal). In short, the bigger the node the more occurrences, the bigger the edge the more connections. I then applied a Force Atlas algorithm to the data, processed the data for groupings, and pressed go. After some manual adjustments for legibility (the Gephi project can be downloaded here if you are interested in the nuances of my logic), I ended up with is this: (full size png)
What does it tell us? First of all, and unsurprisingly, the top keyword - ‘formation’ - is found at the heart of the network. Similarly with occurrences put into the network as raw data (as opposed to as a percentage of all titles in the journal) those journals better represented in the data are closer to the centre. Finally, the graph shows that the top 10 keywords are strongly related.
Frankly, I really don’t know what to make of it: I certainly don’t want to make any bold claims (such ‘China and evidence are unrelated!’ or ‘Lethaia doesn’t accept articles about China!). What I will say is that I do know that if I knew more about the data (Know Your Big Data!), had used a larger number of keywords and had introduced time into the proceedings (change over time being, of course, the historians bread and butter) I might be able to start seeing some interesting trends.
And that is precisely what I am going to do. Shortly, I will be taking delivery of a similar dataset for journals published in the field of History (which I know well) and will be repeating the exercise (with the tweaks mentioned above). If you can think of anything in particular you’d like me to look out for or want to comment on what I’ve done thus far, please let me know.