Web Archives: A Tool for Geographical Research?
By Emmanouil Tranos and Christoph Stich, University of Birmingham
If you are a quantitative social scientist there are few things more fascinating than free, under-utilised, quirky and easy to download data that also fits well the narrative of 'big data'.
Combine the above characteristics with data that have the potential to support researchers answering interesting research questions and then you will make a researcher happy! And this is exactly what the JISC UK Web Domain Dataset held by the UK Web Archive is all about.
A detailed description of the data can be found here, but briefly this is a subset of the Internet Archive that includes all the archived webpages under the .UK Top Level Domain (TLD) as well as the archival timestamp for the period January 1996 to March 2013. The UK Web Archive partnered with the Internet Archive and JISC to create this unique data set, which enables researchers to easily access probably the largest national archive of webpages.
The UK web space has several unique characteristics
Apart from the fact that UK was an early adopter of internet technologies and applications, it also includes some widely recognisable second level domain names such as the .co.uk and the .ac.uk. While the first one (mainly) denotes commercial activities based in the UK similar to the .com top level domain, the latter is used for UK universities. Moreover, the English language makes the UK web space more accessible to the rest of the world.
How is this dataset useful?
The JISC UK Web Domain Dataset is an easy way to access the Internet Archive data. It is, in essence, a long list of strings (i.e. groups of characters), that include the archival timestamp and the original URL of the archived webpages.
For instance, the first numerical part of the line below indicates when the contact page of the uk.eurogate.co.uk website was archived (9/5/2008 at 16:21:38).
20080509162138/http://uk.eurogate.co.uk/contact_us IG8 8HD
With the use of these strings a researcher can retrieve the HTML documents of the archived webpages from the Internet Archive API. The UK Web Archive further processed this data and created a subset of the archived UK webpages that includes all the .uk webpages that contain a UK postcode.
In the above example, the last element indicates that this specific webpage contains the postcode IG8 8HD.
This dataset, which is known as the Geoindex and can be downloaded from here, is probably one of the largest open data sets of georeferenced digital content.
There are, however, a number of technical and conceptual challenges attached to the usage of these data. For instance, there is a debate in the literature regarding how much of the web is currently archived (e.g. Hale et al. 2017). Although there is some critique regarding the depth of archival process (i.e. how many webpages from each website are archived), the Internet Archive is the most extended digital archive (Holzmann et al., 2016; Ainsworth et al., 2011).
Moreover, the volume of the data requires some upfront investment regarding data analysis skills, but is still doable with some standard off-the-shelf libraries and tools (e.g. Python or R).
After filtering out invalid postcodes, we are left with a dataset that contains about 5.8 million pairs of British postcodes and domain names.
As one can see in plot, the number of domains that reference a postcode grows relatively rapidly in the decade between 1995 and 2005 before growth levels off. The distribution of domains also more or less aligns with the population density of the UK. This is a good indicator that the collected data captures actual activity in the UK.
Unsurprisingly the data also reveal a difference between London and the rest of the country. The number of domains that reference a postcode per inhabitant grew faster in London than in other places, but eventually the rest of the country caught up with London. There are, however, quite significant differences in how the domains are distributed within London as well.
So, what research questions can these data help us answer? Utilising funding from the ESRC and the Consumer Data Research Centre (CDRC) we employed this data to explore the evolution of the digital economy in the UK. Firstly, we are utilising this data in order to understand whether the availability of online content attracts individuals online. We do that by employing unique survey data available from CDRC.
Our underlying hypothesis is that the availability of internet content of local interest can attract people online in order to access and take advantage of the potential on-line opportunities such as accessing local products and services. The first results seem to support our hypothesis.
Secondly, we are using this data to explore the economic activities (e.g. products and services offered b firms) that take place in some of the UK digital clusters. By filtering the data to only focus on archived web pages from specific clusters in the UK and by utilising the textual data available from the archived HTML documents, we are building topic models to reveal what type of economic activities exist in these clusters and how these activities have evolved over time.
We are testing how this archived web data can help us learn more about economic activities and how they have evolved over time. We are also comparing the outputs of this analysis with official industrial classifications from various sources including freely available such data from CDRC.
Lastly, together with colleagues from City-REDI, we are using the archived web data as a proxy to understand the early adoption of web technologies in the UK. Building upon arguments developed in evolutionary economics, the early adoption of web technologies may signify innovative regions which developed 'digital capacity' early enough, something which may affect their future growth trajectories. The first results indicate that indeed the early adoption of web technologies is related to positive future growth trajectories.
To close, we believe that our on-going research, apart from answering substantive geographical research questions, will also illustrate the value of archived web data for geographical research. It is one of the few available data sources that can provide longitudinal georeferenced data, which also includes a wealth of unstructured textual data.
The latter can also reveal patterns and activities that other more 'conventional' data sources would not have been able to uncover.
Ainsworth, S. G., Alsum, A., SalahEldeen, H., Weigle, M. C., & Nelson, M. L. (2011). How much of the web is archived? Paper presented at the Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries.
Hale, S. A., Blank, G., & Alexander, V. D. (2017). Live versus archive: Comparing a web archive to a population of web pages. In N. Brügger & R. Schroeder (Eds.), Web as History: Using Web Archives to Understand the Past and the Present (pp. 45-61). London: UCL Press.
Holzmann, H., Nejdl, W., & Anand, A. (2016). The Dawn of today's popular domains: A study of the archived German Web over 18 years. Paper presented at the Digital Libraries (JCDL), 2016 IEEE/ACM Joint Conference.