UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

2 posts from September 2018

28 September 2018

Sports Collections in the UK Web Archive

By Helena Byrne, Web Archive Curator, The British Library

The 30th September is National Sporting Heritage Day in the UK and to celebrate the event in 2018 we will give you a quick overview of our sports collections. 

Sport studies give us a real insight into popular culture and political issues of the time, however, it is a subject area that has often been underrepresented in many traditional libraries and archives. The UK Web Archive works across the six UK legal Deposit Libraries and with other external partners to try and bridge gaps in our subject expertise.

UKWA Sports Collections
We currently have three collections that focus on sport:

  1. Sport: Football
  2. Sports Collection
  3. Sports: International Events

Shine - Football Graph

Trend graph on SHINE

Sport: Football
Football in all its varieties is probably the most popular sport in the UK, which is why there is a collection dedicated exclusively to football and related activities. There are many subsections to the Rugby and Soccer strand of the collection which can be viewed by clicking on the information box.

Sport Football Collection

Sports Collection
The general collection on sports has been broken down into subsections based on the type of sport rather than a specific sport title like tennis or snooker. These subject headings were based on the Universal Decimal Classification page about sport (from PD 1000 – 2003 UDC Abridged Edition). We used this general taxonomy of sports so that the collection can easily adapt to new sporting trends that emerge in the future. The Ball Sports section excludes football as there is already a dedicated collection on this subject. Ball sports is probably the most versatile section and this has an additional five subsections:

  1. By Hand
  2. On a Table
  3. With Club
  4. With Racket (Racquet)
  5. With a Stick

Sports Collection

Sports: International Events
Our third main collection covers international sporting events. Currently there are six subsections in this collection:

  1. Olympic & Paralympic Games 2012
  2. Commonwealth Games Glasgow 2014
  3. Tour De France (Yorkshire)
  4. Winter Olympics Sochi 2014
  5. Rugby World Cup 2015
  6. Rio Olympics 2016

The decision to build collections on international sporting events is dependent on staff resource and their subject knowledge of these events. Going forward we would like to build collections around the major sporting events hosted in the UK but this is not always easy or possible. A major challenge around collecting on international events is that many of the web publishers are not based in the UK and do not always set up a UK website for the event. We archive content under the Non-Print Legal Deposit Regulations 2013, that means we are not able to automatically scope in content published outside the UK.

Access and Reuse
Under the Non-Print Legal Deposit Regulations 2013 access to archived content is restricted to a UK Legal Deposit library reading room. However, if we have permission from the website owner we can make the archived version of their content open access along with government publications under the Open Government Licence. This is why if you browse through the collections on the Beta version of our website most of the links to archived content will direct you to one of the UK Legal Deposit Libraries for access but some of the content you can view from your personal device.

The UK Web Archive can be used just like many other primary resources whether it be a magazine or a newsletter and the same copyright regulations apply. The web has been in use for nearly 30 years and the publication The Web as History gives an outline of how researchers from different disciplines interact with web and web archive content. Some of the datasets used in this publication are available for reuse from:

International Internet Preservation Consortium (IIPC)
As individual institutions the British Library and the National Library of Scotland are members of the International Internet Preservation Consortium (IIPC) and worked on building collaborative collections covering international events such as the Summer and Winter Olympic/Paralympic Games. Since the formation of the IIPC Content Development Group (CDG) in 2015, there has been a consolidated effort to build collections both, on and off the playing field. The British Library took the lead curatorial role in the 2016 Summer Olympics and Paralympics Games and the 2018 Winter Olympics and Paralympics Games, all of the IIPC collections are open access.

Get Involved
The UK Web Archive aims to archive, preserve and give access to the entire UK web space. 

If you see content that that should be included in one of sports collections then please fill in our online nomination form.
Alternatively, if you would like to get more hands on with curating a collection then get in touch.


27 September 2018

Web Archives: A Tool for Geographical Research?

By Emmanouil Tranos and Christoph Stich, University of Birmingham

If you are a quantitative social scientist there are few things more fascinating than free, under-utilised, quirky and easy to download data that also fits well the narrative of 'big data'.

Combine the above characteristics with data that have the potential to support researchers answering interesting research questions and then you will make a researcher happy! And this is exactly what the JISC UK Web Domain Dataset held by the UK Web Archive is all about.

A detailed description of the data can be found here, but briefly this is a subset of the Internet Archive that includes all the archived webpages under the .UK Top Level Domain (TLD) as well as the archival timestamp for the period January 1996 to March 2013. The UK Web Archive partnered with the Internet Archive and JISC to create this unique data set, which enables researchers to easily access probably the largest national archive of webpages.

The UK web space has several unique characteristics
Apart from the fact that UK was an early adopter of internet technologies and applications, it also includes some widely recognisable second level domain names such as the and the While the first one (mainly) denotes commercial activities based in the UK similar to the .com top level domain, the latter is used for UK universities. Moreover, the English language makes the UK web space more accessible to the rest of the world.

How is this dataset useful?
The JISC UK Web Domain Dataset is an easy way to access the Internet Archive data. It is, in essence, a long list of strings (i.e. groups of characters), that include the archival timestamp and the original URL of the archived webpages.

For instance, the first numerical part of the line below indicates when the contact page of the website was archived (9/5/2008 at 16:21:38).

20080509162138/ IG8 8HD

With the use of these strings a researcher can retrieve the HTML documents of the archived webpages from the Internet Archive API. The UK Web Archive further processed this data and created a subset of the archived UK webpages that includes all the .uk webpages that contain a UK postcode.

In the above example, the last element indicates that this specific webpage contains the postcode IG8 8HD.

This dataset, which is known as the Geoindex and can be downloaded from here, is probably one of the largest open data sets of georeferenced digital content.

There are, however, a number of technical and conceptual challenges attached to the usage of these data. For instance, there is a debate in the literature regarding how much of the web is currently archived (e.g. Hale et al. 2017). Although there is some critique regarding the depth of archival process (i.e. how many webpages from each website are archived), the Internet Archive is the most extended digital archive (Holzmann et al., 2016; Ainsworth et al., 2011).

Moreover, the volume of the data requires some upfront investment regarding data analysis skills, but is still doable with some standard off-the-shelf libraries and tools (e.g. Python or R).

After filtering out invalid postcodes, we are left with a dataset that contains about 5.8 million pairs of British postcodes and domain names.

As one can see in plot, the number of domains that reference a postcode grows relatively rapidly in the decade between 1995 and 2005 before growth levels off. The distribution of domains also more or less aligns with the population density of the UK. This is a good indicator that the collected data captures actual activity in the UK.


Unsurprisingly the data also reveal a difference between London and the rest of the country. The number of domains that reference a postcode per inhabitant grew faster in London than in other places, but eventually the rest of the country caught up with London. There are, however, quite significant differences in how the domains are distributed within London as well.



So, what research questions can these data help us answer? Utilising funding from the ESRC and the Consumer Data Research Centre (CDRC) we employed this data to explore the evolution of the digital economy in the UK. Firstly, we are utilising this data in order to understand whether the availability of online content attracts individuals online. We do that by employing unique survey data available from CDRC.

Our underlying hypothesis is that the availability of internet content of local interest can attract people online in order to access and take advantage of the potential on-line opportunities such as accessing local products and services. The first results seem to support our hypothesis.

Secondly, we are using this data to explore the economic activities (e.g. products and services offered b firms) that take place in some of the UK digital clusters. By filtering the data to only focus on archived web pages from specific clusters in the UK and by utilising the textual data available from the archived HTML documents, we are building topic models to reveal what type of economic activities exist in these clusters and how these activities have evolved over time.

We are testing how this archived web data can help us learn more about economic activities and how they have evolved over time. We are also comparing the outputs of this analysis with official industrial classifications from various sources including freely available such data from CDRC.

Lastly, together with colleagues from City-REDI, we are using the archived web data as a proxy to understand the early adoption of web technologies in the UK. Building upon arguments developed in evolutionary economics, the early adoption of web technologies may signify innovative regions which developed 'digital capacity' early enough, something which may affect their future growth trajectories. The first results indicate that indeed the early adoption of web technologies is related to positive future growth trajectories.

To close, we believe that our on-going research, apart from answering substantive geographical research questions, will also illustrate the value of archived web data for geographical research. It is one of the few available data sources that can provide longitudinal georeferenced data, which also includes a wealth of unstructured textual data.

The latter can also reveal patterns and activities that other more 'conventional' data sources would not have been able to uncover.

Ainsworth, S. G., Alsum, A., SalahEldeen, H., Weigle, M. C., & Nelson, M. L. (2011). How much of the web is archived? Paper presented at the Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries.

Hale, S. A., Blank, G., & Alexander, V. D. (2017). Live versus archive: Comparing a web archive to a population of web pages. In N. Brügger & R. Schroeder (Eds.), Web as History: Using Web Archives to Understand the Past and the Present (pp. 45-61). London: UCL Press.

Holzmann, H., Nejdl, W., & Anand, A. (2016). The Dawn of today's popular domains: A study of the archived German Web over 18 years. Paper presented at the Digital Libraries (JCDL), 2016 IEEE/ACM Joint Conference.