THE BRITISH LIBRARY

UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

28 September 2018

Sports Collections in the UK Web Archive

By Helena Byrne, Web Archive Curator, The British Library

The 30th September is National Sporting Heritage Day in the UK and to celebrate the event in 2018 we will give you a quick overview of our sports collections. 

Introduction
Sport studies give us a real insight into popular culture and political issues of the time, however, it is a subject area that has often been underrepresented in many traditional libraries and archives. The UK Web Archive works across the six UK legal Deposit Libraries and with other external partners to try and bridge gaps in our subject expertise.

UKWA Sports Collections
We currently have three collections that focus on sport:

  1. Sport: Football
  2. Sports Collection
  3. Sports: International Events

Shine - Football Graph

Trend graph on SHINE

Sport: Football
Football in all its varieties is probably the most popular sport in the UK, which is why there is a collection dedicated exclusively to football and related activities. There are many subsections to the Rugby and Soccer strand of the collection which can be viewed by clicking on the information box.

Sport Football Collection


Sports Collection
The general collection on sports has been broken down into subsections based on the type of sport rather than a specific sport title like tennis or snooker. These subject headings were based on the Universal Decimal Classification page about sport (from PD 1000 – 2003 UDC Abridged Edition). We used this general taxonomy of sports so that the collection can easily adapt to new sporting trends that emerge in the future. The Ball Sports section excludes football as there is already a dedicated collection on this subject. Ball sports is probably the most versatile section and this has an additional five subsections:

  1. By Hand
  2. On a Table
  3. With Club
  4. With Racket (Racquet)
  5. With a Stick

Sports Collection

Sports: International Events
Our third main collection covers international sporting events. Currently there are six subsections in this collection:

  1. Olympic & Paralympic Games 2012
  2. Commonwealth Games Glasgow 2014
  3. Tour De France (Yorkshire)
  4. Winter Olympics Sochi 2014
  5. Rugby World Cup 2015
  6. Rio Olympics 2016


The decision to build collections on international sporting events is dependent on staff resource and their subject knowledge of these events. Going forward we would like to build collections around the major sporting events hosted in the UK but this is not always easy or possible. A major challenge around collecting on international events is that many of the web publishers are not based in the UK and do not always set up a UK website for the event. We archive content under the Non-Print Legal Deposit Regulations 2013, that means we are not able to automatically scope in content published outside the UK.

Access and Reuse
Under the Non-Print Legal Deposit Regulations 2013 access to archived content is restricted to a UK Legal Deposit library reading room. However, if we have permission from the website owner we can make the archived version of their content open access along with government publications under the Open Government Licence. This is why if you browse through the collections on the Beta version of our website most of the links to archived content will direct you to one of the UK Legal Deposit Libraries for access but some of the content you can view from your personal device.

The UK Web Archive can be used just like many other primary resources whether it be a magazine or a newsletter and the same copyright regulations apply. The web has been in use for nearly 30 years and the publication The Web as History gives an outline of how researchers from different disciplines interact with web and web archive content. Some of the datasets used in this publication are available for reuse from: data.webarchive.org.uk/opendata/

International Internet Preservation Consortium (IIPC)
As individual institutions the British Library and the National Library of Scotland are members of the International Internet Preservation Consortium (IIPC) and worked on building collaborative collections covering international events such as the Summer and Winter Olympic/Paralympic Games. Since the formation of the IIPC Content Development Group (CDG) in 2015, there has been a consolidated effort to build collections both, on and off the playing field. The British Library took the lead curatorial role in the 2016 Summer Olympics and Paralympics Games and the 2018 Winter Olympics and Paralympics Games, all of the IIPC collections are open access.

Get Involved
The UK Web Archive aims to archive, preserve and give access to the entire UK web space. 

If you see content that that should be included in one of sports collections then please fill in our online nomination form.
Alternatively, if you would like to get more hands on with curating a collection then get in touch.

 

27 September 2018

Web Archives: A Tool for Geographical Research?

By Emmanouil Tranos and Christoph Stich, University of Birmingham

Introduction
If you are a quantitative social scientist there are few things more fascinating than free, under-utilised, quirky and easy to download data that also fits well the narrative of 'big data'.

Combine the above characteristics with data that have the potential to support researchers answering interesting research questions and then you will make a researcher happy! And this is exactly what the JISC UK Web Domain Dataset held by the UK Web Archive is all about.

A detailed description of the data can be found here, but briefly this is a subset of the Internet Archive that includes all the archived webpages under the .UK Top Level Domain (TLD) as well as the archival timestamp for the period January 1996 to March 2013. The UK Web Archive partnered with the Internet Archive and JISC to create this unique data set, which enables researchers to easily access probably the largest national archive of webpages

The UK web space has several unique characteristics
Apart from the fact that UK was an early adopter of internet technologies and applications, it also includes some widely recognisable second level domain names such as the .co.uk and the .ac.uk. While the first one (mainly) denotes commercial activities based in the UK similar to the .com top level domain, the latter is used for UK universities. Moreover, the English language makes the UK web space more accessible to the rest of the world.

How is this dataset useful?
The JISC UK Web Domain Dataset is an easy way to access the Internet Archive data. It is, in essence, a long list of strings (i.e. groups of characters), that include the archival timestamp and the original URL of the archived webpages.

For instance, the first numerical part of the line below indicates when the contact page of the uk.eurogate.co.uk website was archived (9/5/2008 at 16:21:38).

20080509162138/http://uk.eurogate.co.uk/contact_us IG8 8HD

With the use of these strings a researcher can retrieve the HTML documents of the archived webpages from the Internet Archive API. The UK Web Archive further processed this data and created a subset of the archived UK webpages that includes all the .uk webpages that contain a UK postcode.

In the above example, the last element indicates that this specific webpage contains the postcode IG8 8HD.

This dataset, which is known as the Geoindex and can be downloaded from here, is probably one of the largest open data sets of georeferenced digital content.

Challenges
There are, however, a number of technical and conceptual challenges attached to the usage of these data. For instance, there is a debate in the literature regarding how much of the web is currently archived (e.g. Hale et al. 2017). Although there is some critique regarding the depth of archival process (i.e. how many webpages from each website are archived), the Internet Archive is the most extended digital archive (Holzmann et al., 2016; Ainsworth et al., 2011).

Moreover, the volume of the data requires some upfront investment regarding data analysis skills, but is still doable with some standard off-the-shelf libraries and tools (e.g. Python or R).

Results
After filtering out invalid postcodes, we are left with a dataset that contains about 5.8 million pairs of British postcodes and domain names.

As one can see in plot, the number of domains that reference a postcode grows relatively rapidly in the decade between 1995 and 2005 before growth levels off. The distribution of domains also more or less aligns with the population density of the UK. This is a good indicator that the collected data captures actual activity in the UK.

 

via GIPHY

Unsurprisingly the data also reveal a difference between London and the rest of the country. The number of domains that reference a postcode per inhabitant grew faster in London than in other places, but eventually the rest of the country caught up with London. There are, however, quite significant differences in how the domains are distributed within London as well.

London_dpt

via GIPHY

So, what research questions can these data help us answer? Utilising funding from the ESRC and the Consumer Data Research Centre (CDRC) we employed this data to explore the evolution of the digital economy in the UK. Firstly, we are utilising this data in order to understand whether the availability of online content attracts individuals online. We do that by employing unique survey data available from CDRC.

Hypothesis
Our underlying hypothesis is that the availability of internet content of local interest can attract people online in order to access and take advantage of the potential on-line opportunities such as accessing local products and services. The first results seem to support our hypothesis.

Secondly, we are using this data to explore the economic activities (e.g. products and services offered b firms) that take place in some of the UK digital clusters. By filtering the data to only focus on archived web pages from specific clusters in the UK and by utilising the textual data available from the archived HTML documents, we are building topic models to reveal what type of economic activities exist in these clusters and how these activities have evolved over time.

We are testing how this archived web data can help us learn more about economic activities and how they have evolved over time. We are also comparing the outputs of this analysis with official industrial classifications from various sources including freely available such data from CDRC.

Lastly, together with colleagues from City-REDI, we are using the archived web data as a proxy to understand the early adoption of web technologies in the UK. Building upon arguments developed in evolutionary economics, the early adoption of web technologies may signify innovative regions which developed 'digital capacity' early enough, something which may affect their future growth trajectories. The first results indicate that indeed the early adoption of web technologies is related to positive future growth trajectories.

To close, we believe that our on-going research, apart from answering substantive geographical research questions, will also illustrate the value of archived web data for geographical research. It is one of the few available data sources that can provide longitudinal georeferenced data, which also includes a wealth of unstructured textual data.

The latter can also reveal patterns and activities that other more 'conventional' data sources would not have been able to uncover.

References
Ainsworth, S. G., Alsum, A., SalahEldeen, H., Weigle, M. C., & Nelson, M. L. (2011). How much of the web is archived? Paper presented at the Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries.

Hale, S. A., Blank, G., & Alexander, V. D. (2017). Live versus archive: Comparing a web archive to a population of web pages. In N. Brügger & R. Schroeder (Eds.), Web as History: Using Web Archives to Understand the Past and the Present (pp. 45-61). London: UCL Press.

Holzmann, H., Nejdl, W., & Anand, A. (2016). The Dawn of today's popular domains: A study of the archived German Web over 18 years. Paper presented at the Digital Libraries (JCDL), 2016 IEEE/ACM Joint Conference.

24 August 2018

How is the UK Web Archive documenting the ‘bodily autonomy’ debate online?

This blog post follows on from Kelly Burchmore’s post - Building collections on Gender Equality at the UK Web Archive, if you’ve not done so, we would encourage you to read it first.

Background
The UK Web Archive (UKWA) aims to collect online material connected with nationally important issues and debates. Recently this has included the long running discussions around bodily autonomy. Much of this material is via social media, that can be very challenging to collect.

Archivingthe8th

See the trend online.

Why is UKWA #Archivingthe8th?1
Although the UK Web Archive only collects material related to the UK, many individuals and groups connected with the referendum on the 8th amendment1 campaigned in the UK, therefore much of the material falls within our remit.

In Britain there are many sections of the Irish based Abortion Rights Campaign group set up in various cities starting with the London Irish Abortion Rights Campaign, in the lead up to the referendum date they ran a home to vote campaign through the website hometovote.com. The pro-life group London Irish United For Life also ran a similar campaign through the website hometovote.uk. All of these websites and many more websites on any subject related to this subject are archived in the Bodily Autonomy subsection of the Gender Equality collection.

The UK Web Archive only archives content published in the UK, but other web archives also collected content on this subject. The National Library of Ireland built a special collection on the referendum and George Washington University archived over 2 million tweets that used popular hashtags related to the referendum.

How to get involved?
If there are any UK websites or Twitter accounts that you think should be added to the Bodily Autonomy subsection of the Gender Equality collection, then you can take up the UK Web Archive’s call for action and nominate content by following this link:

beta.webarchive.org.uk/en/ukwa/info/nominate

By Helena Byrne, Curator of Web Archives, The British Library

1#Archivingthe8th
On the 25th of May 2018 the Republic of Ireland had a referendum on the 8th Amendment, if repealed this would make way for government to implement legislation on access to abortion services. Although, the referendum on the 8th Amendment only impacted on the laws of the Republic of Ireland its significance spread across the world and it received a lot of international media attention. Both pro-choice and pro-life solidarity campaign groups formed around the world, mostly made up of the Irish diaspora and other campaigners passionate about the subject. After the result was announced the hashtag #archivingthe8th started trending on Twitter as people wanted to know how this part of public history was going to be preserved for future generations.

06 August 2018

Building collections on Gender Equality at the UK Web Archive

This is a guest blog by Kelly Burchmore, a graduate trainee digital archivist on the Bodleian Libraries’ Developing the Next Generation Archivist programme. The Bodleian is one of the 6 legal deposit libraries in the UK. One of her projects this year is to help curate special collections in the UK Web Archive. Since May she’s been working on the Gender Equality collection.

Why are we collecting gender equality websites?
2018 is the centenary of the 1918 Representation of the People’s Act. UK-wide memorials and celebrations of this journey, and victory of women’s suffrage, are all evident online: from events, exhibitions, commemorations and campaigns. Popular topics being discussed at the moment include the hashtags #timesup and #metoo, gender pay disparity and the recent referendum on the 8th Amendment in the Republic of Ireland. These discussions produce a lot of ephemeral material, and without web archiving this material is at risk of moving or even disappearing. Web Archives are able to demonstrate that gender equality is increasingly being discussed in the media and these discussions have been developing over many years.

Through UK Web Archive SHINE Interface we can see that matching text for the phrase ‘gender equality’ increased from a result of 0.002% (24 out of 843,204) of crawled resources in 1996, to 0.044% (23,289 out of 53,146,359) in 2013.

SHINEgenderequality

If we search UK web content relating to gender equality we will generate so many results; for example, organisations have published their gender pay discrepancy reports online and there is a lot to engage with from social media accounts of both individuals and organisations relating to campaigning for gender equality. It becomes apparent that when we browse this web content gender equality means something different for so many presences online: charities, societies, employers, authorities, heritage centres and individuals such as social entrepreneurs, teachers, researchers and more.

What we are collecting?
The Gender Equality special collection, that is now live on the UK Web Archive comprises material that provides a snapshot into attitudes towards gender equality in the UK. Web material is harvested under the areas of:

• Bodily autonomy
• Domestic abuse/Gender based violence
• Gender equality in the workplace
• Gender identity
• Parenting
• The gender pay gap
• Women’s suffrage

100 years on from the introduction of limited women’s suffrage, the fight for gender equality continues. The collection is still undergoing curation and growing in archival records - and you can help too!

How to get involved?
If there are any UK websites that you think should be added to the Gender Equality collection then you can take up the UK Web Archive’s call for action and nominate.

Fawcett_teachingequalrights.jpeg

03 August 2018

Work Experience at the UK Web Archive

By Emily Mahoney

Upon hearing that I had a work experience placement in the British Library, I immediately thought of books and reading, a main passion of mine from a young age. When I found out about the many other sides to working in such an immense organisation, (the British Library employs just over 1,500 people) I realised it would be far more fascinating than I had imagined.

Photo-1457369804613-52c61a468e7d

I was assigned a position in Web Archiving with Helena Byrne for the week. Coming into a week of work experience in Web Archiving seemed overwhelming to me as someone with no previous experience in the topic, however, the team working in the department made me feel reassured immediately. Instead of being nervous, I could then focus on the multitude of interesting new information coming my way.

Photo-1454165804606-c3d57bc86b40

My first task was to identify images for the covers of the newer Special Collections on the UK Web Archive website. I was then informed that I would be working on a project with Leila Nassereldein, a PhD placement student focused on archiving a collection of online zines that are independent, self-published, and authored by Asian, African or Caribbean people in the UK. This was extremely exciting to me as this is an area most people don’t necessarily think of when considering the British Library and Leila was keen on making a space for these zines through which the smaller, independent and sometimes radical publications could also leave their mark in our web history. While working on this project with Leila I learnt to appraise, curate and archive contemporary websites using the Annotation Curation Tool (W3 ACT) tool.

Photo-1466386460451-cbc548bf581b

Before this week I had never come across the UK Web Archive and this experience has made me aware of just how important it is that we have access to this information in years to come. The online public archive is also an area with a large number of research points that I will definitely be using during any further study. When writing this I was asked what the ‘most interesting’ part of my placement was, however, it would be too hard to choose due to the amount of things that I have learnt during this week that I had never encountered before. Overall, my experience at the British Library was an enriching one that I will never forget, and helped me consider an aspect of our online life that had never occurred to me before.

11 May 2018

Online Hours: Supporting Open Source

Encouraging collaboration
Here at the UK Web Archive, we're very fortunate to be able to work in the open, with almost all code on GitHub. Some of our work has been taken up and re-used by others, which is great. We’d like to encourage more collaboration, but we've had trouble dedicating time to open project management, and our overall management process and our future plans are unclear. For example, we've experimented with so many different technologies over the years that our list of repositories give little insight into where we're going next. There are also problems with how issues and pull-requests have been managed: often languishing unanswered, waiting for us to get around to looking at them. This also applies to the IIPC repositories and other projects we are involved in, as well as the projects we lead.

I wanted to block out some time to deal with these things promptly, but also to find a way of making it a bit more, well, fun. A bit more social. Some forum where we can chat about our process and plans without the formality of having to write things up.

Taking inspiration from Jason Scott live-streamed CD-ripping sessions, we came up with the idea of something like Office Hours for Open Source -- a kind open open video conference or live stream, where we'll share our process, discuss issues relating to open source projects and have a forum where anyone can ask questions about what we’re up to.

Who is this for?
All welcome, from lurkers to those brimming with burning questions. Just remember that being *kind* beats being right.

Furthermore, if anyone else who manages open source projects like ours is also welcome to join and take the lead for a while! I can only cover the projects we’re leading, but there are many more that would be interesting to hear from.

When?
The plan is to launch the first Online Hours session on the 22nd of May, and then hold regular weekly slots every Tuesday from then on. We may not manage to run it every single week, but if it’s regular and frequent that should mean we can cope more easily with missing the odd one or two.

On the 22nd, we will run two sessions - one in the morning (for the west-of-GMT time-zones) and one in the evening (for the eastern half). Following that, we intend to switch between the two slots, making each a.m. and p.m. slot a fortnightly occurrence.

How?
The sessions will be webcast with a slack channel available for chat. See the IIPC Trello board for more information.

The IIPC (International Internet Preservation Consortium) have kindly agreed to help support this event and further Online Hours sessions. Running this initiative in a more open manner should raise the profile of our open source work both inside and outside of the IIPC, and encourage greater adoption of, and collaboration around, open source tools.

For full details, see the IIPC Trello Board card or ask a question in the NetPreserve Slack Channel #oh-sos (ask @NetPreserve to join the Slack).

See you there!

By Andrew Jackson, Web Archive Technical Lead, The British Library

 

04 May 2018

Star Wars in the Web Archive

May the fourth be with you!

It's Star Wars day and I imagine that you are curious to know which side has won the battle of the UK web space?

Looking at the trends in our SHINE dataset (.uk websites 1996-2013 collected by Internet Archive) I first looked at the iconic match-up of Luke vs Darth.

Shine-darth-vader

Bad news, evil seems to have won this round mainly, it seems, due to the popularity of Darth Vader costume mentions on retail websites.

How about a more general 'Light Side vs Dark side'? 

Shine-lightside-v-darkside

It appears that discussing the 'dark side' of many aspects in life is a lot more fun and interesting than the 'light side'. 

How about just analysing the phrase 'may the force be with you'?

Shine-may the force be with you

This phrase doesn't seem to have been particularly popular on the UK web until it started to be used a lot on websites offering downloadable ringtones. Go figure.

Try using the trends feature on this dataset yourself here: www.webarchive.org.uk/shine/graph

Happy stars wars day!

by Jason Webber, Web Archive Engagement Manager, The British Library

@UKWebArchive

 

01 February 2018

A New Playback Tool for the UK Web Archive

We are delighted to announce that the UK Web Archive will be working with Rhizome to build a version of pywb (Python Wayback) that we hope will greatly improve the quality of playback for access to our archived content.

What is playback of a web archive?

When we archive the web, just downloading the content is not enough. Data can be copied from the web into an archive in a variety of ways, but to make this archive actually accessible takes more than just opening downloaded files in a web browser. Technical details of pages and scripts coming out of the archive need to be presented in a way that enables them to work just like the originals, although they aren’t located on their actual servers anymore. Today’s web users have come to expect interactive features and dynamic layouts on all types of websites. Faithfully reproducing these behaviors in the archive has become an increasingly complex challenge, requiring web archive playback software that is on-par with the evolution of the web as a whole.

Why change?

Currently, we use the OpenWayback playback system, originally developed by the Internet Archive. But in more recent years, Rhizome have led the development of a new playback engine, called pywb (Python Wayback). This Python toolkit for accessing web archives is part of the Webrecorder project, and provides a modern and powerful alternative implementation that is being run as an open source project. This has led to rapid adoption of pywb, as the toolkit is already being used by the Portuguese Web Archive, perma.cc, the UK National Archives, the UK Parliamentary Archive, and a number of others.

Open development
To meet our needs we need to modify pywb, but as strong believers in open source development, all work will be in the open, and wherever appropriate, we will fold the improvements back into the core pywb project.

If all goes to plan, we expect to contribute the following back to pywb for others to use:

Other UKWA-specific changes, like theming, implementing our Legal Deposit restrictions, and deployment support, will be maintained separately.

Initially we will work with Rhizome to ensure our staff and curators can access our archived material via both pywb and OpenWayback. If the new playback tool performs as expected  we will move towards using pywb to support public access to all our web archives.