UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

30 March 2020

UKWA: What's available when the reading rooms are closed?

By Jason Webber, Web Archiving Engagement Manager, The British Library

Like many public places at the moment, the reading rooms of the UK Legal Deposit Libraries are going to be shut for some time. What does this mean for the UK Web Archive? Well as some of you might know we try to collect every UK website at least once a year and this is done under the provision of the Non-Print Legal Deposit Regulations 2013. A condition of these regulations are that content collected can only be viewed on library premises. Never fear though, we still have lots for you to do!

UKWA website home page
UKWA website home page

Discover millions of websites
At the end of 2018, we launched our new service for searching the whole UK Web Archive catalogue from anywhere. Go to our website: www.webarchive.org.uk and search for a web address (URL) or word/phrase. You will get results showing all of our resources that you can access from anywhere. Tick the box 'At Libraries' to see everything that we have collected. Access thousands of websites
Over the 15 years that we have been archiving websites we have frequently sought permission from owners to make their sites publicly viewable outside of library premises. In that time we have received permissions from over 15,000 website owners These websites have been selected because they relate to a  specific topics or event, for their importance, or because they were about to go offline. Lots to see!

Screenshot 2020-03-27 at 11.28.36

Browse 'Topics and Themes'
You can browse over 100 different topics and themes. From the extensive 'Brexit' collection to 'Web Comics' there is something for everyone. As a starter, check out 'Online Enthusiasts' and discover many of the  hobbies and societies in the UK.

Screenshot 2020-03-27 at 17.03.13

SHINE service
UK Web Archive holds a collection of all the .uk websites that were archived by the Internet Archive between 1996-2013.  The service includes a 'trends' feature that we highly recommend that you try.https://www.webarchive.org.uk/shine/graph

You can enter a word or phrase (in speech marks) to see the relative popularity in a given year. Enter different terms separated by a comma and you can compare popularity e.g. tom,jane. See who is 'best',  cat or dog or the emergence of words such as 'iphone', 'emoji' or phrases such as 'credit crunch'.

Do tell us what you find!

Trends - the use of the term 'loungewear' in the UK web
Trends - the use of the term 'loungewear' in the UK web

Nominate websites
We are still able to add websites to the archive and welcome nominations! We want to archive every single UK website and your help is invaluable. Make your suggestions here: www.webarchive.org.uk/nominate

Stay safe everyone!

@ukwebarchive

23 March 2020

Boris Johnson, fertility and the royal baby: how far does the concept of Olympic Legacy go?

By Caio Mello, Doctoral Researcher at the School of Advanced Study, University of London

Recently, exploring the data available in the UK Web Archive related to London’s 2012 Olympic legacy, I found a very curious fact. Boris Johnson - Mayor of London during the games - told the BBC in 2013 that a baby boom in London that year was among the legacies of the event. According to Johnson, his team at City Hall had looked at the data and found a rise in birth rates that year not seen in the capital since 1967, the year after England won the FIFA World Cup. Moreover, Johnson said that even Kate Middleton and Prince William’s first baby could be considered a post-Olympic outcome.

In a recent blog post I briefly introduced my research on media coverage of the Rio and London Olympics and discussed the wide range of attributes to which the word legacy has been attached. As part of a Digital Humanities project, this study seeks to develop an interdisciplinary approach to the topic, combining both qualitative and quantitative methods. In order to do that, I have been looking at different repositories of news articles, including web archives. In addition to accessing content available in the UK Web Archive by going to the British Library, I have also searched for news articles through SHINE, a tool developed as part of the Big UK Domain Data for the Arts and Humanities project, to explore UK web content collected and stored by the Internet Archive. SHINE offers access to an open data repository that has allowed me to conduct multiple studies by writing Python scripts that return language and textual analysis. By scraping some of the news articles from SHINE and analysing word frequency in an initial exploratory study, I have been able to get a sense of how broad the concept of legacy might be.

Although often concerned with what could be described as ‘material legacy’, the articles and reportage go beyond physical infrastructure - such as stadiums - to describe expectations that more people will practise sports or even that a country might be more strongly recognised as open and welcoming. Legacy definitely seems to carry a very positive meaning per se and, when it refers to negative outcomes, it often seems to flirt with irony. On the other hand, words like gentrification appear in very dubious contexts: sometimes they refer to regeneration and development, at other times to an unsustainable process that leads to people’s exclusion from traditional areas affected by this sort of transformation.

While Johnson’s reference to the baby boom can be understood as a joke, it reveals how obsessive politicians can become in using the official narrative of Olympic legacy as it relates to their particular country or host city. A good legacy, as pointed out by MacRury and Poynter in Olympic Cities: 2012 and the remaking of London, is fundamentally important for managing tensions between Olympic dreams and huge economic investment.

In December 2019, Boris Johnson, now the Prime Minister, tweeted his desire to host the football World Cup in 2030. “I want it to show our national confidence as we get Brexit done”, wrote Johnson. Once again, immaterial legacy emerges at the heart of a political argument defending the choice to participate in such mega-events. Looking deeply into these multiple dimensions of legacy seems to be an important step to understand, through language usage, how narratives have been built around the Olympics and how different actors have appropriated the concept and disputed its meanings.

13 March 2020

Theseus' Data Store

By Andy Jackson, Web Archiving Technical Lead, The British Library

My father used to joke about how he’d had his hammer his whole working life. He’d bought it when he’d started out as a joiner, and decades later it was still going strong. He’d replaced the shaft five times and the head twice, but it was still the same hammer! This simple story of maintenance and renewal springs to mind because a few days ago, we finally managed to replace the most important component of our data store. Our storage cluster has been running near-continuously for almost a decade, but as of now, every single hardware component has been renewed or replaced.

Claw Hammer
Andy's Dad's hammer

We use Apache Hadoop to provide our main data store, via the Hadoop Distributed File System (HDFS). We mostly like it because it's cheap, robust, and helps us run large-scale analysis and indexing tasks over our content. But we also like it because of how we can maintain it over time.

HDFS runs across multiple computers, all working together to ensure there are at least three copies of any data stored in the system, and that these copies are in separate machines and separate server racks. It runs like a beehive. The 'queen' is called the Namenode, and although it doesn't store any data, it keeps track of where all the data is and orchestrates the ingest and replication processes. The 'worker' nodes just store and maintain their own blocks of data, and send data back and forth between themselves as instructed by the Namenode. The Namenode also provides the interface we use to access the system, referring each client to the right set of worker nodes as files are accessed. All the time, the system calculates checksums of the chunks of data and uses this to verify the integrity of the files.

This architecture was designed to anticipate hardware failure and recover from it, which makes the system much easier to maintain. If a drive, or even a full server fails, we can simply remove it, replace it, and keep an eye on it as the data is re-distributed. As new, higher-capacity drives come along, we can upgrade the drives in each node one-by-one, in a rolling update that grows the capacity of the cluster.

Rear of the UKWA racks
Rear of the UKWA racks

Similarly, over time, we can upgrade the operating system and other supporting software on every node, to make sure we're up to date. Almost all of this can be done while the system is running, without interrupting access. But the exception is the Namenode – as a hive needs its Queen, HDFS needs its Namenode, so we avoid interrupting it unless absolutely required. It had been running on the same hardware all this time, but now it's happily running on a new bit of kit. At last.

Like the Ship of Theseus, every piece has been replaced, but it's still the same store, and the data is still safe. Of course, it's not as easy to manage and as transparently scaleable as Cloud storage, but for on-site storage it does a great job. Rather than having to shift between storage silos every few years, the data is in constant motion, and this design allows the components and support contracts for the different layers to move at different speeds and rates of renewal over the years. This is one of the advantages of open source systems – they can provide a stable interface for a service, decoupled from any particular vendor or hardware, allowing support methods, contracts and contractors to change over time.

But HDFS has strong competition these days. There's many other options, many of which are compatible with the defacto standard, S3 (Simple Storage Service).. Being able to work with the same interface whether storage is local or in the cloud might make all the difference. We're happy with HDFS for now, but we'll be preparing for the day when a new ship comes alongside and it's time to shift the cargo...