THE BRITISH LIBRARY

UK Web Archive blog

63 posts categorized "Web/Tech"

24 June 2020

Our new Science web archive collection

Add comment

 
By Philip Eagle, Subject Librarian - Science, Technology and Medicine at The British Library
 
 
Air pump CC0
A Philosopher Shewing an Experiment on the Air Pump, 1769 by Valentine Green

 

Introduction

We have just activated our new web archive collection on science in the UK. One of the British Library's objectives as an institution as a whole is to increase our profile and level of service to the science community. In pursuit of this aim we are curating a web archive collection in collaboration with the UK legal deposit libraries. We have some collections already on science related subjects such as the late Stephen Hawking and science at Cambridge University, but not science as a whole.

 

Collection scope

We have interpreted "science" widely to include engineering and communications, but not IT, as that already has a collection. Our collection is arranged according to the standard disciplines such as biology, chemistry, engineering, earth sciences and physics, and then subdivided according to their common divisions, based on the treatment of science in the Universal Decimal Classification.

The collection has a wide range of types of site. We have tried to be fairly exhaustive on active UK science-related blogs, learned societies, charities, pressure groups, and museums. Because of the sheer number of university departments in the UK, we have not been able to cover them all. Instead we have selected the departments that did best in the 2014 Research Excellence Framework, and then taken a random sample to make sure that our collection properly reflects the whole world of academic science in the UK. We are also adding science-related Twitter accounts. Social media is generally difficult to archive due to its proprietary nature, but Twitter is open source so we can archive this more easily.

 

Access

Under the Non-Print Legal Deposit Regulations 2013 we can archive UK websites but we are only able to make them available to people outside the Legal Deposit Libraries Reading Rooms, if the website owner has given permission. Some of the sites in the collection have already had permission granted, such as the Hunterian Society, Dame Athene Donald’s blog, and the Royal College of Anaesthetists. Some others who have not given permission include Science Sparks, the Wellcome Collection, and the British Pregnancy Advisory Service. The Web Archive page will tell you whether any archived site is only viewable from a library, anything with no statement can be viewed on the public web.


Get involved

As ever, if you have a site to nominate that has been left out, you can tell us by filling in our public nomination form: https://www.webarchive.org.uk/ukwa/info/nominate

23 June 2020

WARCnet and the UK Web Archive

Add comment

By Jason Webber, Web Archiving Engagement Manager

 

We at the UK Web Archive (UKWA) have recently taken part in a new initiative called WARCnet led by the University of Aarhus in Denmark (and funded by Independent Research Fund Denmark).

“The aim of the WARCnet network is to promote high-quality national and transnational research that will help us to understand the history of (trans)national web domains and of transnational events on the web, drawing on the increasingly important digital cultural heritage held in national web archives.”

 

Warcnetblog-01
WARCnet logo

 

The majority of participants are researchers currently using web archives as part of their studies, many with extensive experience and others new to the field. This makes this an exciting project to be part of as it is an excellent way for content holders such as UKWA to be able to work closely with a group of researchers and try and understand their needs and challenges. The project had a kick-off meeting in May 2020 that was originally intended to be in person but took place virtually. All the speakers pre-recorded their talks which does now mean that these are now all available (including one by myself). I’d particularly recommend viewing the two keynote speakers Matthew S. Weber and Ian Milligan.

 

Warcnetblog-02
Title slide for Jason Webber's WARCnet presentation

 

Working Groups
It is intended for any outcomes from WARCnet to be driven by the participants themselves and to this end four working groups have been formed:

 

  • Working Group 1 - Comparing entire web domains
  • Working Group 2 - Analysing transnational events
  • Working group 3 - Digital research methods and tools
  • Working group 4 - Research data management across borders

 

The UKWA team is involved with each of the first three working groups, all of which have met in the last weeks to see how we can take this project forward. You can read more about each group here.

There are at least three more small conferences planned (currently as in person), one later this year in Luxembourg and two next year in London and Aarhus.

Look out for updates on our involvement with this initiative on this blog and through our twitter account @UKWebArchive and @WARC_net.

08 June 2020

Documenting the Olympics & Paralympics

Add comment

 
 
Olympic Stamps
Stamps issued by Greece in 1896, the Universal Postal Union Collection, Philatelic Collections, The British Library.

 

Join our panel discussion to discover more about researchers' experiences when navigating archives, as well as the collection policies related to Olympics/Paralympics of GLAM organisations. This event is a collaboration between the British Society of Sports History (BSSH) and the British Library Web Archive team.

 

Register here to receive the joining details:

https://forms.gle/Tjzikxgjvr3FofSr8 

Date:           19 June 2020

Time:          3-4:30pm (BST) / 10-11:30am (EST)

Location:    Zoom

Twitter hashtag: #ResearchingtheGames

 

Presentations

Heather Dichter, De Montfort University - Finding Olympic history in non-sport archives

Laura Alexandra Brown, Northumbria University - The heritage of the Games: Interpreting urban change in Olympic host cities

Robert McNicol, Librarian, Wimbledon Lawn Tennis Museum - Researching the Olympics/Paralympics at Wimbledon

Helena Byrne, Curator of Web Archives, British Library - Preserving the Olympics/Paralympics online

 

What to expect

There is a broad mix of physical, digitised and born digital resources will be covered in the presentations. The Curator of Web Archives, Helena Byrne will be discussing the UK Web Archive collections related to the Olympics/Paralympics as well as the collaboration with the International Internet Preservation Consortium (IIPC).

The year 2020 was originally an Olympic/Paralympic year before the outbreak of the coronavirus pandemic. It is also a significant milestone for the UK Web Archive and the IIPC. It marks 15 years since the first UK Web Archive collections were published and also 10 years since the IIPC first started archiving the Olympics.

 

UKWA Sports
https://www.webarchive.org.uk/en/ukwa/collection

 

The UK Web Archive and sports

The UK Web Archive has been archiving sports related websites since it was established in 2005. However, it wasn’t until 2017 when dedicated sports collections were established. There are three broad collection groups Sports Collection, Sports: Football and Sports: International Events. The subsections of the Sports: International Events includes two summer and two winter Olympic/Paralympic collections from 2010, 2012, 2014 and 2016. The largest of these collections is the Olympic & Paralympic Games 2012 collection as the Games were hosted in the UK.

 

Access and reuse

Under the Non-Print Legal Deposit Regulations 2013 (NPLD) access to archived content is restricted to a UK legal deposit library reading room. However, if we have permission from the website owner, we can make the archived version of their content open access along with government publications under the Open Government Licence. This is why if you browse through the collections on our website, most of the links to archived content will direct you to one of the UK legal deposit libraries for access but some of the content you can view from your personal device.

 

IIPC and the Olympic/Paralympics

The UK Web Archive is made up of the six UK legal deposit libraries, two of those libraries, the British Library and the National Library of Scotland are also members of the International Internet Preservation Consortium (IIPC) which was founded in 2003. In 2010 the IIPC started its first collaborative collection on the Winter Olympics 2010 and has covered every Olympic/Paralympic Games since. Since the formation of the IIPC Content Development Group (CDG) the collections have started to include a broader range of subjects on and off the playing field.

 

Get Involved

The UK Web Archive aims to archive, preserve and give access to the entire UK web space.

If you see content that that should be included in one of sports collections then please fill in our online nomination form.

29 May 2020

Using Webrecorder to archive UK political party leaders' social media after the UK General Election 2019

Add comment

This blog post is is by Nicola Bingham, Helena Byrne, Carlos Lelkes-Rarugal and Giulia Carla Rossi

Introduction to Webrecorder

The UK Web Archive aims to capture the whole of the UK web space at least once a year, and targeted websites at more frequent intervals. We conduct this activity under the auspices of the Legal Deposit Regulations 2013 which enable us to capture, preserve and make accessible the UK Web for the benefit of researchers now and in the future.

Along with many cultural and heritage institutions that perform at-scale web archiving, we use Heritrix 3, the state of the art crawler developed by the Internet Archive and maintained and improved by an international community of web archiving technologists.

Heritrix copes very well with large scale, bulk crawling but is not optimised for high fidelity crawling of dynamic content, and in particular does not archive social media content very well.

Researchers are increasingly turning their attention to social media as a significant witness to our times, therefore we have a requirement to capture this content, in certain circumstances and in line with our collection development policy. Usually this will be around public events such as General Elections where much of the campaigning over recent years has been played out online and increasingly on social media in particular. 

For this reason we have looked at alternative web archiving tools such as Webrecorder to complement our existing toolset. 

Webrecorder was developed by Ilya Kreymer under the auspices of Rhizome (a non-profit organisation based in New York which commissions, presents and preserves digital art), under its digital preservation program. It offers a browser based version, which offers free accounts up to 5GB storage and a Desktop App

Webrecorder was already well known to us at the UK Web Archive although we had not used it until recently. It is a web archiving service which creates an interactive copy of web pages that the user explores in their browser including content revealed by interactions such as playing video and audio, scrolling, clicking buttons etc. This is a much more sophisticated method of acquisition than that used by Hertrix which essentially only follows HTML links and doesn’t handle dynamic content very well. 


What we planned to do

The UK General Election Campaign ran from the 6th of November 2019 when Parliament was dissolved, until polling day on the 12th of December 2019. On the 13th of December 2019 the UK Web Archive team, based at the British Library attempted to archive various social media accounts of the main political party leaders. Seventeen political leaders from the four home nations were identified and a selection of three social media accounts were targeted: Twitter, Facebook and Instagram. Not all leaders have accounts on all three platforms, but in total forty four social media accounts were archived. These accounts are identified in the table below by an X. 

List of UK political political part leaders' social media accounts archived
Image credit: Carlos Lelkes-Rarugal

 

 

How we did it

On the 13th of December, 2019 we ran the Webrecorder Desktop App across twelve office PCs. Many were running the Webrecorder Autopilot function over the accounts, but we had mixed success, in that not all accounts captured the same amount of data. As the Autopilot functionality didn’t work well on all accounts, a combination of automated and manual capture processes were used where necessary. It took the team a lot longer than expected to archive the accounts therefore some were archived on a range of dates the following week.    

 

Large political party’s vs smaller party’s social media accounts

The two largest political party leaders, Jeremy Corbyn and Boris Johnson, have many more social media followers than the other home nations party leaders. This meant that it was more difficult to get a comprehensive capture of Corbyn and Johnson’s Twitter accounts than, for example, Arlene Foster’s. The more popular Twitter accounts took many hours to crawl; Corbyn’s took almost ten hours to archive thirteen day’s worth of Tweets (which only took us up to 1st December). 

 

Technical Issues

We experienced several technical issues with crawling, mainly concerned with issues around  IP addresses, the app crashing, and Autopilot working on some computers and not others. It was hard to get the app restarted after it crashed, so some time was lost when this happened. Different computers with the same specs ran differently. The Autopilot capture for Jeremy Corbyn’s and Boris Johnson’s Twitter accounts were started at the same time but Corbyn’s ran uninterrupted while Johnson’s crashed when it reached 475 MB. Although Corbyn’s account was crawled for nearly ten hours it only collected 93 MB of data. In contrast, Nigel Farage’s Twitter page was crawled for over four hours and only produced 506 MB. It is important to check the size of crawled data, as the hours the Webrecorder Desktop App is running on Autopilot does not necessarily translate into a high fidelity crawl. 

 

Added complications when using multiple devices with the same user profile:

Complications arose mainly from the auditing and collating of WARC files; performing QA and keeping track of which jobs were successful and those that were not. 

Initially, all participants in this project had planned to use their own work PC or work laptop and a local desktop installation of Webrecorder. However, an hour or so into the process(early in the day), it soon became apparent that there would not be enough time to archive all of the social media accounts within our time frame, given the volume of social media accounts and the unanticipated time it would take to archive each one. For example, it took one instance of a desktop Webrecorder application almost ten hours to archive Jeremy Corbyn’s Twitter account (only able to capture Tweets up to a month prior to the day of archiving).

It was then decided that we could potentially, and experimentally, run multiple parallel Webrecorder applications across a number of office desktop PCs; PCs that were free and available for us to use. This was possible because of the IT Architecture in place, allowing users to log into any office machine with the correct credentials and making their personal desktop load up along with all their files and user settings, regardless of the PC they log into. 

The British Library’s IT system, which incorporates a lot of the Windows ecosystem, gives each user their own dedicated central work directory where they are given a virtual hard drive and  their own storage space for all their documents and any other work related files. This allowed one user to be logged into several office PCs at the same time and therefore run a separate desktop Webrecorder application running on each machine. This was indeed very helpful as it allowed each machine to focus on one particular social media account, which in many cases took hours to archive. 

Having multiple Webrecorder jobs greatly increased our capacity to archive by removing the previous bottleneck, that was, one webrecorder job per user. Instead, this was increased to several webrecorder jobs per user.

Work flow of gathering WARC files from Webrecorder
Image credit: Carlos Lelkes-Rarugal

 

 

Having multiple Webrecorder jobs added complications down the line, not necessarily impacting the archiving process, but rather, complicating the auditing and collating of WARC files. When a user had several Webrecorder jobs running concurrently, each job would still be downloading to the same user work directory (the user’s virtual hard drive). So if a user had many parallel jobs running, this would create multiple WARC files in the same folder (but with different names, so no clashes), WARC files being produced by the different desktop PC that the user had logged in to. This was quite an elaborate setup because once a job had completed, the entire contents of the Webrecrder folder (where the WARCs were stored) was copied to a USB so that an initial Quality Assurance (QA) could be performed on the completed job on a more capable laptop. The difficulty was in finding the WARC file that corresponded to the completed job, which was somewhat convoluted as there would have been multiple WARC files with this type of file-naming convention:

 “rec-20191213100335021576-DESKTOP-AOCGH38-7B5SEXKS.warc.gz”. 

As you can imagine, taking a copy of Webrecorder’s folder contents not only has the completed job, but also the instances of other WARC files from other incomplete jobs. Coupled with multiple jobs per PC, and multiple PCs per user; keeping track of what had completed and which WARCs were either corrupted or not up to standard, was quite demanding. 

 


Review of the data collected 

File size of data collected from UK political party leaders' social media accounts
Image credit: Carlos Lelkes-Rarugal

 

How to access this data

The archived social media accounts can be accessed through the UK General Election 2019 collection in a UK Legal Deposit Library Reading Room. The UK Legal Deposit Libraries are the British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries, Cambridge Unity Libraries and Trinity College Dublin Library.  

The 2019 collection is part of a time series of UK General Elections dating from 2005. They can be accessed over the Internet on the Topics and Themes page of the UK Web Archive website. All the party leaders' social media accounts are tagged into the subsection UK Party Leaders Social Media Accounts (access to individual websites depends on whether we have an additional permission to allow ‘open’ access). More information about what is included in the UK General Election 2019 collection is available through the UK Web Archive blog

 

Conclusion


Overall, undertaking this experiment was an interesting experience for our small team of British Library Web Archive Curators. Many valuable lessons were learnt on how best to utilise Webrecorder in our current practice. The major takeaway was that it was a lot more time consuming than we expected. Instead of taking up one working day, it took nearly a whole week to archive our targeted social media accounts with Webrecorder. Our usual practise is to archive social media accounts with the Heritrix crawler, which works reasonably well with Twitter but is less suited to capturing other platforms. For a long time, we were unable to capture any Facebook content with Heritrix, mainly due to the platform’s publishing model, however the way the platform is published has changed recently allowing us limited success. Archiving social media will always remain challenging for the UK Web Archive, for myriad technical, ethical and legal reasons. The sheer scale of the UK’s social media output is too large for us to capture adequately (and indeed, this may not even be desirable) and certainly too large a task for us to tackle with manual, high fidelity tools such as Webrecorder. However, our recent experience during the 2019 UK General Election has convinced us that using Webrecorder to capture significant events is a worthwhile exercise, as long as we target selected, in scope accounts on a case by case basis. 

 

24 April 2020

Harnessing the Crowd: Coronavirus Topical Collection at the UK Web Archive

Add comment

By Nicola Bingham, Lead Curator of Web Archiving, The British Library

Note: This post was originally published on the Digital Preservation Coalition (DPC) blog.

The UK Web Archive, a partnership of the 6 UK Legal Deposit Libraries* (LDLs), has been collecting UK websites since the early 2000’s. As well as archiving snapshots of the whole UK Web Space we have dozens of curated collections focussing on a wide range of topics, themes and events reflecting all aspects of UK life.

Collections are instigated by a broad range of curators – in this context, ‘curator’ is not necessarily synonymous with job title - including LDL staff, academic researchers, various UK GLAM organisations (e.g., Jersey Heritage, Hampshire Archives and Local Studies, Wimbledon Lawn Tennis Museum) and local community groups. Collections may focus on a researcher’s area of interest, align with an institution’s collection policy or reflect diverse political, sporting or topical events such as the London Olympic Games, Brexit or Climate Change. Below are the members of the Web Archiving team at the British Library.

UK Web Archive Team

We have a particularly strong time-series of collections focusing on UK General Elections having archived every campaign since 2005. For each event we have used more or less the same categories – candidate’s web presence, national and local political party websites, online news and commentary, interest group manifestos and comment and analysis by think tanks.

Structuring the collections with consistent sub-categories enables curators to distribute web archiving more efficiently, as does dividing selection broadly along the lines of the geographical interest of the 3 National Libraries that belong to the UKWA.

We hope that our General Election collections will preserve the voices and illustrate the concerns and priorities of a wide spectrum of UK society and help to show how political parties and candidates engaged and responded at pivotal moments in UK history.

It is interesting to note how use of the Internet for political campaigning and communication has evolved over time. In 2005 very little social media existed and politicians were just beginning to explore its capabilities, whereas by 2019 campaigners were making little or no use of websites, concentrating almost exclusively on using social media.

The (somewhat) scheduled nature of UK General Elections, especially since the Fixed-term Parliaments Act of 2011, allows us to plan election web archiving strategies ahead of time. Having said this, we have been tested in recent years with snap elections in June 2017 and December 2019! And of course candidates are only announced a couple of week’s before polling day which means we have to react at that point to archive candidate’s websites, or official, publicly facing social media accounts.

Rapidly unfolding events such as natural disasters or terrorist attacks require a different approach. However, even here we have some experience, having archived collections about the London Terrorist Attack 2005, Grenfell Tower Fire, and Pandemic Outbreaks such as Avian Flu and Swine Flu over the years.

For the past few weeks we have been actively collecting the UK perspective of the Coronavirus (COVID-19) Pandemic. We are clearly facing one of the severest threats in our lifetimes, certainly one of the fastest and most clearly devastating, and while Librarians might not (yet) be members of the Emergency Services, we feel the act of recording the outbreak as it plays out online is a crucial one.

Websites are being selected by a cohort of curators across the LDLs and beyond. We have also been ably assisted by colleagues at the Royal College of Nursing Archives who are nominating health-related websites. However due to the unpredictable, fast paced nature of the outbreak and the consequent deluge in online information, it is more important for us to harness the crowd to elicit website nominations. For this reason, we will canvas for website nominations much more widely among our colleagues, the library and archive community and the general public when responding to rapidly unfolding events. We will also visit targeted websites much more frequently than we would usually to capture frequently edited web content.

The collection is not public yet while we concentrate on acquiring the websites. Once we’re finished, it will take time to prepare the collection for publication by performing quality assurance and clearing permissions for open access. In due course, the Coronavirus collection will be available here under the Pandemic Outbreaks Collection. The top-level heading reflects the fact that we have previously collected around Avian Flu and Swine Flu and acknowledges that, sadly, we will be collecting about future outbreaks.

UKWA_PandemicOutbreak_Collection_Screenshot

In terms of getting involved, we welcome submissions from colleagues in the DPC community - and in fact from any member of the public. Details of how to nominate websites for inclusion are here: www.webarchive.org.uk/nominate. Alternatively, please email nominations to web-archivist@bl.uk
We’re also working on an international collection with the International Internet Preservation Consortium (IIPC). Details of how to contribute to this collection are here: netpreserveblog.wordpress.com/2020/02/13/cdg-collection-novel-coronavirus/ (non-English language websites are particularly welcome here).

If your organisation has not previously done any web archiving and you would like to capture your own institution’s or communities’ response to Coronavirus, plenty of tools exist that can be used remotely. Webrecorder is a good place to start as it can be used in a browser, free of charge up to a 5GB data limit. Of course web archives such as the UKWA and Internet Archive would also be very happy to preserve your websites free of charge (see details above).

*The UK Legal Deposit Libraries: Bodleian Libraries, Oxford University, British Library, Cambridge University Libraries, National Library of Scotland, National Library of Wales, Trinity College, Dublin

15 April 2020

Adding Poetry Websites to the UK Web Archive

Add comment

By Pete Hebden, Phd Student Placement, Newcastle University

One of the great features of the UK Web Archive is its series of curated collections, which can be found on the UKWA Topics and Themes page. Each collection centres on a specific topic, some responding to particular events, such as Brexit or the First World War Centenary, others drawing upon the knowledge of contributors to create a set of in-depth examples around a particular subject. During my time at the British Library, I spent some time contributing to the Poetry Zines and Journals collection, originally started in 2016 by previous PhD placement student Joe McCarthy. The collection contains an amazing range of UK-based online outlets for poetry, encompassing blogs, Twitter accounts, online journals, and the personal websites of some poets, where there is a significant amount of the poet’s creative work on the site.

Poetry writing
Poetry writing

 Although the collection was already very well curated when I came to it, it had not been significantly updated in several years, so many newer publications were not included in the collection. The past few years have seen a serious increase in the number of high-quality online literary journals – a trend that this collection was very astutely responding to when it was first created – and so there were a number of recent but well-established poetry titles that I could add to the list. One example is perverse, an online-only poetry journal started by the poet Chrissy Williams in 2018.

Along with roughly a third of the Poetry Zines and Journals collection, the archived version of perverse is only accessible on-site at the reading rooms of the UK’s legal deposit libraries. The rest of the content, for which open access permission has been obtained, can be viewed from anywhere. The Rialto and Porridge magazine are two examples of recently added sites that are open access, and the links here lead to the archived versions of those sites.

My other choices for inclusion in the collection were guided by some of my own specific areas of knowledge and interest. I included several online journals that are based in, or focus on, the north of England and Scotland, as these are literary scenes that I am more familiar with. Butcher’s Dog and Another North are two relatively recent literary journals based in the north of England. Another North is entirely digital, while Butcher’s Dog is a print journal with a strong online presence. I also added several more websites for print magazines that feature a significant amount of poetry on their site. For example, Popshot and The Rialto, both print magazines, regularly feature poetry from their most recent issues on their websites and/or social media, which gives readers an idea of the journal’s editorial policy and marks a significant change in the way that poetry is distributed by these publications thanks to the internet.

One interesting problem that I encountered during the process was around the formats that some digital publications use for distribution. While most online poetry journals choose to publish in a standard website or blog form, some distribute each issue as a downloadable file, such as a PDF or EPUB. This method of delivery presents a problem when attempting to archive the content, as the web crawler is not necessarily able to access and download these files, meaning that the poetry itself goes unrecorded. For these journals, we had to use alternative ways of recording their poetry in the collection. For example, perverse (mentioned above), as well as distributing each issue as a PDF download, also posts each poem individually to Twitter, and so we set up a regular capture of their Twitter account in order to record all of the poetry. Many other journals use social media in a similar way, and so in these cases I was able to use this as a way of archiving the journal’s output.

Over the past few decades, the web has provided an exciting platform for a diverse range of poets and publishers to showcase their work and it has been a very enjoyable challenge to contribute to the cataloguing of this transformation. I hope that my work on the Poetry Zines and Journals collection will help other readers and researchers exploring the breadth and variety of UK poetry available online today. 

If you know of any websites that should be included in this collection or in the general UK Web Archive, please nominate it.

 

30 March 2020

UKWA: What's available when the reading rooms are closed?

Add comment

By Jason Webber, Web Archiving Engagement Manager, The British Library

Like many public places at the moment, the reading rooms of the UK Legal Deposit Libraries are going to be shut for some time. What does this mean for the UK Web Archive? Well as some of you might know we try to collect every UK website at least once a year and this is done under the provision of the Non-Print Legal Deposit Regulations 2013. A condition of these regulations are that content collected can only be viewed on library premises. Never fear though, we still have lots for you to do!

UKWA website home page
UKWA website home page

Discover millions of websites
At the end of 2018, we launched our new service for searching the whole UK Web Archive catalogue from anywhere. Go to our website: www.webarchive.org.uk and search for a web address (URL) or word/phrase. You will get results showing all of our resources that you can access from anywhere. Tick the box 'At Libraries' to see everything that we have collected. Access thousands of websites
Over the 15 years that we have been archiving websites we have frequently sought permission from owners to make their sites publicly viewable outside of library premises. In that time we have received permissions from over 15,000 website owners These websites have been selected because they relate to a  specific topics or event, for their importance, or because they were about to go offline. Lots to see!

Screenshot 2020-03-27 at 11.28.36

Browse 'Topics and Themes'
You can browse over 100 different topics and themes. From the extensive 'Brexit' collection to 'Web Comics' there is something for everyone. As a starter, check out 'Online Enthusiasts' and discover many of the  hobbies and societies in the UK.

Screenshot 2020-03-27 at 17.03.13

SHINE service
UK Web Archive holds a collection of all the .uk websites that were archived by the Internet Archive between 1996-2013.  The service includes a 'trends' feature that we highly recommend that you try.https://www.webarchive.org.uk/shine/graph

You can enter a word or phrase (in speech marks) to see the relative popularity in a given year. Enter different terms separated by a comma and you can compare popularity e.g. tom,jane. See who is 'best',  cat or dog or the emergence of words such as 'iphone', 'emoji' or phrases such as 'credit crunch'.

Do tell us what you find!

Trends - the use of the term 'loungewear' in the UK web
Trends - the use of the term 'loungewear' in the UK web

Nominate websites
We are still able to add websites to the archive and welcome nominations! We want to archive every single UK website and your help is invaluable. Make your suggestions here: www.webarchive.org.uk/nominate

Stay safe everyone!

@ukwebarchive

13 March 2020

Theseus' Data Store

Add comment

By Andy Jackson, Web Archiving Technical Lead, The British Library

My father used to joke about how he’d had his hammer his whole working life. He’d bought it when he’d started out as a joiner, and decades later it was still going strong. He’d replaced the shaft five times and the head twice, but it was still the same hammer! This simple story of maintenance and renewal springs to mind because a few days ago, we finally managed to replace the most important component of our data store. Our storage cluster has been running near-continuously for almost a decade, but as of now, every single hardware component has been renewed or replaced.

Claw Hammer
Andy's Dad's hammer

We use Apache Hadoop to provide our main data store, via the Hadoop Distributed File System (HDFS). We mostly like it because it's cheap, robust, and helps us run large-scale analysis and indexing tasks over our content. But we also like it because of how we can maintain it over time.

HDFS runs across multiple computers, all working together to ensure there are at least three copies of any data stored in the system, and that these copies are in separate machines and separate server racks. It runs like a beehive. The 'queen' is called the Namenode, and although it doesn't store any data, it keeps track of where all the data is and orchestrates the ingest and replication processes. The 'worker' nodes just store and maintain their own blocks of data, and send data back and forth between themselves as instructed by the Namenode. The Namenode also provides the interface we use to access the system, referring each client to the right set of worker nodes as files are accessed. All the time, the system calculates checksums of the chunks of data and uses this to verify the integrity of the files.

This architecture was designed to anticipate hardware failure and recover from it, which makes the system much easier to maintain. If a drive, or even a full server fails, we can simply remove it, replace it, and keep an eye on it as the data is re-distributed. As new, higher-capacity drives come along, we can upgrade the drives in each node one-by-one, in a rolling update that grows the capacity of the cluster.

Rear of the UKWA racks
Rear of the UKWA racks

Similarly, over time, we can upgrade the operating system and other supporting software on every node, to make sure we're up to date. Almost all of this can be done while the system is running, without interrupting access. But the exception is the Namenode – as a hive needs its Queen, HDFS needs its Namenode, so we avoid interrupting it unless absolutely required. It had been running on the same hardware all this time, but now it's happily running on a new bit of kit. At last.

Like the Ship of Theseus, every piece has been replaced, but it's still the same store, and the data is still safe. Of course, it's not as easy to manage and as transparently scaleable as Cloud storage, but for on-site storage it does a great job. Rather than having to shift between storage silos every few years, the data is in constant motion, and this design allows the components and support contracts for the different layers to move at different speeds and rates of renewal over the years. This is one of the advantages of open source systems – they can provide a stable interface for a service, decoupled from any particular vendor or hardware, allowing support methods, contracts and contractors to change over time.

But HDFS has strong competition these days. There's many other options, many of which are compatible with the defacto standard, S3 (Simple Storage Service).. Being able to work with the same interface whether storage is local or in the cloud might make all the difference. We're happy with HDFS for now, but we'll be preparing for the day when a new ship comes alongside and it's time to shift the cargo...