THE BRITISH LIBRARY

UK Web Archive blog

57 posts categorized "Web/Tech"

30 March 2020

UKWA: What's available when the reading rooms are closed?

Add comment

By Jason Webber, Web Archiving Engagement Manager, The British Library

Like many public places at the moment, the reading rooms of the UK Legal Deposit Libraries are going to be shut for some time. What does this mean for the UK Web Archive? Well as some of you might know we try to collect every UK website at least once a year and this is done under the provision of the Non-Print Legal Deposit Regulations 2013. A condition of these regulations are that content collected can only be viewed on library premises. Never fear though, we still have lots for you to do!

UKWA website home page
UKWA website home page

Discover millions of websites
At the end of 2018, we launched our new service for searching the whole UK Web Archive catalogue from anywhere. Go to our website: www.webarchive.org.uk and search for a web address (URL) or word/phrase. You will get results showing all of our resources that you can access from anywhere. Tick the box 'At Libraries' to see everything that we have collected. Access thousands of websites
Over the 15 years that we have been archiving websites we have frequently sought permission from owners to make their sites publicly viewable outside of library premises. In that time we have received permissions from over 15,000 website owners These websites have been selected because they relate to a  specific topics or event, for their importance, or because they were about to go offline. Lots to see!

Screenshot 2020-03-27 at 11.28.36

Browse 'Topics and Themes'
You can browse over 100 different topics and themes. From the extensive 'Brexit' collection to 'Web Comics' there is something for everyone. As a starter, check out 'Online Enthusiasts' and discover many of the  hobbies and societies in the UK.

Screenshot 2020-03-27 at 17.03.13

SHINE service
UK Web Archive holds a collection of all the .uk websites that were archived by the Internet Archive between 1996-2013.  The service includes a 'trends' feature that we highly recommend that you try.https://www.webarchive.org.uk/shine/graph

You can enter a word or phrase (in speech marks) to see the relative popularity in a given year. Enter different terms separated by a comma and you can compare popularity e.g. tom,jane. See who is 'best',  cat or dog or the emergence of words such as 'iphone', 'emoji' or phrases such as 'credit crunch'.

Do tell us what you find!

Trends - the use of the term 'loungewear' in the UK web
Trends - the use of the term 'loungewear' in the UK web

Nominate websites
We are still able to add websites to the archive and welcome nominations! We want to archive every single UK website and your help is invaluable. Make your suggestions here: www.webarchive.org.uk/nominate

Stay safe everyone!

@ukwebarchive

13 March 2020

Theseus' Data Store

Add comment

By Andy Jackson, Web Archiving Technical Lead, The British Library

My father used to joke about how he’d had his hammer his whole working life. He’d bought it when he’d started out as a joiner, and decades later it was still going strong. He’d replaced the shaft five times and the head twice, but it was still the same hammer! This simple story of maintenance and renewal springs to mind because a few days ago, we finally managed to replace the most important component of our data store. Our storage cluster has been running near-continuously for almost a decade, but as of now, every single hardware component has been renewed or replaced.

Claw Hammer
Andy's Dad's hammer

We use Apache Hadoop to provide our main data store, via the Hadoop Distributed File System (HDFS). We mostly like it because it's cheap, robust, and helps us run large-scale analysis and indexing tasks over our content. But we also like it because of how we can maintain it over time.

HDFS runs across multiple computers, all working together to ensure there are at least three copies of any data stored in the system, and that these copies are in separate machines and separate server racks. It runs like a beehive. The 'queen' is called the Namenode, and although it doesn't store any data, it keeps track of where all the data is and orchestrates the ingest and replication processes. The 'worker' nodes just store and maintain their own blocks of data, and send data back and forth between themselves as instructed by the Namenode. The Namenode also provides the interface we use to access the system, referring each client to the right set of worker nodes as files are accessed. All the time, the system calculates checksums of the chunks of data and uses this to verify the integrity of the files.

This architecture was designed to anticipate hardware failure and recover from it, which makes the system much easier to maintain. If a drive, or even a full server fails, we can simply remove it, replace it, and keep an eye on it as the data is re-distributed. As new, higher-capacity drives come along, we can upgrade the drives in each node one-by-one, in a rolling update that grows the capacity of the cluster.

Rear of the UKWA racks
Rear of the UKWA racks

Similarly, over time, we can upgrade the operating system and other supporting software on every node, to make sure we're up to date. Almost all of this can be done while the system is running, without interrupting access. But the exception is the Namenode – as a hive needs its Queen, HDFS needs its Namenode, so we avoid interrupting it unless absolutely required. It had been running on the same hardware all this time, but now it's happily running on a new bit of kit. At last.

Like the Ship of Theseus, every piece has been replaced, but it's still the same store, and the data is still safe. Of course, it's not as easy to manage and as transparently scaleable as Cloud storage, but for on-site storage it does a great job. Rather than having to shift between storage silos every few years, the data is in constant motion, and this design allows the components and support contracts for the different layers to move at different speeds and rates of renewal over the years. This is one of the advantages of open source systems – they can provide a stable interface for a service, decoupled from any particular vendor or hardware, allowing support methods, contracts and contractors to change over time.

But HDFS has strong competition these days. There's many other options, many of which are compatible with the defacto standard, S3 (Simple Storage Service).. Being able to work with the same interface whether storage is local or in the cloud might make all the difference. We're happy with HDFS for now, but we'll be preparing for the day when a new ship comes alongside and it's time to shift the cargo...

02 March 2020

15 Years of the UK Web Archive - The Early Years

Add comment

Think back 15 years to the beginning of 2005. Future Prime Minister David Cameron wasn't yet Leader of the Conservative party and Google Maps, Twitter and the iPhone all had yet to be launched. It was, however, the year that we started collecting copies of UK published websites for permanent preservation and access.

The original UK Web Archive Consortium website captured by the Internet Archive.
First UKWAC website via the Internet Archive

Our Origins
In the beginning a group of interested UK institutions - The British Library, The National Archives, Wellcome Trust, National Library of Scotland, National Library of Wales and  JISC - formed a consortium (UKWAC) to implement a project to archive websites.

A multi-disciplinary team was formed to look into the many challenging technical and curatorial issues involved in archiving websites. The learning curve in this field can be steep and was especially so then. At the time only a few other organisations were in this field, including the Internet Archive and the National Library of Australia (NLA). In fact, initially, it was a tool developed by NLA - Pandas software - that we used in those early days. Later on we switched to using Heritrix to collect the web (and still do).

One of the special elements of archiving the web is that whilst it can be difficult, this very challenge encourages co-operation and partnership. We have been very grateful for all of the input and help along the way.  

What to archive?
In this early era of web archiving, website 'targets' were carefully selected by curators and the owners were asked for permission for us to archive and publicly display them. If the website owner refused, which is very rare, or didn't respond then the website couldn't be archived. 

Overall, the responses that we received from website owners were overwhelmingly positive and led to the formation of key early collections such as the 2005 General Election and the 7 July London terrorist attacks.

BBC news Election website from 2005
BBC News 2005 Election website

Where next?
By early 2005 the first websites were being captured and stored, however, it wasn't until May 2005 that we first displayed them to the public.

The evolution of the UK Web Archive website and how we developed a growing number of 'special collections' (as they were initially known) will be the subject of future blog posts over the next few months.

Watch this space!

#15YearsOfUKWA

 

 

 

26 February 2020

Spotlight on Hedley Sutton, Asian & African Studies Reference Team Leader at The British Library

Add comment

By Helena Byrne, Curator of Web Archives, The British Library

Hedley Sutton is the Team Leader, Asian & African Studies Reference Services at the British Library. He joined the Library as a cataloguer in what was then called the Bibliographic Services Division in 1982. Early in 1988 he moved to the India Office Library and Records Section (later renamed the Oriental & India Office Collections … then Asia, Pacific & Africa Collections … and now Asian & African Studies) as Serials and Acquisitions Librarian, before taking up his current role in the Reference Enquiry team in 1999.

Hedley-sutton

 In a previous blog post (2014) Hedley stated that:

“A Reference Team Leader spends most of their day answering queries sent in by e-mail, fax and letter or manning Reading Room enquiry desks. Some, however, also help with contributing to the selection of sites for inclusion in the UK Web Archive.”

In 2008, Hedley started to select websites for the web archive team and to date has selected over 6,000 targets. His initial focus was on UK published websites related to his own specialism of Asian and African studies, however he soon turned to selecting websites on a wide variety of topics that will be of interest to future researchers.

In Hedley’s free time, he likes to write limericks and when he started to come across websites that covered interesting niche subject areas he was inspired to write a series of blogs called If Websites Could Talk. In the first blog post (2016), Hedley brings many of the websites to life as they discuss amongst themselves “to which might be regarded as the most fantastic and extraordinary site of all”. In the second blog post (2017), the websites talked about “which one has the best claim to be recognized as the most extraordinary”. After a long break, the third blog post (2020), also tries to determine which website is the most extraordinary site of all.

You can view archived versions of the websites that Hedley has selected by searching on the UK Web Archive website: https://www.webarchive.org.uk/ukwa/   

If you know of a website that you feel should be in the UK Web Archive, please nominate it.

30 September 2019

The Magic of Wimbledon in the UK Web Archive

Add comment

By Robert McNicol, Librarian at the Wimbledon Lawn Tennis Museum

341099

The magic of Wimbledon is its ability to preserve its history and tradition while simultaneously embracing the future. When you enter the Grounds of The All England Lawn Tennis Club, you know you’re somewhere special. It’s the spiritual home of the sport and you can feel the history all around you. And yet Wimbledon in 2019 is also a thoroughly modern sporting venue with state-of-the-art facilities for players, spectators, officials and broadcasters. While Wimbledon loves its traditions (the grass courts, the all-white clothing, the strawberries & cream), it has always been looking ahead as well. From the very first Lawn Tennis Championships in 1877, to the introduction of Open tennis in 1968, to the building of roofs on Centre and No.1 Courts. Wimbledon is both the past and the future of tennis.

It’s in this same spirit that the Kenneth Ritchie Wimbledon Library has teamed up with the British Library to curate a collection of tennis websites for the UK web archive. This is a subsection of the much larger Sports Collection on the UK Web Archive Website. Using the latest technology to preserve the past, it’s a project that captures the essence of Wimbledon.

Naturally, one of the first websites we added to the Tennis collection was our own. Wimbledon.com was established in 1995 and is very excited to be celebrating its 25th anniversary next year.  This project ensures that, in future, researchers will be able to go back and search the contents of the Wimbledon website from previous years. We have also archived some Wimbledon social media feeds, including the Twitter feed of the Wimbledon Lawn Tennis Museum, of which the Library is part.

However, the ultimate aim is to archive a complete collection of UK-based tennis websites. This will include sites belonging to governing bodies, clubs, media and individual players. One part of the project already completed is to archive the Twitter feeds of all British players with a world ranking. From Andy Murray and Johanna Konta to Finn Bass and Blu Baker, every British player with a Twitter account has had it saved for posterity!

If you want to hear more about the project, you may be interested in attending Wimbledon’s Tennis History Conference on Saturday 9 November, where Helena Byrne (Curator of Web Archiving at the British Library) will be joining me to do a joint presentation.

And if you’d like to know more about the Wimbledon Library, feel free to get in touch. We’re the world’s biggest and best tennis library, holding thousands of books, periodicals and programmes from more than 90 different countries. We’re open by appointment to anyone with an interest in researching tennis history. https://www.wimbledon.com/en_GB/atoz/library_research_enquiries.html

Finally, if you’d like to nominate a tennis or other sporting websites for us to archive, go to our Save a UK website form: https://www.webarchive.org.uk/en/ukwa/info/nominate

16 July 2019

Summer Placement with the UK Web Archive

Add comment

By Isobelle Degale, Masters student, University of Sussex

My summer placement at the British Library is now coming to an end. As a Masters student studying Human Rights, I contacted the UK Web Archiving team based at the British Library as a way to enrich my understanding of the sources available on London policing, specifically looking at stop and search procedure.

BL-porthole

The first few days of the placement I learnt how to add online content onto the UK Web Archives using the 'Annotation and Curation Tool' (ACT). I learnt how to add 'targets' (web addresses) to the web archive using ACT and the importance of crawl frequency of different sources. Over the last few weeks I have been researching and selecting content to add to the online collections: Black and Asian Britain and Caribbean Communities in the UK.

Having previously studied history, including the impact of the British Empire during my undergraduate degree, I also have an interest in the Windrush generation and have been selecting content such as websites, podcast links, videos and documentaries. I have also gained hands on experience in web archiving through emailing website authors requesting permission for open access  of their content.

As my summer dissertation discusses discrimination and disproportionality of London stop and searches, I have also been adding related content to the UK Web Archive. I have gathered content such as news articles, twitter accounts of activists, grassroots websites and publications from racial equality think tanks that highlight the disproportionality of stop and searches on young BME (Black and Ethnic Minorities) peoples and communities, which is the central debate of this topic. My dissertation specifically explores the experiences and perspectives of those stopped and searched. I have noted that there is a gap on the web which explores and expresses the opinions of those who are more likely to be stopped, despite the abundance of news reports and statistics on the topic.

My experience with the web archiving team has opened up my thoughts to the value of archiving online content, as with the breadth and depth of the web, socially and culturally important web sites can easily be overlooked if not archived.

I hope that my contribution over the weeks will be useful in documenting the cultural and social celebration of black and Asian communities in Britain, but also demonstrating that there are negative experiences of black and ethnic minority Britons that make up an important part of daily life and should not be ignored. As a human rights student I feel that it is important in recognising inequality in both past and present Britain. I am, therefore, grateful to the Web Archiving team for the opportunity to add to the UK Web Archive the much debated topic of London stop and searches that will hopefully provide insight and information into the subject.

29 March 2019

Collecting Interactive Fiction

Add comment

Intro
Works of interactive fiction are  stories where the reader/player can guide or affect the narrative in some way. This can be through turning to a specific page as in 'Choose Your Own Adventure', or clicking a link or typing text in digital works. 

Archiving Interactive Fiction
Attempts to archive UK-made interactive fiction began with an exploration of the affordances of a couple of different tools. The British Library’s own ACT (Annotation Curation Tool), and Rhizome’s WebRecorder. ACT is a system which interfaces with the Internet Archive’s Heritrix crawl engine to provide large scale captures of the UK Web. Webrecorder instead focusses on much smaller scale, higher fidelity captures which include video, audio and other multimedia content. All types of interactive fiction (parser, hypertext, choice-based and multimodal) were tested with both ACT and Webrecorder in order to determine tools which were best suited to which types of content. It should be noted that this project is experimental and ongoing, and as a result, all assertions and suggestions made here are provisional and will not necessarily affect or influence Library collection policy or the final collection. As yet, Webrecorder files do not form part of standard Library collections.

Cat_Simulator

For most parser-based works (those made with Inform 7), Webrecorder appears to work best. It is generally more time-consuming to obtain captures in Webrecorder than in ACT as each page element has to be clicked manually (or at least, the top level page in each branch must be visited) in order to create a fully replayable record. However, this is not the case with most Inform 7 works. For the vast majority, visiting the title page and pressing space bar was sufficient to capture the entire work. The works are then fully replayable in the capture, with users able to type any valid commands in any order. ACT failed to capture most parser works, but there were some successes. For example, Elizabeth Smyth’s Inform 7 game 1k Cupid was fully replayable in ACT, while Robin Johnson’s custom-made Aunts and Butlers also retained full functionality. Unfortunately, games made with Quest failed to capture with either tool.

Another form which appears to be currently unarchivable are those works which make use of live data such as location information, maps or other online resources. Matt Bryden’s Poetry Map failed to capture in ACT, and in Webrecorder although the poems themselves were retained, the background maps were lost. Similarly, Kate Pullinger’s Breathe was recorded successfully with WebRecorder, but naturally only the default text, rather than the adaptive, location-based information is present. Archiving alternative resources such as blogs describing the works may be necessary for these pieces until another solution is found. However, even where these works don’t capture as intended, running them through ACT may still have benefits. A functional version of J.R. Carpenter’s This Is A Picture of Wind, which makes use of live wind data, could not be captured, but crawling it obtained a sample thumbnail which indicates how the poems display in the live version – something which would not have been possible using Webrecorder alone.

Choice-based works made with Ink generally captured well with ACT, although Isak Grozny’s dripping with the waters of SHEOL required Webrecorder. This could be due to the dynamic menus, the use of javascript, or because Autorun has been enabled on itch.io, all of which can prevent ACT from crawling effectively. ChoiceScript games were difficult to capture with either tool for various reasons. Firstly, those which are paywalled could not be captured. Secondly, the manner in which the files are hosted appears to affect capture. When hosted as a folder of individual files rather than as a single compiled html file, the works could only be captured with Webrecorder’s Firefox emulator, and even then, the page crashes frequently. Those which had been compiled appeared to capture equally well with either tool.

Twine works generally capture reasonably well with ACT. ACT is probably the best choice for larger Twines in particular, as capturing a large number of branches quickly becomes extremely time-consuming in Webrecorder. Works which rely on images and video to tell their story, such as Chris Godber’s Glitch, however, retain a greater deal of their functionality if recorded in Webrecorder. As the game is somewhat sprawling, a route was planned through which would give a good idea of the game’s flavour while avoiding excessively long capture times. Webrecorder also contains an emulator of an older version of Firefox which is compatible with older javascript functions and Flash. This allowed for archiving of works which would have otherwise failed to capture, such as Emma Winston’s Cat Simulator 3000 and Daniel Goodbrey’s Icarus Needs.

As alluded to above, using the two tools in tandem is probably the best way to ensure these digital works of fiction are not lost. However, creators are advised to archive their own work too, either by nominating web pages to the UKWA, capturing content with Webrecorder, or saving pages with the Internet Archive’s Wayback Machine.

By Lynda Clark, Innovation Placement, The British Library - @notagoth

21 March 2019

Save UK Published Google + Accounts Now!

Add comment

The fragility of social media data was highlighted recently when Myspace deleted (by accident) user’s audio and video files without warning. This almost certainly resulted in the loss of many unique and original pieces of work. This is another example of how online social media platforms should not be seen as archives and that if things are important to you they should also be stored elsewhere. The UK Web Archive can play a role in this and we do what we can to preserve websites and selected social media. We do, however, need your help!

Google+
If you have a  Google + account you will have seen the warning that the service is shutting down on 2 April 2019 and have warned users to download any data they want to save by 31 March 2019.

However, it’s not easy to know how to preserve data from social media accounts and sometimes this information without the context of the platform it was hosted on doesn’t give the full picture. In a previous blog post we outlined the challenges involved in archiving social media. Currently the most popular social media platform in the UK Web Archive is Twitter, followed by Facebook, which we haven’t been able to successfully capture since 2015, and a limited amount of Instagram, Wiebo, WeChat and Google +.

Under the 2013 Non-Print Legal Deposit Regulations we can legally only collect digital content published in the UK. As these platforms are hosted outside the UK there is no automated way to identify UK accounts so it requires a person to look through and identify the profiles that are added. In general, these are profiles of politicians, public figures, people renowned in their field of study, campaign groups and institutions.

So far, we only have handful of Google + profiles in the UK Web Archive but we are keen to have more.

How to save your Google+ data
If you have a Google + profile or know of other profiles published in the UK that you think should be preserved, fill in our nomination form before March 29th 2019: https://www.webarchive.org.uk/en/ukwa/info/nominate

If the profiles you want to archive outside the UK you can use the save a website now function on the Internet Archive website: https://archive.org/web/

By Helena Byrne, Web Curator of Web Archiving, The British Library