THE BRITISH LIBRARY

UK Web Archive blog

52 posts categorized "Web/Tech"

16 July 2019

Summer Placement with the UK Web Archive

Add comment

By Isobelle Degale, Masters student, University of Sussex

My summer placement at the British Library is now coming to an end. As a Masters student studying Human Rights, I contacted the UK Web Archiving team based at the British Library as a way to enrich my understanding of the sources available on London policing, specifically looking at stop and search procedure.

BL-porthole

The first few days of the placement I learnt how to add online content onto the UK Web Archives using the 'Annotation and Curation Tool' (ACT). I learnt how to add 'targets' (web addresses) to the web archive using ACT and the importance of crawl frequency of different sources. Over the last few weeks I have been researching and selecting content to add to the online collections: Black and Asian Britain and Caribbean Communities in the UK.

Having previously studied history, including the impact of the British Empire during my undergraduate degree, I also have an interest in the Windrush generation and have been selecting content such as websites, podcast links, videos and documentaries. I have also gained hands on experience in web archiving through emailing website authors requesting permission for open access  of their content.

As my summer dissertation discusses discrimination and disproportionality of London stop and searches, I have also been adding related content to the UK Web Archive. I have gathered content such as news articles, twitter accounts of activists, grassroots websites and publications from racial equality think tanks that highlight the disproportionality of stop and searches on young BME (Black and Ethnic Minorities) peoples and communities, which is the central debate of this topic. My dissertation specifically explores the experiences and perspectives of those stopped and searched. I have noted that there is a gap on the web which explores and expresses the opinions of those who are more likely to be stopped, despite the abundance of news reports and statistics on the topic.

My experience with the web archiving team has opened up my thoughts to the value of archiving online content, as with the breadth and depth of the web, socially and culturally important web sites can easily be overlooked if not archived.

I hope that my contribution over the weeks will be useful in documenting the cultural and social celebration of black and Asian communities in Britain, but also demonstrating that there are negative experiences of black and ethnic minority Britons that make up an important part of daily life and should not be ignored. As a human rights student I feel that it is important in recognising inequality in both past and present Britain. I am, therefore, grateful to the Web Archiving team for the opportunity to add to the UK Web Archive the much debated topic of London stop and searches that will hopefully provide insight and information into the subject.

29 March 2019

Collecting Interactive Fiction

Add comment

Intro
Works of interactive fiction are  stories where the reader/player can guide or affect the narrative in some way. This can be through turning to a specific page as in 'Choose Your Own Adventure', or clicking a link or typing text in digital works. 

Archiving Interactive Fiction
Attempts to archive UK-made interactive fiction began with an exploration of the affordances of a couple of different tools. The British Library’s own ACT (Annotation Curation Tool), and Rhizome’s WebRecorder. ACT is a system which interfaces with the Internet Archive’s Heritrix crawl engine to provide large scale captures of the UK Web. Webrecorder instead focusses on much smaller scale, higher fidelity captures which include video, audio and other multimedia content. All types of interactive fiction (parser, hypertext, choice-based and multimodal) were tested with both ACT and Webrecorder in order to determine tools which were best suited to which types of content. It should be noted that this project is experimental and ongoing, and as a result, all assertions and suggestions made here are provisional and will not necessarily affect or influence Library collection policy or the final collection. As yet, Webrecorder files do not form part of standard Library collections.

Cat_Simulator

For most parser-based works (those made with Inform 7), Webrecorder appears to work best. It is generally more time-consuming to obtain captures in Webrecorder than in ACT as each page element has to be clicked manually (or at least, the top level page in each branch must be visited) in order to create a fully replayable record. However, this is not the case with most Inform 7 works. For the vast majority, visiting the title page and pressing space bar was sufficient to capture the entire work. The works are then fully replayable in the capture, with users able to type any valid commands in any order. ACT failed to capture most parser works, but there were some successes. For example, Elizabeth Smyth’s Inform 7 game 1k Cupid was fully replayable in ACT, while Robin Johnson’s custom-made Aunts and Butlers also retained full functionality. Unfortunately, games made with Quest failed to capture with either tool.

Another form which appears to be currently unarchivable are those works which make use of live data such as location information, maps or other online resources. Matt Bryden’s Poetry Map failed to capture in ACT, and in Webrecorder although the poems themselves were retained, the background maps were lost. Similarly, Kate Pullinger’s Breathe was recorded successfully with WebRecorder, but naturally only the default text, rather than the adaptive, location-based information is present. Archiving alternative resources such as blogs describing the works may be necessary for these pieces until another solution is found. However, even where these works don’t capture as intended, running them through ACT may still have benefits. A functional version of J.R. Carpenter’s This Is A Picture of Wind, which makes use of live wind data, could not be captured, but crawling it obtained a sample thumbnail which indicates how the poems display in the live version – something which would not have been possible using Webrecorder alone.

Choice-based works made with Ink generally captured well with ACT, although Isak Grozny’s dripping with the waters of SHEOL required Webrecorder. This could be due to the dynamic menus, the use of javascript, or because Autorun has been enabled on itch.io, all of which can prevent ACT from crawling effectively. ChoiceScript games were difficult to capture with either tool for various reasons. Firstly, those which are paywalled could not be captured. Secondly, the manner in which the files are hosted appears to affect capture. When hosted as a folder of individual files rather than as a single compiled html file, the works could only be captured with Webrecorder’s Firefox emulator, and even then, the page crashes frequently. Those which had been compiled appeared to capture equally well with either tool.

Twine works generally capture reasonably well with ACT. ACT is probably the best choice for larger Twines in particular, as capturing a large number of branches quickly becomes extremely time-consuming in Webrecorder. Works which rely on images and video to tell their story, such as Chris Godber’s Glitch, however, retain a greater deal of their functionality if recorded in Webrecorder. As the game is somewhat sprawling, a route was planned through which would give a good idea of the game’s flavour while avoiding excessively long capture times. Webrecorder also contains an emulator of an older version of Firefox which is compatible with older javascript functions and Flash. This allowed for archiving of works which would have otherwise failed to capture, such as Emma Winston’s Cat Simulator 3000 and Daniel Goodbrey’s Icarus Needs.

As alluded to above, using the two tools in tandem is probably the best way to ensure these digital works of fiction are not lost. However, creators are advised to archive their own work too, either by nominating web pages to the UKWA, capturing content with Webrecorder, or saving pages with the Internet Archive’s Wayback Machine.

By Lynda Clark, Innovation Placement, The British Library - @notagoth

21 March 2019

Save UK Published Google + Accounts Now!

Add comment

The fragility of social media data was highlighted recently when Myspace deleted (by accident) user’s audio and video files without warning. This almost certainly resulted in the loss of many unique and original pieces of work. This is another example of how online social media platforms should not be seen as archives and that if things are important to you they should also be stored elsewhere. The UK Web Archive can play a role in this and we do what we can to preserve websites and selected social media. We do, however, need your help!

Google+
If you have a  Google + account you will have seen the warning that the service is shutting down on 2 April 2019 and have warned users to download any data they want to save by 31 March 2019.

However, it’s not easy to know how to preserve data from social media accounts and sometimes this information without the context of the platform it was hosted on doesn’t give the full picture. In a previous blog post we outlined the challenges involved in archiving social media. Currently the most popular social media platform in the UK Web Archive is Twitter, followed by Facebook, which we haven’t been able to successfully capture since 2015, and a limited amount of Instagram, Wiebo, WeChat and Google +.

Under the 2013 Non-Print Legal Deposit Regulations we can legally only collect digital content published in the UK. As these platforms are hosted outside the UK there is no automated way to identify UK accounts so it requires a person to look through and identify the profiles that are added. In general, these are profiles of politicians, public figures, people renowned in their field of study, campaign groups and institutions.

So far, we only have handful of Google + profiles in the UK Web Archive but we are keen to have more.

How to save your Google+ data
If you have a Google + profile or know of other profiles published in the UK that you think should be preserved, fill in our nomination form before March 29th 2019: https://www.webarchive.org.uk/en/ukwa/info/nominate

If the profiles you want to archive outside the UK you can use the save a website now function on the Internet Archive website: https://archive.org/web/

By Helena Byrne, Web Curator of Web Archiving, The British Library

20 December 2018

The UK Web Archive gets a fresh look

Add comment

Until recently, if you wanted to research a historic UK website you may have had to look in a number of different places. There was the 'Open' UK Web Archive that contained the 15,000 or so publicly available websites collected since 2005. If you also wanted to check the vast 'Legal Deposit' web archive  (containing the whole UK Web space) then you would need to travel to the reading room of a UK Legal Deposit Library to see if what you needed was there. For the first time, the new UKWA website offers:

  • The ability to search the entire collection in one place
  • the opportunity to browse over 100 curated collections on a wide range of topics.

Home

www.webarchive.org.uk

Who is the UK Web Archive?
UKWA is a partnership of all the UK Legal Deposit Libraries - The British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries Oxford, Cambridge University Libraries, Trinity College, Dublin. The Legal Deposit Web Archive is available in the reading rooms of all the Libraries. A readers pass for each library is required to gain access to a reading room. 

How much is available now?
At the time of writing, everything that a human (curators and collaborators) has selected since 2005 is searchable. This constitutes many thousands of websites and millions of individual web pages. We will be adding the huge yearly Legal Deposit collections over the coming year - we'll let you know as they become available.

Among the many websites available are the BBC and many newspaper websites such as The Sun, The Daily Mail and The Guardian.

Do the websites look and work as they did originally?
Yes and no. Every effort is made so that websites look how they did originally and internal links should work. However, for a variety of technical  issues many websites will look different or some elements may be missing. As a minimum, all of the text in the collection is searchable and most images should be there. Whilst we collect a considerable amount of video, much of this will not play back.

Is every UK website available?
We aim to collect every website made or owned by a UK resident, however, in reality it is extremely difficult to be comprehensive! Our annual Legal Deposit collections include every .uk (and .london, .scot, .wales and .cymru) plus any website on a server located in the UK. Of course, many websites are .com, .info etc. and on servers in other countries.

If you have or know of a UK website that should be in the archive we encourage you to nominate them here.

Keep in touch by following us on Twitter.

By Jason Webber, Web Archive Engagement Manager, The British Library

11 May 2018

Online Hours: Supporting Open Source

Add comment

Encouraging collaboration
Here at the UK Web Archive, we're very fortunate to be able to work in the open, with almost all code on GitHub. Some of our work has been taken up and re-used by others, which is great. We’d like to encourage more collaboration, but we've had trouble dedicating time to open project management, and our overall management process and our future plans are unclear. For example, we've experimented with so many different technologies over the years that our list of repositories give little insight into where we're going next. There are also problems with how issues and pull-requests have been managed: often languishing unanswered, waiting for us to get around to looking at them. This also applies to the IIPC repositories and other projects we are involved in, as well as the projects we lead.

I wanted to block out some time to deal with these things promptly, but also to find a way of making it a bit more, well, fun. A bit more social. Some forum where we can chat about our process and plans without the formality of having to write things up.

Taking inspiration from Jason Scott live-streamed CD-ripping sessions, we came up with the idea of something like Office Hours for Open Source -- a kind open open video conference or live stream, where we'll share our process, discuss issues relating to open source projects and have a forum where anyone can ask questions about what we’re up to.

Who is this for?
All welcome, from lurkers to those brimming with burning questions. Just remember that being *kind* beats being right.

Furthermore, if anyone else who manages open source projects like ours is also welcome to join and take the lead for a while! I can only cover the projects we’re leading, but there are many more that would be interesting to hear from.

When?
The plan is to launch the first Online Hours session on the 22nd of May, and then hold regular weekly slots every Tuesday from then on. We may not manage to run it every single week, but if it’s regular and frequent that should mean we can cope more easily with missing the odd one or two.

On the 22nd, we will run two sessions - one in the morning (for the west-of-GMT time-zones) and one in the evening (for the eastern half). Following that, we intend to switch between the two slots, making each a.m. and p.m. slot a fortnightly occurrence.

How?
The sessions will be webcast with a slack channel available for chat. See the IIPC Trello board for more information.

The IIPC (International Internet Preservation Consortium) have kindly agreed to help support this event and further Online Hours sessions. Running this initiative in a more open manner should raise the profile of our open source work both inside and outside of the IIPC, and encourage greater adoption of, and collaboration around, open source tools.

For full details, see the IIPC Trello Board card or ask a question in the NetPreserve Slack Channel #oh-sos (ask @NetPreserve to join the Slack).

See you there!

By Andrew Jackson, Web Archive Technical Lead, The British Library

 

04 May 2018

Star Wars in the Web Archive

Add comment

May the fourth be with you!

It's Star Wars day and I imagine that you are curious to know which side has won the battle of the UK web space?

Looking at the trends in our SHINE dataset (.uk websites 1996-2013 collected by Internet Archive) I first looked at the iconic match-up of Luke vs Darth.

Shine-darth-vader

Bad news, evil seems to have won this round mainly, it seems, due to the popularity of Darth Vader costume mentions on retail websites.

How about a more general 'Light Side vs Dark side'? 

Shine-lightside-v-darkside

It appears that discussing the 'dark side' of many aspects in life is a lot more fun and interesting than the 'light side'. 

How about just analysing the phrase 'may the force be with you'?

Shine-may the force be with you

This phrase doesn't seem to have been particularly popular on the UK web until it started to be used a lot on websites offering downloadable ringtones. Go figure.

Try using the trends feature on this dataset yourself here: www.webarchive.org.uk/shine/graph

Happy stars wars day!

by Jason Webber, Web Archive Engagement Manager, The British Library

@UKWebArchive

 

01 February 2018

A New Playback Tool for the UK Web Archive

Add comment

We are delighted to announce that the UK Web Archive will be working with Rhizome to build a version of pywb (Python Wayback) that we hope will greatly improve the quality of playback for access to our archived content.

What is playback of a web archive?

When we archive the web, just downloading the content is not enough. Data can be copied from the web into an archive in a variety of ways, but to make this archive actually accessible takes more than just opening downloaded files in a web browser. Technical details of pages and scripts coming out of the archive need to be presented in a way that enables them to work just like the originals, although they aren’t located on their actual servers anymore. Today’s web users have come to expect interactive features and dynamic layouts on all types of websites. Faithfully reproducing these behaviors in the archive has become an increasingly complex challenge, requiring web archive playback software that is on-par with the evolution of the web as a whole.

Why change?

Currently, we use the OpenWayback playback system, originally developed by the Internet Archive. But in more recent years, Rhizome have led the development of a new playback engine, called pywb (Python Wayback). This Python toolkit for accessing web archives is part of the Webrecorder project, and provides a modern and powerful alternative implementation that is being run as an open source project. This has led to rapid adoption of pywb, as the toolkit is already being used by the Portuguese Web Archive, perma.cc, the UK National Archives, the UK Parliamentary Archive, and a number of others.

Open development
To meet our needs we need to modify pywb, but as strong believers in open source development, all work will be in the open, and wherever appropriate, we will fold the improvements back into the core pywb project.

If all goes to plan, we expect to contribute the following back to pywb for others to use:

Other UKWA-specific changes, like theming, implementing our Legal Deposit restrictions, and deployment support, will be maintained separately.

Initially we will work with Rhizome to ensure our staff and curators can access our archived material via both pywb and OpenWayback. If the new playback tool performs as expected  we will move towards using pywb to support public access to all our web archives.

22 December 2017

What can you find in the (Beta) UK Web Archive?

Add comment

We recently launched a new Beta interface for searching and browsing our web archive collections but what can you expect to find?

UKWA have been collecting websites since 2005 in a range of different ways, using changing software and hardware. This means that behind the scenes we can't bring all of the material collected into one place all at once. What isn't there currently will be added over the next six months (look out for notices on twitter). 

What is available now?

At launch on 5 December 2017 the Beta website includes all of the websites that have been 'frequently crawled' (collected more often than annually) 2013-2016. This includes a large number of 'Special Collections' selected over this time and a reasonable selection of news media.

DC07-screenshot-brexit

What is coming soon?

We are aiming to add 'frequently crawled' websites from 2005-2013 to the Beta collection in January/February 2018. This will add our earliest special collections (e.g. 2005 UK General Election) and should complete all of the websites that we have permission to publicly display.

What will be available by summer 2018?

The largest and most difficult task for us is to add all of the websites that have been collected as part of the 'Legal Deposit' since 2013. We do a vast 'crawl' once per year of 'everything' that we can identify as being a UK website. This includes all .UK (and .london, .scot, .wales and .cymru) plus any website we identify as being on a server in the UK. This amounts to tens of millions of websites (and billions of individual assets). Due to the scale of this undertaking we thank you for your patience.

We would love to know your experiences of using the new Beta service, let us know here: www.surveymonkey.co.uk/r/ukwasurvey01

 By Jason Webber, Web Archive Engagement Manager, The British Library