UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

6 posts from July 2014

25 July 2014

Special Collection – Tour de France comes to Yorkshire

As curator for sport in the British Library I have had a pretty exciting time in recent years, with plenty of sporting mega-events hitting the headlines in the UK, including the London Olympic Games and recently the Tour de France starting In Yorkshire.

The latter was celebrated by the Library in a number of ways: several members of staff actually biked from St Pancras to our Yorkshire site in Boston Spa (a two-day; 200 mile journey); while I (taking the train!) helped to create a small exhibition of cycling-related collections items in cases close to the newly refurbished Boston Spa reading room. Here I am with my colleague Robert Davies in front of the exhibition.

As with most of the significant events taking place in this country, the web archiving team wanted to make a record of the Tour of Yorkshire’s online presence for future researchers, so I was given a watching brief for relevant websites.

163856_large_Tour%20de%20France%20large

 

The Grand Depart
Everyone now knows that the Grand Depart was a resounding success in attracting enthusiastic spectators all along its route from Leeds to the Mall in London. The Tour organisers expected three million people to line the roads; they achieved more than double that! I anticipated a great response (similar to the success of the torch relay in 2012) so I was very keen to ensure that we archive the many different websites of the local councils and tourist offices through whose boroughs and counties the tour would pass. Many of these web sites had huge amounts of information on them, from details of local campsites, guest houses and B&Bs to special brochures with interactive maps and lists of events connected to the Tour. Opportunities for future tourism were clearly being optimised.

A mega event
It had to be borne in mind that the Grand Depart was not just a special event for the UK but formed part of a larger sociological and anthropological phenomenon: i.e., the mega-event, a phenomenon which is a growing area of research in a number of subject areas – not only in sport, where the development of organisations like the IOC and FIFA are of interest to sports sociologists and historians – but to economists and cultural observers. The local activity encouraged by such events, like the Tour-associated cultural festivals, and educational projects bear witness to their wide-ranging social impact.

Which websites to archive?
So all this had to be recorded if possible. Add to this the day-by-day; hour-by-hour reports of media organisations like broadcasters and newspapers and there were clearly a large number of websites waiting to be gathered. One aspect did seem to be missing, and that was the protest sites, which tend to be much in evidence with events like the Olympic Games. Contrary to this, most Tour websites were celebrating the Tour in every way possible. Where they did echo the Olympics was in their keen embracing of the successful outcomes of the latter - such as volunteering - with Asda sponsoring a volunteering website which called for route and crossing marshals, ‘dignitary managers’ and coordinators of all kinds.

The riders
The websites of the riders themselves proved problematic at first, as it was not clear until almost the last minute who was going to ride. In the end, as we know, Sir Bradley Wiggins bowed out, but we made sure that we kept a close eye on Chris Froome and Mark Cavendish, as well as the UK based teams like Team Sky, The British Cycling Organisation and the Tour de France organisation itself. It was a huge disappointment to see British hopes being dashed by falls but we can now follow Chris Froomes twitter feed, from his original expressions of excitement to his reports on his MRI scans ‘confirmed fractures to the left wrist and right hand’. While on his Facebook page, Mark Cavendish displays a picture of himself fresh from the operating theatre! Sad, but interesting, times.

The collection
Websites are marvellous research sources for the study of sport in particular. With their aid you can observe events as they take place from day to day, and get a marvellous feel for the atmosphere surrounding these exciting occasions. The process of archiving the Tour sites is not over. In the aftermath of such events the sites will often sum up their experiences, and others may even spring up in response to what has taken place. So the watching brief is certainly not over!

By Gill Ridgley, Lead Curator, Sociological and Cultural Studies, The British Library

23 July 2014

First World War Centenary – an online legacy in partnership with the HLF

Earlier this year, we at the UK Web Archive were delighted to reach an agreement with the Heritage Lottery Fund (HLF) to enable the archiving of a very large and significant set of websites relating to the Centenary of the First World War.

Throughout the Centenary and beyond, we will be working with the HLF in order to take archival copies of the websites of all HLF-funded First World War Centenary projects, and to make them available to users in the Open UK Web Archive. The first of these archived sites are already available in the First World War special collection but we hope that this will eventually lead to more than 1,000.

HLF Funding
HLF is funding First World War projects throughout the Centenary, ranging from small community projects to major museum redevelopments. Grants start at £3,000 and funding is available through four different grants programmes: First World War: then and now (grants of £3,000 - £10,000), Our Heritage (grants of £10,000 - £100,000), Young Roots (Grants of £10,000 - £50,000 for projects led by young people) and Heritage Grants (grants of more than £100,000).

HLF_Blue(RGB)AFF_TNL_RGB

Include your website
If you have HLF funding for a First World War Centenary project, please send the URL (web address) to [email protected] with your project reference number.

If you have a UK-based WW1 website NOT funded by HLF we would still encourage you to add it for permanent archiving through our Nominate form.

Legacy
This set of archived websites will form a key part of our wider Centenary collection, and capture an important legacy of this most significant of anniversaries.

By Jason Webber, Web Archiving Engagement and Liaison Officer, The British Library

21 July 2014

A right to be remembered

A notice placed in a Spanish newspaper 16 years ago, relating to an individual’s legal proceedings over social security debts, appeared many years later in Google’s search results. This led to the recent landmark decision by the European Court of Justice (ECJ) to uphold the Spanish data protection regulator’s initial ruling against Google – who were asked to remove the index and stop any future access to the digitised newspaper article by searching for the individual’s name.

Right to be forgotten
This “right to be forgotten” has been mentioned frequently since, a principle that an individual shall be able to remove traces of past events in their life from the Internet or other records. The “right to be forgotten” is a concept which has generated a great deal of legal, technical and moral wrangling, and is taken into account in practice but not (yet) enforced explicitly by law. As a matter of fact, the ECJ did not specifically find that there is a ‘right to be forgotten’ in the Google case, but applied existing provisions in the EU Data Protection Directive, and Article 8 of the European Convention on Human Rights, the right to respect for private and family life.

Implications to UK Law
In the UK Web Archive our aim is to collect and store information from the Internet and keep that for posterity. There is a question, therefore on how the ECJ decision implicates web archiving?

To answer this question, we would like to point to our existing notice and takedown policy which allows the withdrawal of public access to, or removal of deposited material under specific circumstances.

There is at present no formal and general “right to be forgotten” in UK law, on which a person may demand withdrawal of the lawfully archived copy of lawfully published material, on the sole basis that they do not wish it to be available any longer. However, the Data Protection Act 1998 is applied as the legal basis for withdrawing material containing sensitive personal data, which may cause substantial damage or distress to the data subject. Our policy is in line with the Information Commissioner's Office's response to the Google ruling, which recommend a focus on "evidence of damage and distress to individuals" when reviewing complaints.

Links only, not data
It is important to recognise that the context of the ECJ’s decision is Google’s activities in locating, indexing and making available links to websites containing information about an individual. It is not about the information itself and the court did not consider the blocking or taking down access to the newspaper article.

The purpose of Legal Deposit is to protect and ensure the “right to be remembered” by keeping snapshots of the UK internet as the nation’s digital heritage. Websites archived for Legal Deposit are only accessible within the Legal Deposit Libraries’ reading rooms and the content of the archive is not available for search engines. This significantly reduces the potential damage and impact to individuals and the libraries’ exposure to take-down requests.

Summary
Our conclusion is that the Google case does not significantly change our current notice and take-down policy for non-print Legal Deposit material. However, we will review our practice and procedures to reflect the judgement, especially with regard to indexing, cataloguing and resource discovery based on individuals’ names.

By Helen Hockx-Yu, Head of Web Archiving, The British Library

* I would like to thank my colleague Lynn Young, British Library’s Records Manager, whose various emails and internal papers provide much useful information for this blog post.

18 July 2014

UK Web Domain Crawl 2014 – One month update

The British Library started the annual collection of the UK Web on the 19th of June. Now that we are one month into a process which may continue for several more, we thought we would look at the set-up and what we have found so far.

Setting up a ‘Crawl’
Fundamentally a crawl consists of two elements: ‘seeds’ and ‘scope’. That is, a series of starting points and decisions as to how far from those starting points we permit the crawler to go. In theory, you could crawl the entire UK Web with a broad enough scope and a single seed. However, practically speaking it makes more sense to have as many starting points as possible and tighten the scope, lest the crawler’s behaviour becomes unpredictable.

Photo2

Seeds
For this most recent crawl the starting seed list consisted of over 19,000,000 hosts. As it's estimated that there are actually only around 3-4 million active UK websites at this point in time this might seem an absurdly high figure. The discrepancy arises partly due to the difference between what is considered to be a 'website' and a 'domain'—Nominet announced the registration of their 10,000,000th domain in 2012. However, each of those domains may have many subdomains, each serving a different site, which vastly inflates the number.

While attempting to build the seed list for the 2014 domain crawl, we counted the number of subdomains per domain: the most populous had over 700,000.

Scope
The scope definition is somewhat simpler: Part 3 of The Legal Deposit Libraries (Non-Print Works) Regulations 2013 largely defines what we consider to be 'in scope'. The trick becomes translating this into automated decisions. For instance, the legislation rules that a work is in scope if "activities relating to the creation or the publication of the work take place within the United Kingdom". As a result, one potentially significant change for this crawl was the addition of a geolocation module. With this included, every URL we visit is tagged with both the IP address and the result of a geolocation lookup to determine which country hosts the resource. We will therefore automatically include UK-hosted .com, .biz, etc. sites for the first time.

Currently it seems that the crawlers have visited over 350,000 hosts not ending in “.uk” as they have content hosted in the UK.

Geolocation
Although we automatically consider in-scope those sites served from the UK, we can include resources from other countries—the policy for which is detailed here—in order to obtain as full a representation of a UK resource as possible. Thus far we have visited 110 different countries over the course of this year’s crawl.

With regard to the number of resources archived from each country, at the top end the UK accounts for more than every other country combined, while towards the bottom of the list we have single resources being downloaded from Botswana and Macao, among others:

Visited Countries:

1. United Kingdom
2. United States
3. Germany
4. Netherlands
5. Ireland
6. France
...
106. Macao
107. Macedonia, Republic of
108. Morocco
109. Kenya
110. Botswana

Malware
Curiously we've discovered significantly fewer instances of malware than we did in the course of our previous domain crawl. However, we are admittedly still at a relatively early stage and those numbers are only likely to increase over the course of the crawl. The distribution, however, has remained notably similar: most of the 400+ affected sites have only a single item of malware while one site alone accounts for almost half of those found.

Data collected
So far we have archived approximately 10TB of data—the actual volume of data downloaded will likely be significantly higher as firstly, all stored data are compressed and secondly, we don’t store duplicate copies of individual resources (see our earlier blog post regarding size estimates).

By Roger G. Coram, Web Crawl Engineer, The British Library

11 July 2014

Researcher in focus: Saskia Huc-Hepher – French in London

Saskia is a researcher at the University of Westminster and worked with the UK Web Archive in putting together a special collection of websites. This is her experience:

Curating a special collection
Over the course of the last two years, I have enjoyed periodically immersing myself in the material culture of the French community in London as it is (re)presented immaterially on-line. In a genuinely web-like fashion, a dip into one particular internet space has invariably led me inquisitively onto others, equally enlightening, and equally expressive of the here and now of this minority community, as the one before, and in turn leading to the discovery of yet more on-line microcosms of the French diaspora.

In fact, the website curation exercise has proven to be a rather addictive activity, with “just one more” site, tantalisingly hyperlinked to the one under scrutiny, delaying the often overdue computer shutdown. These meanderings, however, have a specific objective in mind: to create a collection of websites mirroring the physical presence of the French community in London in its manifold forms, be they administrative, institutional, entrepreneurial, gastronomical, cultural or personal.

Although the collection was intended to display a variety of London French on-line discourses and genres, thereby reflecting the multi-layered realities of the French presence on-land, the aim was also that they should come together as a unified whole, given a new sense of thematic coherence through their culturo-diasporic commonality and shared “home” in the Special Collection.

Open UK Web Archive vs Non-Print Legal Deposit
One of the key challenges with attempting to pull together a unified collection has been whether it can be viewed as a whole online. For websites to be published on the Open UK Web Archive website, permission needs to be granted by the website owner. Any website already captured for the Non-print Legal Deposit (from over 3.5 million domains) can be chosen but these can only be viewed within the confines of a Legal Deposit Library.

In theory, this would mean that the Non-print Legal Deposit websites selected for the London French collection would be accessible on-site in one of the official libraries, but – crucially – not available for open-access consultation on-line.

As regards to this collection, therefore, the practical implications of the legislation could have given rise to a fragmented entity, an archive of two halves arbitrarily divorced from one another, one housed in the ‘ivory towers’ of the research elite and the other freely available to all via the Internet: not the coherent whole I had been so keen to create.

What to select?
In addition to aiming to produce a unified corpus, it was my vision that the rationale of the curation methodology should be informed by the “ethnosemiotic” conceptual framework conceived for my overarching London French research. My doctoral work brings together the ideas of two formerly disparate thinkers, namely (and rather fittingly perhaps) those of French ethnographer, Pierre Bourdieu, and Anglophone, Gunther Kress, of the British “school” of social semiotics, whose particular focus is on multimodal meanings.

Consequently, when selecting websites to be included in the collection, or at least earmarked for permission-seeking, it was vital that I took a three-pronged approach, choosing firstly “material” that demonstrated the official on-line presence of the French in London (what Bourdieu might term the “social field” level) and secondly the unofficial, but arguably more telling, grassroots’ representations of the community on the ground (Bourdieusian “Habitus”), as portrayed through individuals' blogs. Thirdly, for my subsequent multimodal analysis of the sites to be effective, it would also be necessary to select sites drawing on a multiplicity of modes, for instance written text, photographic images, sound, colour, layout, etc., which all websites do by default, but which some take to greater depths of complexity than others.

Video and audio not always captured
However, in the same way that the non-print legal deposit legislation challenges the integrity of the collection as a whole, so these theoretical aspirations turned out to be rather more optimistic than I had envisaged, not least because of the technical limitations of the special collections themselves.
Despite the infinite supplies of generosity and patience from the in-house team at the British Library the fact that special collections cannot at present accommodate material from audiovisual sites, such as on-line radio and film channels (even some audio, visual and audiovisual content from standard sites can be lost in the crawling process) is an undeniable shortcoming.
It was a particular frustration when curating this collection, as audiovisual data, often containing tacit manifestations of cultural identity, are increasingly relied upon in the 21st-century digital age and thus of considerable value now and, perhaps more importantly, for future generations.

3D-Wall visualisation tool
Since completion of the inaugural collection, one or two additional positive lessons have been learned, like the “impact” value of the 3D-Wall visualisation tool. When disseminating my curation work at the Institut Français de Londres last March, before a diverse public audience, composed of community members, together with academics, historians, journalists, publishers and students, none of whom were thought to be familiar with the UK Web Archive, making use of the 3D Wall proved to be an effective and tangible way to connect with the uninitiated audience.

3d-wall

It brought the collection to life, transforming it from a potentially dull and faceless list of website names to a vibrant virtual “street” of London French cyberspaces, bringing a new dimension to the term “information superhighway”. It gave the audience a glimpse of the colourful multitude of webpages making up the façades of “London French Street”, to be visited and revisited beyond the confines of my presentation.

Indeed, the appeal of the collection, as displayed through the 3D Wall, generated unanticipated interest among several key players within the institutional and diplomatic bodies of the French community in London, not least the Deputy Consul and the Head of the French Lycée, both of whom expressed a keen desire to become actively involved in the project.
They found the focus on the quality of the everyday lives of the London French community a refreshing change from the media obsession with the quantity of its members, and I am convinced that it was the 3D Wall that enabled the collection to be showcased to its full potential.

In summary
To conclude, it can be said that I have found the journey, from idea through curation – with the highs and lows of selections, permission-seeking and harvesting – and ultimately “going live” a rewarding and enlightening process.

It has offered insights into the technical and administrative challenges of attempting to archive the ephemeral world of the on-line so as to preserve and protect it as well as providing rich insights into both the formal and informal representations of ‘Frenchness’ in modern London.

The corpus of websites I have curated aims to play its part in recording the collective identity of this often overlooked minority community, giving it a presence, accessible to all, for generations to come and, as such, contributing prospectively to the collective memory of this diasporic population.

By Saskia Huc-Hepher (University of Westminster)

02 July 2014

How much of the UK's HTML is valid?

How much of the HTML in the UK web archive is valid HTML? Despite its apparent simplicity, this turns to be a rather difficult question to answer.

What is valid HTML anyway?
What do we mean by valid?

Certainly, the W3C works to create appropriate web standards, and provides tools to assist validation according to those standards that we could re-use.

However, the web browser software that you are using has its own opinion as to what HTML can be. For example, during the ‘browser wars’, competing software products invented individual features in order to gain market share while ignoring any effort to standardise them. Even now, although the relationship between browsers is much more amicable, some of the browser vendors still maintain their own 'living standard' that is similar to, but distinct from, the W3C HTML specification. Even aside from the issue of which definition to validate against, there is the further complication that browsers have always attempted to resolve errors and problems with malformed documents (a.k.a. ‘tag soup’), and do their best to present the content anyway.

Consequently, anecdotally at least, we know that a lot of the HTML on the web is perfectly acceptable despite being invalid, and so it is not quite clear what formal validation would achieve. Furthermore, the validation process itself is quite a computationally intensive procedure, and few web archives have the resources to carry out validation at scale. Based on this understanding of the costs and benefits, we do not routinely run validation processes over our web archives.

What can we look for?
However, we do process our archives in order to index the text from the resources. As each format stores text differently, we have to perform different processes to extract the text from HTML versus, say, a PDF or Office document. Therefore, we have to identify the format of each one in order to determine how to get at the text.

In fact, to help us understand our content, we run two different identification tools, Apache Tika and DROID. The former identifies the general format, and is a necessary part of text extraction processes, whereas the latter attempts to perform a more granular identification. For example, it is capable of distinguishing between the different versions of HTML.

Ideally, one would hope that each of these tools would agree on which documents are HTML, and DROID would provide a little additional information concerning the versions of the formats in use. However, it turns out that DROID takes a somewhat stricter view of what HTML should look like, whereas Tika is a little more forgiving of HTML content that strays further away from standard usage. Another way to look at this is to say that DROID attempts to partially validate the first part of an HTML page, and so those documents that Tika identifies as HTML, but DROID does not, forms a reasonable estimate of the lower-bound of the percentage of invalid HTML in the collection.

Results
Based on two thirds of our 1996-2010 collection (a randomly-selected subset containing 1.7 billion of about 2.5 billion resources hosted from *.uk), we've determined the DROID 'misses' as a percentage of the Tika 'hits' for HTML, year by year, here:

Droid-vs-tikka02

From there one can see that pre-2000, at least ten percent of the archived HTML is so malformed that it's difficult to even identify it as being HTML. For 1995, the percentage rises to 95%, with only 5% of the HTML being identified as such by DROID (although note that the 1995 data only contains a few hundred resources). Post-2000 the fraction of 'misses' has dropped significantly and as of 2010 appears to be around 1%.

What next?
While it is certainly good news that we can reliably identify 99% of the HTML on the contemporary web, neither Tika nor DROID perform proper validation, and the larger question goes unanswered. While 1% of the current web is certainly invalid, we know from experience that the actual percentage is likely to be much higher. The crucial point, however, is that it remains unclear whether full, formal validity actually matters. As long as we can extract the text and metadata, we can ensure the content can be found and viewed, whether it is technically valid or not.

Although the utility of validation is not yet certain, we will still consider adding HTML validation to future iterations of our indexing toolkit. We may only pass a smaller random-selected sample of the HTML through that costly process, as this would still allow us to understand how much content has the clarity of formal validation, and thus how important the W3C (and the standards it promotes) are to the UK web. Perhaps it will tell us something interesting about standard adoption and format dynamics over time.

Written by Andy Jackson, Web Archiving Technical Lead, The British Library