UK Web Archive blog

22 posts categorized "Collections"

21 July 2014

A right to be remembered

Add comment Comments (0)

A notice placed in a Spanish newspaper 16 years ago, relating to an individual’s legal proceedings over social security debts, appeared many years later in Google’s search results. This led to the recent landmark decision by the European Court of Justice (ECJ) to uphold the Spanish data protection regulator’s initial ruling against Google – who were asked to remove the index and stop any future access to the digitised newspaper article by searching for the individual’s name.

Right to be forgotten
This “right to be forgotten” has been mentioned frequently since, a principle that an individual shall be able to remove traces of past events in their life from the Internet or other records. The “right to be forgotten” is a concept which has generated a great deal of legal, technical and moral wrangling, and is taken into account in practice but not (yet) enforced explicitly by law. As a matter of fact, the ECJ did not specifically find that there is a ‘right to be forgotten’ in the Google case, but applied existing provisions in the EU Data Protection Directive, and Article 8 of the European Convention on Human Rights, the right to respect for private and family life.

Implications to UK Law
In the UK Web Archive our aim is to collect and store information from the Internet and keep that for posterity. There is a question, therefore on how the ECJ decision implicates web archiving?

To answer this question, we would like to point to our existing notice and takedown policy which allows the withdrawal of public access to, or removal of deposited material under specific circumstances.

There is at present no formal and general “right to be forgotten” in UK law, on which a person may demand withdrawal of the lawfully archived copy of lawfully published material, on the sole basis that they do not wish it to be available any longer. However, the Data Protection Act 1998 is applied as the legal basis for withdrawing material containing sensitive personal data, which may cause substantial damage or distress to the data subject. Our policy is in line with the Information Commissioner's Office's response to the Google ruling, which recommend a focus on "evidence of damage and distress to individuals" when reviewing complaints.

Links only, not data
It is important to recognise that the context of the ECJ’s decision is Google’s activities in locating, indexing and making available links to websites containing information about an individual. It is not about the information itself and the court did not consider the blocking or taking down access to the newspaper article.

The purpose of Legal Deposit is to protect and ensure the “right to be remembered” by keeping snapshots of the UK internet as the nation’s digital heritage. Websites archived for Legal Deposit are only accessible within the Legal Deposit Libraries’ reading rooms and the content of the archive is not available for search engines. This significantly reduces the potential damage and impact to individuals and the libraries’ exposure to take-down requests.

Our conclusion is that the Google case does not significantly change our current notice and take-down policy for non-print Legal Deposit material. However, we will review our practice and procedures to reflect the judgement, especially with regard to indexing, cataloguing and resource discovery based on individuals’ names.

By Helen Hockx-Yu, Head of Web Archiving, The British Library

* I would like to thank my colleague Lynn Young, British Library’s Records Manager, whose various emails and internal papers provide much useful information for this blog post.

18 July 2014

UK Web Domain Crawl 2014 – One month update

Add comment Comments (0)

The British Library started the annual collection of the UK Web on the 19th of June. Now that we are one month into a process which may continue for several more, we thought we would look at the set-up and what we have found so far.

Setting up a ‘Crawl’
Fundamentally a crawl consists of two elements: ‘seeds’ and ‘scope’. That is, a series of starting points and decisions as to how far from those starting points we permit the crawler to go. In theory, you could crawl the entire UK Web with a broad enough scope and a single seed. However, practically speaking it makes more sense to have as many starting points as possible and tighten the scope, lest the crawler’s behaviour becomes unpredictable.


For this most recent crawl the starting seed list consisted of over 19,000,000 hosts. As it's estimated that there are actually only around 3-4 million active UK websites at this point in time this might seem an absurdly high figure. The discrepancy arises partly due to the difference between what is considered to be a 'website' and a 'domain'—Nominet announced the registration of their 10,000,000th domain in 2012. However, each of those domains may have many subdomains, each serving a different site, which vastly inflates the number.

While attempting to build the seed list for the 2014 domain crawl, we counted the number of subdomains per domain: the most populous had over 700,000.

The scope definition is somewhat simpler: Part 3 of The Legal Deposit Libraries (Non-Print Works) Regulations 2013 largely defines what we consider to be 'in scope'. The trick becomes translating this into automated decisions. For instance, the legislation rules that a work is in scope if "activities relating to the creation or the publication of the work take place within the United Kingdom". As a result, one potentially significant change for this crawl was the addition of a geolocation module. With this included, every URL we visit is tagged with both the IP address and the result of a geolocation lookup to determine which country hosts the resource. We will therefore automatically include UK-hosted .com, .biz, etc. sites for the first time.

Currently it seems that the crawlers have visited over 350,000 hosts not ending in “.uk” as they have content hosted in the UK.

Although we automatically consider in-scope those sites served from the UK, we can include resources from other countries—the policy for which is detailed here—in order to obtain as full a representation of a UK resource as possible. Thus far we have visited 110 different countries over the course of this year’s crawl.

With regard to the number of resources archived from each country, at the top end the UK accounts for more than every other country combined, while towards the bottom of the list we have single resources being downloaded from Botswana and Macao, among others:

Visited Countries:

1. United Kingdom
2. United States
3. Germany
4. Netherlands
5. Ireland
6. France
106. Macao
107. Macedonia, Republic of
108. Morocco
109. Kenya
110. Botswana

Curiously we've discovered significantly fewer instances of malware than we did in the course of our previous domain crawl. However, we are admittedly still at a relatively early stage and those numbers are only likely to increase over the course of the crawl. The distribution, however, has remained notably similar: most of the 400+ affected sites have only a single item of malware while one site alone accounts for almost half of those found.

Data collected
So far we have archived approximately 10TB of data—the actual volume of data downloaded will likely be significantly higher as firstly, all stored data are compressed and secondly, we don’t store duplicate copies of individual resources (see our earlier blog post regarding size estimates).

By Roger G. Coram, Web Crawl Engineer, The British Library

11 July 2014

Researcher in focus: Saskia Huc-Hepher – French in London

Add comment Comments (0)

Saskia is a researcher at the University of Westminster and worked with the UK Web Archive in putting together a special collection of websites. This is her experience:

Curating a special collection
Over the course of the last two years, I have enjoyed periodically immersing myself in the material culture of the French community in London as it is (re)presented immaterially on-line. In a genuinely web-like fashion, a dip into one particular internet space has invariably led me inquisitively onto others, equally enlightening, and equally expressive of the here and now of this minority community, as the one before, and in turn leading to the discovery of yet more on-line microcosms of the French diaspora.

In fact, the website curation exercise has proven to be a rather addictive activity, with “just one more” site, tantalisingly hyperlinked to the one under scrutiny, delaying the often overdue computer shutdown. These meanderings, however, have a specific objective in mind: to create a collection of websites mirroring the physical presence of the French community in London in its manifold forms, be they administrative, institutional, entrepreneurial, gastronomical, cultural or personal.

Although the collection was intended to display a variety of London French on-line discourses and genres, thereby reflecting the multi-layered realities of the French presence on-land, the aim was also that they should come together as a unified whole, given a new sense of thematic coherence through their culturo-diasporic commonality and shared “home” in the Special Collection.

Open UK Web Archive vs Non-Print Legal Deposit
One of the key challenges with attempting to pull together a unified collection has been whether it can be viewed as a whole online. For websites to be published on the Open UK Web Archive website, permission needs to be granted by the website owner. Any website already captured for the Non-print Legal Deposit (from over 3.5 million domains) can be chosen but these can only be viewed within the confines of a Legal Deposit Library.

In theory, this would mean that the Non-print Legal Deposit websites selected for the London French collection would be accessible on-site in one of the official libraries, but – crucially – not available for open-access consultation on-line.

As regards to this collection, therefore, the practical implications of the legislation could have given rise to a fragmented entity, an archive of two halves arbitrarily divorced from one another, one housed in the ‘ivory towers’ of the research elite and the other freely available to all via the Internet: not the coherent whole I had been so keen to create.

What to select?
In addition to aiming to produce a unified corpus, it was my vision that the rationale of the curation methodology should be informed by the “ethnosemiotic” conceptual framework conceived for my overarching London French research. My doctoral work brings together the ideas of two formerly disparate thinkers, namely (and rather fittingly perhaps) those of French ethnographer, Pierre Bourdieu, and Anglophone, Gunther Kress, of the British “school” of social semiotics, whose particular focus is on multimodal meanings.

Consequently, when selecting websites to be included in the collection, or at least earmarked for permission-seeking, it was vital that I took a three-pronged approach, choosing firstly “material” that demonstrated the official on-line presence of the French in London (what Bourdieu might term the “social field” level) and secondly the unofficial, but arguably more telling, grassroots’ representations of the community on the ground (Bourdieusian “Habitus”), as portrayed through individuals' blogs. Thirdly, for my subsequent multimodal analysis of the sites to be effective, it would also be necessary to select sites drawing on a multiplicity of modes, for instance written text, photographic images, sound, colour, layout, etc., which all websites do by default, but which some take to greater depths of complexity than others.

Video and audio not always captured
However, in the same way that the non-print legal deposit legislation challenges the integrity of the collection as a whole, so these theoretical aspirations turned out to be rather more optimistic than I had envisaged, not least because of the technical limitations of the special collections themselves.
Despite the infinite supplies of generosity and patience from the in-house team at the British Library the fact that special collections cannot at present accommodate material from audiovisual sites, such as on-line radio and film channels (even some audio, visual and audiovisual content from standard sites can be lost in the crawling process) is an undeniable shortcoming.
It was a particular frustration when curating this collection, as audiovisual data, often containing tacit manifestations of cultural identity, are increasingly relied upon in the 21st-century digital age and thus of considerable value now and, perhaps more importantly, for future generations.

3D-Wall visualisation tool
Since completion of the inaugural collection, one or two additional positive lessons have been learned, like the “impact” value of the 3D-Wall visualisation tool. When disseminating my curation work at the Institut Français de Londres last March, before a diverse public audience, composed of community members, together with academics, historians, journalists, publishers and students, none of whom were thought to be familiar with the UK Web Archive, making use of the 3D Wall proved to be an effective and tangible way to connect with the uninitiated audience.


It brought the collection to life, transforming it from a potentially dull and faceless list of website names to a vibrant virtual “street” of London French cyberspaces, bringing a new dimension to the term “information superhighway”. It gave the audience a glimpse of the colourful multitude of webpages making up the façades of “London French Street”, to be visited and revisited beyond the confines of my presentation.

Indeed, the appeal of the collection, as displayed through the 3D Wall, generated unanticipated interest among several key players within the institutional and diplomatic bodies of the French community in London, not least the Deputy Consul and the Head of the French Lycée, both of whom expressed a keen desire to become actively involved in the project.
They found the focus on the quality of the everyday lives of the London French community a refreshing change from the media obsession with the quantity of its members, and I am convinced that it was the 3D Wall that enabled the collection to be showcased to its full potential.

In summary
To conclude, it can be said that I have found the journey, from idea through curation – with the highs and lows of selections, permission-seeking and harvesting – and ultimately “going live” a rewarding and enlightening process.

It has offered insights into the technical and administrative challenges of attempting to archive the ephemeral world of the on-line so as to preserve and protect it as well as providing rich insights into both the formal and informal representations of ‘Frenchness’ in modern London.

The corpus of websites I have curated aims to play its part in recording the collective identity of this often overlooked minority community, giving it a presence, accessible to all, for generations to come and, as such, contributing prospectively to the collective memory of this diasporic population.

By Saskia Huc-Hepher (University of Westminster)

24 June 2014

Your Web Archive Needs You!

Add comment Comments (0)

With the centenary of the outbreak of World War One taking place this summer the British Library’s Web Archiving team has been working with colleagues across the Library and beyond to initiate a ‘First World War Centenary Special Collection’ of websites.

The collection is part of a wide range of centenary projects under way at the Library including:

These projects will enable thousands of people to engage with the centenary and to showcase the many significant items held by the Library relating to the war.

The Special Collection
The web archive collection will include a huge variety of websites related to the centenary including the various events which will be taking place; resources about the history of the war; academic sites on the meaning of the conflict in modern memory and patterns of memorialisation and critical reflections on British involvement in armed conflict more generally.

The collection will help researchers find out how the First World War shaped our society and continues to touch our lives at a personal level in our local communities and as a nation.

Archiving began in April 2014 and will continue until 2019. Some examples of websites archived so far include:

We need your help!
Do you know of a website which may be suitable for the First World War Centenary Collection? If so, we would love to hear from you, particularly if you edit or publish a WW1 themed website yourself.

Websites could include those created by museums, archives, libraries, special interest groups, universities, performing arts groups, schools and community groups, family and local history societies or individual publications. It does not cost anything to have your website archived by the British Library and involves no work on your part once nominated.

Please nominate UK based WW1 related websites through our nominate form.

If you have HLF funding for a First World War Centenary project, please send the URL (web address) to with your project reference number.

See what we have in the WW1 special collection so far.

Written by Nicola Bingham, Web Archivist, British Library

23 June 2014

Researcher in focus: Paul Thomas - UK and Canadian Parliamentary Archives

Add comment Comments (0)

At the UK Web Archive, we’re always delighted to learn about specific uses that researchers have been able to make of our data. One such case is from the work of Paul Thomas, a doctoral student in political science at the University of Toronto.

Paul writes:

‘The UK Web Archive has been a huge asset to my dissertation. My research examines how backbench parliamentarians in Canada, the UK and Scotland are increasingly cooperating across party lines through a series of informal organizations known as All-Party Groups (APGs). For the UK, the most important source for my research is the registry of APGs that is regularly produced by the House of Commons. The document, which is published in both web and PDF formats, provides details on the more than 500 groups that are in operation, including which MPs and Peers are involved, and what funding groups have received from outside bodies like lobbyists or charities.

‘A key part of the study involved using the registries to construct a dataset that tracked membership patterns across the various groups, and how they changed over time. Unfortunately, each time a new version of the registry is produced, the previous web copy is taken down.’

While the Parliamentary Archives keep old copies of the registry on file, they only do so in PDF – a format that is not so conducive to the extraction of information into a dataset. Paul was able to find and use successive versions from the UK Web Archive going back to 2006, including a number that were missing from the Internet Archive. Paul was also able to obtain pre-2006 versions from the Internet Archive. ‘Without the UK Web Archive, I would have first needed to purchase the past registries in PDF from the Parliamentary Archives and then painstakingly copy the details on each group into a dataset.’ Overall, Paul writes, ‘the UK Web Archive saved me an enormous amount of time in compiling my data'.

Paul recently gave a paper drawing on this data at the Annual Conference of the Canadian Political Science Association:

31 July 2013

Propaganda, political communication and action on the web

Add comment Comments (0)

[A guest post from Ian Cooke, lead curator for international studies and politics at the British Library, and curator of the current exhibition Propaganda: Power and Persuasion]

Chorus - small detail

If you’ve visited our summer exhibition, Propaganda: Power and Persuasion, you will have seen our “Chorus” installation. Positioned on a large wall at the end of our exhibition, it displays a huge set of archived tweets that relate to three recent events (the Olympics opening ceremony, the debate on gun control in the United States, and President Obama’s ‘Four more years’, which became the most re-tweeted message). In our exhibition, we’re interested in the impact which social media is having on communicating and challenging influence from state and other powerful institutions.

There are different ways of looking at this. A simplification of one argument runs something like this: social media, through enabling access on an equal footing to the same shared public space, is a democratising tool that allows challenge to other forms of influence. People can respond to and question statements that appear dubious, and put across their own point of view. If propaganda is about narrowing the space for debate, then social media provides a powerful means to open it up. Additionally, the new technologies provide freely-available tools by which communities and grass-roots campaigns can network and co-ordinate action to powerful effect. I attended last year’s Netroots UK conference, where Sue Marsh gave an inspirational talk on digital activism and challenges to perceptions and prejudices used in the debate on cuts to welfare benefits for long-term sick and disabled people.       

However, some would offer a challenge to the view of social media as always empowering. The vast proliferation of information produced, and the speed by which it is received – so that events or messages are commented on immediately – means that it becomes very hard to check sources and accuracy. Misleading information, or just a point of view put strongly, can be repeated and run unchallenged. In some cases, authority and authorship can be hard to trace. Further, some would argue that new communications technologies allow new opportunities for misdirection in political campaigning. One example is so-called “astro-turfing”, where an apparently local and popular campaign  has in fact been set up and co-ordinated by a centralised and well-resourced body. Such activities have existed long before social media, but these new technologies create powerful new ways to both disguise and professionalise the role of the campaigner.

Over the past year, I had the opportunity to create a small collection of websites for the UK Web Archive as part of the Library’s ‘Curators Choice’ programme. This was a great opportunity to start exploring some of these issues, under the heading Political Action and Communication. The collection is more concerned with exploring the interpretation of new media as empowering and democratising, although some sites included, such as WhoFundsYou? are concerned with issues of transparency on the web.

In the collection you’ll find examples of websites set up to support specific campaigns, or organised around specific issues, such as the national and local Frack-Off campaigns against the use of hydraulic fracturing (“fracking”) to extract shale gas from rock. The Occupy protests in London early in 2012 are represented through the Bank of Ideas, which was hosted in disused UBS offices in Hackney, and the Occupied Times of London

There are also examples of charities and companies that support other organisations in online campaigning. These include FairSay, Social Spark, and Hands Up. All these offer advice, web design and other new media support for charities and campaigning organisations. The Sheila McKechnie Foundation uses its own website and Campaign Central directory to offer support and resources for grass-roots campaigning (on and off-line) around Britain. 

I was also interested in the way that blogging is used in campaigning and political commentary. There are examples of individual blogs including Guido Fawkes and Never Seconds. Co-authored blogs can change the style of discussion by bringing in a wider range of viewpoints. Some present views from one political perspective, such as Left Foot Forward. Others attempt to represent a wider spectrum of debate, such as Speaker’s Chair. The latter is particularly interesting in light of criticisms of political communication on the web, which argue that debate quickly polarises as people essentially only read and follow people with whom they already agree.   

One area of campaigning that I specifically left out in this collection was party political campaigning during general elections. This is of course a huge area and presents its own challenges for web archiving, as sites are often live for only a short period. The UK Web Archive has however collected websites for the 2010 general election and 2005 general election, as well as the 2009 European parliamentary elections. You can also see more examples of campaign websites and political communication in our collection on the impact of the 2010 public spending cuts.

My thanks go to everyone who supported the Political Action and Communication collection, those who suggested sites and to those who agreed to have their websites archived. All the archived websites included here can be viewed from anywhere, and that of course requires permission from owners of websites – who are often busy running or supporting campaigns. As you’ll see this is a collection that I’m just getting started with, so I need to find more examples to explore further. If you have a site to suggest, would like to comment on the collection, or have found the collection useful, then I’d love to hear from you.


[Propaganda, Power and Persuasion runs at the British Library in St Pancras until 17 September.  Ian may be contacted by email at ian dot cooke at bl dot uk ]

17 June 2013

Innovation in geographical context: the Cambridge Network collection

Add comment Comments (0)

Most people have heard of Silicon Valley, the area of northern California famous for its concentration of technology companies, both well established and newly started. The term has in more recent years been applied to the area around Cambridge ("Silicon Fen") and perhaps most recently there has appeared "Silicon Roundabout", just a short distance from us here at the British Library. These three examples point towards the key importance of geographical proximity for economic development and for innovation in particular.

We are particularly pleased to to have started to capture some of the web archival record for the cluster of companies, educational institutions and other organisations associated with the Cambridge Network. We were particularly pleased to have been able to work with the Network, which exists to bring business and academia together to facilitate the sharing of ideas, and to encourage collaboration and partnership.

Not surprisingly, many parts of the University of Cambridge are represented in the collection, such as the Centre for Business Research or the Centre for Advanced Photonics and Electronics (CAPE).  There are the sites of organisations whose purpose is to facilitate knowledge exchange in general, such as the UK Innovation Research Centre or the Huntingdonshire Business Network. And there are sites from local government, the law, financial services, the charitable sector and the many other parts that go to make up the rich ecology of business in a local area.

And of course there are the companies themselves. Some are well-known, the majority are not. But of the 715 organisations represented in the collection, it may be that some of them grow to become household names. This collection will hopefully be of great value in capturing a snapshot of innovation in progress in a particular geographical area. Browse the collection here.

19 February 2013

Nineteenth century English literature: a new special collection

Add comment Comments (0)

[A guest post from Andrea Lloyd, Curator of Printed Literary Sources, 1801-1914 at the British Library]

After almost a year of gathering I’m pleased to announce that my ‘Curator’s Choice’ collection of websites relating to 19th century English literature has now been published on the UK Web Archive.

As a curator of printed literary sources for the period 1801-1914 it doesn’t require a great leap of imagination to discover why I chose this particular topic. The collection is intended to reflect the diverse interests in the genre that are substantiated on the web. Opinions about, and interpretations of 19th century literature and its authors are constantly evolving and I hope that this resource contextualises these important scholarly and cultural changes.

The sites included so far display a broad and eclectic array of subject matters – ranging from author societies to museums; from literary adaptations to academic syllabi. 19th century literature is still hugely popular and attracts a wide audience. Given the massive interest in the likes of Jane Austen and Charles Dickens, I initially thought I would concentrate on lesser-known authors, and on literature that has grown somewhat obscure in the intervening years. This ultimately isn’t how the collection has evolved – sometimes because many of the more niche sites are published without giving any administrator contact details (so permission cannot be sought to archive the site). In other cases, the owners have not responded to permission requests – often because they have cast the sites off into the vast ‘webosphere’ to fend for themselves.

Anna_t BY-NC-SA Flickr

As someone who works with 19th century printed ephemera on a regular basis I found this exercise particularly fascinating. Pertinent comparisons can be drawn between the ephemeral items that are published on the web and those that were printed in the 19th century. A great deal of the ephemeral literature produced in the 19th century has survived to this day (albeit in a fragile state) – either through luck or thanks to collectors with foresight. Given its transient and contributory nature there is a great danger that similar items produced in electronic formats may not be so lucky – hence the reason the Web Archive is so vital. Hopefully my 22nd century counterpart will thank me for choosing to preserve for posterity some of the more marginal, fleeting and subjective sites available relating to the genre!

Now it’s available for all to see, I hope that others will recommend sites that they think would complement the theme and  help to create a lasting snapshot of 19th century literary scholarship in the 21st century. Do get in touch via this blog, or @UKWebArchive on Twitter.

[Image by anna_t, Creative Commons BY-NC-SA]