THE BRITISH LIBRARY

UK Web Archive blog

5 posts from April 2013

30 April 2013

Dispatches from the domain crawl #1

After the blaze of publicity surrounding the advent of Non-Print Legal Deposit, the web archiving team have been busy putting the regulations into practice. This is the first of a series of dispatches from the domain crawl, documenting our discoveries as we begin crawling the whole of the UK web domain for this first time.

Firstly, some numbers. In the first week, we acquired nearly 3.6TB of compressed data (in its raw, uncompressed form, the data is ~40% larger) from some 191 million URLs. Although we staggered the launch as a series of smaller crawls, by the end of the week we reached a sustained rate of 300Mb/s. The bulk of this was from the general crawl of the whole domain, which we kicked off with a list of 3.8 million hostnames.

At this stage it is difficult to determine what our success rate is - that is, how successful we are at harvesting each resource we target. This is partly because the Heritrix crawler has what might be described as an optimistic approach to determining what in a harvested page is actually a real link to another resource (particularly when parsing Javascript). As a result, some of the occasions on which Heritrix does not return a resource are due to the fact that there was not a real resource to be had.

At this early stage it is also hard to determine reliably the difference between a erroneous response for a real link resource that has disappeared, and an occasion on which access to a real resource was blocked. Over time, we'll learn more about how best to answer some of these questions, which will hopefully start to reveal interesting things about the UK web as a whole.

Roger Coram / Andy Jackson / Peter Webster

 

16 April 2013

Just what is the UK web domain anyway ?

This sounds like a simple question. Ten seconds on most sites will tell a human viewer where a site originates from, and a little digging will produce the answer eventually. But under Non-Print Legal Deposit, we need a scaleable way of settling the question without human intervention. Our remit under the new regulations extends to sites that are issued from a .uk or other UK geographic top-level domain, or where part of the publishing process takes place in the UK. (See the regulations here, and a summary here.) UK map

We estimate that there are just short of five million sites that end in .uk - a simple, unambiguous and machine-readable way of knowing that a site originates from within the UK and so is covered by the remit we now have. However, not all UK domains end in .uk. Many .com, .org and other sites are in fact published from within the UK, and there are few reliable figures as to how many of these there are. And so to identify which of these fall within the scope of the regulations, we need other methods.

One such method is to find out where the site is hosted. www.geoiptool.com provides information on where a server is located, although it is difficult to attain 100% accuracy. Another way is to look at where the domain name is registered, using a service such as www.whois.net. However, in many cases domains are registered by one company on behalf of another or of an individual, perhaps because they want their contact details to remain private. There also isn't (yet) a straightforward way of querying any of these services at scale for thousands or indeed millions of sites.

There may be sites for which we have direct knowledge, from the site owner, that their .com domain is operated from within the UK, but that could only ever be for a tiny proportion of sites. And so after all these possibilities are exhausted, the next step is to make judgements based on the presentation of the site itself. But what in a site is "enough" ? A postal address in a Contact Us page is a possibility; so is a UK-domain email address (for those sites whose owners don't use anything as twentieth century as the post).

What if a site doesn't disclose the information we might like, but is self-evidently from the UK (once you look at the content)? One example is Conservative Home, a prominent political site, which nowhere explicitly states that it is published in the UK. This is a particular issue for blogs, which are often hosted on a platform service such as Wordpress (which is based in San Antonio, Texas) but would be thought by most to be "published" from wherever the author is based. There are similar issues in determining which parts of social media sites such as Twitter or Facebook should be treated as published from within the UK.

All of this of course supposes that all website owners tell the truth about where they are based. There may be cases where a site is published in another country but purports to be from the UK, perhaps to protect the author from a repressive regime. Conversely an owner might, for reasons which are hard to predict, wish that their site published within the UK did not appear to be.

It's early days for Non-Print Legal Deposit, and some of these issues will become clearer as we gain more experience with just these sorts of difficult questions. 

[Map reproduced courtesy of Showeet.com, under a Creative Commons Attribution-NoDerivs 3.0 licence.]

Peter Webster, Web Archiving Engagement and Liaison Manager

12 April 2013

Health and Social Care Act 2012: collection now available

Some weeks ago we blogged about our effort to capture some of the web estate of the NHS. There was an urgency in this, as Primary Care Trusts (PCTs), Strategic Health Authorities (SHAs) and some other organisations would cease to exist at the beginning of April, as the reforms under the Health and Social Care Act 2012 took effect. And at that point those bodies would no longer be obliged to keep those sites available.

We're now delighted to be able to announce the launch of this collection of over three hundred sites. It contains the sites of the SHAs and the PCTs, grouped by region. It also includes the Local Involvement Networks (now superseded by Healthwatch).

The collection also includes sites such as that of the National Institute for Health and Clinical Excellence (NICE), the Health Protection Agency, and information about the change from the Department of Health, and from the media.

Thanks to the tireless work of Ravish Mistry, the archive of sites from the PCTs and SHAs is comprehensive, and the coverage of the other types of sites is very full. The collection represents a highly important resource for future historians of the National Health Service, as well as being a reference point for more current discussion of the implementation of the reforms as they continue.

Peter Webster
Web Archiving Engagement and Liaison Manager, British Library

05 April 2013

Non-Print Legal Deposit: it's here !

Ten years after the Legal Deposit Libraries Act 2003 established the principle, from tomorrow we shall be beginning to archive the whole of the UK web domain, in partnership with the other five legal deposit libraries for the UK. The new regulations are here.

I thought it worth drawing together some key information, along with some of the media coverage that has appeared this week.

The British Library's press release is here, and there are also some useful FAQs which fill in some of the detail. These cover:

There has also been much coverage in the media, including (in roughly chronological order):

Associated Press (4 April)

The Verge (4 April)

Wired (5 April)

The Guardian (5 April) (and coverage of the launch event)

BBC News (5 April)

Daily Express (5 April)

Daily Telegraph (5 April)

International Business Times (5 April)

Paidcontent.org (5 April)

Times Higher Education Supplement (6 April)

Al Jazeera (6 April) (with video)

ZDNET (by @jackschofield) (8 April)

The Spectator (Books Blog) (11 April)

I shall keep adding to this list as more coverage appears. From outside the UK, see the New Zealand Herald, La Stampa (Italy), Computerworld New Zealand

Peter Webster, Web Archiving Engagement and Liaison Manager

04 April 2013

Librarianship in the 21st century: a new collection

[A guest post from Rossitza Atanassova, Digital Curator at the British Library]

What better institution to archive UK librarianship-related websites than The British Library! The
evolving role of libraries in the UK
collection launches with a modest number of websites worthy of preservation, and with a call to librarians, information professionals, researchers and the public to nominate many more worthwhile sites.

The collection aims to reflect developments within the UK library community in the 21st century, in response to financial, technological, political, social and other pressures and challenges. As well as some important institutional and organisation sites (CILIP, MLA, RIN), the collection showcases collaborations (Inspire, UKRR) and advocacy blogs (Public Libraries News), special interest groups (MMIT) and fora (LILAC), communities of knowledge exchange (LIKE, #UKLibChat) and of research and practice (LIS Research Coalition, Research Active). It tries to highlight the work of inspirational professional individuals (Joeyanne Libraryanne) and groups (Heart of the School); innovative services supporting learning and research (SCARLET) and the visually impaired (RNIB, Reading Sight, Speaking Volumes). One of the more dominant themes in the collection is of open access institutional repositories and the new role for librarians and information professionals in digital repositories and data management (RSP, UKCoRR, Open and Shut?)

I am most grateful for the enthusiastic response from website owners whom I had contacted and huge thank you to the Web Archiving Team for doing all the technical work behind the scenes!