UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

5 posts from October 2012

30 October 2012

How good is good enough? – Quality Assurance of harvested web resources

Quality Assurance is an important element of web archiving. It refers to the evaluation of harvested web resources which determines whether pre-defined quality standards are being attained.

So the first step is to define quality, which should be a straightforward task considering the aim of web harvesting is to capture or copy resources as they are on the live web. Getting identical copies seems to be the ultimate quality standard.

The current harvesting technology unfortunately does not deliver 100% replicas of web resources. One could draw up a long list of known technical issues in web preservation. Dynamic scripts, streaming media, social networks, database-driven content… The definition of quality quickly turns into a statement of what is acceptable, or how good is good enough. Web curators and archivists regularly look at imperfect copies of web resources and make trade-off decisions about their validity as archival copies.

 We use four aspects to define quality:

1. Completeness of capture: whether the intended content has been captured as part of the harvest.

2. Intellectual content: whether the intellectual content (as opposed to styling and layout) can be replayed in the Access Tool.

3. Behaviour: whether the harvested copy can be replayed including the behaviour present on the live site, such as the ability to browse between links interactively.

4. Appearance: look and feel of a website.

When applying these quality criteria, more emphasis is placed on the intellectual content rather than appearance or behaviour of a website.  As long as most of the content of a website is captured and can be replayed reasonably well, then the harvested copy is submitted to the archive for long term preservation, even if the appearance is not 100% accurate.

Capture
Example of a "good enough" copy of a web page, despite missing 2 images

We also have a list of what is “not good enough” which helps separate the “bad” from the “good enough”.  An example of this is the so called “live leakage”, a common problem in replaying archived resources, which occurs when links in an archived resource resolve to the current copy on the live site, instead of to the archival version within a web archive. This is a particular concern when the leakage is to a payment gateway which could cause confusion to users leading them to make payments for items that they do not intend to purchase or that do not exist. There are certain remedial actions we can take to address the problem but there is as yet no global fix.  Suppressing the relevant page from the web archive is often a last resort. 

Quality assurance in web archiving currently relies heavily on visual comparison of the harvested and the live version of the resource, review of previous harvests and crawl logs. This is time consuming and does not scale up.  For large scale web archive collections, especially those based on national domains it is impossible to carry out the selective approach described above.  Quality assurance, if undertaken, often relies on sampling.  Some automatic solutions have been developed in recent years which for example examine HTTP status code to identify missing content.  Automatic quality assurance is an area where more development will be welcome.

Helen Hockx-Yu, Head of Web Archiving, British Library

26 October 2012

Ambassador, with these websites, you're really spoiling us

[A special guest post from Stella Wisdom, British Library Digital Curator at the British Library.]

A little help from our friends for the video games collection

In my post last month I discussed the challenges in obtaining permission from website owners for the sites I’m selecting for the new video games collection. I’ve decided to try a new approach, recruiting an ambassador who is well known and respected in the video game industry to champion the collection, who can help explain and promote the benefits of web archiving to site owners. I would hereby like to introduce Ian Livingstone, Life President of Eidos, the company behind the success of Lara Croft: Tomb Raider. Ian brings so much experience to the party, we don’t even expect him to bring chocolates !

Ian’s long history in the gaming industry started in 1975 when he co-founded Games Workshop, I_Livingstone - new - smalllaunching Dungeons & Dragons in Europe, then building a nationwide retail chain and publishing White Dwarf magazine. In 1982, with Games Workshop co-founder Steve Jackson, he created the Fighting Fantasy role-playing gamebook series, which has sold over 16 million copies to date. Fighting Fantasy is 30 years old this year, and Ian has written a new gamebook Blood of the Zombies to celebrate the anniversary, and it has also recently launched as an App on iOS and Android.

In 1984, Ian moved into computer games, designing Eureka, the first title released by publisher Domark. He then oversaw a merger that created Eidos Interactive, where he was Chairman for seven years. At Eidos he helped bring to market some of its most famous titles including Lara Croft: Tomb Raider. Ian became Life President of Eidos for Square Enix, which bought the publisher in 2009, and he continues to have creative input in all Eidos-label games.

Ian is known for actively supporting upcoming games talent; as an advisor and investor in indie studios such as Playdemic, Mediatonic and Playmob. He is vice chair of trade body UKIE, a trustee of industry charity GamesAid, chair of the Video Games Skills Council, chair of Next Gen Skills, a member of the Creative Industries Council and an advisor to the British Council.

In 2010 he was asked by Ed Vaizey, the UK Minister for Culture, Communications and Creative Industries, to become a government skills champion and was tasked with producing a report reviewing the UK video games industry. The NextGen review, co-authored with Alex Hope of visual effects firm Double Negative, was published by NESTA in 2011, recommending changes in education policy, the main one being to bring computer science into the schools National Curriculum as an essential discipline.

With this wealth of experience and connections, I can’t think of anyone better to work with and I’m hopeful the collection will successfully grow with Ian’s support. This is the first time the UK Web Archive has appointed an ambassador for a collection, so it will be interesting to follow its progress. If the use of a champion is successful, then other collections may benefit from the same approach.

I’ve also been doing some advocacy work of my own this week; talking about the video game collection at GameCity7 festival, meeting many interesting people there, discussing video game heritage and engaging them in web archiving.

As ever, I’m still seeking nominations, so if you know of any sites that you think should be included, then please get in touch (at [email protected] or via Twitter @miss_wisdom) or use the nomination form.

18 October 2012

Religion, the state and the law in contemporary Britain

Another in a series of forthcoming new collections is one that I myself am curating with the working title of 'State, religion and law in contemporary Britain.'

The politics of religion in Britain looks like a much more urgent area of inquiry in 2012 than it did a decade ago. In large part due to the terrorist attacks of 9/11 and 7/7, questions about the nexus of faith and national identity have found a new urgency. At the same time, older questions about the place of faith schools and of the bishops in the House of Lords, or of abortion or euthanasia have been given new and sharper focus in a changed climate of public debate.

The period since 2001 is also marked by a massive upswing in the use of the web as a medium for religious and religio-political debate, both by the established churches and campaigning secularist organisations, and by individuals and smaller organisations, most obviously in the blogosphere.

This collection is therefore trying to capture some representative sites concerned with issues of politics, government and law that touch on the disputed role of religious symbolism, belief and practice in the public sphere in Britain.

The collection is still ongoing and suggestions are very welcome, to [email protected], or via the nomination page. So far, the collection is rather weighted towards Christian voices and organisations, and suggestions for sites from amongst other faiths would be particularly welcome.

I've attempted to capture some representative general voices, such as the blog of the human rights campaigner Peter Tatchell, which deals with religious issues; the public theology think-tank Theos, and the National Secular Society.

We have already harvested some interesting sites relating to specific issues and events, such as the official site for the 2010 Papal visit to the UK, and some of the dispute at the time about the appropriateness or otherwise of spending public money on the security arrangements for the visit, from the BBC and elsewhere.

An issue at the 2010 General Election was the place of the bishops in the House of Lords, and the Power2010 campaign pressed for that to change, as did the British Humanist Association.

An issue that has come to prominence in recent weeks is that of the appropriate time limit for abortion, and we have twelve archived instances of the site of the Society for the Protection of the Unborn Child, stretching back as far as 2005.

11 October 2012

BlogForever: a new approach to blog harvesting and preservation ?

[Ed Pinsent of the University of London Computer Centre writes about the BlogForever project.]

The European Commission funded BlogForever project is developing an exciting new system to harvest, preserve, manage and reuse blog content. I'm interested not only as a supplier to the project, but also because I'm fairly familiar with the way that Heritrix copies web content, and the BlogForever spider seems to promise a different method.

The system will perform an intelligent harvesting operation which retrieves and parses hypertext as well as all other associated content (images, linked files, etc) from blogs. It copies content by interrogating not only the RSS feed of a blog (similar to the JISC ArchivePress project), but also by copying data from the original HTML. The parsing action will be able to render the captured content into structured data, expressed in XML; it does this in accordance with the project's data model.

The result of this parsing action will carve semantic entities out of blog content on an unprecedented micro-level. Author names, comments, subjects, tags, categories, dates, links, and many other elements will be expressed within the hierarchical XML structure. When this content is imported into the BlogForever repository (based on CERN’s Invenio platform), a public-facing access mechanism will provide a rendition of the blog which can be interrogated, queried and searched to a high degree of detail. Every rendition, and updated version thereof, will be different, representing a different time-slice of the web; without the need for creating and managing multiple copies of the same content. The resulting block of XML will be much easier to store, preserve, and render than current web-archiving methods.

BlogForever are proposing to create a demonstrator system to prove that it would be possible for any organisation, or consortium of like-minded organisations, to curate aggregated databases of blog content on selected themes. If there was a collection of related blogs in fields of scientific research, media, news, politics, arts, education, a researcher could search across that content in very detailed ways, revealing significant connections between written content. Potentially, that's an interrogation of web content of a quality that even Google cannot match.

This interests me as it might also offer us the potential to think about web preservation in a new way. In most existing methods, the approach is to copy entire websites from URLs, replicating the folder structure. This approach tends to treat each URL as a single entity, and follows the object-based method of digital preservation; by which I mean that all digital objects in a website (images, attachments, media, stylesheets) are copied and stored. We've tended to rely on sophisticated wrapper formats to manage all that content and preserve the folder hierarchy; ARC and WARC are useful in that respect, and in California the Bag-It approach also works for websites, and is capable of moving large datasets around a network efficiently.

Conversely, the type of content going into the BlogForever repository is material generated by the spider: it’s no longer the unstructured live web. It’s structured content, pre-processed, and parsed, fit to be read by the databases that form the heart of the BlogForever system. The spider creates a “rendition” of the live web, recast into the form of a structured XML file. XML is already known to be a robust preservation format.

If these renditions of blogs were to become the target of preservation, we would potentially have a much more manageable preservation task ahead of us, with a limited range of content and behaviours to preserve and reproduce. It feels like instead of trying to preserve the behaviour, structure and dependencies of large numbers of digital objects, we would instead be preserving very large databases of aggregated content.

BlogForever (ICT No. 269963) is funded by the European Commission under Framework Programme 7 (FP7) ICT Programme

04 October 2012

Exploring the lost web

There has been some attention paid recently to the rate at which the web decays. A very interesting recent article by SalahEldeen and Nelson looked at the rate at which online sources shared via social media subsequently disappear. The authors concluded that 11% would disappear in the first year, and after that there would be a loss of 0.02% per day (that's another 7.24% per year); a startling rate of loss.

There are ways and means of doing something about it, not least through national and international web archives like ourselves. And we preserve many extremely interesting sites that are already lost from the live UK web domain.

Some of them relate to prominent public figures who have either passed away, or are no longer in that public role. One example of the former is the site of the former Labour MP and foreign secretary Robin Cook, who died in 2005. One of the latter is that of his colleague Clare Short, who left parliamentary politics in 2010 after serving as secretary of state for international development.

Organisations also often have limited lives as well, of course, and amongst our collections is the site of the Welsh Language Board, set up by act of Parliament in 2003, and abolished by later legislation in 2012. Perhaps more familiar was one of the major corporate casualties of recent years, Woolworths, which went into administration in 2009.

Some others relate to events that have happened or campaigns that have ended. In the case of some of the more 'official' sites, we in the web archiving team can anticipate when sites are likely to be at risk, and can take steps to capture them. In other cases, we need members of the public to let us know. If you know of a site which you think is important, and that may be at risk, please let us know using our nomination form.

One such site is One and Other, Anthony Gormley's live artwork on the vacant fourth plinth in Trafalgar Square. Also in the archive is David Cameron's campaign site when a candidate for the constituency of Witney in the 2005 general election. Finally, there is What a difference a day makes, a remarkable blog post from one who experienced the London terrorist attacks of 2005. All three now exist only in the web archive.