UK Web Archive blog

28 posts from December 2011

09 December 2011

Archiving websites with the Web Curator Tool

Have you ever wondered how the UK Web Archive works? Wondered about the technical system that we use to collect and provide access to websites? Wondered about how we process copies of websites before we make them available to the public? Well, you'’re in the right place to find out more! We'’re planning to use this blog to publish information like this, including technical developments as well as the collection development side of our work. But before we start talking about WARCs, crawlers, SIPs and the like, we thought it might be useful to give you some background on the system at the very heart of our web archiving activities. That system is the Web Curator Tool.

Wct

The Web Curator Tool is an open source, selective web archiving workflow management application. Initiated by the International Internet Preservation Consortium (IIPC), it was developed as a collaborative effort by the British Library and the National Library of New Zealand. We'’ve been using it at the British Library for several years and it'’s also used by a number of other IIPC members worldwide.

WCT enables multiple non-technical users to manage the workflow associated with selective archiving. This includes permissions management (i.e permission to archive websites), crawl scheduling (how often websites will be crawled, eg weekly, monthly, every six months, etc), quality assurance, crawl validation, metadata management, and so forth. It’s an incredibly useful piece of software.

Written in Java and designed to run on Apache Tomcat, a typical WCT installation has the following components:•

  • The WCT Core
  • A relational database.
  • The WCT Digital Asset Store (where harvested material is held prior to archiving)•
  • One or more Harvest Agents (responsible for gathering web content, we use Heritrix).

WCT is the web archiving back office. Our front end to the web archive is our public web site, which allows you to browse and search the archive. The archived material itself is replayed by the Internet Archive's Wayback machine software. 

WCT is freely available for anyone to install under an Apache Public License and can be downloaded from Sourceforge.

08 December 2011

07 December 2011

Advent Calendar: December 7th

Bank of England

Website archived on: 7th December 2005

Bank-england

Archived by: The British Library

Subject classifications: 
Government, Law & Politics;
Business, Economy & Industry > Trade, Commerce, and Globalisation;
Business, Economy & Industry > Banking, Insurance, Accountancy and Financial Economics

Special collection? Yes - Credit Crunch

Still available on live web? Yes, though content is frequently updated. 

Other instances available? Yes - 27 in total, collected between 2004 and 2011. The website has been redesigned 3 times over this period. 

 

06 December 2011

Advent Calendar: December 6th

The Hutton Inquiry

Investigation into the Circumstances surrounding the Death of Dr David Kelly in 2003.

Website archived on: 6th December 2004

Still available on the live web?No Yes

Hutton
Archived by: The National Archives

Subject classifications: Government, Law & Politics > Public Inquiries

Special Collection? No

Other instances? Yes: 17 others, collected between Oct 2004 & Feb 2005

(Editors note: updated 10.14am with correct reference to live site)

05 December 2011

Advent Calendar: December 5th

Journal: Bandolier

'Evidence-based thinking about health care'

Archived on: December 5th 2010

Bandolier
Archived by: The Wellcome Library

Subject classifications: Medicine & Health

Special Collection? No

Still available on the live web? At a different URL

Other instances? yes, 17 since 2004

 

04 December 2011

Advent Calendar: December 4th

Newport Town Poet: Goff Morgan

Archived on: 4th December 2004

Newporttownpoet
Archived By: the Llyfrgell Genedlaethol Cymru / The National Library of Wales.

Subject classificationsArts & Humanities > Literature

Special collection? No

Still available on live web? Uncertain - whilst not a 404, only a partial view of the top banner is shown (possibly a frames issue)

Other instances? Yes: Dec 2005, Dec 2006, Dec 2007

03 December 2011