By Andrew Jackson, Web Archiving Technical Lead
It’s been over a year since we made our historical search system available, and it’s proven itself to be stable and useful. Since then, we’ve been largely focussed on changes to our crawl system, but we’ve also been planning how to take what we learned in the Big UK Domain Data for the Arts and Humanities project and use it to re-develop the UK Web Archive.
Our current website has not changed much since 2013, and doesn’t describe who we are and what we do now that the UK Legal Deposit regulations are in place. It only describes the sites we have crawled by permission, and does not reflect the tens of thousands of sites and URLs that we have curated and categorised under Legal Deposit, nor the billions of web pages in the full collection. To try to address these issues, we’re currently developing a new website that will open-up and refresh our archives.
One of the biggest challenges is the search index. The 3.5 billion resources we’ve indexed for SHINE represents less than a third of our holdings, so now we need to scale our system up to cope with over ten billion documents, and a growth rate of 2-3 billion resource per year. We will continue working with the open source indexer we have developed, while updating our data processing platform (Apache Hadoop) and dedicating more hardware to the SolrCloud that holds our search indexes. If this all works as planned, we will be able to offer a complete search service that covers our entire archive, from 1995 to yesterday.
The first release of the new website is not expected to include all of the functionality offered by the SHINE prototype, just the core functionality we need to make our content and collections more available to a general audience. Quite how we bring together these two distinct views of the same underlying search index is an open question at this point it time. Later in the year, we will make the new website available as a public beta, and we’ll be looking for feedback from all our users, to help us decide how things should evolve from here.
As well as scaling up search, we’ve also been working to scale up our access service. While it doesn’t look all that different, our website playback service has been overhauled to cope with the scale of our full collection. This allows us to make our full holdings knowable, even if they aren’t openly accessible, so you get a more informative error message (and HTTP status code) if you attempt to access content that we can only make available on site at the present time. For example, if you look at our archive of google.co.uk, you can see that we have captured the Google U.K. homepage during our crawls but can’t make it openly available due to the legal framework we operate within.
The upgrades to our infrastructure will also allow us update the tools we use to analyse our holdings. In particular, we will be attending the Archives Unleashed 4.0 Datathon and looking at at the Warcbase and ArchiveSpark projects, as they provide a powerful set of open source tools and would enable us to collaborate directly with our research community. A stable data-analysis framework will also provide a platform for automated QA and report generation and make it much easier to update our datasets.
Taken together, we believe these developments will revolutionise the way readers and researchers can use the UK Web Archive. It’s going to be an interesting year.