10 Years of the Web Archive - What have we saved - video
Talk given by Andy Jackson, Web Archiving Technical Lead at the IIPC General Assembly 2015
Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites
The UK Web Archive, the Library's premier resource of archived UK websites
27 August 2015
Talk given by Andy Jackson, Web Archiving Technical Lead at the IIPC General Assembly 2015
13 August 2015
If you have read any of my previous blogs (Beginner’s Guide to Web Archives 1,2,3) you will know that as part of my work at the British Library I have been curating a special web archive collection on climate change. But why did I choose this subject?
Having begun as a topic of scientific interest, the threat of climate change has developed into a potentially world-changing issue with major implications for how we live our lives. The projected impacts of climate change have profound impacts on things like food, water, human health; and therefore on national and international policy and the ‘business as usual’ world economy. Naturally therefore, the topic is heavily debated in the public arena, from the science of global warming and its associated effects to the policies designed to mitigate or adapt to it.
We might expect different individuals and organisations – as for any topic – to portray the issue in different ways. But how exactly is climate change characterised on the internet? For instance, while there are many websites that accept the current understanding of climate science and actively promote action to limit global warming, there are many others that partially or completely deny the science. How is the issue portrayed by these different groups? Or another example: how is the issue portrayed by renewable energy companies compared to fossil fuel companies, two groups with very conflicting interests? As climate change progresses, how will its online characterisation change? I wanted to build a collection that could help to answer some of these questions.
The collection consists of websites from different societal groups that have an active interest in the subject: for example academics; the energy sector; policy makers; special interest groups; the media and some members of the public. Websites generally fall into one of the following categories: personal blog pages/twitter feeds, non-governmental organisations/coalitions, news, government, energy companies, religious organisations, educational websites, learned societies and university institutions. The proportion of each website devoted to climate change ranges from almost 100 % (some blogs/specialist websites) to more limited coverage. Some websites may be notable for the complete absence of climate change references. For example, after discussions in Cardiff, I have included each of the main UK energy companies, even when their websites do not mention climate change. Such information was considered to be useful in terms of the questions posed above.
The collection is an evolving beast, so if you have any suggestions regarding extra websites we could include, please fill in the online form here. We are hoping to make as many of the websites openly available as possible, but don’t forget that if you want to view the whole collection, you will need to head to your nearest legal deposit library to do so.
Peter Spooner, Science Policy Intern
10 August 2015
Coming to the end of his short time working on web archives at the British Library, science-policy intern Peter Spooner reflects on the process of creating a web archive special collection.
In my previous blog entry, I covered why we might want to create special collections. Here, I would like to examine the pros and cons of these collections in more detail.
In order for an archivist to create a special collection, he/she must come up with a subject, refine the scope of the topic to prevent the collection from becoming too large, and then collect websites. In my case - climate change – I decided to collect websites to show how climate change is portrayed across society (by charities, the energy sector, interested individuals, learned societies etc.) with a focus on the portrayal of climate science and policy. Whilst I hope such a collection will be interesting and useful, problems do exist.
In July, the British Library team headed to meet some environmental psychologists from Cardiff University. The major success of the meeting was to inform the researchers about web archiving and our climate change special collection. The resource was well received and was seen as being potentially useful. However, a number of issues came up before and during the discussion:
The last of these points I addressed in a previous blog entry, but the remainder are worth commenting on here. As I highlighted above, special collections are designed to be small and easy to use. However, such limited scope may not meet the needs of different researchers. There are several approaches one could take in order to try and resolve this issue. In some cases, collections may focus on a particular, event, such as a general election. The web content associated with these collections is often short-lived and after the event the collection would not need much updating. However, for collections on long-lasting themes, more involvement is required.
In one instance, thematic special collections could remain under the control of dedicated archivists. In this case, collection users could send in suggestions of websites to include when important events occur or new web material is created. Collections could be slightly expanded to be broad enough for a variety of user interests. However, the number of collections is necessarily limited by the time commitment of the web archivists.
Another possibility is that the archivists act as technical support whilst researchers create their own collections. This approach requires a greater input on the part of the researcher, but allows more collections to be created and maintained. Since they are designed by the users, each collection should be exactly fit for purpose. However, since each researcher is likely to have slightly different interests or questions in mind, the number of collections may be very large and some collections may closely mirror one another.
Listening to talks by academics involved in the British Library’s BUDDHA project, a common starting point for research was to create a corpus: a collection of written texts – in this case websites – of interest that could then be used to inform the research question. This approach is just what I have described above. A large number of corpora created by researchers could be stored by housing different groups of collections under common themes; so the theme of climate change could contain a number of collections on different aspects of the issue.
Perhaps the ideal model that the British Library could adopt is something of a combination of the above ideas. The Library may want to preserve the integrity of its existing special collections, which are carefully curated and designed for a wide range of users. These ‘Special Collections’ could remain under archivist control as described above, with contributions from user feedback. Alongside this core set of special collections could exist the more specific and numerous ‘Research Collections’ - those collections created by researchers. In this way the Library could make available a variety of resources that may be of interest to different users, combining the work of researchers and archivists to accommodate the limited time of both.
One thing we need to do in order to ensure the success of this combined approach is to get more and more researchers involved with creating collections. More projects like BUDDHA and further visits to interested academics will help to increase awareness of the web archive as a research resource, to grow it and turn it into an invaluable tool.
Peter Spooner, Science Policy Intern
05 August 2015
"The term 'malware' is commonly used as a catch-all phrase for unwanted software designed to infiltrate a computer...without the owner's informed consent. It includes but is not limited to viruses, Trojan horses, malware."
"Whilst highly undesirable for most contemporary web users, malware is a pervasive feature of the Internet. Many archives choose to scan harvests and identify malware but prefer not to exclude or delete them from ingest into their repositories, as exclusion threatens the integrity of a site and their prevalence across the web is a valid research interest for future users."
DPC Technology Watch Report, March 2013
The above hopefully goes some way to illustrating our concerns regarding 'viral' content in the data we archive. If overlooked or ignored, such content has the potential to prove hazardous in the future but similarly, they do form an integral part of the Web as we know it (Professor Stephen Hawking famously stated that he thought that "computer viruses should count as life" and who are we to argue?).
Faced with such considerations, there were several options available:
The latter option was chosen. The specific implementation was that of a XOR Cipher , wherein the individual bytes of the viral content and logically XOR'd with a known byte-length key. Applying the same cipher using the same key reverses the operation. Essentially this turns any record flagged as containing viral content into (theoretically safe) pseudo-gibberish.
To quickly illustrate that in Python:
key = "X"
message = "This is a secret message. Shhhhh!"
encoded = [ord(m)^ord(key) for m in message]
print(encoded)
"""
The value of 'encoded' here is just a list of numbers; attempting to convert
it to a string actually broke my Putty session.
"""
decoded = "".join([chr(e^ord(key)) for e in encoded])
print(decoded)
For all our crawling activities we use the Internet Archive's Heritrix crawler. Part of the ethos behind Heritrix's functionality is that content is processed and written to disk as quickly as possible; ideally you should be utilising all available bandwidth. With that it mind the options for virus-scanners were few. While there are many available few offer any kind of API and even fewer have the ability to parse streamed content and must instead scan content on disk. Given that disk-writes are often the slowest part of the process this was not ideal and left us with only one obvious choice: ClamAV .
We created a ViralContentProcessor module which interacts with ClamAV, streaming every downloaded resource to the running daemon and receiving the result. Anything which is found to contain a virus:
It is worth noting that ClamAV does, in addition to scanning for various types of malware, have the option to identify phishing attempts. However, we disabled this early on in our crawls when we discoverd that it was identifying various examples of phishing emails provided by banks and similar websites to better educate their customers.
During the crawl the resources—memory usage, CPU, etc.—necessary for ClamAV are similar to those required by the crawler itself. That said, the virus-scanning is seldom the slowest part of the crawl.
All web content archived by the British Library is stored in WARC format (ISO 28500). A WARC file is essentially a series of concatenated records, each of a specific type. For instance an average HTML page might look like this:
WARC/1.0
WARC-Type: response
WARC-Target-URI: https://www.gov.uk/licence-finder/activities?activities=158_45_196_63§ors=183
WARC-Date: 2015-07-05T08:54:13Z
WARC-Payload-Digest: sha1:ENRWKIHIXHDHI5VLOBACVIBZIOZWSZ5L
WARC-IP-Address: 185.31.19.144
WARC-Record-ID: <urn:uuid:2b437331-684e-44a8-b9cd-9830634b292e>
Content-Type: application/http; msgtype=response
Content-Length: 23174
HTTP/1.1 200 OK
Server: nginx
Content-Type: text/html; charset=utf-8
Cache-Control: max-age=1800, public
...
<!DOCTYPE html>
...
The above essentially contains the raw HTTP transaction plus additional metadata. There is also another type of record: a conversion:
A 'conversion' record shall contain an alternative version of another record's content that was created as the result of an archival process.
ISO 28500
It's this type of record we use to store our processed viral content. A record converted as per the above might appear thusly:
WARC/1.0
WARC-Type: conversion
WARC-Target-URI: https://www.gov.uk/licence-finder/activities?activities=158_45_196_63§ors=183
WARC-Date: 2015-04-20T11:03:11Z
WARC-Payload-Digest: sha1:CWZQY7WV4BJZRG3XHDXNKSD3WEFNBDJD
WARC-IP-Address:185.31.19.144
WARC-Record-ID: <urn:uuid:e21f098e-18e4-45b9-b192-388239150e76>
Content-Type: application/http; encoding=bytewise_xor_with_118
Content-Length: 23174
>""&YGXGVDFFV9={
...
The two records' metadata do not differ drastically—the main differences being the specified WARC-Type and the Content-Type. In this latter field we include the encoding as part of the MIME. The two records' content, however, appear drastically different: the former record contains valid HTML while the latter contains a seemingly random series of bytes.
In order the access content stored in WARC files we typically create an index, identifying the various URLs and recording their particular offset within a given WARC file. As mentioned earlier, content identified as containing a virus is stored in a different series of files to those of 'clean' content. Currently we do not provide access to viral content but by doing the aforementioned separation this means that firstly we can easily index the regular content and omit the viral and secondly, it means we can, should the demand arise, easily identify and index the viral content.
The software used to replay our WARC content—OpenWayback—is capable of replaying WARCs of all types. While there would be additional step wherin we reverse the XOR cipher, access to the content should not prove problematic.
In addition to the annual crawl of the UK domain, we also undertake more frequent crawls of a smaller set of sites. These site are crawled on a daily, weekly, etc. basis to capture more frequently-changing content. In the course of roughly 9,000 frequent crawls since April 2013 only 42 have encountered viral content.
Looking at the logs from the 2014 Domain Crawl which, as mentioned earlier, contain the results from the ClamAV scan, there were 494 distinct viruses flagged. In terms of the most common, the top ten appear to be:
In total there were 40,203 positive results from ClamAV, with the Html.Exploit.CVE_2014_6342 in top spot above accounting for over a quarter.
Roger G. Coram, Web Crawl Engineer, The British Library