Viral Content in the UK Domain
"The term 'malware' is commonly used as a catch-all phrase for unwanted software designed to infiltrate a computer...without the owner's informed consent. It includes but is not limited to viruses, Trojan horses, malware."
"Whilst highly undesirable for most contemporary web users, malware is a pervasive feature of the Internet. Many archives choose to scan harvests and identify malware but prefer not to exclude or delete them from ingest into their repositories, as exclusion threatens the integrity of a site and their prevalence across the web is a valid research interest for future users."
DPC Technology Watch Report, March 2013
The above hopefully goes some way to illustrating our concerns regarding 'viral' content in the data we archive. If overlooked or ignored, such content has the potential to prove hazardous in the future but similarly, they do form an integral part of the Web as we know it (Professor Stephen Hawking famously stated that he thought that "computer viruses should count as life" and who are we to argue?).
Faced with such considerations, there were several options available:
- We could simply not store any content flagged as containing a virus. The problem here is the effect is unpredictable—what if the content in question is the front-page of a website? It effectively means that site cannot be navigated as intended.
- We could store the content but make it inaccessible.
- We could postpone the scan for viruses until after the crawl. However, this would require amending the output files to either remove or alter infected records.
- We could 'nullify' the content, making it unreadable but potentially reversible such that the original data can be read if required.
The latter option was chosen. The specific implementation was that of a XOR Cipher , wherein the individual bytes of the viral content and logically XOR'd with a known byte-length key. Applying the same cipher using the same key reverses the operation. Essentially this turns any record flagged as containing viral content into (theoretically safe) pseudo-gibberish.
To quickly illustrate that in Python:
key = "X"
message = "This is a secret message. Shhhhh!"
encoded = [ord(m)^ord(key) for m in message]
The value of 'encoded' here is just a list of numbers; attempting to convert
it to a string actually broke my Putty session.
decoded = "".join([chr(e^ord(key)) for e in encoded])
Heritrix & ClamAV
For all our crawling activities we use the Internet Archive's Heritrix crawler. Part of the ethos behind Heritrix's functionality is that content is processed and written to disk as quickly as possible; ideally you should be utilising all available bandwidth. With that it mind the options for virus-scanners were few. While there are many available few offer any kind of API and even fewer have the ability to parse streamed content and must instead scan content on disk. Given that disk-writes are often the slowest part of the process this was not ideal and left us with only one obvious choice: ClamAV .
We created a ViralContentProcessor module which interacts with ClamAV, streaming every downloaded resource to the running daemon and receiving the result. Anything which is found to contain a virus:
- ...is annotated with the output from ClamAV (this then appears in the log file).
- ...is bytewise XOR'd as previously mentioned and the amended content written to a different set of WARC files than non-viral content.
It is worth noting that ClamAV does, in addition to scanning for various types of malware, have the option to identify phishing attempts. However, we disabled this early on in our crawls when we discoverd that it was identifying various examples of phishing emails provided by banks and similar websites to better educate their customers.
During the crawl the resources—memory usage, CPU, etc.—necessary for ClamAV are similar to those required by the crawler itself. That said, the virus-scanning is seldom the slowest part of the crawl.
All web content archived by the British Library is stored in WARC format (ISO 28500). A WARC file is essentially a series of concatenated records, each of a specific type. For instance an average HTML page might look like this:
Content-Type: application/http; msgtype=response
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Cache-Control: max-age=1800, public
The above essentially contains the raw HTTP transaction plus additional metadata. There is also another type of record: a conversion:
A 'conversion' record shall contain an alternative version of another record's content that was created as the result of an archival process.
It's this type of record we use to store our processed viral content. A record converted as per the above might appear thusly:
Content-Type: application/http; encoding=bytewise_xor_with_118
The two records' metadata do not differ drastically—the main differences being the specified WARC-Type and the Content-Type. In this latter field we include the encoding as part of the MIME. The two records' content, however, appear drastically different: the former record contains valid HTML while the latter contains a seemingly random series of bytes.
In order the access content stored in WARC files we typically create an index, identifying the various URLs and recording their particular offset within a given WARC file. As mentioned earlier, content identified as containing a virus is stored in a different series of files to those of 'clean' content. Currently we do not provide access to viral content but by doing the aforementioned separation this means that firstly we can easily index the regular content and omit the viral and secondly, it means we can, should the demand arise, easily identify and index the viral content.
The software used to replay our WARC content—OpenWayback—is capable of replaying WARCs of all types. While there would be additional step wherin we reverse the XOR cipher, access to the content should not prove problematic.
In addition to the annual crawl of the UK domain, we also undertake more frequent crawls of a smaller set of sites. These site are crawled on a daily, weekly, etc. basis to capture more frequently-changing content. In the course of roughly 9,000 frequent crawls since April 2013 only 42 have encountered viral content.
2013 Domain Crawl
- 30TB regular content.
- 4GB viral content.
2014 Domain Crawl
- 57TB regular content.
- 4.7GB viral content.
Looking at the logs from the 2014 Domain Crawl which, as mentioned earlier, contain the results from the ClamAV scan, there were 494 distinct viruses flagged. In terms of the most common, the top ten appear to be:
In total there were 40,203 positive results from ClamAV, with the Html.Exploit.CVE_2014_6342 in top spot above accounting for over a quarter.
Roger G. Coram, Web Crawl Engineer, The British Library