In order to process the large amount of data contained within the web archive we have been using a cluster based on Apache's Hadoop for some time now. Primarily the cluster is used for text-extraction (via Tika) and various data analytics via Hadoop's MapReduce framework. The Hadoop cluster contains a distributed filesystem provided by HDFS - it is here that we currently store a copy of the entire archive in WARC format.
With the recent release of Hadoop's WebHDFS it appears that accessing data stored in HDFS via HTTP is becoming commonplace. Earlier in 2011, Cloudera announced the release of Hoop which offers a similar API. Both offer methods to request not only single files but particular blocks of data from within those files. Something we had used in the past for demonstration purposes is Wayback's "RemoteCollection"; in addition to the typically-used "CDXCollection" where indexes and (W)ARCs are local, Wayback offers the facility to request WARC files via HTTP from a remote Wayback instance.
As we currenly store a copy of all our WARC files within HDFS and that a WARC record is essentially a block of data within a WARC file the two technologies seem ideally suited.
Our initial experiments have been done using Hoop but there should be few changes involved to get something similar working with WebHDFS. Within Wayback's RemoteCollection.xml configuration:
- A 'resourceIndex' property defines the location of the remote CDX - this is configured as normal to reference another Wayback instance which has its own, local CDX.
<property name="resourceIndex">
<bean class="org.archive.wayback.resourceindex.RemoteResourceIndex">
<property name="searchUrlBase" value="http://127.0.0.1/wayback/xmlquery" />
</bean>
</property>
- A 'resourceStore' property defines the prefix of the service from which Wayback will request files - this is configured to point to Hoop.
<property name="resourceStore">
<bean class="org.archive.wayback.resourcestore.SimpleResourceStore">
<property name="prefix" value="http://127.0.0.1/hoop?" />
</bean>
</property>
The local, client-facing Wayback installation queries the remote CDX and receives the results as XML:
<?xml version="1.0" encoding="utf-8"?>
<wayback>
<request>
<startdate>19960101000000</startdate>
<numreturned>1</numreturned>
<type>urlquery</type>
<enddate>20111013132720</enddate>
<numresults>1</numresults>
<firstreturned>0</firstreturned>
<url>civictrustwales.org/ehd_pix/wag/wag_logo.gif</url>
<resultsrequested>1000</resultsrequested>
<resultstype>resultstypecapture</resultstype>
</request>
<results>
<result>
<compressedoffset>0</compressedoffset>
<mimetype>image/gif</mimetype>
<file>
/data/60588463/60395493/WARCS/BL-60395493-0.warc.gz?user.name=rcoram&offset=216696&len=4018&bogus=.warc.gz</file>
<redirecturl>-</redirecturl>
<urlkey>civictrustwales.org/ehd_pix/wag/wag_logo.gif</urlkey>
<closest>true</closest>
<digest>A6FHZCVHZ3PUBPLZ75FD2W6QMIN7RDPC</digest>
<httpresponsecode>200</httpresponsecode>
<url>
http://www.civictrustwales.org/ehd_pix/wag/wag_logo.gif</url>
<capturedate>20110630141833</capturedate>
</result>
</results>
</wayback>
Ordinarily Wayback will receive the name of the (W)ARC file and request this from the remote server and seek to the relevant offset on receipt. More specifically, it appends the value of the <file> element above to the prefix defined in the "resourceStore" above and makes the subsequent HTTP request. By replacing this <file> value with the parameters we need to pass to Hoop we can use Wayback to make the request.
The <file> tag above shows the amendments we have made. Hoop requires the full path in HDFS, plus the offset and length of the required data. Currently Wayback requires that the 'file' being requested ends with "\.w?arc(\.gz)" and if it finds otherwise, will append ".arc.gz". By adding the 'bogus' parameter we avoid this and force Wayback to expect a WARC record. Note that the offset is also set to zero - we will be receiving a single record rather than a whole file in which we have to seek.
After making this full request ('resourceStore' prefix + the Hoop data) Hoop returns to the WARC record which Wayback will handle as normal and renders to the browser.
* There is an obvious limitation insofar as this requires two running instances of Wayback. One which interacts with Hoop and another which does little more than serve a CDX. Allowing the former to use a local CDX while still requesting remote files would be far simpler.
Roger Coram
Web Archiving Engineer, UK Web Archive