TechTalk: UKWA web archiving tools on GitHub
We recently made a release to GitHub courtesy of the Open Planets Foundation. This primarily consists of two tools: the 'ArchiveExplorer' and 'ArchiveFS'.
The UK Web Archive is stored by the Library in a series of WARC files. As explained in an earlier post, we use the Internet Archive's Wayback software to replay these files and provide access to websites. However, there are times when we only need to access the contents of an individual WARC file and we found that there seemingly were no tools available - from the IA or anywhere else - for viewing the contents of an individual WARC file.
ArchiveExplorer serves this requirement and enables the various records in the WARC file to be viewed directly in the tool, or double-clicked to be opened externally.
Various FUSE (Filesystem in Userspace) tools have existed for mounting and viewing the contents of various file types. Using the FUSE libraries, we've created a tool for mounting ARC/WARC files. After mounting, the contents of each file will then be available with the directory structure mimicking that of the original site. While the filesystem is read-only, any required file-operations (e.g. MIME identification, virus-scanning, etc.) can be performed as normal.
Other web archiving institutions are welcome to make use of the tools, available from Github.
Web Archiving Engineer, UK Web Archive