BlogForever: a new approach to blog harvesting and preservation ?
The European Commission funded BlogForever project is developing an exciting new system to harvest, preserve, manage and reuse blog content. I'm interested not only as a supplier to the project, but also because I'm fairly familiar with the way that Heritrix copies web content, and the BlogForever spider seems to promise a different method.
The system will perform an intelligent harvesting operation which retrieves and parses hypertext as well as all other associated content (images, linked files, etc) from blogs. It copies content by interrogating not only the RSS feed of a blog (similar to the JISC ArchivePress project), but also by copying data from the original HTML. The parsing action will be able to render the captured content into structured data, expressed in XML; it does this in accordance with the project's data model.
The result of this parsing action will carve semantic entities out of blog content on an unprecedented micro-level. Author names, comments, subjects, tags, categories, dates, links, and many other elements will be expressed within the hierarchical XML structure. When this content is imported into the BlogForever repository (based on CERN’s Invenio platform), a public-facing access mechanism will provide a rendition of the blog which can be interrogated, queried and searched to a high degree of detail. Every rendition, and updated version thereof, will be different, representing a different time-slice of the web; without the need for creating and managing multiple copies of the same content. The resulting block of XML will be much easier to store, preserve, and render than current web-archiving methods.
BlogForever are proposing to create a demonstrator system to prove that it would be possible for any organisation, or consortium of like-minded organisations, to curate aggregated databases of blog content on selected themes. If there was a collection of related blogs in fields of scientific research, media, news, politics, arts, education, a researcher could search across that content in very detailed ways, revealing significant connections between written content. Potentially, that's an interrogation of web content of a quality that even Google cannot match.
This interests me as it might also offer us the potential to think about web preservation in a new way. In most existing methods, the approach is to copy entire websites from URLs, replicating the folder structure. This approach tends to treat each URL as a single entity, and follows the object-based method of digital preservation; by which I mean that all digital objects in a website (images, attachments, media, stylesheets) are copied and stored. We've tended to rely on sophisticated wrapper formats to manage all that content and preserve the folder hierarchy; ARC and WARC are useful in that respect, and in California the Bag-It approach also works for websites, and is capable of moving large datasets around a network efficiently.
Conversely, the type of content going into the BlogForever repository is material generated by the spider: it’s no longer the unstructured live web. It’s structured content, pre-processed, and parsed, fit to be read by the databases that form the heart of the BlogForever system. The spider creates a “rendition” of the live web, recast into the form of a structured XML file. XML is already known to be a robust preservation format.
If these renditions of blogs were to become the target of preservation, we would potentially have a much more manageable preservation task ahead of us, with a limited range of content and behaviours to preserve and reproduce. It feels like instead of trying to preserve the behaviour, structure and dependencies of large numbers of digital objects, we would instead be preserving very large databases of aggregated content.
BlogForever (ICT No. 269963) is funded by the European Commission under Framework Programme 7 (FP7) ICT Programme