Web archiving as a challenging business
My internship here at the British Library’s Web Archiving team comes to an end and I try to sum up my impressions. I would say, I have been somewhat stricken by how a daunting task web archiving is, and how much challenges it creates for professionals.
Displaying an open collection
The British Library provides the public with an open collection of websites, accessible from anywhere. These open collections are resource heavy, being enriched with metadata and descriptions. This task is done by web curators and web archivists. The latter are also in charge of quality assurance, they check if the harvest was done properly by the web crawling software. Giving open access means asking permission from the website owners. This is a very labour intensive and slow process, which would easily require two or three times the current available resources. To face the emergency of some events, such as next General Election, the selection is done now, while the permission requests have to be postponed to a less busy time. For some resources, open access is not an option as for example some news websites who charge for access to their own archives.
You’d think things should get easier since the 2013 Legal Deposit Libraries (Non-Print Works) Regulations have allowed British Library to collect and preserve UK websites without asking permission. But new issues arise: collecting a huge quantity of data, indexing it, preserving it on a long term perspective, dealing with the fact that the appearance of an archived website may not be the same as its live version. And then all this content must be made available for users (restricted to the reading rooms for websites without permission).
But how does one search a web archive? Anyone who tried once probably had this annoying sense that there is definitely too much data to deal with. One of the challenges is consequently to provide users with efficient tools enabling them to find their way through this maze of data. Consequently users need to learn how to use these tools, bearing in mind their expectations may be shaped by the habit of using Google. Yet, using the web archive for scholarly purposes is a completely different approach. A historical search engine must meet specific requirements. No Google-like relevance sorting here but a mere chronological ranking enhanced with powerful results refine functionalities like events or time line. This research project from the L3S Research Centre in Germany is one amongst other involving web archive, showing that the tool building is made hand in hand with researchers who use web archive as a material for their work.
Being involved in web archiving today is really fascinating. It means observing and being part of an emerging field. This was also discussed at the opening presentation of 2014 IIPC General Assembly.
A new job?
Web archiving is not really part of librarians’ training yet, and professionals have to learn by doing. At this moment in time web archiving only concerns few people, not more than a handful mostly based in national libraries (this becomes less true over time as can be seen in the composition of IIPC).
But issues arising with web archiving are in line with general trends for libraries. It concerns electronic journals management, mostly bought and displayed as packages, or mass digitisation projects. The new challenge consists in dealing with scale matters. The core business of librarians is seemingly shifting from selecting to highlighting resources. Social media channels are one of the new librarian’s tricks to do so. Most of digital libraries have a twitter account (see the often humorous @GallicaBnF) as well as the web archives (@internetarchive, @UKWebArchive, @DLWebBnF).
Apart from archiving work these teams of specialists are doing, one other task is the promotion of web archives inside the libraries themselves. The reference staff may not be comfortable yet with this new material, and still very few readers use the web archive. Another challenge to come!
Clémence Agostini (intern at the BL Web Archiving team from ENSSIB)