English and Drama blog

15 April 2020

Collecting Literature on the Web: a Q&A

A Q&A with Carlos Rarugal, Assistant Web Archivist about the UK Web Archive’s literary collections, and the challenges faced by colleagues trying to collect the web, conducted by Callum McKean, Curator of Contemporary Literary and Creative Archives. For more news and information about the work of the UK Web Archive, visit their blog or follow them on Twitter.

‘Oh, you work at the British Library? You guys have everything, right?’ My colleagues and I hear this more often than you’d think. Usually it’s in reference to the Legal Deposit Act, which states that one copy of every book (which includes pamphlets, magazines, newspapers, sheet music and maps) published in the United Kingdom must be sent to the British Library (and, that five other UK libraries have the right to request a free copy within one year of publication). But books — even if they include unusual printed formats — aren’t everything. So much knowledge, entertainment, culture and community is created and shared without ever going to print in the traditional sense. This is why Legal Deposit legislation was updated in 2013 to include regulations around ‘non-print’ works, making provision for the collection of works published online or offline in formats other than print, such as websites, blogs, e-journals and CD-ROMs. The UK Web Archive (which celebrated its 15th birthday this year!) exists to collect, make accessible and preserve web resources of scholarly and cultural importance from the UK domain in line with this legislation. But the Archive also plays an important role in selecting and curating this material. I spoke to Carlos Rarugal, Assistant Web Archivist, about some of their literary collections — the challenges they face — and how you can get involved.

Hi Carlos. Considering the UK Web Archive is relatively young, it’s often presumed that it’s focus is exclusively on contemporary material, but I’ve noticed that there’s a lot of content about contemporary reception of older literary material, like the Dickens Bicentenary and the 19th Century Literature collections — how do you decide which anniversaries to collect?

Hi Callum. The Dickens Bicentenary and 19th Century literature collections are examples of what we call Special Collections, in that they’re actively curated, either by Library staff or outside experts. Web Archiving happens in two streams, called a Domain Crawl and a Frequent Crawl. The Domain Crawl takes place once a year over several months and is a ‘shallow crawl’ of all known UK hosted web-sites. As you can imagine, this involves many millions of web-sites, so we deliberately cap the amount of data per site to 500 megabytes and refer to it as a ‘shallow capture’ because it’s unlikely to capture the whole complexity of the original website.

The Frequent crawl is different because it’s curated and deals with a relatively small subset of websites, around 100,000, each including a unique database record and metadata. Special Collections are often created to highlight frequently crawled sites because a curator or external partner outside of the UK Web Archive, with access to tools, has added the site for crawling, or the public has nominated the topic for inclusion, or curators and archivists within the UK Web Archive team itself were made aware of the occasion. Our network of internal and external experts means that we’re particularly good at capturing material relating to contemporary reception of historical events, as well as more active events as they happen. (As you can imagine, our whole team is very busy with the Pandemic Outbreaks collection right now).

Screenshot 2020-04-15 at 16.15.49A screenshot taken from the UK Web Archive's 19th Century Literature Special Collections Page, showing the breadth of content being captured, from news and blogs to academic journals.

One of the collections which I think will be of particular interest to readers of this blog is the Poetry and Zines Special Collection.  This must be a very difficult collection to build in some ways, as this activity often happens in the nooks and crannies of the internet — can you say a little bit about how these collections are built? 

It would be fair to say that we have archived millions of websites that have yet to be discovered or accessed by the public. Websites that feature zines, poetry zines and journals have been captured, though their numbers are few. Contributions come from the curatorial team responsible for contemporary published collections, especially Debbie Cox and Jerry Jenkins (ed note: whose extensive work on Artists’ Books has appeared on this blog). We rely on them and their highly specialised knowledge and professional connections to build these obscure collections. Our work is always collaborative in this way. We try to partner with as many Curators and Archivists as possible, and also with external experts who are keen to get involved in web archiving.

One of these experts, Pete Hebden, who we were lucky enough to work with recently, just posted to the UK Web Archive Blog talking about his experience of exploring and helping to build this collection. If your readers are looking for a place to start, I’d recommend they take a look.

You spoke about contemporary events earlier, two of the largest collections which are available from home seem to be the ones relating to Black and Asian Britain and LGBT issues. This isn’t surprising given how active the online discourse around social justice has become in the past ten or fifteen years. The Library has been active in these spaces more generally, with exhibitions such as Windrush: Songs in a Strange Land and Gay UK: Love, Law and Liberty, and their corresponding web spaces, Black and Asian Britain and LGBTQ Histories. But these are highly controlled spaces, sensitively and co-operatively curated by experts and activists. Collecting web-content  — which is relatively uncontrolled, and sometimes hurtful and offensive — must present huge issues in terms of data-protection and hate-speech.

Yes, definitely. There are quite a few sensitive areas in the UK Web Archive: adult sites, for instance, aren’t promoted but are still archived. Personally, I think it’s important that we archive all sides of the story; and if certain narratives are controversial, we should pay particular attention.  All sides of the conversation will be important for the historians of the future, who will see value in a discursive and highly active medium with a rich research potential. We are limited, of course, but more-so in technical than curatorial and legal terms.

If a site is public, then we are simply operating within our obligation as outlined in the 2013 Non-Print Legal Deposit act. Whether or not we can capture this content is a more of a technical issue, as many sites that include forums for discussions are highly dynamic (PHP or Javascript heavy) and very difficult to capture using our current setup. The discussions online are also occurring more frequently on app-based platforms, which again we are unable to capture, and sites such as Facebook (to name a few) are designed to inhibit crawlers and so we are prevented from archiving.

Under these regulations, if archived content is illegal, it will be suppressed from public access. The content we collect grows daily by gigabytes and is not ‘processed’ at the point of archiving; rather there is a delay to archived content being made available (to both curators and the public). Only parts of our archived content is full-text indexed, so it would not be possible to perform deep searches on recent crawls.

There are times when content can be removed from public access for other limited reasons; for example if sensitive personal data has been mistakenly published on a live website. Our Notice and Takedown process is robust and we are quick to respond; thankfully, there have only been a handful of such requests in the past few years. Archiving under GDPR is permitted as stated in Article 89 which allows ‘archiving in the public interest’.

Screenshot 2020-04-15 at 16.17.42
The Black and Asian Britain Special Collection page is one of the largest, and contains a wide variety of literary material, from author pages to literary festivals and Twitter pages.

You mentioned social media and how difficult it is to archive. Whilst there’s clearly work going on in this area, I think the Archive does a good job of capturing some of the everyday interactions that happen online, especially on public forums. One of the most charming hubs, and I think it’s important too, is the one relating to Online Enthusiast Communities in the UK. The internet seems able to bring people with similar interests together across huge geographical distances. Most of this activity happens on specialist forums. Everything is represented here, it’s a real curiosity shop, from fans of Japanese Anime to Pylon enthusiasts. Some of the collections that might interest our readers are the Comics UK Forum and the Writers Online Forum. What are some of the issues around collecting this kind of highly social material?

People are fascinated by the spectrum of content in the Online Enthusiast Communities of the UK Collection, and although a lot has been tagged into that collection, it’s likely that more sites that have already been archived are waiting to be added to it, or have yet to be nominated and are waiting to be archived. Whilst all Collections remain active, only a few at a time have the focus of curators and archivists. When a Collection is in focus, curators, archivists, and external partners all work together to focus on adding targets en-masse, with a sustained amount of archiving within a short time frame. Collections in focus need attention so that time-sensitive content is quickly captured. For example, Twitter or news articles that should be captured, perhaps daily, require curation to add/amend the crawling schedules of those websites.

Another issue when archiving uncommon content is the lack of discovery; if we are unaware of their existence then it is unlikely that we will capture that content, even in the Domain Crawl. This issue is compounded when delving deeper into a site, that is, looking at their online forums where regular discussions occur. Without the proper intervention of users, these overlooked forums may not be crawled, and if they are, they may only be shallow crawls that occur infrequently. We do have forums that are being captured often, sometimes daily, however, it is rare that we would have a frequent crawl of a forum unless a user updated a record accordingly. The complex structure of forums, and the fact that they often sit behind a login, makes them more difficult to capture effectively.

Screenshot 2020-04-15 at 16.17.23The Online Enthusiast Communities in the UK page is a fascinating insight into how communities of shared interest, including those of a literary bent, evolve online.

Thanks Carlos