[In this guest post, Jules Mataly describes his research at the University of Amsterdam, making comparative use of three different web archives, including the UK Web Archive. His thesis, The Three Truths of Mrs Thatcher was completed earlier this year.]
As a Master’s student of New Media and Digital Culture at the University of Amsterdam, my final thesis made use of the UK Web Archive to a great extent. The goal of the thesis was to do a comparative analysis of different archives on a given topic. I had in mind to compare, and to find a way to quantify, the impact of different curating approaches on archived materials. This was not in relation to the gigabytes collected (all the collections are huge anyway), but rather in terms of the sources and origins of the archived pages. After some deliberation, the choice of research topic fell on Margaret Thatcher.
At that point in time, Mrs Thatcher had just passed away. Given her status, there emerged a great number of online articles seeking to establish what her impact on politics had been. And so I wondered: what will an historian ten or twenty years from now be likely to find, if researching the online publications of today? What parts of the seemingly significant material of our time will be successfully archived and preserved for future times, and what is likely to be lost? Here I discuss the methods used for my thesis, entitled “The Three Truths of Margaret Thatcher”.
When compiling the research, it was necessary to find web archives with which to compare the UK Web Archive. After being introduced to my research project, the head of the UK Web Archive, Helen Hockx-Yu, very kindly offered access to a brand new research interface. This yet-to-be-officially-released interface is built by the UK Web Archive team, and based upon what the Internet Archive had collected of the UK web domain (the JISC UK Web Domain Dataset). Finally, and thanks to the help of Erik Borra from the Digital Methods Institute in Amsterdam, I created a list of URLs that were curated by Google and accessible through the Internet Archive. Google is the front door of the Internet to most people, but it also allows querying pages within a time-range, giving access to pages of the past that would not appear in today’s results. Studying pages retrieved through the Google search engine, I tried to find which ones the Internet Archive had saved. These pages were then used create a third corpus.
The UK Web Archive is – unlike numerous other national web archive initiatives – online, available to all and without restrictions. This makes it a great prospect for research. As I was researching a topic that has not been archived purposely by the British Library (i.e. not as part of a special collection), I used the text-based search to query the archive’s databases, thus entering web archives in a “Google fashion", by keywords.
Users of the Internet Archive previously only had the option to search by URL. Now, however, it is also possible to extract archived material through full-text search in other archives. For the thesis research in question, it was necessary to generate lists of URLs that could be easily provided by text search. When creating such lists, it is possible to group the results by domain, but I purposely ignored that option. My interest was primarily in complete URLs, and only secondarily in web domains and top-level domains. Had it been possible to group more than ten results per page, it would have significantly improved the workflow. An optimal situation would have been to be able to download complete lists of the resulting URLs, perhaps in a .csv format, preferably with corresponding meta-data (e.g. date of crawl, number of times visited by crawlers).
By querying the UK Web Archive as one would query the live web through search engines, I obtained a list of websites. This list was sorted by unknown criteria. Had it been possible to sort search results according to various parameters – occurrences of the search terms, number of times the specific page had been visited by the crawler, crawl date – the resulting list would have been open to greater research possibilities. It would not only be possible to study the archived materials (or a sample of them, as I did in my research), but also enable studies on what the users are confronted with when browsing the archives. It can be deduced that in the light of the sheer volume of available data, knowledge of what sites the users actually access is of great importance.
In conclusion, I would like to use this opportunity to sincerely thank Helen Hockx-Yu for sharing interesting thoughts and providing access to the user interface prototype built upon the Internet Archive data. My thesis “The Three Truths of Margaret Thatcher” would not have been complete without it.