UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

4 posts from September 2012

27 September 2012

Digital Research 2012, Oxford

I recently made the trip to the Digital Research 2012 gathering in Oxford, with my colleagues in the web archiving team Helen Hockx-Yu and Andy Jackson. We were taking part in a day of presentations and workshops on the theme of digital research using web archives. (See the programmes of our session and of the whole conference.)

It was an excellent opportunity to showcase a cluster of current projects, both here at the BL and in association with us, and to make connections between them. Andy demonstrated some of the forthcoming visualisation tools for the archive, some of which are already available on the UK Web Archive site (see earlier post). Helen presented some summary results from a recent survey of our users, about which she wrote in an earlier post.

Recently, the JISC very generously funded two projects to explore the use of the UK Web Domain Dataset, and there were presentations from both. Helen Margetts from the Oxford Internet Institute presented the Big Data project, which is conducting a link analysis of the whole dataset, showing its usefulness for political scientists and other social science researchers by analysing the place of government in information networks in the UK.

I myself then presented some early findings from the Analytical Access to the Domain Dark Archive project, led by the Institute of Historical Research (University of London). I reported on a series of workshops with potential users of the dataset, who raised important questions about research of this type. How far should researchers trust analytical tools inside a 'black box', presenting results generated by algorithms that are not (and often cannot) be transparent ? Also, how far does research on datasets of this scale present new questions of research ethics, and who should be looking for the answers to them ? 

In the afternoon we discussed some of the themes raised in the morning, to do with potential users and their needs. Some of these were:

(i) that large datasets present amazing opportunities for analysis at a macro level, but at the same time many scholars will still want to use web archives as simply another resource discovery option, to find and consult individual sites. Both approaches need to be catered for.

(ii) possible interaction with Wikipedia. As over time more and more sites disappear from the live web, and UKWA  increasingly becomes the repository for the only copy, we might expect UKWA to become cited as a source more in Wikipedia. However, there may be ways to aid and encourage this process.

(iii) how do we identify potential user groups ? We can't safely say that scholars in Discipline A are more likely to use the archive than those in Discipline B. It may be that sub-groups within each discipline find their own uses. For instance: one wouldn't find much data about the Higgs Boson in the archive; but a physicist interested in public engagement with the issue might find a great deal. One wouldn't look in UKWA for the texts of the Man Booker prize shortlist; but a literature specialist could find a wealth of reviews and other public engagement with those texts.  

Overall, it was a most successful day, which gave us much food for thought.

20 September 2012

Valuing Video Games Heritage: an update on our new video games collection

[British Library Digital Curator Stella Wisdom updates us on a forthcoming special collection, preserving the rich digital heritage of video games.]

Some of you may remember my blog post from February this year , where I explained that I was selecting websites for a new Web Archive collection that will preserve information about computer games; including resources documenting gaming culture and the impact that video games have had on education and contemporary cultural life.

Since then I’ve been busy researching several target areas for sites that I would like to add to the collection, such as:

  • Sites which illustrate the experience of playing games, e.g. walkthroughs, image galleries, videos of game play and FAQs
  • Fansites
  • Forums
  • Vulnerable sites, e.g. industry sites for companies that have ceased trading
  • Sites about popular games i.e. the types of games played by people who do not identify themselves as "gamers"
  • Gamification, i.e. use of game features and techniques being adopted in non-game contexts
  • Educational games and sites which illustrate the progression of game development education
  • Events, e.g. game launches, game culture festivals
  • Pro and anti-video games and game culture sites
  • Sites which chart the evolution of video games
  • Game development competitions, including those that showcase student and independent game developers’ work
  • Game publishers, retailers and reviewers, including journalistic output

One of my challenges has been in obtaining permission from website owners; as not everyone within the video game industry or player community seems to value the richness of its history and heritage, or understand the concepts of digital preservation and web archiving. However, I’ve been making progress in networking, both online and in person, with those who create and play video games. So I’m hoping that this engagement activity will encourage more site owners to respond positively and give their support to the project. I’m also still seeking nominations, so if you know of any sites that you think should be included, then please get in touch (at [email protected] or via Twitter @miss_wisdom) or use the nomination form

So far, I’ve discovered some wonderful resources and have been able to archive interesting sites, which include:

  • GameCity; an annual videogame culture festival that takes place in Nottingham
  • Dare to be Digital; a video games development competition at Abertay University for students at UK universities and art colleges
  • BAFTA Games; who give British Academy Games Awards and also organise a competition for 11 to 16 year olds to recognise and encourage young games designers
  • North Castle; one of the oldest fansites for the Nintendo game The Legend of Zelda   
  • The Oliver Twins; a site that tells the story of Philip & Andrew Oliver, who from the age of 12 began writing games for the UK games market and co-founded Blitz Games Studios in 1990.

13 September 2012

Web Archives and Chinese Literature

The following is a guest post by Professor Michel Hockx, School of Oriental and African Studies, University of London, who explains the difference between doing research on internet literature from doing research on printed literature, and how web archives help.


In July of this year, Brixton-based novelist Zelda Rhiando won the inaugural Kidwell-e Ebook Award. The award was billed as “the world’s first international e-book award.” It may have been the first time that e-writers in English from all over the world had been invited to compete for an award, but for e-writers in Chinese such awards have been around for well over a decade. This might sound surprising, since the Chinese Internet is most frequently in the news here for the way in which it is censored, i.e. for what does not appear on it. What people often forget, however, is that the environment for print-publishing in China is much more restricted and much more heavily censored. Therefore, those with literary interests and ambitions have gone online in huge numbers. Reading and writing literature is consistently ranked among the top-ten reasons why Chinese people spend time online.

 

I have been following the development of Chinese internet literature almost since its inception and I am currently finalizing a monograph on the subject, simply titled Internet Literature in China and due to be published by Columbia University Press. (That scholars of literature feel compelled to publish their research outcomes on topics like this in the form of printed books shows how poorly attuned the humanities world still is to the new technologies.) Doing research on internet literature is substantially different from doing research on printed literature, most importantly because born-digital literary texts are not stable. Printed novels may come in different editions, but generally the assumption of literature scholars who do research on the same novel is that they have all read the same text. For internet literature there can be no such assumption, because “the text” often evolves over time and usually looks different depending on user interaction.  The text looks different depending on when you visited it and what you did with it. So one of the methods I employ is to present my interpretations of such texts at different moments in time. For traditional literature scholars, this is unusual: they don’t normally tell you in their research “when I read this text in 2011, I interpreted it like this, but when I read it again in 2012, I interpreted it like that.” Using this method relies on the availability of the material, and on the possibility to preserve it so that other scholars can reproduce my readings. And that is where web archives come in.

 

As far as I know, there is no Chinese equivalent of the UK Web Archive. In the area of preservation of born-digital material, China is very far behind the UK (instead it devotes huge resources to the digitization and preservation of its printed cultural heritage). Some literary websites in China have their own archives. In the case of popular genre fiction sites these archives can be huge, and they can be searchable by author, genre, popularity (number of hits or comments), and so on. Genre fiction (romance fiction, martial arts fiction, erotic fiction, and so on) is hugely popular on the Chinese Internet, because of the relatively few legal restrictions compared to print publishing. Readers subscribe to novels they like and they then receive regular new instalments, often on a daily basis. However, no matter how large the archives, there usually tends to be a cut-off point after which works are taken offline. When I first started my research in 2002, I was blissfully unaware of such potential problems. As a result, roughly 90% of the URLs mentioned in the footnotes to my first scholarly articles on the topic are no longer accessible. Fortunately, when I began to rework some of my earlier articles for my book, I found that the Internet Archive had preserved a substantial number of the links, so in many cases my footnotes now refer to the Internet Archive. Although the Internet Archive does not preserve images and other visual material (which can play an important role in online literature), having the texts as I saw them in 2002 is definitely better than having nothing at all, and will convince my fellow scholars that I am not just making them all up!

 

During my later research, I took care to save pages, and sometimes entire sites or parts of sites, to my own computer to ensure preservation of what I had seen. But archiving material on my computer does not make it any more accessible to others. That is why I use the services of the Digital Archive for Chinese Studies (DACHS, with one server in Heidelberg, and one in Leiden), where scholars in my field can store copies of online material they refer to in footnotes to publications. DACHS also has another important function: it preserves copies of online material from China that is in danger of disappearing, because it is political or ephemeral, or both. DACHS also invites scholars to introduce such materials and place them in context, as in Nicolai Volland’s collection of online documents pertaining to “Control of the Media in the People’s Republic of China”, or Michael Day’s annotated collection of Chinese avant-garde poetry websites.

 

In order for online Chinese-language literature to be preserved, its cultural value needs to be appreciated not just by foreign enthusiasts like myself, but more generally by scholars and critics in China itself. The first decade or so of Chinese writing on the Internet will probably never be restored in any detail, although a relatively complete picture might still emerge if existing partial archives were merged. Meanwhile, I hope that new archiving options for later material will become available soon. 

05 September 2012

How to Make Websites More Archivable?

I was contacted by an organisation which is going to be disbanded in a couple of months. When the organisation disappears, so will its website. Fortunately we have already archived a few instances of their website in the UK Web Archive.

The lady who contacted me however complained that the archival copies are incomplete as they do not include the “database” and would like to deposit a copy with us. Under examination it turns out that a section called “events” which has a calendar interface, was not copied by our crawler. I also found out that 2 other sections, of which the content is pulled dynamically from an underlying database, seem to be only accessible via a search interface. These would have been missed by the crawler too.

The above situation reflects some common technical challenges in web archiving. The calendar is likely to send the crawler into the so-called “crawler trap” inadvertently as it would follow the (hyper-linked) dates on the calendar endlessly. For that reason, the “events” section was excluded from our previous crawls. The database driven search interface presents content based on searches or interactions, which the crawler cannot perform. Archiving crawlers are generally capable of capturing explicitly referenced content which can be served by requesting a URL, but cannot deal with URLs which are not explicitly in the HTML but embedded in JavaScript or Flash presentations or generated dynamically.

We found out the earliest and latest dates related to the events in the organisation’s database and used these to limit the data range the crawler should follow. We then successfully crawled the “events” section without trapping our crawler. For the other 2 sections, we noticed that the live website also has a map interface which provides browseable lists of projects per region. Unfortunately only the first pages are available because the links to consequent pages are broken on the live site. The crawler copied the website as it was, including the broken links.

There are a few basic things, if taken into account when a website is designed, which will make a website a lot more archivable. These measures ensure preservation and help avoid information loss, if for any reason a website has to be taken offline.

1. Make sure important content is also explicitly referenced.
This requirement is not in contradiction with having cool, interactive features. All we ask you to do is providing an alternative, crawler-friendly way of access, using explicit or static URLs. A rule of thumb is that each page should be reachable from at least one static URL.

2. Have a site map
Use a site map to list the pages of your website accessible to crawlers or human users, in XML or in HTML.

3. Make sure all links work on your website.
If your website contains broken links, copies of your website will also have broken links.

There are more things one can do to make websites archivable. Google for example has issued guidelines to web masters to help find, crawl, and index websites: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769. Many best practices mentioned here are applicable too to archiving crawlers. Although archiving crawlers work in a way that is very similar to search engine crawlers, it is important to understand the difference. Search engine crawlers are only interested in files which can be indexed. Archiving crawlers intend to copy all files, of all formats, belonging  to a website.  

Helen Hockx-Yu, Head of Web Archiving, British Library