UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

5 posts from July 2015

24 July 2015

Geo-location in the 2014 UK Domain Crawl

In April 2013 The Legal Deposit Libraries (Non-Print Works) Regulations 2013 Act was passed and of particular relevance is the section which specifies which parts of that ephemeral place we call the Web are considered to be part of "the UK":

  • 18 (1) “…a work published on line shall be treated as published in the United Kingdom if:
    • “(b) it is made available to the public by a person and any of that person’s activities relating to the creation or the publication of the work take place within the United Kingdom.”

In more practical terms, resources are to be considered as being published in the United Kingdom if the server which serves said resources is physically located in the UK. Here we enter the realm of Geolocation.

Gps

Comparison satellite navigation orbits" by Cmglee, Geo Swan - Own work.Licensed under CC BY-SA 3.0 via Wikimedia Commons

Heritrix & Geolocation

Geolocation is the practice of determining the "real world" location of something—in our case the whereabouts of a server, given its IP address.

The web-crawler we use, Heritrix, already has many necessary features to accomplish this. Among its many DecideRules (a series of ACCEPT/REJECT rules which determine whether a URL is to be downloaded) is the ExternalGeoLocationDecideRule. This requires:

  • A list of ISO 3166-1 country-codes to be permitted in the crawl
    • GB, FR, DE, etc.
  • An Implementation of ExternalGeoLookupInterface.

This latter ExternalGeoLookupInterface is where our own work lies. This is essentially a basic framework on which you must hang your own implementation. In our case, our implementation is based on MaxMind’s GeoLite2 database. Freely available under the Creative Commons Attribution-ShareAlike 3.0 Unported License, this is a small database which translates IP addresses (or, more specifically, IP address ranges) into country (or even specific city) locations.

Taken from our Heritrix configuration, the below shows how this is included in the crawl:

<!- GEO-LOOKUP: specifying location of external database. -->
<bean id="externalGeoLookup" class="uk.bl.wap.modules.deciderules.ExternalGeoLookup">
  <property name="database" value="/dev/shm/geoip-city.mmdb"/>
</bean>
<!-- ...  ACCEPT those in the UK... -->
<bean id="externalGeoLookupRule" class="org.archive.crawler.modules.deciderules.ExternalGeoLocationDecideRule">
  <property name="lookup">
    <ref bean="externalGeoLookup"/>
  </property>
  <property name="countryCodes">
    <list>
      <value>GB</value>
    </list>
  </property>
</bean>

The GeoLite2 database itself is, at around only 30MB, very small. Part of beauty of this implementation is that the entire database can be held comfortably in memory. The above shows that we keep the database in Linux's shared memory, avoiding any disk IO when reading from the database.

Testing

To test the above we performed a short, shallow test crawl of 1,000,000 seeds. A relatively recent addition to Heritrix's DecideRules is this property:

<property name="logToFile" value="true" />

During a crawl, this will create a file, scope.log, containing the final decision for every URI along with the specific rule which made that decision. For example:

2014-11-05T10:17:39.790Z 4 ExternalGeoLocationDecideRule ACCEPT http://www.jaymoy.com/
2014-11-05T10:17:39.790Z 0 RejectDecideRule REJECT https://t.co/Sz15mxnvtQ
2014-11-05T10:17:39.790Z 0 RejectDecideRule REJECT http://twitter.com/2017Hull7

So for the above 2 URLs were rejected outright, while the first was ruled in-scope by theExternalGeoLocationDecideRule.

Parsing the full output from our test crawl, we find:

  • 89,500,755 URLs downloaded in total.
  • 26,072 URLs which were not on .uk domains (and therefore would, ordinarily, not be in scope).
    • 137 distinct hosts.
UK
British Isles Euler diagram 15 by TWCarlson - Own work. Licensed under CC0 via Wikimedia Commons

2014 Domain Crawl

The process for examining the output of our first Domain Crawl is largely unchanged from the above. The only real difference is the size: the scope.log file gets very large when dealing with domain scale data. It logs not only the decision for every URL downloaded but every URL notdownloaded (and the reason why).

Here we can use a simple sed command (admittedly implemented slightly differently via distributed via Hadoop Streaming to cope with the scale) to parse the logs' output:

sed -rn 's@^.+ ExternalGeoLocationDecideRule ACCEPT https?://([^/]+)/.*$@\1@p' scope.log | grep -Ev "\.uk$" sort -u

This will produce a list of all the distinct hosts which have been ruled in-scope by the ExternalGeoLocationDecideRule (excluding, of course, any .uk hosts which are considered in scope by virtue of a different part of the legislation).

The above produced a list of 2,544,426 hosts ruled in-scope by the above Geolocation process.

By Roger G. Coram, Web Crawl Engineer, The British Library 

17 July 2015

Curating the Election - Archiving the most complex General Election yet…

GenEl2015_outcome
https://en.wikipedia.org/wiki/United_Kingdom_general_election,_2015

This year’s General Election is not only one of the closest fought in recent times.

With more parties in the limelight than ever before it is almost certainly the most complex.

 As so much of the election is played out in the here-today-gone-tomorrow world of the Net and broadcast media, the archiving challenge is all the greater. (Many political pages disappear soon after the election results)

The Library is capturing these transient messages before they are lost. Across the Library, and across the Legal Deposit Library network, staff, led by Jennie Grimshaw in Research Engagement have been working on a special web collection, to join several General Election collections we have created in the past. 

Meanwhile, we have been adding extra recordings to the Broadcast News service (these are available within hours of having been broadcast). Because of the significant Scottish dimension, the TV channels STV and BBC Scotland have been added to the mix, creating a lasting archive for years to come.

Because we have archived the 2005 and 2010 elections we can also see that there were significant changes in the way the internet was used. And increasingly the web archive is showing how it can support long-term research of this kind.

Compared with the 2010 General Election, it is clear that there has been a mushrooming of campaigning on the web. In excess of 7,400 websites and webpages have been selected in 2015 compared to approximately 770 pages in the 2010 collection, and 139 in 2005.

One reason for this growth is the way prospective candidates now attempt to engage the electorate on multiple channels. In addition to setting up their own campaigning website and having a page on their party’s constituency website they increasingly use social media channels such as Facebook and Twitter to reach out to voters. For example a total 951 Twitter accounts have been selected across all the subject categories, illustrating just how prominent a part social media played.

Led by Jennie Grimshaw in Research Engagement at the British Library, the team involved included curators from the three national libraries, from Northern Ireland and the Bodleian, Library, Oxford.

One element of the project was to endeavour to capture websites from the same constituencies as selected in the 2010 and 2005 crawls, in an effort to offer some comparison on how constituency and web presences evolve from one election to the next.

UK_opinion_polling_2010-2015
https://en.wikipedia.org/wiki/United_Kingdom_general_election,_2015

The 2015 General Election web archive collection has harvested thirty-two opinion polls, 100 blogs, supplementing the comment and analysis along with more traditional news websites. There are also the webpages and publications of 62 think tanks and 412 interest groups -- all of which creates a rich online documentary archive around the Election, including much material which will disappear rapidly from the live web.  

By Jerry Jenkins, Curator of Emerging Media at the British  Library

note by the editor: the links provided in this post link to the Open UK Web Archive, which gives access to archived webpages where permission has been granted for open access. The complete collections for all three General Elections can only be accessed in the British Library reading rooms under the terms of the Non-Print Legal Deposit Legislation.

10 July 2015

UK Web Archives Forum @ BBC Broadcasting House

Friday 19th June saw the first UK Web Archives Forum at Broadcasting House. This was set-up by BBC Archives as an opportunity to get the British Library, the National Archives and Channel 4 together with BBC Archives to discuss current archiving policies & practice in the ever shifting world of web & social media archiving. Representatives from the aforementioned institutions were present including the BBC's own Web Archives team.

The session was very well received, and everyone involved came away with lots of new ideas and potential future collaborations. Presentations and overviews of state-of-play web archiving activities were shared, and then in depth discussions on the moving landscape of web archiving methodology and the challenges in archiving social media took place.

UK Web and Social Media Archive Forum June 2015
Of great interest was the work underway by the BBC Archives, British Library and National Archives in the archiving of Twitter communications. Other major areas of interests were around standards and practices. The BBC has, for example, adopted a number of solutions for web archiving including Crawling WARCs, generating PDFs, screencasts and document archiving to ensure all basis are covered in preserving bbc.co.uk. It was interesting to see the scale adopted by the British Library in preserving the .uk web domain. And the National Archives also explained their challenges in archiving .gov websites, and the large array of government funded organisations at national levels.

It was decided that we would meet again in the future to look more collaborations, quality assurances in our archive results and how to best tackle future online distribution platforms, especially social media and mobile applications were younger generations are now consuming content at a faster rate than ever bfore. There are a lot of exciting challenges in the area of web archiving, so the need for a forum to discuss and shape policies and practices is vital. We hope to work with the Digital Preservation Coalition on future workshops in this field of work, to help to provide common standards for all concerned and for those standards to be shared with the wider UK web archiving community.

Some of the tools under discussion:

Download and preserve the content for future use.

By Carl Davies, Archive Manager, Radio & Multiplatform, BBC Engineering

08 July 2015

Big UK Domain Data for the Arts and Humanities: working with the archive of UK web space, 1996–2013

Buddah05

Buddah02

Buddah03

Buddah04

 



In January 2014, the Institute of Historical Research, University of London (in partnership with the British Library, the Oxford Internet Institute and Aarhus University) was awarded funding by the Arts and Humanities Research Council for a project to explore ways in which humanities researchers could engage with web archives. The main aims of ‘Big UK Domain Data for the Arts and Humanities’ were to highlight the value of web archives for research; to develop a theoretical and methodological framework for their analysis; to explore the ethical implications of this kind of big data research; to train researchers in the use of big data; and to inform collections development and access arrangements at the British Library.

SAS_HUMAN_111
Helen Hockx-Yu showing the BUDDAH interface to people at the Being Human Festival 2014

For the past 15 months the project team have been working with 10 researchers, drawn from a range of arts and humanities disciplines, to address these issues and particularly to develop a prototype interface which will make the historical archive (1996–2103) accessible. The researchers came armed with a range of fascinating questions, from analysing Euro-scepticism on the web to studying the Ministry of Defence’s recruitment strategy, from examining the history of disability campaigning groups and charities online to looking at Beat literature in the contemporary imagination. The case studies that they have produced demonstrate some of the challenges posed by the archived web, but also its value and significance. They are available from the project website.

 

BUDDAH01
Along the way, the project has produced not only one of the largest full-text indexes of web archive (WARC) files in the world, but also a sophisticated interface which supports complex query building and gives researchers the ability to create and manipulate corpora derived from the larger dataset.

This interface is accessible as a beta version. It opens up a fascinating range of options now that you longer need to know the URL of a vanished website in order to find it in the archive.

Buddah06

For those less familiar with the concept of web archives, we’ve also produced two short animations, ‘What is a Web Archive?’ and ‘What does the UK Web Archive collect?’. They’re both available under a CC-BY-NC-SA licence, so do please share!

Jane Winters
Professor of Digital History
Institute of Historical Research, School of Advanced Study, University of London
@jfwinters

03 July 2015

What is a Web Archive? (in less than 3 mins)

You may have heard of the term 'Web Archiving' but what is it and why is it important that the UK Legal Deposit libraries support this? This short video is a good start:

 

What do the UK Web Archive collect?
What can you expect to find and where might you go to access the three collections that the UK Web Archive hold?

 These videos were produced as part of the AHRC funded 'Big UK Domain Data for the Arts and Humanities' project.