UK Web Archive blog

2 posts from April 2017

28 April 2017

What websites do we collect during UK General Elections?

The UK Web Archive has been archiving websites connected to General elections since 2005.

During the 2005 and 2010 elections, collecting was done on a permissions-cleared basis requiring curators to make contact with individual website owners requesting permission to archive the website before it was captured and stored. Any site belonging to website publishers who refused permission, did not respond or were not contactable were not archived. The 2015 election was collected following the introduction of new Legal Deposit regulations in 2013 that allow any UK website to be collected without permission.

Although the collections are not comprehensive, due to various factors such as the time consuming permissions process and the ephemeral nature of websites (which often do not include contact details), there are large sections of content relating to the General Elections that could not be covered.

Collection Summary:

2005

The UK General Election 2005 was the first of our Election collections. It includes 139 different items, or ‘Targets’ which cover a wide variety of websites such as those of individual candidates, major political parties, interest groups and a selection of election manifestos. Even though this collection is fairly small it is worth highlighting that until relatively recently election campaigning was predominantly carried out through print media; in 2005 it was by no means the case that all political candidates had a website.

2010

The UK General Election 2010 collection is much bigger totalling 770 items. This collection has eleven sub categories that cover:

Candidates (15 items)

Election Blogs (27 items)

Interest Groups (113 items)

News and Commentary (30 items)

Opinion Polls (7 items)

Other (8 items)

Political Parties - Local (191 items)

Political Parties - National (54 items)

Public and Community Engagement (13 items)

Regulation and Guidance (15 items)

Research Centres and Think Tanks (14 items)

2015

The UK General Election 2015 collection is the biggest collection of its type with 7,861 items. By 2015 we observed that much more, traditionally paper-based content had moved onto the web. This shift in publishing along with the introduction of the Non-Print Legal Deposit Regulations (NPLD) in 2013, which enabled the Legal Deposit Libraries to collect online UK content at scale without seeking explicit permissions, meant that this collection was bigger than those of previous years. This collection has eleven sub categories that cover:

Candidates (1,957 items)

Election Blogs (100 items)

Interest Groups (416 items)

News and Commentary (4,582 items)

Opinion Polls (32 items)

Other (75 items)

Political Parties - Local (442 items)

Political Parties - National (142 items)

Public & Community Engagement (45 items)

Regulation & Guidance (7 items)

Research Centres & Think Tanks (62 items)

 All content archived in 2015 will be available to users later this year either via the UK Web Archive website or through a UK Legal Deposit Library Reading Rooms depending on the permission status of the individual websites.

2017

As the June 2017 general election was called at short notice, the collection will likely be much smaller in size compared to the 2015 collection. However, as a number of the websites in the 2015 collection are still live they will be re-tagged for the 2017 collection which will give the curators more time to focus on selecting the more ephemeral websites and social media content.

By Helena Byrne, Assistant Web Archivist, The British Library

18 April 2017

The Challenges of Web Archiving Social Media

By Helena Byrne, Assistant Web Archivist, British Library

What is the UK Web Archive?
The UK Web Archive aims to archive, preserve and give access (where permissions allow) to the UK web space. It only collects information that is publically available online in the UK. Therefore, any web pages that require a log in such as membership only areas are not captured; neither are emails or private Intranets. As most of the popular social media platforms are not hosted in the UK, being largely based in the US, their public interfaces are not automatically picked up in our annual domain crawl. Thus, all social media sites in the archive have to be manually selected and scoped in so that they are legitimately archived under Non-Print Legal Deposit Regulations.

What Social Media is in the UK Web Archive?
The UK Web Archive selectively collects publically accessible Facebook and Twitter profiles related to thematic collections such as the EU Referendum, or ‘Brexit’, or those accounts of prominent individuals and organisations in the UK, such as the Prime Minister and the main political parties.  In the main, Social media is collected when building special collections on big events that shape society for instance elections and referendums. We collect profiles that are related directly to political parties or interest groups campaigning on relevant issues.  As we can only archive content from the UK web space we cannot crawl individual hashtags like #BBCRecipes and #Brexit as a lot of this content is generated outside the UK, and we cannot ascertain the provenance of 3rd party comments.

Difficulties with web archiving social media
Archiving social media is technically challenging as these platforms are presented in a different way to ‘traditional’ websites. Social media platforms use Application Programming Interfaces (API’s) as a way to ‘enable controlled access their underlying functions and data’ (Day Thomson). In the past we have tried to crawl other platforms such as Instagram and Flickr but have been unsuccessful, due to a combination of technical difficulties and restrictions that are sometimes set to prevent crawler access.

How to access the UK Web Archive
Under the 2013 Non-Print Legal Deposit Regulations the UK Legal Deposit Libraries are permitted to archive UK content published on the web. However, access to this content is limited to Legal Deposit Library premises unless explicit permission is obtained from the site owner to make content available on the UK Web Archive  Open UK Web Archive website. More information on Non-Print Legal Deposit can be found here and information on how to access the UK Web Archive can be found here.

What to expect when using this resource
The success rate of crawling Twitter and Facebook is limited and the quality of the captures varies. In the worst case scenario, what is presented to the user amounts to the date a post was made in a blank white box. There are many reasons why a crawler cannot follow links. One reason is that the user used a Shortened URL that is now broken or couldn’t be read at the time of the crawl. The Internet Archive is currently working with companies that provide this service to ensure the longevity of shortened URL’s. Advertisements on social media and archived websites are not always captured, resulting in either a ‘Resource Not in Archive’ message or leakage to the live web.  More information on this can be found here.

Twitter

1. Unison Scotland Twitter

Unison Scotland –Twitter from April 8th 2016

2. RC of Psychiatrists

RC of Psychiatrists – Twitter from August 2nd 2016

Facebook

Initially when we first started archiving public Facebook pages the crawls were quite successful albeit with the caveat around archiving external links. As you can see from the Unison Scotland example there are white boxes where an external link was shared using a shortened URL which wasn’t captured. In spring 2015 Facebook changed its display settings and we were only able to capture a white screen. However, more recent captures have been successful.

3. Unison Facebook

Unison Scotland –Facebook from April 8th 2016

4. EU Citizens for an Independent Scotland Facebook

EU Citizens for an Independent Scotland- Facebook from 15th November 2014

Conclusion

As you can see from the few samples here the quality of the capture can vary but a lot of valuable information can still be gathered from these instances. In March 2017 the UK Web Archive deployed a new version of their web crawler which will take a screen shot of the home page of websites before they archive the content. Although, it will be sometime in the future when the technology will be available for researchers to view these screenshots it is hoped that it will bridge the gap between what is captured and not captured.

Internationally more research needs to be done on archiving social media along with the assistance of the platform proprietors. No two platforms are the same and require a tailored approach to ensure a successful crawl.

More information about the UK Web Archive can be found here.

[Update 26 Jan 2018]

Over the last few years Facebook has made many changes that affect the ability of web archives to capture their pages. Our last complete captures seems to be mid 2015. It is no longer possible to view publicly available content on Facebook without logging this means that we are unable to archive any Facebook pages.