The Challenges of Web Archiving Social Media
What is the UK Web Archive?
The UK Web Archive aims to archive, preserve and give access (where permissions allow) to the UK web space. It only collects information that is publically available online in the UK. Therefore, any web pages that require a log in such as membership only areas are not captured; neither are emails or private Intranets. As most of the popular social media platforms are not hosted in the UK, being largely based in the US, their public interfaces are not automatically picked up in our annual domain crawl. Thus, all social media sites in the archive have to be manually selected and scoped in so that they are legitimately archived under Non-Print Legal Deposit Regulations.
What Social Media is in the UK Web Archive?
The UK Web Archive selectively collects publically accessible Facebook and Twitter profiles related to thematic collections such as the EU Referendum, or â€˜Brexitâ€™, or those accounts of prominent individuals and organisations in the UK, such as the Prime Minister and the main political parties. In the main, Social media is collected when building special collections on big events that shape society for instance elections and referendums. We collect profiles that are related directly to political parties or interest groups campaigning on relevant issues. As we can only archive content from the UK web space we cannot crawl individual hashtags like #BBCRecipes and #Brexit as a lot of this content is generated outside the UK, and we cannot ascertain the provenance of 3rd party comments.
Difficulties with web archiving social media
Archiving social media is technically challenging as these platforms are presented in a different way to â€˜traditionalâ€™ websites. Social media platforms use Application Programming Interfaces (APIâ€™s) as a way to â€˜enable controlled access their underlying functions and dataâ€™ (Day Thomson). In the past we have tried to crawl other platforms such as Instagram and Flickr but have been unsuccessful, due to a combination of technical difficulties and restrictions that are sometimes set to prevent crawler access.
How to access the UK Web Archive
Under the 2013 Non-Print Legal Deposit Regulations the UK Legal Deposit Libraries are permitted to archive UK content published on the web. However, access to this content is limited to Legal Deposit Library premises unless explicit permission is obtained from the site owner to make content available on the UK Web Archive Open UK Web Archive website. More information on Non-Print Legal Deposit can be found here and information on how to access the UK Web Archive can be found here.
What to expect when using this resource
The success rate of crawling Twitter and Facebook is limited and the quality of the captures varies. In the worst case scenario, what is presented to the user amounts to the date a post was made in a blank white box. There are many reasons why a crawler cannot follow links. One reason is that the user used a Shortened URL that is now broken or couldnâ€™t be read at the time of the crawl. The Internet Archive is currently working with companies that provide this service to ensure the longevity of shortened URLâ€™s. Advertisements on social media and archived websites are not always captured, resulting in either a â€˜Resource Not in Archiveâ€™ message or leakage to the live web. More information on this can be found here.
Unison Scotland â€“Twitter from April 8th 2016
RC of Psychiatrists â€“ Twitter from August 2nd 2016
Initially when we first started archiving public Facebook pages the crawls were quite successful albeit with the caveat around archiving external links. As you can see from the Unison Scotland example there are white boxes where an external link was shared using a shortened URL which wasnâ€™t captured. In spring 2015 Facebook changed its display settings and we were only able to capture a white screen. However, more recent captures have been successful.
Unison Scotland â€“Facebook from April 8th 2016
EU Citizens for an Independent Scotland- Facebook from 15th November 2014
As you can see from the few samples here the quality of the capture can vary but a lot of valuable information can still be gathered from these instances. In March 2017 the UK Web Archive deployed a new version of their web crawler which will take a screen shot of the home page of websites before they archive the content. Although, it will be sometime in the future when the technology will be available for researchers to view these screenshots it is hoped that it will bridge the gap between what is captured and not captured.
Internationally more research needs to be done on archiving social media along with the assistance of the platform proprietors. No two platforms are the same and require a tailored approach to ensure a successful crawl.
More information about the UK Web Archive can be found here.
[Update 26 Jan 2018]
Over the last few years Facebook has made many changes that affect the ability of web archives to capture their pages. Our last complete captures seems to be mid 2015. It is no longer possible to view publicly available content on Facebook without logging this means that we are unable to archive any Facebook pages.