THE BRITISH LIBRARY

UK Web Archive blog

6 posts categorized "Social media"

31 July 2020

LGBTQ+ Lives Online

Add comment

 
 A white banner with the LGBTQ+ flag colours painted on with the text - love is love
Photo by 42 North from Pexels

By Steven Dryden, British Library LGBTQ+ Staff Network & Ash Green CILIP LGBTQ+ Network

 

When the internet first rose to prominence in the late 1990s, one of the primary modes of communicating with others was through internet chat rooms and forums. Suddenly, isolated people all over the world with a personal computer and internet access could communicate with others ‘like them’.

By using the term ‘like them’ we acknowledge that there is some form of social oppression which makes a person, perhaps alone in a rural community, feel unable to be themselves - to know anything about themselves at all. It is perhaps partly for the need to feel more connected with other people ‘like them’ that LGBTQ+ people adapted to online community-building quickly. Now, as we have been living online for over 25 years, it seems pertinent to consider what traces of early digital lives survive, and how we can begin to make sense of it. What survives of digital campaigns to legalise the age of consent for all sexualities in the UK (2001), gain recognition and protections of members of the trans community (Gender Recognition Act 2004) or the battle for marriage equality in the UK (England and Wales, 2013, Scotland 2014, Northern Ireland 2019)? As well as historical content such as this, we must also ensure we are ready and able to curate current and future online discussions and websites surrounding LGBTQ+ lives as well.

Part of this process has already begun. Through the UK Web Archive, the British Library along with the other five UK Legal Deposit Libraries, has been able to run an annual domain crawl of the UK web since April 2013, after the implementation of Non-Print Legal Deposit Regulations. Prior to this websites were archived on a permissions basis since January 2005. Through the Shine interface you can search the JISC UK Web Domain Dataset (1996-2013), this holds all the .uk websites archived by the Internet Archive from 1996 to April 2013. As a next step, the British Library and Chartered Institute of Library and Information Professionals (CILIP) LGBTQ+ Network are pleased to work collaboratively and develop LGBTQ+ Lives Online. This project will tag and subject categorise relevant websites in the UK Web Archive, and expand the scope of websites we collect for future generations. We look forward to sharing with you over the coming months the work that is being undertaken and how you can contribute.

CILIP LGBTQ+ Network members are pleased to be working collaboratively with the British Library and the UK Web Archive on this project, and recognise the historical value and importance of developing the LGBTQ+ Lives Online web archive.

The aim of the UK Web Archive is to collect content published on the UK web that reflects all aspects of life in the UK. This includes important aspects of British culture and events that shape society. The LGBTQ+ Lives Online collection reflects the important role this community plays in British society. The UK Web Archive is delighted to collaborate with the British Library LGBTQ+ Staff Network and the CILIP LGBTQ+ Network to build on the existing LGBTQ+ collection. Although there is a dedicated collection about the LGBTQ+ community, many of the websites tagged in this collection also intersect with other collections in the archive such as our various sports collections, Political Action and Communication and Oral History in the UK.

 

Get Involved:

CILIP LGBTQ+ Network, the British Library and the UK Web Archive welcome nominations for UK websites which should be included in the LGBTQ+ Lives Online.

Nominations can be made via this form: https://www.webarchive.org.uk/en/ukwa/nominate

 

Keep an eye on the CILIP LGBTQ+ Network Twitter as well as the UK Web Archive blog and Twitter account for more updates on the LGBTQ+ Lives Online collection.

 

24 June 2020

Our new Science web archive collection

Add comment

 
By Philip Eagle, Subject Librarian - Science, Technology and Medicine at The British Library
 
 
Air pump CC0
A Philosopher Shewing an Experiment on the Air Pump, 1769 by Valentine Green

 

Introduction

We have just activated our new web archive collection on science in the UK. One of the British Library's objectives as an institution as a whole is to increase our profile and level of service to the science community. In pursuit of this aim we are curating a web archive collection in collaboration with the UK legal deposit libraries. We have some collections already on science related subjects such as the late Stephen Hawking and science at Cambridge University, but not science as a whole.

 

Collection scope

We have interpreted "science" widely to include engineering and communications, but not IT, as that already has a collection. Our collection is arranged according to the standard disciplines such as biology, chemistry, engineering, earth sciences and physics, and then subdivided according to their common divisions, based on the treatment of science in the Universal Decimal Classification.

The collection has a wide range of types of site. We have tried to be fairly exhaustive on active UK science-related blogs, learned societies, charities, pressure groups, and museums. Because of the sheer number of university departments in the UK, we have not been able to cover them all. Instead we have selected the departments that did best in the 2014 Research Excellence Framework, and then taken a random sample to make sure that our collection properly reflects the whole world of academic science in the UK. We are also adding science-related Twitter accounts. Social media is generally difficult to archive due to its proprietary nature, but Twitter is open source so we can archive this more easily.

 

Access

Under the Non-Print Legal Deposit Regulations 2013 we can archive UK websites but we are only able to make them available to people outside the Legal Deposit Libraries Reading Rooms, if the website owner has given permission. Some of the sites in the collection have already had permission granted, such as the Hunterian Society, Dame Athene Donald’s blog, and the Royal College of Anaesthetists. Some others who have not given permission include Science Sparks, the Wellcome Collection, and the British Pregnancy Advisory Service. The Web Archive page will tell you whether any archived site is only viewable from a library, anything with no statement can be viewed on the public web.


Get involved

As ever, if you have a site to nominate that has been left out, you can tell us by filling in our public nomination form: https://www.webarchive.org.uk/ukwa/info/nominate

29 May 2020

Using Webrecorder to archive UK political party leaders' social media after the UK General Election 2019

Add comment

This blog post is is by Nicola Bingham, Helena Byrne, Carlos Lelkes-Rarugal and Giulia Carla Rossi

Introduction to Webrecorder

The UK Web Archive aims to capture the whole of the UK web space at least once a year, and targeted websites at more frequent intervals. We conduct this activity under the auspices of the Legal Deposit Regulations 2013 which enable us to capture, preserve and make accessible the UK Web for the benefit of researchers now and in the future.

Along with many cultural and heritage institutions that perform at-scale web archiving, we use Heritrix 3, the state of the art crawler developed by the Internet Archive and maintained and improved by an international community of web archiving technologists.

Heritrix copes very well with large scale, bulk crawling but is not optimised for high fidelity crawling of dynamic content, and in particular does not archive social media content very well.

Researchers are increasingly turning their attention to social media as a significant witness to our times, therefore we have a requirement to capture this content, in certain circumstances and in line with our collection development policy. Usually this will be around public events such as General Elections where much of the campaigning over recent years has been played out online and increasingly on social media in particular. 

For this reason we have looked at alternative web archiving tools such as Webrecorder to complement our existing toolset. 

Webrecorder was developed by Ilya Kreymer under the auspices of Rhizome (a non-profit organisation based in New York which commissions, presents and preserves digital art), under its digital preservation program. It offers a browser based version, which offers free accounts up to 5GB storage and a Desktop App

Webrecorder was already well known to us at the UK Web Archive although we had not used it until recently. It is a web archiving service which creates an interactive copy of web pages that the user explores in their browser including content revealed by interactions such as playing video and audio, scrolling, clicking buttons etc. This is a much more sophisticated method of acquisition than that used by Hertrix which essentially only follows HTML links and doesn’t handle dynamic content very well. 


What we planned to do

The UK General Election Campaign ran from the 6th of November 2019 when Parliament was dissolved, until polling day on the 12th of December 2019. On the 13th of December 2019 the UK Web Archive team, based at the British Library attempted to archive various social media accounts of the main political party leaders. Seventeen political leaders from the four home nations were identified and a selection of three social media accounts were targeted: Twitter, Facebook and Instagram. Not all leaders have accounts on all three platforms, but in total forty four social media accounts were archived. These accounts are identified in the table below by an X. 

List of UK political political part leaders' social media accounts archived
Image credit: Carlos Lelkes-Rarugal

 

 

How we did it

On the 13th of December, 2019 we ran the Webrecorder Desktop App across twelve office PCs. Many were running the Webrecorder Autopilot function over the accounts, but we had mixed success, in that not all accounts captured the same amount of data. As the Autopilot functionality didn’t work well on all accounts, a combination of automated and manual capture processes were used where necessary. It took the team a lot longer than expected to archive the accounts therefore some were archived on a range of dates the following week.    

 

Large political party’s vs smaller party’s social media accounts

The two largest political party leaders, Jeremy Corbyn and Boris Johnson, have many more social media followers than the other home nations party leaders. This meant that it was more difficult to get a comprehensive capture of Corbyn and Johnson’s Twitter accounts than, for example, Arlene Foster’s. The more popular Twitter accounts took many hours to crawl; Corbyn’s took almost ten hours to archive thirteen day’s worth of Tweets (which only took us up to 1st December). 

 

Technical Issues

We experienced several technical issues with crawling, mainly concerned with issues around  IP addresses, the app crashing, and Autopilot working on some computers and not others. It was hard to get the app restarted after it crashed, so some time was lost when this happened. Different computers with the same specs ran differently. The Autopilot capture for Jeremy Corbyn’s and Boris Johnson’s Twitter accounts were started at the same time but Corbyn’s ran uninterrupted while Johnson’s crashed when it reached 475 MB. Although Corbyn’s account was crawled for nearly ten hours it only collected 93 MB of data. In contrast, Nigel Farage’s Twitter page was crawled for over four hours and only produced 506 MB. It is important to check the size of crawled data, as the hours the Webrecorder Desktop App is running on Autopilot does not necessarily translate into a high fidelity crawl. 

 

Added complications when using multiple devices with the same user profile:

Complications arose mainly from the auditing and collating of WARC files; performing QA and keeping track of which jobs were successful and those that were not. 

Initially, all participants in this project had planned to use their own work PC or work laptop and a local desktop installation of Webrecorder. However, an hour or so into the process(early in the day), it soon became apparent that there would not be enough time to archive all of the social media accounts within our time frame, given the volume of social media accounts and the unanticipated time it would take to archive each one. For example, it took one instance of a desktop Webrecorder application almost ten hours to archive Jeremy Corbyn’s Twitter account (only able to capture Tweets up to a month prior to the day of archiving).

It was then decided that we could potentially, and experimentally, run multiple parallel Webrecorder applications across a number of office desktop PCs; PCs that were free and available for us to use. This was possible because of the IT Architecture in place, allowing users to log into any office machine with the correct credentials and making their personal desktop load up along with all their files and user settings, regardless of the PC they log into. 

The British Library’s IT system, which incorporates a lot of the Windows ecosystem, gives each user their own dedicated central work directory where they are given a virtual hard drive and  their own storage space for all their documents and any other work related files. This allowed one user to be logged into several office PCs at the same time and therefore run a separate desktop Webrecorder application running on each machine. This was indeed very helpful as it allowed each machine to focus on one particular social media account, which in many cases took hours to archive. 

Having multiple Webrecorder jobs greatly increased our capacity to archive by removing the previous bottleneck, that was, one webrecorder job per user. Instead, this was increased to several webrecorder jobs per user.

Work flow of gathering WARC files from Webrecorder
Image credit: Carlos Lelkes-Rarugal

 

 

Having multiple Webrecorder jobs added complications down the line, not necessarily impacting the archiving process, but rather, complicating the auditing and collating of WARC files. When a user had several Webrecorder jobs running concurrently, each job would still be downloading to the same user work directory (the user’s virtual hard drive). So if a user had many parallel jobs running, this would create multiple WARC files in the same folder (but with different names, so no clashes), WARC files being produced by the different desktop PC that the user had logged in to. This was quite an elaborate setup because once a job had completed, the entire contents of the Webrecrder folder (where the WARCs were stored) was copied to a USB so that an initial Quality Assurance (QA) could be performed on the completed job on a more capable laptop. The difficulty was in finding the WARC file that corresponded to the completed job, which was somewhat convoluted as there would have been multiple WARC files with this type of file-naming convention:

 “rec-20191213100335021576-DESKTOP-AOCGH38-7B5SEXKS.warc.gz”. 

As you can imagine, taking a copy of Webrecorder’s folder contents not only has the completed job, but also the instances of other WARC files from other incomplete jobs. Coupled with multiple jobs per PC, and multiple PCs per user; keeping track of what had completed and which WARCs were either corrupted or not up to standard, was quite demanding. 

 


Review of the data collected 

File size of data collected from UK political party leaders' social media accounts
Image credit: Carlos Lelkes-Rarugal

 

How to access this data

The archived social media accounts can be accessed through the UK General Election 2019 collection in a UK Legal Deposit Library Reading Room. The UK Legal Deposit Libraries are the British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries, Cambridge University Library and Trinity College Dublin Library.  

The 2019 collection is part of a time series of UK General Elections dating from 2005. They can be accessed over the Internet on the Topics and Themes page of the UK Web Archive website. All the party leaders' social media accounts are tagged into the subsection UK Party Leaders Social Media Accounts (access to individual websites depends on whether we have an additional permission to allow ‘open’ access). More information about what is included in the UK General Election 2019 collection is available through the UK Web Archive blog

 

Conclusion


Overall, undertaking this experiment was an interesting experience for our small team of British Library Web Archive Curators. Many valuable lessons were learnt on how best to utilise Webrecorder in our current practice. The major takeaway was that it was a lot more time consuming than we expected. Instead of taking up one working day, it took nearly a whole week to archive our targeted social media accounts with Webrecorder. Our usual practise is to archive social media accounts with the Heritrix crawler, which works reasonably well with Twitter but is less suited to capturing other platforms. For a long time, we were unable to capture any Facebook content with Heritrix, mainly due to the platform’s publishing model, however the way the platform is published has changed recently allowing us limited success. Archiving social media will always remain challenging for the UK Web Archive, for myriad technical, ethical and legal reasons. The sheer scale of the UK’s social media output is too large for us to capture adequately (and indeed, this may not even be desirable) and certainly too large a task for us to tackle with manual, high fidelity tools such as Webrecorder. However, our recent experience during the 2019 UK General Election has convinced us that using Webrecorder to capture significant events is a worthwhile exercise, as long as we target selected, in scope accounts on a case by case basis. 

 

22 September 2016

Web Archiving Rio 2016 Olympic and Paralympic Games

Add comment

‘For the Olympics, the whole world is captivated, turns on its television and supports their country’

Introduction
The Olympic and Paralympic Games in Rio de Janeiro, Brazil may be over but it will be some time before they are forgotten about in the press and social media. Web archives play a vital role in preserving the narratives that have come out of these Games. The Content Development Group (CDG) at the International Internet Preservation Consortium (IIPC) has been archiving both the Winter and Summer Games since 2010 and the Rio 2016 Collection will be available in October 2016.

Rio-world-map

Rio 2016 is the first time the CDG has archived events both on and off the playing field making this its biggest collection so far in terms of the number of nominations and geographical coverage. The CDG also enlisted the help of subject experts as well as the general public to nominate sites from countries not usually covered in IIPC collections. As the IIPC only has members in around 33 countries public nominations played an important role in filling this void.

What’s involved?
But what’s involved in web archiving the Olympics? CDG members the British Library and the National Library of Scotland co-hosted a Twitter chat on 10th August 2016 to give an insight on what’s involved. The Twitter chat was based on set questions published in an IIPC blog post with a Q&A session and some time for live nominations. This was an international chat with participants from the USA, Ireland, England, Scotland, Serbia and even Australia. The chat was added to Storify as well as the final archived collection of the Games. Even though the chat was small it helped us to connect with a wider audience and increase the number of public nominations. You can follow updates on this project on Twitter by using the collection hashtag #Rio2016WA.

How can you get involved?
There is still time for you to get involved in web archiving the Olympics and Paralympics. The public nomination form will be open till 23rd September 2016. If you would like to make a nomination you can follow these guidelines. As Carly Lloyd stated above the whole world is captivated by the Olympics now is your opportunity to be part of it.

By Helena Byrne, Assistant Web Archivist, The British Library

17 May 2016

Saving BBC Recipes Website

Add comment

There's been much coverage today of plans to remove the recipe pages from the BBC website.

6018503713_573fccc22a_z

The UK Web Archive has been collecting selected pages from the BBC, mainly news, for over ten years and since 2013 we have attempted to capture the entirety of the BBC web estate. A small number of pages are available on the Open UK Web Archive website. Most of the BBC's online presence, however, is only available in the reading rooms of UK Legal Deposit libraries, including both of the British Library sites at St. Pancras and Boston Spa in Yorkshire.

We have today instigated a further crawl of the BBC website with the specific aim of ensuring that we save the recipes from the food pages. We can also report that the Internet Archive, Library of Alexandria and the National Library of Iceland have also captured these pages so their future is assured.

Polly Russell, British Library Curator and Food Historian says 

"Cookery books, like cookery websites, obviously serve a practical purpose but that is not all. For historians, sociologists and anthropologists they also tell us about people's culinary aspirations and anxieties, cultural tastes and trends, dietary preoccupations, social expectations and economic conditions. They are, therefore, a rich source for researchers. So while it's sad news to hear about plans to close the much trusted and well-loved BBC Food website, it's a relief that the British Library is going to be able to archive the website for posterity."

 

 

02 December 2011

Twittervane: Crowdsourcing selection

Add comment Comments (0)


TwitterbirdWe’re excited to announce development of a new tool to automate the selection of websites for archiving: the Twittervane.

At the moment, our selection process is manual, dependent upon internal subject specialists or external experts to contact us and nominate websites for archiving in the UK Web Archive. We benefit from their expertise and wouldn’t be without it, but we recognise that this manual selection process can sometimes be time consuming for frequent selectors. It’s also inevitably subjective, reflecting the interests of a relatively small number of selectors. 

Automated selection is an efficient and under-utilised alternative, but up until now it has been difficult to see how an automated approach could clearly identify the most popular and widely relevant websites. Our answer?  Twittervane. 

The Twittervane project will investigate how the power and wisdom of the crowd can be leveraged to automatically select websites for archiving. In essence, it's a crowdsourcing approach to selection that will compliment the manual selections provided by subject specialists and other experts. 

The project will:

  • Deliver a prototype tool for analysing twitter content that will:
    • determine which websites are shared most frequently around a given theme over a given time period;
    • link to our existing web archiving infrastructure to support harvesting of sites that fall within the UK domain
  • Generate at least one pilot special collection comprising websites most frequently shared across the crowd that address or are relevant to a unifying theme
  • Assess the viability of the approach from a curatorial perspective and investigate the ‘wisdom of crowds’ in this context. 

It’s important to get curatorial input to this approach, so we’ll be asking curators from the Library to assess the quality and relevance of resulting selections. The project will start in December and the prototype completed in time for next year’s IIPC May General Assembly in Washington, particularly important as the IIPC are contributing funding for the project.

We aim to provide regular progress updates as development takes place, so watch this space - and Twitter, of course - for more details.