THE BRITISH LIBRARY

UK Web Archive blog

8 posts categorized "Legal deposit"

24 June 2020

Our new Science web archive collection

Add comment

 
By Philip Eagle, Subject Librarian - Science, Technology and Medicine at The British Library
 
 
Air pump CC0
A Philosopher Shewing an Experiment on the Air Pump, 1769 by Valentine Green

 

Introduction

We have just activated our new web archive collection on science in the UK. One of the British Library's objectives as an institution as a whole is to increase our profile and level of service to the science community. In pursuit of this aim we are curating a web archive collection in collaboration with the UK legal deposit libraries. We have some collections already on science related subjects such as the late Stephen Hawking and science at Cambridge University, but not science as a whole.

 

Collection scope

We have interpreted "science" widely to include engineering and communications, but not IT, as that already has a collection. Our collection is arranged according to the standard disciplines such as biology, chemistry, engineering, earth sciences and physics, and then subdivided according to their common divisions, based on the treatment of science in the Universal Decimal Classification.

The collection has a wide range of types of site. We have tried to be fairly exhaustive on active UK science-related blogs, learned societies, charities, pressure groups, and museums. Because of the sheer number of university departments in the UK, we have not been able to cover them all. Instead we have selected the departments that did best in the 2014 Research Excellence Framework, and then taken a random sample to make sure that our collection properly reflects the whole world of academic science in the UK. We are also adding science-related Twitter accounts. Social media is generally difficult to archive due to its proprietary nature, but Twitter is open source so we can archive this more easily.

 

Access

Under the Non-Print Legal Deposit Regulations 2013 we can archive UK websites but we are only able to make them available to people outside the Legal Deposit Libraries Reading Rooms, if the website owner has given permission. Some of the sites in the collection have already had permission granted, such as the Hunterian Society, Dame Athene Donald’s blog, and the Royal College of Anaesthetists. Some others who have not given permission include Science Sparks, the Wellcome Collection, and the British Pregnancy Advisory Service. The Web Archive page will tell you whether any archived site is only viewable from a library, anything with no statement can be viewed on the public web.


Get involved

As ever, if you have a site to nominate that has been left out, you can tell us by filling in our public nomination form: https://www.webarchive.org.uk/ukwa/info/nominate

08 June 2020

Documenting the Olympics & Paralympics

Add comment

 
 
Olympic Stamps
Stamps issued by Greece in 1896, the Universal Postal Union Collection, Philatelic Collections, The British Library.

 

Join our panel discussion to discover more about researchers' experiences when navigating archives, as well as the collection policies related to Olympics/Paralympics of GLAM organisations. This event is a collaboration between the British Society of Sports History (BSSH) and the British Library Web Archive team.

 

Register here to receive the joining details:

https://forms.gle/Tjzikxgjvr3FofSr8 

Date:           19 June 2020

Time:          3-4:30pm (BST) / 10-11:30am (EST)

Location:    Zoom

Twitter hashtag: #ResearchingtheGames

 

Presentations

Heather Dichter, De Montfort University - Finding Olympic history in non-sport archives

Laura Alexandra Brown, Northumbria University - The heritage of the Games: Interpreting urban change in Olympic host cities

Robert McNicol, Librarian, Wimbledon Lawn Tennis Museum - Researching the Olympics/Paralympics at Wimbledon

Helena Byrne, Curator of Web Archives, British Library - Preserving the Olympics/Paralympics online

 

What to expect

There is a broad mix of physical, digitised and born digital resources will be covered in the presentations. The Curator of Web Archives, Helena Byrne will be discussing the UK Web Archive collections related to the Olympics/Paralympics as well as the collaboration with the International Internet Preservation Consortium (IIPC).

The year 2020 was originally an Olympic/Paralympic year before the outbreak of the coronavirus pandemic. It is also a significant milestone for the UK Web Archive and the IIPC. It marks 15 years since the first UK Web Archive collections were published and also 10 years since the IIPC first started archiving the Olympics.

 

UKWA Sports
https://www.webarchive.org.uk/en/ukwa/collection

 

The UK Web Archive and sports

The UK Web Archive has been archiving sports related websites since it was established in 2005. However, it wasn’t until 2017 when dedicated sports collections were established. There are three broad collection groups Sports Collection, Sports: Football and Sports: International Events. The subsections of the Sports: International Events includes two summer and two winter Olympic/Paralympic collections from 2010, 2012, 2014 and 2016. The largest of these collections is the Olympic & Paralympic Games 2012 collection as the Games were hosted in the UK.

 

Access and reuse

Under the Non-Print Legal Deposit Regulations 2013 (NPLD) access to archived content is restricted to a UK legal deposit library reading room. However, if we have permission from the website owner, we can make the archived version of their content open access along with government publications under the Open Government Licence. This is why if you browse through the collections on our website, most of the links to archived content will direct you to one of the UK legal deposit libraries for access but some of the content you can view from your personal device.

 

IIPC and the Olympic/Paralympics

The UK Web Archive is made up of the six UK legal deposit libraries, two of those libraries, the British Library and the National Library of Scotland are also members of the International Internet Preservation Consortium (IIPC) which was founded in 2003. In 2010 the IIPC started its first collaborative collection on the Winter Olympics 2010 and has covered every Olympic/Paralympic Games since. Since the formation of the IIPC Content Development Group (CDG) the collections have started to include a broader range of subjects on and off the playing field.

 

Get Involved

The UK Web Archive aims to archive, preserve and give access to the entire UK web space.

If you see content that that should be included in one of sports collections then please fill in our online nomination form.

29 May 2020

Using Webrecorder to archive UK political party leaders' social media after the UK General Election 2019

Add comment

This blog post is is by Nicola Bingham, Helena Byrne, Carlos Lelkes-Rarugal and Giulia Carla Rossi

Introduction to Webrecorder

The UK Web Archive aims to capture the whole of the UK web space at least once a year, and targeted websites at more frequent intervals. We conduct this activity under the auspices of the Legal Deposit Regulations 2013 which enable us to capture, preserve and make accessible the UK Web for the benefit of researchers now and in the future.

Along with many cultural and heritage institutions that perform at-scale web archiving, we use Heritrix 3, the state of the art crawler developed by the Internet Archive and maintained and improved by an international community of web archiving technologists.

Heritrix copes very well with large scale, bulk crawling but is not optimised for high fidelity crawling of dynamic content, and in particular does not archive social media content very well.

Researchers are increasingly turning their attention to social media as a significant witness to our times, therefore we have a requirement to capture this content, in certain circumstances and in line with our collection development policy. Usually this will be around public events such as General Elections where much of the campaigning over recent years has been played out online and increasingly on social media in particular. 

For this reason we have looked at alternative web archiving tools such as Webrecorder to complement our existing toolset. 

Webrecorder was developed by Ilya Kreymer under the auspices of Rhizome (a non-profit organisation based in New York which commissions, presents and preserves digital art), under its digital preservation program. It offers a browser based version, which offers free accounts up to 5GB storage and a Desktop App

Webrecorder was already well known to us at the UK Web Archive although we had not used it until recently. It is a web archiving service which creates an interactive copy of web pages that the user explores in their browser including content revealed by interactions such as playing video and audio, scrolling, clicking buttons etc. This is a much more sophisticated method of acquisition than that used by Hertrix which essentially only follows HTML links and doesn’t handle dynamic content very well. 


What we planned to do

The UK General Election Campaign ran from the 6th of November 2019 when Parliament was dissolved, until polling day on the 12th of December 2019. On the 13th of December 2019 the UK Web Archive team, based at the British Library attempted to archive various social media accounts of the main political party leaders. Seventeen political leaders from the four home nations were identified and a selection of three social media accounts were targeted: Twitter, Facebook and Instagram. Not all leaders have accounts on all three platforms, but in total forty four social media accounts were archived. These accounts are identified in the table below by an X. 

List of UK political political part leaders' social media accounts archived
Image credit: Carlos Lelkes-Rarugal

 

 

How we did it

On the 13th of December, 2019 we ran the Webrecorder Desktop App across twelve office PCs. Many were running the Webrecorder Autopilot function over the accounts, but we had mixed success, in that not all accounts captured the same amount of data. As the Autopilot functionality didn’t work well on all accounts, a combination of automated and manual capture processes were used where necessary. It took the team a lot longer than expected to archive the accounts therefore some were archived on a range of dates the following week.    

 

Large political party’s vs smaller party’s social media accounts

The two largest political party leaders, Jeremy Corbyn and Boris Johnson, have many more social media followers than the other home nations party leaders. This meant that it was more difficult to get a comprehensive capture of Corbyn and Johnson’s Twitter accounts than, for example, Arlene Foster’s. The more popular Twitter accounts took many hours to crawl; Corbyn’s took almost ten hours to archive thirteen day’s worth of Tweets (which only took us up to 1st December). 

 

Technical Issues

We experienced several technical issues with crawling, mainly concerned with issues around  IP addresses, the app crashing, and Autopilot working on some computers and not others. It was hard to get the app restarted after it crashed, so some time was lost when this happened. Different computers with the same specs ran differently. The Autopilot capture for Jeremy Corbyn’s and Boris Johnson’s Twitter accounts were started at the same time but Corbyn’s ran uninterrupted while Johnson’s crashed when it reached 475 MB. Although Corbyn’s account was crawled for nearly ten hours it only collected 93 MB of data. In contrast, Nigel Farage’s Twitter page was crawled for over four hours and only produced 506 MB. It is important to check the size of crawled data, as the hours the Webrecorder Desktop App is running on Autopilot does not necessarily translate into a high fidelity crawl. 

 

Added complications when using multiple devices with the same user profile:

Complications arose mainly from the auditing and collating of WARC files; performing QA and keeping track of which jobs were successful and those that were not. 

Initially, all participants in this project had planned to use their own work PC or work laptop and a local desktop installation of Webrecorder. However, an hour or so into the process(early in the day), it soon became apparent that there would not be enough time to archive all of the social media accounts within our time frame, given the volume of social media accounts and the unanticipated time it would take to archive each one. For example, it took one instance of a desktop Webrecorder application almost ten hours to archive Jeremy Corbyn’s Twitter account (only able to capture Tweets up to a month prior to the day of archiving).

It was then decided that we could potentially, and experimentally, run multiple parallel Webrecorder applications across a number of office desktop PCs; PCs that were free and available for us to use. This was possible because of the IT Architecture in place, allowing users to log into any office machine with the correct credentials and making their personal desktop load up along with all their files and user settings, regardless of the PC they log into. 

The British Library’s IT system, which incorporates a lot of the Windows ecosystem, gives each user their own dedicated central work directory where they are given a virtual hard drive and  their own storage space for all their documents and any other work related files. This allowed one user to be logged into several office PCs at the same time and therefore run a separate desktop Webrecorder application running on each machine. This was indeed very helpful as it allowed each machine to focus on one particular social media account, which in many cases took hours to archive. 

Having multiple Webrecorder jobs greatly increased our capacity to archive by removing the previous bottleneck, that was, one webrecorder job per user. Instead, this was increased to several webrecorder jobs per user.

Work flow of gathering WARC files from Webrecorder
Image credit: Carlos Lelkes-Rarugal

 

 

Having multiple Webrecorder jobs added complications down the line, not necessarily impacting the archiving process, but rather, complicating the auditing and collating of WARC files. When a user had several Webrecorder jobs running concurrently, each job would still be downloading to the same user work directory (the user’s virtual hard drive). So if a user had many parallel jobs running, this would create multiple WARC files in the same folder (but with different names, so no clashes), WARC files being produced by the different desktop PC that the user had logged in to. This was quite an elaborate setup because once a job had completed, the entire contents of the Webrecrder folder (where the WARCs were stored) was copied to a USB so that an initial Quality Assurance (QA) could be performed on the completed job on a more capable laptop. The difficulty was in finding the WARC file that corresponded to the completed job, which was somewhat convoluted as there would have been multiple WARC files with this type of file-naming convention:

 “rec-20191213100335021576-DESKTOP-AOCGH38-7B5SEXKS.warc.gz”. 

As you can imagine, taking a copy of Webrecorder’s folder contents not only has the completed job, but also the instances of other WARC files from other incomplete jobs. Coupled with multiple jobs per PC, and multiple PCs per user; keeping track of what had completed and which WARCs were either corrupted or not up to standard, was quite demanding. 

 


Review of the data collected 

File size of data collected from UK political party leaders' social media accounts
Image credit: Carlos Lelkes-Rarugal

 

How to access this data

The archived social media accounts can be accessed through the UK General Election 2019 collection in a UK Legal Deposit Library Reading Room. The UK Legal Deposit Libraries are the British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries, Cambridge Unity Libraries and Trinity College Dublin Library.  

The 2019 collection is part of a time series of UK General Elections dating from 2005. They can be accessed over the Internet on the Topics and Themes page of the UK Web Archive website. All the party leaders' social media accounts are tagged into the subsection UK Party Leaders Social Media Accounts (access to individual websites depends on whether we have an additional permission to allow ‘open’ access). More information about what is included in the UK General Election 2019 collection is available through the UK Web Archive blog

 

Conclusion


Overall, undertaking this experiment was an interesting experience for our small team of British Library Web Archive Curators. Many valuable lessons were learnt on how best to utilise Webrecorder in our current practice. The major takeaway was that it was a lot more time consuming than we expected. Instead of taking up one working day, it took nearly a whole week to archive our targeted social media accounts with Webrecorder. Our usual practise is to archive social media accounts with the Heritrix crawler, which works reasonably well with Twitter but is less suited to capturing other platforms. For a long time, we were unable to capture any Facebook content with Heritrix, mainly due to the platform’s publishing model, however the way the platform is published has changed recently allowing us limited success. Archiving social media will always remain challenging for the UK Web Archive, for myriad technical, ethical and legal reasons. The sheer scale of the UK’s social media output is too large for us to capture adequately (and indeed, this may not even be desirable) and certainly too large a task for us to tackle with manual, high fidelity tools such as Webrecorder. However, our recent experience during the 2019 UK General Election has convinced us that using Webrecorder to capture significant events is a worthwhile exercise, as long as we target selected, in scope accounts on a case by case basis. 

 

24 August 2018

How is the UK Web Archive documenting the ‘bodily autonomy’ debate online?

Add comment

This blog post follows on from Kelly Burchmore’s post - Building collections on Gender Equality at the UK Web Archive, if you’ve not done so, we would encourage you to read it first.

Background
The UK Web Archive (UKWA) aims to collect online material connected with nationally important issues and debates. Recently this has included the long running discussions around bodily autonomy. Much of this material is via social media, that can be very challenging to collect.

Archivingthe8th

See the trend online.

Why is UKWA #Archivingthe8th?1
Although the UK Web Archive only collects material related to the UK, many individuals and groups connected with the referendum on the 8th amendment1 campaigned in the UK, therefore much of the material falls within our remit.

In Britain there are many sections of the Irish based Abortion Rights Campaign group set up in various cities starting with the London Irish Abortion Rights Campaign, in the lead up to the referendum date they ran a home to vote campaign through the website hometovote.com. The pro-life group London Irish United For Life also ran a similar campaign through the website hometovote.uk. All of these websites and many more websites on any subject related to this subject are archived in the Bodily Autonomy subsection of the Gender Equality collection.

The UK Web Archive only archives content published in the UK, but other web archives also collected content on this subject. The National Library of Ireland built a special collection on the referendum and George Washington University archived over 2 million tweets that used popular hashtags related to the referendum.

How to get involved?
If there are any UK websites or Twitter accounts that you think should be added to the Bodily Autonomy subsection of the Gender Equality collection, then you can take up the UK Web Archive’s call for action and nominate content by following this link:

beta.webarchive.org.uk/en/ukwa/info/nominate

By Helena Byrne, Curator of Web Archives, The British Library

1#Archivingthe8th
On the 25th of May 2018 the Republic of Ireland had a referendum on the 8th Amendment, if repealed this would make way for government to implement legislation on access to abortion services. Although, the referendum on the 8th Amendment only impacted on the laws of the Republic of Ireland its significance spread across the world and it received a lot of international media attention. Both pro-choice and pro-life solidarity campaign groups formed around the world, mostly made up of the Irish diaspora and other campaigners passionate about the subject. After the result was announced the hashtag #archivingthe8th started trending on Twitter as people wanted to know how this part of public history was going to be preserved for future generations.

06 August 2018

Building collections on Gender Equality at the UK Web Archive

Add comment

This is a guest blog by Kelly Burchmore, a graduate trainee digital archivist on the Bodleian Libraries’ Developing the Next Generation Archivist programme. The Bodleian is one of the 6 legal deposit libraries in the UK. One of her projects this year is to help curate special collections in the UK Web Archive. Since May she’s been working on the Gender Equality collection.

Why are we collecting gender equality websites?
2018 is the centenary of the 1918 Representation of the People’s Act. UK-wide memorials and celebrations of this journey, and victory of women’s suffrage, are all evident online: from events, exhibitions, commemorations and campaigns. Popular topics being discussed at the moment include the hashtags #timesup and #metoo, gender pay disparity and the recent referendum on the 8th Amendment in the Republic of Ireland. These discussions produce a lot of ephemeral material, and without web archiving this material is at risk of moving or even disappearing. Web Archives are able to demonstrate that gender equality is increasingly being discussed in the media and these discussions have been developing over many years.

Through UK Web Archive SHINE Interface we can see that matching text for the phrase ‘gender equality’ increased from a result of 0.002% (24 out of 843,204) of crawled resources in 1996, to 0.044% (23,289 out of 53,146,359) in 2013.

SHINEgenderequality

If we search UK web content relating to gender equality we will generate so many results; for example, organisations have published their gender pay discrepancy reports online and there is a lot to engage with from social media accounts of both individuals and organisations relating to campaigning for gender equality. It becomes apparent that when we browse this web content gender equality means something different for so many presences online: charities, societies, employers, authorities, heritage centres and individuals such as social entrepreneurs, teachers, researchers and more.

What we are collecting?
The Gender Equality special collection, that is now live on the UK Web Archive comprises material that provides a snapshot into attitudes towards gender equality in the UK. Web material is harvested under the areas of:

• Bodily autonomy
• Domestic abuse/Gender based violence
• Gender equality in the workplace
• Gender identity
• Parenting
• The gender pay gap
• Women’s suffrage

100 years on from the introduction of limited women’s suffrage, the fight for gender equality continues. The collection is still undergoing curation and growing in archival records - and you can help too!

How to get involved?
If there are any UK websites that you think should be added to the Gender Equality collection then you can take up the UK Web Archive’s call for action and nominate.

Fawcett_teachingequalrights.jpeg

03 August 2018

Work Experience at the UK Web Archive

Add comment

By Emily Mahoney

Upon hearing that I had a work experience placement in the British Library, I immediately thought of books and reading, a main passion of mine from a young age. When I found out about the many other sides to working in such an immense organisation, (the British Library employs just over 1,500 people) I realised it would be far more fascinating than I had imagined.

Photo-1457369804613-52c61a468e7d

I was assigned a position in Web Archiving with Helena Byrne for the week. Coming into a week of work experience in Web Archiving seemed overwhelming to me as someone with no previous experience in the topic, however, the team working in the department made me feel reassured immediately. Instead of being nervous, I could then focus on the multitude of interesting new information coming my way.

Photo-1454165804606-c3d57bc86b40

My first task was to identify images for the covers of the newer Special Collections on the UK Web Archive website. I was then informed that I would be working on a project with Leila Nassereldein, a PhD placement student focused on archiving a collection of online zines that are independent, self-published, and authored by Asian, African or Caribbean people in the UK. This was extremely exciting to me as this is an area most people don’t necessarily think of when considering the British Library and Leila was keen on making a space for these zines through which the smaller, independent and sometimes radical publications could also leave their mark in our web history. While working on this project with Leila I learnt to appraise, curate and archive contemporary websites using the Annotation Curation Tool (W3 ACT) tool.

Photo-1466386460451-cbc548bf581b

Before this week I had never come across the UK Web Archive and this experience has made me aware of just how important it is that we have access to this information in years to come. The online public archive is also an area with a large number of research points that I will definitely be using during any further study. When writing this I was asked what the ‘most interesting’ part of my placement was, however, it would be too hard to choose due to the amount of things that I have learnt during this week that I had never encountered before. Overall, my experience at the British Library was an enriching one that I will never forget, and helped me consider an aspect of our online life that had never occurred to me before.

11 May 2018

Online Hours: Supporting Open Source

Add comment

Encouraging collaboration
Here at the UK Web Archive, we're very fortunate to be able to work in the open, with almost all code on GitHub. Some of our work has been taken up and re-used by others, which is great. We’d like to encourage more collaboration, but we've had trouble dedicating time to open project management, and our overall management process and our future plans are unclear. For example, we've experimented with so many different technologies over the years that our list of repositories give little insight into where we're going next. There are also problems with how issues and pull-requests have been managed: often languishing unanswered, waiting for us to get around to looking at them. This also applies to the IIPC repositories and other projects we are involved in, as well as the projects we lead.

I wanted to block out some time to deal with these things promptly, but also to find a way of making it a bit more, well, fun. A bit more social. Some forum where we can chat about our process and plans without the formality of having to write things up.

Taking inspiration from Jason Scott live-streamed CD-ripping sessions, we came up with the idea of something like Office Hours for Open Source -- a kind open open video conference or live stream, where we'll share our process, discuss issues relating to open source projects and have a forum where anyone can ask questions about what we’re up to.

Who is this for?
All welcome, from lurkers to those brimming with burning questions. Just remember that being *kind* beats being right.

Furthermore, if anyone else who manages open source projects like ours is also welcome to join and take the lead for a while! I can only cover the projects we’re leading, but there are many more that would be interesting to hear from.

When?
The plan is to launch the first Online Hours session on the 22nd of May, and then hold regular weekly slots every Tuesday from then on. We may not manage to run it every single week, but if it’s regular and frequent that should mean we can cope more easily with missing the odd one or two.

On the 22nd, we will run two sessions - one in the morning (for the west-of-GMT time-zones) and one in the evening (for the eastern half). Following that, we intend to switch between the two slots, making each a.m. and p.m. slot a fortnightly occurrence.

How?
The sessions will be webcast with a slack channel available for chat. See the IIPC Trello board for more information.

The IIPC (International Internet Preservation Consortium) have kindly agreed to help support this event and further Online Hours sessions. Running this initiative in a more open manner should raise the profile of our open source work both inside and outside of the IIPC, and encourage greater adoption of, and collaboration around, open source tools.

For full details, see the IIPC Trello Board card or ask a question in the NetPreserve Slack Channel #oh-sos (ask @NetPreserve to join the Slack).

See you there!

By Andrew Jackson, Web Archive Technical Lead, The British Library

 

04 May 2018

Star Wars in the Web Archive

Add comment

May the fourth be with you!

It's Star Wars day and I imagine that you are curious to know which side has won the battle of the UK web space?

Looking at the trends in our SHINE dataset (.uk websites 1996-2013 collected by Internet Archive) I first looked at the iconic match-up of Luke vs Darth.

Shine-darth-vader

Bad news, evil seems to have won this round mainly, it seems, due to the popularity of Darth Vader costume mentions on retail websites.

How about a more general 'Light Side vs Dark side'? 

Shine-lightside-v-darkside

It appears that discussing the 'dark side' of many aspects in life is a lot more fun and interesting than the 'light side'. 

How about just analysing the phrase 'may the force be with you'?

Shine-may the force be with you

This phrase doesn't seem to have been particularly popular on the UK web until it started to be used a lot on websites offering downloadable ringtones. Go figure.

Try using the trends feature on this dataset yourself here: www.webarchive.org.uk/shine/graph

Happy stars wars day!

by Jason Webber, Web Archive Engagement Manager, The British Library

@UKWebArchive