UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

5 posts categorized "Social sciences"

22 May 2024

Reflections on the IIPC Early Scholars Spring School on Web Archives 2024

By Cameron Huggett, PhD Student (CDP), British Library/Teesside University

IIPC-2024-Paris-Early-Scholars-Summer-School-banner
IIPC Early Scholars Spring School on Web Archives banner

My name is Cameron, and I am currently undertaking an AHRC funded Collaborative Doctoral Partnership (CDP) project, between the British Library and Teesside University. My research centres on racial discourses within association football fanzines and e-zines from c.1975 to the present, and aims to examine the broader connections between football fandom, race and identity. 

I attended the Early Scholars Spring School on Web Archives, prior to commencement of the conference, which allowed me to knowledge share with colleagues from a number of different countries, institutions and disciplines, offering new perspectives on my own research. Within this school, I was fortunate enough to be able to deliver a short lighting talk, outlining my own use of web archiving within my research into the history of racial discourses within football fanzines. This generated an engaging discussion around my methodologies and led me to reflect upon how quantitative techniques can be better adopted within historical research practices.

I also particularly enjoyed discovering more about the collections of the Bibliothèque Nationale de France (BNF) and Institut National de L'audiovisuel (INA). The scope of the collections and innovative user interfaces were particularly impressive. For example, INA had created a programme that allowed the user to view a collection item, such as an election debate broadcast, alongside archived tweets relating to event in real time.

 My primary takeaway was how web archives can be innovatively employed to record the breadth and depth of online communities and discourses, as well as supplement more traditional sources within a historian’s research framework.  

06 October 2022

WARCnet Special Report: Skills, Tools and Knowledge Ecologies in Web Archive Research, 2022

by Sharon Healy, Maynooth University (Project Lead)

WARST report image - skills, tools and knowledge ecologies in web archive research

The WARST team are delighted to announce the publication of a WARCnet Special Report, titled: Skills, Tools and Knowledge Ecologies in Web Archive Research. This study is part of a collaborative project by researchers from Maynooth University, the British Library, the International Internet Preservation Consortium, Bayerische Staatsbibliothek, and the University of Siegen. The research team are all members of Web ARChive studies network researching web domains and events (WARCnet).

The study focuses on individuals around the globe who participate in web archive research, in the context of web archiving, curation, and the use of web archives and archived web content for research or other purposes. We consider web archive research to be representative of the processes and activities described in Archive-It’s web archiving life cycle model from appraisal, acquisition, and preservation, to replay, access, use and reuse (Bragg & Hannah, 2013).

The methodology for the study entailed desk research, participation in WARCnet meeting discussions, and an online questionnaire. The study sought to identify and document the skills, tools, and knowledge required to achieve a broad range of goals within the web archiving life cycle and to explore the challenges for participation in web archive research, and the interludes of such challenges across communities of practice. We suggest that there is a perpetual need to examine the roles of skills, tools, and methods associated with the web archiving life cycle as long as internet, web and software technologies keep advancing, upgrading, and changing.

The Executive summary offers an overview of the findings, and is translated into Danish, French, Spanish and Catalan.

The Report is available to download from WARCnet website:

https://cc.au.dk/fileadmin/dac/Projekter/WARCnet/Healy_et_al_Skills_Tools_and_Knowledge_Ecologies.pdf

A section of the Report that focused on the software, tools and methods used in the web archive research life cycle was presented in a poster at iPres 2022.

30 September 2022

Celebrating Sporting Heritage Day 2022

By Helena Byrne, Curator of Web Archives, The British Library

NSHD-Facebook-Banner-Sport-Icons-2.jpg-564x339

This blog post gives an overview of our sports related activities for the year to celebrate Sporting Heritage Day 2022 

2022 has been, and continues to be, a really busy year for international sport especially in the UK. The Winter Olympics in Beijing and the Commonwealth Games in Birmingham were  always scheduled to take place in 2022 years in advance. But as the Covid-19 pandemic caused disruption to many events in 2020 and 2021 many sporting events were postponed. The UEFA Women's Euros and the Rugby League World Cup, both hosted by England, were moved from 2021 to 2022, meaning that 2022 was even busier than normal in terms of major sporting events.

Sports has always been an Important part of the UK Web Archive so 2022 has been a busy year for us so far. Since 2017, sports has been grouped into three separate collections. 

Sports Collection - https://www.webarchive.org.uk/en/ukwa/collection/1768 

Sports: Football - https://www.webarchive.org.uk/en/ukwa/collection/1490 

Sports: International Events - https://www.webarchive.org.uk/en/ukwa/collection/2315 

The UK Web Archive regularly publishes blog posts about sport, which can be found here: https://blogs.bl.uk/webarchive/sports/

2022 Winter Olympics and Paralympics

As members of the International Internet Preservation Consortium (IIPC) both the British Library and the National Library of Scotland contributed to the IIPC Content Development Group (CDG) 2022 Winter Olympics and Paralympics collection.

The Olympics took place in Beijing from 4 to 20 February 2022, while the Paralympics were also in Beijing from 4 to 13 March 2022. 

The collection archived 863 items which included whole websites, subsections or individual pages from websites. These items are from 38 countries and 24 different languages are represented in the collection. Topics covered both events on and off the sporting field.

Browse the collection here:

https://archive-it.org/collections/18422 

UEFA Women’s Euro England 2022

The UEFA Women's Euro 2022 competition took place across England from July 6 to July 31, 2022. Although the event is over we are still collecting websites about the Euros from around the UK till the end of October. 

This collection covers both the sporting and cultural achievements of the event. There are over 275 items in the UEFA Women’s Euro England 2022 collection.

So far we have published seven blog posts about the Women’s Euros and there are still more to come. They can be found on the UK Web Archive blog with the sports tag here:

https://blogs.bl.uk/webarchive/sports/ 

Browse the collection here: https://www.webarchive.org.uk/en/ukwa/collection/4278

Commonwealth Games Birmingham 2022

Commonwealth Games Birmingham 2022 ran from 28 July to 8 August. Although the sporting events are over the cultural programme is continuing for a number of weeks. This means that UKWA still has an open call for nominations for this collection.

The collection covers both the sporting and cultural achievements as well as the social impact of this mega event. So far there are 434 items in the Commonwealth Games Birmingham 2022 collection.

Browse the collection here: https://www.webarchive.org.uk/en/ukwa/collection/4228 

Rugby League World Cup 2021

The Rugby League World Cup 2021 will take place from 15 October to 19 November 2022 across England. 

This event is unique in that the men's, women's and wheelchair competition all take place alongside each other. You can nominate your UK published Rugby League World Cup content here: https://www.webarchive.org.uk/nominate 

Updates on this collection will be published on the UK Web Archive blog and Twitter account

When published this collection will sit as a subsection of the Sports: International Events collection on the UKWA Topics & Themes page and will be available here: https://www.webarchive.org.uk/en/ukwa/collection/2315 

Access to the collections 

All of the archived content in the IIPC CDG 2022 Winter Olympics and Paralympics collection is open access. CDG collaborative collections are archived using the Archive-It platform meaning that all archived content is open access, although a publisher may  request its removal under the Internet Archives’ general terms and conditions

All CDG collections can be viewed here: https://archive-it.org/home/IIPC 

UK Web Archive Content has a mix of on-site and remote access due to the Non-Print Legal Deposit Regulations implemented in 2013. The full manifest of  content selected for UK Web Archive collections is visible on the website but access to individual archived websites depends on permission being granted by website publishers.  A note under each title informs users whether they can view the archived website online or whether they need to visit a UK Legal Deposit Library to view the archived content. 

All curated collections can be found on the Topics and Themes page of the UK Web Archive website: https://www.webarchive.org.uk/en/ukwa/category 

Get involved

The UK Web Archive is a partnership of the six UK legal Deposit Libraries and works with other external partners in order to expand  our subject expertise. We can’t curate the whole of the UK web on our own, however - we need your help to ensure that information, discussions and creative output related to sports is preserved for future generations.

Anyone can suggest UK published websites to be included in the UK Web Archive by filling in our nomination form: https://www.webarchive.org.uk/nominate 

19 October 2021

Clouds and blackberries: how web archives can help us to track the changing meaning of words

By Dr Barbara McGillivray (Turing Fellow), Pierpaolo Basile (Assistant Professor in Computer Science, University of Bari), Dr Marya Bazzi (Turing Fellow) and  Dr Jenny Basford, Jason Webber (British Library)

NOTE: This a re-blog from the Alan Turing Institute, with permission.

The meaning of words changes all the time. Think of the word ‘blackberry’, for example, which has been used for centuries to refer to a fruit. In 1999, a new brand of mobile devices was launched with the name BlackBerry. Suddenly, there was a new way of using this old word. ‘Cloud’ is another example of a well-established word whose association with ‘cloud computing’ only emerged in the past couple of decades. Linguists call this phenomenon ‘semantic change’ and have studied its complex mechanisms for a long time. What has changed in recent years is that we now have access to huge collections of data which can be mined to find these changes automatically. Web archives are a great example of such collections, because they contain a record of the changing content of web pages.

But how can we automatically detect in a huge web archive when a word has changed its meaning? A common strategy is to build geometric representations of words called word embeddings. Word embeddings use lots of data about the context in which words are used so that similar words can be clustered together. We can then do operations on these embeddings, for example to find the words that are closest (and most similar in meaning) to a given word. It’s a useful technique, but building embeddings takes a lot of computing power. Having access to pre-trained embeddings can therefore make a big difference, enabling those in the scientific community without sufficient computational resources to participate in this research.

A team of researchers from The Alan Turing Institute and the Universities of Bari, Oxford and Warwick, in collaboration with the UK Web Archive team based at the British Library, has now released DUKweb, a set of large-scale resources that make pre-trained word embeddings freely available. Described in this article, DUKweb was created from the JISC UK Web Domain Dataset (1996-2013), a collection of all .uk websites archived by the Internet Archive between 1996 and 2013. (This dataset is held and maintained by the UK Web Archive, which has been collecting websites since 2005, initially on a selective basis and since 2013 at a whole domain level.) DUKweb contains 1.3 billion word occurrences and two types of word embeddings for each year of the JISC UK Web Domain Dataset. The size of DUKweb is 330GB.

Researchers can use DUKweb to study semantic change in English between 1996 and 2013, looking at, for instance, the effects of the growth of the internet and social media on word meanings. For example, if the word ‘blackberry’ is used mostly to refer to fruits in 1996 and to mobile phones in 2000, the 1996 embedding for this word will be quite different from its 2000 embedding. In this way, we can find words that may have changed meaning in this time period. The figure below (from Tsakalidis et al., 2019) shows four words whose contexts of use have changed in the last couple of decades: ‘blackberry’, ‘cloud’, ‘eta’ and ‘follow’. The bars indicate words most similar to these four words in 2000 (red bars) and in 2013 (blue bars). The scale along the bottom gives a measure of the change.

figure 02 - analysis - clouds, blackberries

The resources that underpin DUKweb are hosted on the British Library’s research repository, and are available for anyone in the world to download, reuse and repurpose for their own projects. This repository is part of the BL’s Shared Research Repository for cultural heritage organisations, which brings together the research outputs produced by participating institutions, and makes them discoverable to anybody with an internet connection. Providing a stable, dedicated location to hold heritage datasets in order to share them with a wider research community has been one of the key drivers in the implementation and development of this repository service. We are grateful to the British Library’s Repository Services team for supporting this collaboration between the UK Web Archive team and the Turing by making the content for DUKweb available.

Read the paper: DUKweb: diachronic word representations from the UK Web Archive corpus

 

25 August 2020

Cats vs Dogs on the Archived Web

 By Helena Byrne, Curator of Web Archives at the British Library

 

Cats and dogs, two of the most popular pets in the world, have international days of celebration in August. The 8th August is International Cat Day and the 26th August is International Dog Day. 

 

How popular are cats and dogs on the archived web?

 

Cats vs Dogs
Screenshot of the search results on Shine for Cat and Dog

 

One way to answer this question is to use the Shine Trends feature. Shine was developed as part of the Big UK Data Arts and Humanities project funded by the AHRC. The data was acquired by JISC from the Internet Archive and includes all .uk websites in the Internet Archive web collection crawled between 1996 and April 2013. The collection comprises over 3.5 billion items (URLs, images and other documents) and has been full-text indexed by the UK Web Archive. Every word of every website in the collection can be searched for and analysed.

 

Taking the Shine graph at face value, overall it would seem that cats are more popular on the archived .uk domain than dogs.

 

The graph shows the percentage of resources archived for each year. In some cases the largest peak on the graph doesn’t necessarily mean the most mentions for your search; this could be attributed to a larger amount of data archived for that particular year. However, when it comes to ‘Cats vs Dogs’, the largest peak for ‘Cat’ is the most popular year while the most popular year for ‘Dog’ is slightly below the peak in the graph.  In 2005, there were almost 14.2 million mentions of ‘cat’ out of 331 million resources archived. While in 2012, there were almost 13 million mentions of ‘Dog’ out of 464 million resources archived that year.

It is not possible to view every archived resource attributed to the generated stats, but you can click on markers along the plotted graph and you will be supplied with a random sample of matching records for that year. The sample displays a sentence where the term appears, as well as a link out to the Internet Archive so that you can review the archived website.

When we review the random sample for ‘Cat’ generated for 2005, we can see that very few of the references are to our furry friends; instead, the word “Cat” mostly refers to an abbreviation for catalogue (for shopping online). This reflects a lot of the changes in how the web is used and online shopping became more popular during this period. By looking through some of the other samples we can see the use of the term ‘CAT’ as an acronym for various different systems.

On the other hand, when we look at the sample results for ‘Dog’ in 2012, most of the results are about the animal or related products such as dog food and dog accessories.

 

Possible big data project

 

After reviewing the use of the term ‘Cat’ and ‘Dog’ can we really say that the animal-related variation is the most popular on the archived .uk domain?

A possible way to truly determine which family pet is the most popular would be through an in depth analysis of the .UK domain. Something similar to the project, ‘Mining the UK Web Archive for Semantic Change Detection’ run by the Alan Turing Institute, would provide more insight into which animal is more popular in this dataset. 

This project identified words whose meaning has changed over time on the archived web. For example, when the word ‘tweet’ stopped being commonly referred to as the sound a bird makes and used more often to describe the message being sent through the social media platform Twitter.

Pierpaolo Basile, a visiting researcher at the Alan Turing Institute, used the same data that is behind Shine in his research project ‘Detecting semantic shift in large corpora by exploiting temporal random indexing’. You can watch a recording of a presentation about this research on the Alan Turing Institute YouTube channel.

 

What cats and dogs websites are in the UK Web Archive?

 

The general UK Web Archive and a number of curated collections on the Topics and Themes page of the website feature many animal-related websites, and a lot of these focus on cats and dogs. Although archiving social media is very challenging, we do have a wide selection of Twitter accounts in the archive. These include many cat persona profiles; from libraries to political cats. Some of the political cats included in the archive are Larry the Cat from 10 Downing Street and Palmerston from the Foreign Office. We haven’t come across any similar UK dog persona profiles so if you know of any please nominate them to be included in the UK Web Archive. However, there are other Twitter profiles that collect images of dogs such as Non-League Dogs. This profile is included in both the soccer section of our Sport: Football collection as well as our Online Enthusiast Communities in the UK collection.

Animal welfare websites are also well represented in our UK General Election series of collections dating from 2005 to 2019, as many publish political manifestos during the election period.

As mentioned in the International Owl Awareness Day blog post, the Online Enthusiast Communities in the UK curated collection has an Animal Related Hobbies subsection. Here you can find a number of cat and dog-related sites but we know there are many more out there. Why not nominate your favourite websites and forums?

 

How can you access these archived websites?

 

Under the Non-Print Legal Deposit Regulations 2013, we can archive UK websites but we are only able to make them available to people outside the Legal Deposit Libraries Reading Rooms, if the website owner has given permission. The UK Legal Deposit Libraries are the British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries, Cambridge University Library and Trinity College Dublin Library.  Some of the sites in the collection have already had permission granted, such as the Battersea Dogs & Cats Home, Cats Protection and Library Cat. Some examples of websites that are onsite-only access include Dogs Trust, Dog Forum and Purrs In Our Hearts Forum.

 

As the content of the UK Web Archive has mixed access, the message ‘Viewable only on Library premises’ will appear under the title if you need to visit a Legal Deposit Library to view the content. If there is no message underneath then the archived version of the website should be available on your personal device.

 

Get involved with preserving cats and dogs online with the UK Web Archive

 

The UK Web Archive aims to archive, preserve and give access to the UK web space. We endeavour to include important aspects of British culture and events that shape society. Animals and especially pets in the UK are an important aspect of our collective national culture and are represented in several collections across the UK Legal Deposit Libraries, including the UK Web Archive.

 

We can’t however, curate the whole of the UK Web on our own, we need your help to ensure that information, discussion and creative output on this subject are preserved for future generations. Anyone can suggest UK websites to be included in the UK Web Archive by filling in our nominations form: https://www.webarchive.org.uk/en/ukwa/nominate

 

Browse through what we have so far and please nominate more content!