UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

2 posts from October 2021

19 October 2021

Clouds and blackberries: how web archives can help us to track the changing meaning of words

By Dr Barbara McGillivray (Turing Fellow), Pierpaolo Basile (Assistant Professor in Computer Science, University of Bari), Dr Marya Bazzi (Turing Fellow) and  Dr Jenny Basford, Jason Webber (British Library)

NOTE: This a re-blog from the Alan Turing Institute, with permission.

The meaning of words changes all the time. Think of the word ‘blackberry’, for example, which has been used for centuries to refer to a fruit. In 1999, a new brand of mobile devices was launched with the name BlackBerry. Suddenly, there was a new way of using this old word. ‘Cloud’ is another example of a well-established word whose association with ‘cloud computing’ only emerged in the past couple of decades. Linguists call this phenomenon ‘semantic change’ and have studied its complex mechanisms for a long time. What has changed in recent years is that we now have access to huge collections of data which can be mined to find these changes automatically. Web archives are a great example of such collections, because they contain a record of the changing content of web pages.

But how can we automatically detect in a huge web archive when a word has changed its meaning? A common strategy is to build geometric representations of words called word embeddings. Word embeddings use lots of data about the context in which words are used so that similar words can be clustered together. We can then do operations on these embeddings, for example to find the words that are closest (and most similar in meaning) to a given word. It’s a useful technique, but building embeddings takes a lot of computing power. Having access to pre-trained embeddings can therefore make a big difference, enabling those in the scientific community without sufficient computational resources to participate in this research.

A team of researchers from The Alan Turing Institute and the Universities of Bari, Oxford and Warwick, in collaboration with the UK Web Archive team based at the British Library, has now released DUKweb, a set of large-scale resources that make pre-trained word embeddings freely available. Described in this article, DUKweb was created from the JISC UK Web Domain Dataset (1996-2013), a collection of all .uk websites archived by the Internet Archive between 1996 and 2013. (This dataset is held and maintained by the UK Web Archive, which has been collecting websites since 2005, initially on a selective basis and since 2013 at a whole domain level.) DUKweb contains 1.3 billion word occurrences and two types of word embeddings for each year of the JISC UK Web Domain Dataset. The size of DUKweb is 330GB.

Researchers can use DUKweb to study semantic change in English between 1996 and 2013, looking at, for instance, the effects of the growth of the internet and social media on word meanings. For example, if the word ‘blackberry’ is used mostly to refer to fruits in 1996 and to mobile phones in 2000, the 1996 embedding for this word will be quite different from its 2000 embedding. In this way, we can find words that may have changed meaning in this time period. The figure below (from Tsakalidis et al., 2019) shows four words whose contexts of use have changed in the last couple of decades: ‘blackberry’, ‘cloud’, ‘eta’ and ‘follow’. The bars indicate words most similar to these four words in 2000 (red bars) and in 2013 (blue bars). The scale along the bottom gives a measure of the change.

figure 02 - analysis - clouds, blackberries

The resources that underpin DUKweb are hosted on the British Library’s research repository, and are available for anyone in the world to download, reuse and repurpose for their own projects. This repository is part of the BL’s Shared Research Repository for cultural heritage organisations, which brings together the research outputs produced by participating institutions, and makes them discoverable to anybody with an internet connection. Providing a stable, dedicated location to hold heritage datasets in order to share them with a wider research community has been one of the key drivers in the implementation and development of this repository service. We are grateful to the British Library’s Repository Services team for supporting this collaboration between the UK Web Archive team and the Turing by making the content for DUKweb available.

Read the paper: DUKweb: diachronic word representations from the UK Web Archive corpus

 

04 October 2021

UK Web Archive Climate Change Collection

By Andrea Deri, Cataloguer, Lead Curator of UK Web Archive Climate Change Collection; Nicola Bingham, Lead Curator, Web Archives; Eilidh MacGlone, Web Archivist; Trevor Thomson, General Collections Assistant (Collection Development) National Library of Scotland


What public climate and sustainability related UK websites would you preserve for future research?

What public UK websites tell the story of climate change actions in your areas of living, travelling, working, study and passions?

Nominate these websites to the UK Web Archive Climate Change Collection. You can nominate as many websites or webpages as you feel are relevant.

Desert landscape - Photo by '_Marion'
Photo by '_Marion'

About the Climate Change Collection
The UK Web Archive Climate Change Collection is not only an archive of past digital content preserved for future research. It is also a live, dynamic, growing resource for decisions, research and learning today.  

Much of the debate around climate change is taking place on the Web and is, therefore, highly ephemeral, meaning it is important to capture it now, in real time. The UK Web Archive Climate Change collection does just that: captures climate related public UK websites and archives them regularly according to the frequency of updates on the website. 

What is the UK Web Archive?
The UK Web Archive (UKWA) is a collaboration of the six UK legal deposit libraries working together to preserve websites for future generations. The Climate Change collection is one of over hundred curated collections of the UK Web Archive. Given the multi-, inter- and transdisciplinary nature of the climate crisis, researchers may also find several other UKWA collections relevant for studying climate change, for example, the News Sites, Science Collection, British Countryside, Energy, Local History Societies, District Councils, Political Action and Communication, Brexit, among others.  

While all the UK legal deposit libraries contribute subject expertise to the Climate Change collection’s development, to make it more representative we solicit nominations as widely as possible. To this end we have developed a simple form, which allows anyone to nominate public websites or web pages published in the UK. If you would like to nominate a website for the UK Web Archive Climate Change collection add the title, URL and brief description of the website or webpage. 

UKWA Climate change nomination-form

If you would like us to acknowledge your nomination, enter  your name and email address.

What can UKWA archive?
Before you nominate, you might want to check your nomination for scope and duplication. The UK Web Archive cannot archive sound and video platforms in which the audio and video content dominate. Websites that require personal log-in details, for example Facebook sites, or private intranets, emails, personal data on social networking sites or websites only allowable to restricted groups. 

What happens to my nomination?
All nominations are checked manually by a curator. If the website meets the requirements of non-print legal deposit, it is added to the collection by library staff without any prejudice regarding content. We want to make the climate change collection representative of diverse perspectives. The annotation process includes assigning broad subject labels, crawl frequency (the frequency of archiving), and a licencing request for making historical pages public. While all UKWA Climate Change collection titles are listed online, archived versions of the websites can be accessed only in legal deposit libraries’ reading rooms unless licenced.  

 Why is this collection important?
The UKWA Climate Change collection serves several functions, three being particularly important: 

  1. Supports research - Supports research related to climate change issues
  2. Raises awareness & curiosity - Makes readers aware of and curious about the diversity of climate change impacts, mitigation and adaptation activities across scale
  3. Engages in action - Inspires readers to take action including nominating websites for future preservation and by doing so contributing to the knowledge base of climate change

By inviting nominations, the UKWA Climate Change collection draws on a citizen science approach, in other words, engages members of the public in academic research and developing the collection. The integration of library science and citizen science acknowledges the complementary values of diverse forms of knowledge, including diverse forms of local knowledge. With their nominations contributors can diversify existing sub-collections and initiate the creation of new sub-collections. For example, a new sub-collection has just recently been suggested dedicated to climate change & sustainability strategies of UK galleries, libraries, archives and museums (GLAM sector).  

History of the Collection
The collection was established when The Paris Agreement was negotiated at the UNFCCC COP21, in 2015. The acceleration of the climate crises, the exponential growth of digital climate content publishing and the demand for innovations that can be inspired by a diversity of knowledge, local, practical, technical and academic, called for an upgrade. The Climate Change collection is an important source of knowledge both in preparation for the UNFCCC COP26 conference in Glasgow

Websites and webpages archived over time tell the stories how individuals and organisations have been making sense of and responding to the climate crises. We encourage you to nominate the public websites that tell the stories of your engagement with the changing climate and websites you want to preserve for future generations. 

Further recommended sources