UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

2 posts from January 2015

30 January 2015

Collecting Data To Improve Tools

Like many other institutions, we are heavily dependent on a number of open source tools. We couldn’t function without them, and so we like to find ways to give back to those communities. We don’t have a lot of spare time or development capacity to contribute, but recently we have found another way to provide useful feedback.

ApacheTika

Large-scale extraction

At the heart of our discovery stack lies Apache Tika, the piece of software we use to try to parse the myriad of data formats in our collection in order to extract the textual representation (along with any useful metadata) that goes into our search indexes. Consequently, we have now executed Apache Tika on many billions of distinct resources, dating from 1995 to the present day. Due to the age and variablity of the content, this often tests Tika to it’s limits. As well as failing to identify many formats, it sometimes simply fails, throwing out an unexpected error, or by getting locked in a infinite loop.

Logging losses

Each of those failures represents a loss – a resource that may never be discovered because we can’t understand it. This may be because it’s malformed, perhaps even damaged during download. It may also be an sign of obsolescence, in that it may indicate the presence of data formats that are poorly understood, and are therefore likely to present a challenge to our discovery and access systems. So, instead of ignoring these errors, we decided to remember them. Specifically, each is logged as a facet of our full-text index, alongside the identity of the resource that caused the problem.

Sharing the results

We’ve been collecting this data for a while, in order to help us tell a broken bitstream from a forgotten format. However, in a recent discussion with the Apache Tika developers, they have indicated that they would also find this data useful as a way of improving the coverage and robustness of their software.

This turns out to be a win-win situation. We store the data we were intending to store anyway, but also share it with the tool developers, who get to improve their software in ways we will be able to take direct advantage of as we run later versions of the tool over our archives in the future.

And it feels good to give a little something back.

– by Andy Jackson

@anjacks0n 

 

28 January 2015

Spam as a very ephemeral (and annoying) genre…

Spam is a part of modern life. Who hasn’t received any recently, is a lucky person indeed. But only try to put your email out there in the open and you’ll be blessed with endless messages you don’t want, from people you don’t know, from places you’ve never heard about! And then just delete, de-le-te, block sender command…

Imagine though someone researching our web lives in say 50 years and this part of our daily existence is nowhere to be found. Spam is the ugly sister of the Web Archive, it is unlikely we’ll keep spam messages in our inboxes, and almost certainly no institution will keep them for posterity. And yet they are such great research materials. They vary in topics, they can be funny, they can be dangerous (especially to your wallet), and they make you shake your head in disbelief…

We all know the spam emails about people who got stuck somewhere and they can’t pay the bill and ask for a modest sum of £2,500 or so. Theses always make me think: if I had spare £2,500, it’d be Bora Bora here I come, but that’s just selfish me! Now these are taken to a new level. It’s about giving us the money that is inconveniently placed in a bank somewhere far, far away:

Charity spree

From Mrs A.J., a widow of a Kuwait embassy worker in Ivory Coast with a very English surname:

…Currently, this money is still in the bank. Recently, my doctor told me I would not last for the next eight months due to cancer problem. What disturbs me most is my stroke sickness. Having known my condition I decided to donate this fund to a charity or the man or woman who will utilize this money the way I am going to instruct here godly.

Strangely two weeks a Libyan lady, who is also a widow, is writing to me that she also suffered a stroke and all she wants to shower me with money as part of her charity spree:

Having donated to several individuals and charity organization from our savings, I have decided to anonymously donate the last of our family savings to you. Irrespective of your previous financial status, please do accept this kind and peaceful offer on behalf of my beloved family.

Spam


Mr. P. N. ‘an accountant with the ministry of Energy and natural resources South Africa’ was straight to the point:

… presently we discovered the sum of 8.6 million British pounds sterling, floating in our suspense Account. This money as a matter of fact was an over invoiced Contract payment which has been approved for payment Since 2006, now we want to secretly transfer This money out for our personal use into an overseas Account if you will allow us to use your account to Receive this fund, we shall give you 30% for all your Effort and expenses you will incure if you agree to Help.

My favourite is quite light-hearted. Got it from a 32 year old Swedish girl:

My aim of writing you is for us to be friends, a distance friend and from there we can take it to the next level, I writing this with the purest of heart and I do hope that it will your attention. In terms of what I seek in a relationship, I'd like to find a balance of independence and true intimacy, two separate minds and identities forged by trust and open communication. If any of this strikes your fancy, do let me know...

So what I’m a girl too, with a husband and a kid? You never know what may be handy…

Blog post by Dorota Walker 
Assistant Web Archivist

@DorotaWalker 

 

Further reading: Spam emails received by [email protected]. Please note that the quotations come from the emails and I left the original spelling intact.