This sounds like a simple question. Ten seconds on most sites will tell a human viewer where a site originates from, and a little digging will produce the answer eventually. But under Non-Print Legal Deposit, we need a scaleable way of settling the question without human intervention. Our remit under the new regulations extends to sites that are issued from a .uk or other UK geographic top-level domain, or where part of the publishing process takes place in the UK. (See the regulations here, and a summary here.)
We estimate that there are just short of five million sites that end in .uk - a simple, unambiguous and machine-readable way of knowing that a site originates from within the UK and so is covered by the remit we now have. However, not all UK domains end in .uk. Many .com, .org and other sites are in fact published from within the UK, and there are few reliable figures as to how many of these there are. And so to identify which of these fall within the scope of the regulations, we need other methods.
One such method is to find out where the site is hosted. www.geoiptool.com provides information on where a server is located, although it is difficult to attain 100% accuracy. Another way is to look at where the domain name is registered, using a service such as www.whois.net. However, in many cases domains are registered by one company on behalf of another or of an individual, perhaps because they want their contact details to remain private. There also isn't (yet) a straightforward way of querying any of these services at scale for thousands or indeed millions of sites.
There may be sites for which we have direct knowledge, from the site owner, that their .com domain is operated from within the UK, but that could only ever be for a tiny proportion of sites. And so after all these possibilities are exhausted, the next step is to make judgements based on the presentation of the site itself. But what in a site is "enough" ? A postal address in a Contact Us page is a possibility; so is a UK-domain email address (for those sites whose owners don't use anything as twentieth century as the post).
What if a site doesn't disclose the information we might like, but is self-evidently from the UK (once you look at the content)? One example is Conservative Home, a prominent political site, which nowhere explicitly states that it is published in the UK. This is a particular issue for blogs, which are often hosted on a platform service such as Wordpress (which is based in San Antonio, Texas) but would be thought by most to be "published" from wherever the author is based. There are similar issues in determining which parts of social media sites such as Twitter or Facebook should be treated as published from within the UK.
All of this of course supposes that all website owners tell the truth about where they are based. There may be cases where a site is published in another country but purports to be from the UK, perhaps to protect the author from a repressive regime. Conversely an owner might, for reasons which are hard to predict, wish that their site published within the UK did not appear to be.
It's early days for Non-Print Legal Deposit, and some of these issues will become clearer as we gain more experience with just these sorts of difficult questions.
[Map reproduced courtesy of Showeet.com, under a Creative Commons Attribution-NoDerivs 3.0 licence.]
Peter Webster, Web Archiving Engagement and Liaison Manager