Posted by Mahendra Mahey, Manager of BL Labs on behalf of Dr. Melodee Beals, Lecturer in Digital History, Department of Politics, History and International Relations, Loughborough University.
In the wild west of the World Wide Web, if you compose a hilarious joke, provide a simple solution to a complex problem or break a major new story, it is almost certain that your work will be copied. Although intellectual property laws exist, they are inconsistently enforced because of the sheer number of sites where reposting occurs - a number that increases with each passing second. If you are lucky, and your re-poster is honest, you may discover how far your ideas have spread through a pingback, an automatically generated comment on your original blog post with a link to its reprint.
In the nineteenth century, reprinting—especially unauthorised reprinting—was the backbone of Atlantic journalism but, unlike modern bloggers, these authors had no effective means of discovering the fate of their quips or queries, except through chance encounters with competing papers or their readers. Although concerns of commercial losses are long past, this lack of attribution continues to plague researchers working with newspapers. Without a precise date of composition or of original publication, and without a specific or even a corporate author, the provenance of these texts remain frustratingly uncertain. One solution to this problem is to track reprinting through text-matching. Using plagiarism detection software, we can carefully reconnect different versions appearing in a wide range of publications. Yet, however efficient our text-matching processes become, two major problems remain. First, text-matching requires machine-readable versions of the articles—electronic texts rather than images. While the sheer number of historical newspapers that have been digitised is impressive, the number that have high-quality, searchable text is deceptively limited. Many community sites have uploaded images of their physical or microfilm archives but do not have the resources to create fully searchable transcriptions. Others, created by state or commercial providers, have relied upon optical-character recognition, the accuracy of which is subject to wild variations. Even when OCR texts are excellent, these represent a considerable investment to providers and often remain locked behind subscription fees.
Reprints within the British Library's 19th Century Newspaper Database, 1818-1819, based on analysis with Copyfind
Thanks to the efforts of public institutions—including the British Library, National Library of Wales, National Library of Australia and the Library of Congress—machine-readable transcriptions for a large number of nineteenth-century newspapers are now available to researchers. But within these collections, a second, more sinister problem arises. No matter how diligently archivists have worked to provide a representative or diverse selection, these digital holdings remain only a slice of the sprawling news network that once existed. Even if we find every single digital copy of a text, how can we know for sure that the original is among them? It is here that the humble pingback returns to the fore. Whether prompted by the innate honesty of editors or by their desire to establish the authenticity of their materials, a significant minority of newspapers articles contained an attribution. Whether appearing as an introductory dateline or a concluding tagline, these Georgian pingbacks offer tantalising clues as to the true origins of these anonymised texts. Yet, because only a minority of articles contain these attributions, because they can appear in many different forms or locations within the article text and because OCR is frustratingly inconsistent in transcribing italic and gothic typefaces, searching for datelines algorithmically is exceedingly difficult.
A Snippet from the Ipswich Journal, 13 January 1821. Courtesy of the British Library.
That is where the crowd come in. Although computers can process data very quickly, the human brain is still more adept at finding patterns when the parameters for those patterns are particularly fuzzy. Because of this, it was easier for astronomers to train volunteers to identify dusty debris disks in nebulae than to train computers to do the same thing. And what is true for nebulae is equally true of these Georgian pingbacks. Using thousands of images from the British Library's 19th-Century Newspapers collection, we have created a new site where you can help spot these attributions and provide researchers with what Georgian authors could only dream of, a in-depth understanding of just who was stealing from whom! The site includes an in-depth tutorial on the structure of nineteenth-century newspapers articles as well as three different ways you can help us tag the database. So, whether you have a smart phone and 5 minutes waiting for your train or want to explore the collection in more depth at your home PC, please visit Georgian Pingabcks and try your hand uncovering a 200-year-old case of plagiarism.
Dr M. H. Beals is a historian of migration and media a Loughborough University. She would like to thank the following undergraduate students at Loughborough University's Department of Politics, History and International Relations for their work on this project. Will Dickinson, Alice Gilbert, Ollie Luhrs, Alex Mackinder, Pooja Makwana, Matthew McCulloch, Jonny Ord, Emily Stanyard and Rebecca Thompson.