12 March 2019
How do I love thee? Let me %>% group_by (ways) %>% count()
Counting is very simple. We’ve been doing it for 50,000 years. One of the first things we learn as a child is how to count: before or at the same time we learn the alphabet, we learn to count to ten. First we learn to count on our fingers, perhaps next we count on an abacus. Eventually we graduate to counting on a calculator or a computer. Computers are very good at it, too, which is useful. Give a computer some text, and it can really quickly count lots of things for you: things like the total number of words, the total number of characters or the number of unique words. Counting helps us do lots of useful things. Counting can help us to break codes or compress data. Samuel Morse counted the average frequency of letters in the English language and assigned the most frequent ones to shorter dot-dash combinations. Your computer is doing the same thing when it zips or unzips a file.
Corpus analysis is the study of lots and lots of words of a particular type. Google N-Gram browser finds words or short phrases in millions of digitised books. EEBO N-Gram browser does the same for millions of transcribed texts from the 17th and 18th century. At this scale, simple counting becomes really powerful. Using these tools, researchers can count the frequency of words, which can be the starting point for understanding how words were used and how ideas gained or lost momentum over time. These tools count the relative frequency of words: how unusual is it to have this word here? Are there many more instances of a word appearing than one would expect from the usual frequency? Simply counting can tell us the importance of terms, ideas, concepts in particular texts, or at particular times.
We can divide things up and then count them: How many times did a particular phrase appear in a particular location? At a particular time? In a particular title?
We can count counts: How many titles were printed in a particular year, and how many words did each of those titles contain?
What else can we count? How about whole documents: how many newspapers were printed in the 19th century? How many titles? How many times was the word ‘Gladstone’ mentioned, vs ‘Disraeli’? Did mentions of ‘steam’ overtake mentions of ‘horse’? Counting can be a blunt tool, but it’s a starting point.
A couple of crude word searches using millions of pages of text from selected 19th century British newspapers
To take a concrete example: let’s do some counting on a single issue of one of the newspapers we’re digitising as part of our Heritage Made Digital project. We’ve taken the text of this issue and uploaded it to a web app called Voyant Tools. Voyant Tools takes text files and gives statistics and visualisations of the words within. What are the counts in this issue? This single issue has 29,734 words. It has 7,793 unique words, which could tell us something about the type of audience, or the ‘footprint’ of the author or title. What are the most common words?
Let’s quickly think about some of these words and their implications.
Mr tells us that news is, unsurprisingly bias towards reporting about one gender.
Street, house and place are intriguing, if not surprising. News is so much about space and place. Without a sense of time, news ceases, really, to be news. Perhaps the same can be said about news and space?
Which leads into the next word: Jan (the abbreviated version of January). This is a newspaper from 6 January 1821. This, alongside Dec (December shortened) tell us something about the age of news. Would you expect more or less mentions of December once news is transmitted via telegraph? There’s also day and time. It’s unlikely these words would be so common in, say, a novel, or a scientific paper. Can counting tell us something about genre?
We can count the counts: Can the words be divided into categories and counted?
What does this tell us? Well, it probably tells us more about the makeup of each individual page than anything else. We could probably guess the front page by looking its unique words. The front page was often mostly advertisements, and contact details would include words like street and Mr. It also confirms our belief that news is about information in space and time: clearly there’s a focus on place, time and people, in a way that would presumably not be so apparent in, say, a novel. If we counted the change in common words over time, we could get a picture of the changing makeup of the front page, as it moved from advertisements to headline news.
Counting is a most natural human urge and one that can have very interesting outcomes. It’s a start for all sorts of interesting research: a way to make all sorts of (often wrong) assumptions. Because counting is dangerous. It attempts to put numbers on things that may not be enumerable. We may find our attempts at counting frustrated by the stubborn fuzziness of the world, stymied by our need to put order on disorder. Over the coming months we hope to show some of the interesting things that can be done with the millions of pages being digitised by Heritage Made Digital, and lots of this research will involve, at its core, counting.
In digital scholarship, it sometimes feels like there is a move away from counting to produce results. Machine learning seems at a great distance from a chart of the most-commonly used words in a bunch of text. But machine learning still often takes a simple count as its raw material. The ‘features’ (the attributes of things we feed machine learning algorithms to make predictions about those things) are often elements like the total count of words in a particular document, or the count of unique words. No matter how sophisticated these methods get, they still, in the end, rely on counting.
Curator, Newspaper Data