20 July 2016
Dealing with Optical Character Recognition errors in Victorian newspapers
Have you browsed through the British Library’s Collection of Nineteenth Century Newspapers? Then you have probably searched for a word in an article, only to find that some instances of that word were highlighted, and not others. In the following article, for example, (which comes from the 24th August 1833 edition of the Leeds Mercury), searching for ‘Magistrates’ (without 'fuzzy search') highlights one instance in the second paragraph, but misses the instance in the first paragraph.
That’s because what you see is a picture of the original source, and you (as a human) are able to read it. But the search engine is searching through OCR output – text generated by Optical Character Recognition (OCR) software which tries to guess what characters are represented on an image. The OCR output for the passage above actually looks like this:
COUNTY RATE tvtaN s s fl s Loud complaintst have been madc and we believe jstly of the unequal pressure of the County Rate ripon the differenrt townships and parishes of and it has In consequence been deter inmosl to make a general survey and to establisB a new scale of ment To this the trading and tnanufacturing interests of the Riding do not object tiorgfl tile effect will doubtless be to advance their assessmcnts in coparlison with those of the agricultural parhitras But we confess that it wa with setrprise we heard that any of the Mogistrates in holding their Courts for the assessment of the respective townships had reated them into secret tribunals and that they lad excluded from their sittings thoso wlto are mainly interested in ascertaining the principles which goreen the raluation of propertt and the full and fair develtpmemnt of which can alone rcuider the decislons of their Courts either satisfactory or permaneent The frank and manly example set by tire township of Leeds dorg h0onour to tbe parish officers and we must say wIthout wishling to give offence to those for swhoimt we feel nothing but respect that the line of conduct r sued by ithe Magistrates at Bradford on Btoaday last in excludintgi a parist officer from their Court swhen they knew that he was tire organ of tie towvnship hltich contributes most targely to this impost il the ltole Riding and when lie lasi explained to them in latigniagr srfaitiently courteous anid respectful that lie sotght only rltv crlsis of public jusrice requires a anuch ittore satisfnectory explanation than toas either given on Lhat tccasion or than ee apprehendl con be give n for adopting one of the roost objectionrble characteristics of the Court of the Holy lrquisition
You can read a lot of it, but there are errors, including the first occurrence of ‘Magistrates’ which is spelt ‘Mogistrates’.
Guessing what characters are in an image is not an easy task for computers, especially when the images are of historical newspapers which can be in varying states of conservation, and often contain complex layouts with columns, illustrations and different font types and sizes all on the same page.
So, how much of a problem is this, and can the errors be corrected?
This is what I have been investigating for my PhD project, as part of the Spatial Humanities project and in association with the Centre for Corpus Approaches to the Social Sciences.
In a nutshell: it’s not very easy to correct OCR errors automatically because errors can be very dissimilar to their correct form – in the example above, for example, the phrase ‘language sufficiently courteous’ has become ‘latigniagr srfaitiently courteous’ in the OCR output. Normalization software (like spell-checkers) often assume that the errors and their corrections will have many letters in common (as if they were playing a game of anagrams), but this assumption is often incorrect, as in the example above. So how can OCR errors be corrected? One state-of-the-art commercial software package I tested, Overproof, uses a technique the designers call ‘reverse OCR’: basically, they compare images of correct words to the image of the source! A simple-sounding idea which turns out to work well; you can read more about it in 'Correcting noisy OCR: context beats confusion' (login may be required).
And how much of a problem are the errors? Well, it depends what you are using the texts for. Leaving aside the question of using search engines, and its 'traps for the unwary', if you are interested in analysing patterns of discourses in texts, the main problem you will face is that the errors are not distributed evenly throughout the texts. This makes it difficult to predict how the errors might affect the retrieval of a particular word/phrase you are interested in. But if you follow some common-sense advice, you can stay on safe ground:
- Don’t over-interpret absences. (In OCR’ed texts, something which is missing may simply be something which is irretrievable because it is affected by OCR errors.)
- Focus on patterns for which you can find many different examples: ‘real-word errors’ (errors which happen to coincide with a word which actually exists, such as ‘Prussia’ which becomes ‘Russia’ when the OCR misses out the ‘P’) do exist, but they do not normally occur very often. Keep an eye out for them, but if you form a hypothesis on the basis of many examples, you are on safe ground!
In conclusion, digitized historical texts may suffer from OCR errors. It is important to be aware of the issue, but do not let this hold you back from using such sources in your research – following some simple rules of thumb (such as not placing too much emphasis on absences and focussing on patterns for which there are many different examples) can keep you on safe ground.