Digital scholarship blog

Enabling innovative research with British Library digital collections

26 March 2019

BL Labs Staff Award Runners Up: 'The Digital Documents Harvester'

This guest blog is by Jennie Grimshaw on behalf of her team who were the BL Labs Staff Award runners up for 2018.

Harvest Haystack uk

The UK Legal Deposit Web Archive (LDWA) contains terabytes of data harvested from the UK web domain. It has a public search interface at https://webarchive.org.uk/ , but finding individual documents in what is in effect a vast unstructured dataset is challenging. The analogy of looking for a needle in a haystack comes to mind as being entirely appropriate.

The Digital Documents Harvesting and Processing Tool (DDHAPT) was designed to overcome the problem of finding individual known documents in the LDWA. It is an adaptation of the web archiving software that enables selectors to set up regular in-depth crawls of target, document heavy websites. The system then extracts new pdfs published since its previous visit from the target websites and presents them to the selector in a list with the most recent at the top:

DDH image 1

The selector can then view an image of the document on the screen by clicking on the title. If the document is in scope, basic metadata is created by completing an on-screen form. If the document doesn’t make the grade for the creation of an individual record, it can be removed from the list of new documents for selection by clicking on the green Ignore button on the right of the screen.

The metadata we create records the title and subtitle, publication year and publisher, edition, series, personal and corporate authors and ISBN (if present). Some fields such as title, publication year and publisher are automatically populated.  A broad subject heading is assigned from a pick list. Our aim is to create a “good enough” record that can stand without upgrading by the digital cataloguers, avoiding double handling.

DDH image 2

To save time and avoid transcription errors system allows the selector to highlight information in the document such as personal author, publisher, series title or ISBN. You then mouse up, which calls up a list of fields. Clicking on the appropriate field automatically transfers the data into it.

DDH image 3

Once the metadata has been created, the selector clicks on a submit button which starts the process of loading it into the British Library catalogue and the catalogues of the other five legal deposit libraries – the national libraries of Scotland and Wales, the Universities of Oxford and Cambridge, and Trinity College Dublin. The document remains in the Legal Deposit Web Archive. Its URL in the web archive is recorded in the metadata and creates the link between the document and its catalogue record. Readers who find the record in the British Library’s public catalogue or those of any of the legal deposit libraries can then click on the “I want this” button and view the document on screen.

The DDHAPT is currently being used to monitor the publications of Westminster government departments and help us ensure that future generations of researchers can reliably access known official documents via the catalogues of the six legal deposit libraries. However, we intend to extend its use to cover the output of other non-commercial publishers such as campaigning charities, think tanks, academic research centres, and pressure groups as a way of making their archived publications easily discoverable.

Normally material collected under the non-print legal deposit regulations can only be viewed by law on the premised on one of the six legal deposit libraries. However, the Libraries have negotiated licences with the UK government and many other non-commercial online publishers that allow us to make their archived websites and the documents on them open and available remotely. These licences lift non-print legal deposit restrictions and allow us to make the documents covered by them available 24/7 from anywhere in the world.

In these ways the DDHAPT improves the discoverability of non-commercially published documents collected under non-print legal deposit, facilitates metadata creation through auto-population of some fields, and avoids double handling through creation of good quality metadata at the point of selection.

Watch the Digital Documents Harvester team receiving their award and talking about their project on our YouTube channel (clip runs from 8.15 to 14.45):

Find out more about Digital Scholarship and BL Labs. If you have a project which uses British Library digital content in innovative and interesting ways, consider applying for an award this year! The 2019 BL Labs Symposium will take place on Monday 11 November at the British Library.

.