20 May 2020
Bringing Metadata & Full-text Together
In Searching eTheses for the openVirus project we put together a basic system for searching theses. This only used the information from the PDFs themselves, which meant the results looked like this:
The basics are working fine, but the document titles are largely meaningless, the last-modified dates are clearly suspect (26 theses in the year 1600?!), and the facets aren’t terribly useful.
The EThOS metadata has much richer information that the EThOS team has collected and verified over the years. This includes:
- DOI, ISNI, ORCID
- Dewey Decimal Classification
- EThOS Service URL
- Repository (‘Landing Page’) URL
So, the question is, how do we integrate these two sets of data into a single system?
Linking on URLs
The EThOS team supplied the PDF download URLs for each record, but we need a common identifer to merge these two datasets. Fortunately, both datasets contain the EThOS Service URL, which looks like this:
This (or just the
uk.bl.ethos.755301 part) can be used as the ‘key’ for the merge, leaving us with one data set that contains the download URLs alongside all the other fields. We can then process the text from each PDF, and look up the URL in this metadata dataset, and merge the two together in the same way.
Except… it doesn’t work.
The web is a messy place: those PDF URLs may have been direct downloads in the past, but now many of them are no longer simple links, but chains of redirects. As an example, this original download URL:
Now redirects (HTTP 301 Moved Permanently) to the HTTPS version:
Which then redirects (HTTP 302 Found) to the actual PDF file:
So, to bring this all together, we have to trace these links between the EThOS records and the actual PDF documents.
Re-tracing Our Steps
While the crawler we built to download these PDFs worked well enough, it isn’t quite a sophisticated as our main crawler, which is based on Heritrix 3. In particular, Heritrix offers details crawl logs that can be used to trace crawler activity. This functionality would be fairly easy to add to Scrapy, but that’s not been done yet. So, another approach is needed.
To trace the crawl, we need to be able to look up URLs and then analyse what happened. In particular, for every starting URL (a.k.a. seed) we want to check if it was a redirect and if so, follow that URL to see where it leads.
We already use content (CDX) indexes to allow us to look up URLs when accessing content. In particular, we use OutbackCDX as the index, and then the pywb playback system to retrieve and access the records and see what happened. So one option is to spin up a separate playback system and query that to work out where the links go.
However, as we only want to trace redirects, we can do something a little simpler. We can use the OutbackCDX service to look up what we got for each URL, and use the same warcio library that pywb uses to read the WARC record and find any redirects. The same process can then be repeated with the resulting URL, until all the chains of redirects have been followed.
This leaves us with a large list, linking every URL we crawled back to the original PDF URL. This can then be used to link each item to the corresponding EThOS record.
This large look-up table allowed the full-text and metadata to be combined. It was then imported into a new Solr index that replaced the original service, augmenting the records with the new metadata.
Updating the Interface
The new fields are accessible via the same API as before – see this simple search as an example.
The next step was to update the UI to take advantage of these fields. This was relatively simple, as it mostly involved exchanging one field name for another (e.g. from
year_i), and adding a few links to take advantage of the fact we now have access to the URLs to the EThOS records and the landing pages.
The result can be seen at:
This new service provides a much better interface to the collection, and really demonstrates the benefits of combining machine-generated and manually curated metadata.
There are still some issues with the source data that need to be resolved at some point. In particular, there are now only 88,082 records, which indicates that some gaps and mismatches emerged during the process of merging these records together.
But it’s good enough for now.
The next question is: how do we integrate this into the openVirus workflow?