Searching for a solution

Chemist and Druggist is a trade journal for pharmacists but the variety of its content and the abundance of images and advertising from 1859 through to 2010 give it a far wider appeal. This is probably why Chemist and Druggist was our first choice for a journal digitisation project.

As project manager for the digitisation of Chemist and Druggist I anticipated that the sheer mass of the print copies – 150 years’ worth of monthly issues – would cause physical problems for Internet Archive, our partners in the digitisation project. What I hadn’t anticipated was the challenges it would raise post digitisation.

We photographed nearly 7,000 issues containing 535,000 pages of OCR data, all freely available for anyone to view, download by issue or page. In addition, you can search within a particular issue, either through the Library catalogue, or on the Internet Archive website.

The biggest challenge was to enable searching across the entire 150 year run of the journal – making it an invaluable search tool for researchers. But the sheer quantity of the Chemist and Druggist OCR data, all linked to the standard single catalogue record for the journal title, exceeded the maximum file size that our Library catalogue’s Java 6 environment could handle. We’re still working on a solution that will allow searching across the whole journal run. Understandably we’re all frustrated at the delay.

In the meantime I’ve been looking for alternative ways to search the whole journal. I noticed that Google indexes all the OCR data for every work on the Internet Archive website. This means that by creating a custom search for the site – the equivalent of a local google search box – and limiting it to select only pages with a url containing our unique identifier for Chemist and Druggist (b19974760) we could have a word search across all the Chemist and Druggist pages on the Internet Archive site. By appending .txt to the search I was able to limit the search to just full text OCR.

From there it was just a case of finding the optimal layout to display as many results as possible – Google limits results to just 10 pages regardless of how many results are on each page. So here’s what my Google custom search tool for Chemist and Druggist looks like:

It’s not pretty and the output is fairly crude, but at least you can identify instances of a keyword or phrase across the whole run of the journal. There are limitations:

You can’t search within a date range
The results don’t link to the digitised page where the text occurs, but to the OCR text for the page (hence the crude output)

What it does give you is the context of your search terms in the full text and details of the issues they were found in. OK so you then have to go back to the viewer and find the issue to see the original page, but it works! Colleagues who’ve tried it say it’s especially good at tracking down names of people, brands and businesses across the issues and years, or for providing clues for where to start researching a topic.

Why not give it a go, and let us know what you think. Your our comments and suggestions might just help us solve the catalogue search problem as well.

Author: Damian Nicolau was Project Manager – Digitisation and Collection Management at the Wellcome Library.

Blog

Searching for a solution

22/03/2016

Related Blog Posts

Blog

Searching for a solution

22/03/2016

Guest contributor

Related Blog Posts

‘Aristotle”s bestselling sex manual

Wikimedia editathon on the history of psychiatry