A good catalogue is transparent: if a library user becomes aware in more than a passing fashion of the technicalities of the catalogue, it is usually a sign that it is not doing its job smoothly. Nonetheless, from time to time it can be interesting to learn what goes on behind the scenes. At this year’s annual conference in Cardiff of the Archives and Records Assocation – the U.K.’s professional body for those people working with archive material – a round table on descriptive standards set off discussions about several interrelated topics, clustering around the rise of the machines.
The archive profession first began to use an agreed international standard for its catalogues in the late 1990s – ISAD(G), the International Standard on Archival Description (General). (We wrote about this standard and its impact in summer 2012, here.) Not entirely coincidentally, this was at the time that the internet really took off and made possible both remote access to data and sharing of that data. In the 16 years since ISAD(G) was approved, the way that we access archive catalogues has changed utterly. Formerly, we went to a record office and browsed printed or typed pages. Now we see the information almost exclusively on a screen, as a database or web document: the first aspect of the rise of the machines discussed by the round table. This has serious implications. Because ISAD(G) is built around the fact that documents in an archive are interrelated, it does not require cataloguers to repeat information over and over: typically, an archive catalogue will have a detailed description of the collection as a whole, and of the organisation or body whose papers are being catalogued, at the top- or collection-level: what would, in an old-fashioned paper catalogue, form the general introduction. Descriptions of individual files are typically much more slim-line. However, an on-line database with keyword searching means that we can no longer assume a reader starting at page one and reading (or at least skimming) the whole catalogue: instead, a hit in a search can bring them into the collection anywhere, without context.
Yet context matters in making sense of brief, one-line file descriptions. If, for example, Professor X works in cancer research funded by grants from the major charities, and then much later, in retirement, suffers from cancer herself, a file titled “Macmillan Cancer Care” will imply very different contents depending on where in her career its date places it, but a reader who comes to the record without an overview of her career, an overview of the kind set out in a catalogue’s introduction, will not be able to make the appropriate inferences. How the profession reacts to this new, less sequential way of consuming archive information, and the changes we may have to make to our data, was a key question to emerge in Cardiff.
Databases and their online front-ends also enable data to be pooled and seen in many different places. Last century, for instance, when a reader picked up a paper binder entitled “Francis Crick (PP/CRI)” they knew that this was a catalogue of archives held at the Wellcome Library, because they had had to visit the Library and come all the way through to the rare materials room to see it: it was not necessary to spell out what their physical environment had already made plain. Now, however, that information can be accessed anywhere – in a pool of all the Library’s catalogues, and increasingly via a search across many institutions’ resources. Catalogue records need to be much more explicit in their titles about what sort of material this is: for instance, that these are Crick’s own papers, not a file of letters to him in the papers of another scientist, or a printed biography, or a TV documentary about him. We will be doing some global changes on our own archive catalogues to reflect this shortly.
So: catalogues are increasingly seen on machines. The second issue takes this further: catalogues are increasingly seen by machines. Web technology such as Linked Data raises the possibility that catalogues in future will consist not of records for individual items, hidden from the wider world of the internet by a search screen whose white box can be either a gateway or, if one does not know the correct term to enter, a barrier, but instead a web of interconnected pages and facts, into which one flows from the world beyond the Library by surfing exactly as one would any other web page.
Excitingly, much of this can be made to happen by automated processes. For machines to interrogate and consume our data, however, that data has to be much more precisely defined: we need to know exactly what we are saying about something in a particular database field, and to be confident that we use that field always to say the same thing.
This is not always as simple as it sounds. Take the “Date” field in an archive catalogue; on the face of it, a simple concept, the point in time at which the object or objects were created, the analogue to a book’s publication date (but, with archive material, potentially fuzzier – we might say “1950s” or “early 17th century”). However, what is this actually the date of? Most young archivists starting to catalogue soon come across what we might call “the photocopy problem”: a 19th century title deed, say, but held in the form of a 1970s photocopy. What does one put in the “Date” field of one’s description? Pragmatism tells us that most people will be interested in the item for the information that it holds rather than looking for examples of late 20th century copying technology, so the young archivist will typically enter the 19th century date (and maybe note elsewhere that this is a later copy). What this has done, however, is to define the “Date” field as “Date of intellectual content” and separate it off from “Date of that content’s particular physical manifestation”. Human beings, reading, can cope with this and interpret what is being said in “Date” accordingly, in the light of a mention elsewhere that this is a copy. A machine is more literal and needs to have this spelled out explicitly. Greater precision in our understanding of what the various ISAD(G) fields are actually saying is also going to be necessary in future.
Catalogues seen on machines, and seen by machines: the third issue, of course, is that increasingly those catalogues describe information created by machines. Archive standards and practices devised for paper material now describe the born-digital products of people’s computers, sometimes material that never had a life on paper. We need to examine our existing practices and see whether assumptions formed in the era of paper are still valid for born-digital material. Since what we have generally included in a catalogue is a description of the material’s content rather than its physical format (we tell you, for instance, the title of a file, its date and its serial number, but not the colour of the cover) most of our standards are probably still applicable, with a little fine tuning.
One example of the adjustments that will be necessary takes us back to the point in the previous paragraph. Born-digital material may pass through many versions, migrating from one format to another in an effort to keep it readable: here, then, the date at which the content was created will be different from the date at which this given version came into existence. Again, we have separated “Date of intellectual content” from “Date of that content’s particular physical manifestation”, and will need new fields, both more tightly-defined, to hold that information. In many cases that information may itself be given to us by machines – files contain “date created” and “date modified” information automatically – but we will not always be able to rely on it: most of us will have experience of copying old material from one PC to another and seeing, with dismay, that files created today nonetheless tell us they were last modified some years ago, before their creation date. Likewise, a file will tell us what it was named by its creator, but this is no guarantee that the name will be sensible or illuminating – think of your own file-naming practices and tell us, if you can, that every file has always been crystal-clear in its nature from the name alone! Yet the sheer bulk of born-digital material means that we will often have to rely on automatic harvesting of this type of information without necessarily subjecting it to human quality-checking. Catalogues of the future, built by this new non-human agency, will have to carry appropriate health warnings for the user.
Machines to show us the catalogues, machines to read them, machines to help assemble them – the rise of the machines makes this a time of interesting change for archivists and for archive users. One thing that will not change, of course, is our basic mission: whatever the means, we are focussed on getting information about our holdings to human readers, and we still need you – whether in person or online – to look at them. In the year 2525 the archive user may have dwindled to a tiny, etiolated figure sitting in a pod with a screen, with one huge index finger for pressing buttons: we will not care, we will still be serving that sci-fi figure as we do now.
Author: Dr Chris Hilton is a senior archivist at the Wellcome Library.