In 2009, the Wellcome Library set out an ambitious vision to digitise a large proportion of its historic collections. This would take the annual digitisation activities of the Library from hundreds, or at most, thousands of images per year to several million images per year. Collections were to include a wide range of content types – archives, printed books from the 15th to the 20th century, manuscripts, paintings and drawings, ephemera. Once we added up all these collections, using broad estimates of what we believed was there, we realised this could see the generation of up to 30m images over 5 years. Exciting, but perhaps slightly daunting, considering we didn’t yet have an infrastructure to fully support such a large collection of digital assets.
Anyone reading this blog will understand why the scale of the programme is key to the blog topic. When we asked our IT department to tell us how much it would cost to store 30m TIFF files – our de facto standard for the couple hundred thousand images in our existing picture library – we were stunned. Two petabytes of online, spinning disk storage with a top-of-the-line enterprise management system and remote backup would cost how much? We learned that the cost would be something like a fifth of our total budget for the entire digitisation programme.
Should we consider a lower-cost storage solution? Even tape back-up was quite expensive for that scale, and you can’t serve images up online from tape anyway. We revised our image sizes, factoring in smaller and smaller resolutions and/or bit depths for material like the printed books, which didn’t need full colour, high resolution images. We still couldn’t afford the storage costs.
Finally, we saw the light and started looking into a relatively new image format called JPEG 2000. We knew almost nothing about it, except that it employed an extremely efficient compression algorithm that could, possibly, allow us to reduce our storage costs without compromising too much on quality.
This was the start of our journey into the complicated and mystifying world of JPEG 2000. This blog charts our progress up to date in determining what type of JPEG 2000 we would use, how we would use it, and how it would impact on the rest of the Digital Library infrastructure. We have by no means worked out all the details around how we are going to implement JPEG 2000, so this blog will also serve as a diary of our progress as we go along. Happy reading, and feel free to post comments.