Determining rates of JPEG 2000 compression on a collection-by-collection basis

As a result of our decision to “go lossy”, we need to make sure that the level of lossiness is appropriate to the image content. We can’t do this on the individual image level, as there are simply too many images. But we can do this on the collection level. We came up with a rule of thumb:

For any given collection of physical formats we will apply a range of different compressions on a representative sample from that collection . We will continue compressing at regular intervals until visual artefacts began to appear on any individual image (i.e. 2:1, 4:1, 6:1, and so on).
Once we determined at which compression level the worst-performing image began to show visual artefacts, we will choose the next lowest compression level (if the worst-performing image showed artefacts at 10:1, we would chose 6:1) and apply that to the entire collection, regardless of how much more compression other material types in that collection might bear.
This rule of thumb allowed us to strike a balance between storage savings and the time and effort in assessing compression levels for a large number of images.

The first “real life” test of this methodology was carried out in relation to our archives digitisation project. We are currently digitising a series of paper archives (letters, notebooks, photos, invitations, memos, etc.) in-house. The scope runs to something like half a million images over a couple of years, and includes the papers of some notable individuals and organisations (Francis Crick being the foremost of these). Archives can be quite miscellaneous in the types of things that you find, but different collections within the archives tend to contain a similar range of materials. This presents a problem if you want to treat images differently depending on their content. The photographer doesn’t know, from one file of material to the next, what sort of content they will be handling. So even for a miscellaneous collection, once the image count gets high enough, you have to make the compromise by taking a collection-level decision on compression rates.

For archival collections we needed to test things like faint pencil marks on a notebook page, typescript on translucent letter paper, black and white photos, printed matter, newsprint, colour drawings, and so on. We chose 10 samples for the test. As this was our first test, and we were curious just how far we could go for some of the material types in our sample, we started with 1:1 lossy compression and increased this to 100:1. We used LuraWave for this testing.

For the archives, the compression intervals were: 1:1 lossy, 2:1, 4:1, 6:1, 10:1, 25:1, 50:1, and 100:1. The idea is that at 2:1, the compression will reduce the file size by half in comparison to the source TIFF, and so on.

Not surprisingly, the biggest drop in filesize was seen in converting from TIFF to JPEG 2000 in the first place. At a 1:1 compression rate, this reduced the average filesize by 86% (ranging from 67% to 95%). A 2.1 compression resulted in no noticeable drop in filesize from 1:1 – begging the question what differences there could possible be between 1:1 and 2.1 in the LuraWave software. At the average file size (5mb) at this compression (2:1) , a 500,000 image repository (our estimate for the archives project) would require 2.4 Tb of storage. These averages are somewhat misleading, because while they represent a spread of material, they do not represent the relative proportions of this material in the actual collection as a whole (and we can’t estimate that yet).

File size reduction was relatively minimal between 2:1 and 10:1. What is obvious here is that setting the compression rate at, say, 2:1 does not give you a 2:1 ratio. You can achieve in fact a 14:1 ratio or higher. An interesting point to make about the very high experimental compression rates of 25:1 and above, was that output file sizes were essentially homogeneous across all the images, where as at 10:1 and lower, file sizes ranged from 1.5 Mb to 11.5 Mb.

TIFF = 35 Mb
1:1/2:1 = 4.96 Mb (86% reduction)
4:1 = 4.56 Mb (87% reduction)
6:1 = 3.89 Mb (89% reduction)
10:1 = 2.87 Mb (92% reduction)
25:1 = 1.39 Mb (96% reduction)
50:1 = 0.72 Mb (98% reduction)
100:1 = 0.37 Mb (99% reduction)

We found that the most colourful images in the collection (such as a colour photograph of a painting) performed the worst, as expected, and started to show artefacts at 10:1. These were extremely minor artefacts, but they could be seen. Other material types were impossible to differentiate from the originals even at 50:1 or 100:1, surprisingly. These tended to be black and white textual items. Using our rule of thumb, we chose 6:1 lossy compression for the archive collections. Were an archive to consist solely of printed pieces of paper, we would reassess and choose a higher compression rate, but an 89% reduction was highly acceptable in storage savings terms.

You may ask: why not just use 1:1 across the board? Is the extra saving actually worth it? Viewed in comparison to the 1:1 setting, we were getting a better than 20% reduction at 6:1 on average. This continues to represent a significant storage saving when you consider the ultimate goal is to digitise around 3.5 million images from the archive collections. Bearing in mind all the other collections we plan to digitise in future (up to 30m images), the savings become further magnified if we strive to reduce file sizes within the limits of what is visually acceptable.

There are a couple of follow-on questions remaining from all this: first, what size original should you begin with? And secondly, is it possible to automate compression using a quality control (such as peak to signal noise ratio) that allows you to compress different images at different rates depending on an accepted level of accuracy. These will be the subject future posts.