Compressor evaluation using the available corpus of files.

The following graphs plot the total T-information content (in bytes) of a reference corpus file versus its size in (bytes) before and after compression, one graph for each corpus file, and each using a range of popular compressors:  gzip, compress, ppmz, bzip, huffman, adaptive huffman and shannon-fano.

Each graph uses a single packed binary files selected from the corpus of files available here.

The following graph illustrates the way the results may be interpretted. An uncompressable file has shannon entropy 1.0, and thus will lie on the line with slope 1.0, its position corresponding to its file size. Thus the corpus file with entropy approx 0.1 lies on a line with slope 0.1, and corresponding to its length, here 250,000 bytes. When a file is compressed, its entropy increases. Most compressors approach the entropy bound 1.0, but also "add" information, a compression cost that lifts the mapping above the ideal (horizontal) mapping. This lifting ultimately prevents the compressor from achieving the compression bound corresponding to the line at slope 1.0.

A compressor which looses information (or relies on the compressor having internally a detailed model of a class of sources), the mapping will fall below the ideal. In such a situation the compressed image file cantains less information than the initial file.

A more detailed paper is available here describing this in more detail. See: "Compressor performance, absolutely!" Titchener 2001  postscript, PDF

1. lgst3.573550p.gz    H =  0.1  (.0943)

2. lgst3.586787p.gz    H =  0.2  (.1882)

3. lgst3.611055p.gz    H =  0.3  (.2971)

4. lgst3.651050p.gz    H =  0.4  (.3855)

5. lgst3.687660p.gz    H =  0.5  (.5010)

6. lgst3.766200p.gz    H =  0.6  (.5885)

7. lgst3.907580p.gz    H =  0.7  (.6799)

8. lgst3.925405p.gz    H =  0.8  (.7769)

9. lgst3.971029p.gz    H =  0.9  (.8550)

10. lgst4.000000p.gz    H = 1.0  (1.000)