Using tcalc  (page In flux)

"tcalc" (and its web accessible counterpart, wtcalc) is a UNIX tool for computing, amongst other things, T-complexity, T-information and T-entropy. The use of this tool and the interpretation of the results it produces needs some care and understanding. As with any tool, it has limitations, and on top of this there are some apparent paradoxes that need to be understood if one is to use this tool to compute for example the Shannon entropy of a source from a sample string. There exists a tutorial page which may help you understand these and other points.

We use the terms T-entropy, T-information and T-complexity to distinguish these from other better known but less useful variants. What is important to understand is that the T-entropy is essentially equivalent but for a scaling constant, with the Shannon entropy (and also in the right circumstances with the Kolmogorov-Sinai entropy, referred to as KS-entropy, sometimes referred to simply as the K-entropy.)

The scaling issues are paramount to understanding how to make use of the output results of tcalc.

tcalc measures a strings complexity (Rather like the LZ complexity) in terms of the "effective number of steps required to produce the string from its alphabet". Unlike the LZ scheme, the T-complexity assumes a specific recursive hierarchal parsing of the string. This means that as the vocabulary is compiled no search for the optimum is performed. (see "Uniqueness theorems for tcodes" by Nicolescu and Titchener) The decomposition of a string into its constituent vocabulary  is strictly mechanical. Recent implementations of the basic tcalc decomposition algorithm approach computational times proportional with string length.

Where as the  LZ algorithm vocabulary size (i.e. production complexity) is bounded asymptotically by n/log(n), the T-complexity is bounded by what turns out to be a very similar function, the logarithmic integral (see Abramowitz and Stegan, mathematical tables) , in fact C_T  is bounded nominally by li(n x ln(#A)), where #A is the alphabet size. One may show using L'Hopital's rule that these two functions are asymptotically similar as n->\infty which is not surprising.

For a binary alphabet we have nominally C_T  < li(n x ln(2)) = li(ln(2^n)) . The bound is useful especially in that it allows us to linearise the measure. Indeed the argument ln(2^n) is highly suggestive. In Shannon speak, were this to be log_2(2^n) = n  we might recognise this to be the number of bits required to communicate distinctly all of the 2^n possible n-bit  messages. Thus ln(2^n) is a measure of the information content of the maximal complexity strings, but measured in "nats" (natural log) rather than "bits".

We have conjectured  that the T-complexity of a string (measured in T-augmentation steps, i.e.. "taugs") is thus simply the logarithmic integral of the total information content of the string, measured in "nats"). Turning this around we may define the T-information (nats) of the string to be the inverse logarithmic integral of the T-complexity (taugs), and thus the information per symbol (T-entropy) is simply the (T-information)/(no of symbols)

So given a string tcalc basically returns three numbers: the T-complexity in taugs, the T-information in nats, and the T-entropy in nats/symbol

It assumes each character in the string occupies a byte character.

More on this shortly.
(last updated May.2005)

Papers to read...

Relevant papers are available here:

"Compressor performance, absolutely!" Titchener 2002
  postscript, PDF
(This paper uses the above corpora to effectively analyse the absolute performance of  a number of popular compressors)

"Partition based entropies of  Deterministic and Stochastic maps"
    Ebeling, Steuer, Titchener, 2001  postscript, PDF

"Deterministic Chaos and Information Theory",
       Titchener  ebeling 2001 postscript, PDF

"A measure of information", Titchener 1999
  postscript, PDF

"Towards a calibrated Corpus for Compression testing"
    Titchener, Fenwick, Chen, 1998  postscript, PDF