"tcalc" (and its web accessible counterpart, wtcalc) is a UNIX tool for computing, amongst other things, T-complexity, T-information and T-entropy. The use of this tool and the interpretation of the results it produces needs some care and understanding. As with any tool, it has limitations, and on top of this there are some apparent paradoxes that need to be understood if one is to use this tool to compute for example the Shannon entropy of a source from a sample string. There exists a tutorial page which may help you understand these and other points.
We use the terms T-entropy, T-information and T-complexity to distinguish these from other better known but less useful variants. What is important to understand is that the T-entropy is essentially equivalent but for a scaling constant, with the Shannon entropy (and also in the right circumstances with the Kolmogorov-Sinai entropy, referred to as KS-entropy, sometimes referred to simply as the K-entropy.)
The scaling issues are paramount to understanding how to make use of the output results of tcalc.
tcalc measures a strings complexity (Rather like the LZ complexity) in terms of the "effective number of steps required to produce the string from its alphabet". Unlike the LZ scheme, the T-complexity assumes a specific recursive hierarchal parsing of the string. This means that as the vocabulary is compiled no search for the optimum is performed. (see "Uniqueness theorems for tcodes" by Nicolescu and Titchener) The decomposition of a string into its constituent vocabulary is strictly mechanical. Recent implementations of the basic tcalc decomposition algorithm approach computational times proportional with string length.
Where as the LZ algorithm
vocabulary size (i.e. production complexity) is bounded asymptotically
by n/log(n), the T-complexity is bounded by what turns out to be a very
similar function, the logarithmic integral (see Abramowitz and Stegan,
mathematical tables) , in fact C_T is bounded nominally by li(n x
ln(#A)), where #A is the alphabet size. One may show using L'Hopital's
rule that these two functions are asymptotically similar as n->\infty which
is not surprising.
For a binary alphabet we have nominally C_T < li(n x ln(2)) = li(ln(2^n)) . The bound is useful especially in that it allows us to linearise the measure. Indeed the argument ln(2^n) is highly suggestive. In Shannon speak, were this to be log_2(2^n) = n we might recognise this to be the number of bits required to communicate distinctly all of the 2^n possible n-bit messages. Thus ln(2^n) is a measure of the information content of the maximal complexity strings, but measured in "nats" (natural log) rather than "bits".
We have conjectured
that the T-complexity of a string (measured in T-augmentation steps, i.e..
"taugs") is thus simply the logarithmic integral of the total information
content of the string, measured in "nats"). Turning this around we may
define the T-information (nats) of the string to be the inverse logarithmic
integral of the T-complexity (taugs), and thus the information per symbol
(T-entropy) is simply the (T-information)/(no of symbols)
So given a string tcalc basically returns three numbers: the T-complexity in taugs, the T-information in nats, and the T-entropy in nats/symbol
It assumes each character
in the string occupies a byte character.
More on this shortly.
(last updated May.2005)
Relevant papers are available here:
performance, absolutely!" Titchener 2002
(This paper uses the above corpora to effectively analyse the absolute performance of a number of popular compressors)
based entropies of Deterministic and Stochastic maps"
Ebeling, Steuer, Titchener, 2001 postscript, PDF
Chaos and Information Theory",
Titchener ebeling 2001 postscript, PDF
of information", Titchener 1999
calibrated Corpus for Compression testing"
Titchener, Fenwick, Chen, 1998 postscript, PDF