The Entropy Pages (2000)
last update Aug. 2004
prepared by M.R. Titchener  (email mark@tcode.auckland.ac.nz)
Department of Computer Science
University of Auckland

This set of web pages is intended as an entropy resource with a difference. A software application "tcalc.c" is provided for the practical computation of entropy in symbolically encoded strings. [tcalc implements a string decomposition algorithm, resulting in a particular recursive hierarchical parsing from which string complexity, information content, and information rate, i.e. entropy, is computed. This particular implementation is not efficient and runs O(n^2), where n is the length of the input string. More recent implementations, ftcalc and qtcalc achieve O(n), decomposition amounting to a speedup factors >4000 for strings of 10,000,000 symbols long.]

In addition, links are provided to a series of pages documenting the entropy computed for a variety of sources, for example,  from Non-linear dynamics, of Natural language texts, and DNA sequences. In addition links are provided to papers and articles, describing the measure.

These results are computed using a novel grammar based measure of information. Here the definitions of string complexity, information and entropy are all
based on the notion that a finite string of characters may be systematically and mechanically constructed from an alphabet, and that the information is but a measure of the numbers of steps required to create the string from its alphabet.

Our measure is based on counting the numbers of steps required to construct a string from its alphabet. Clearly such a notion is related to Kolmogorov's Algorithmic Information content (1964), but also Lempel Ziv's measure of complexity proposed in 1976 for finite strings, and giving rise ultimately to the LZ family of compressors. It is not so evident that the measure we use here relates to Shannon's measure which is defined on probabilistic terms, but in fact our measure is demonstrably equivalent. We use results from non-linear dynamics to illustrate its unequivocal correspondence. On the other hand our measure is easily computable for any finite string. It makes no references to statistical characteristics of the string. It uses no probabilities, relies on no knowledge of the source or source statistics, but delivers results for any finite string.

Our software has been tested to strings of 100 million symbols, accepts alphabet sizes up to 256 characters. A number of diagnostic outputs are available.


Deterministic Information Theory

DET-IT (In progress.... only the first tentative pages are provided here.) defines a new approach to information theory, an approach which contrasts to Shannon's probabilistic formulation, and to Kolmogorov and Chaitin's algorithmic formulations. To distinguish this new variant of Information Theory from these well established areas I have chosen to call it Deterministic-IT.

The recent developments in non-linear dynamics have proven that deterministic systems can exhibit behaviours which we might more usually associate with stochastic indeterminism. Shannon (P50 of "The Mathematical Theory of Communication" by Shannon/Weaver) noted that the form of his equation was that for "...entropy as defined in certain formulations of statistical mechanics". It would appear that even by 1938 (before Shannon published his now famous work on IT, the idea of symbolically encoding the state space trajectories of physical dynamical systems had already been conceived of. In any case, Shannon in the aforementioned publication reiterates the connection which was picked by amongst others Kolmogorov. Since the entropy is sensitive to the particular encoding partition, Kolmogorov's focus was on the maximum value taken over all possible finite or infinite encodings. For many systems the supremum is infinite (Noted also by Shannon for the continuous limit of the entropy definitions). But for certain disipative non-llinear dynamical systems the suprememum remains finite and is, as Schuster points out, a fundamental characteristic of the dynamical system. Hence, we have the notion of Kolmogorov-Sinai entropy (KS-entropy), essentially the supremum of the Shannon entropies computed over all possible partitions (finite and infinite) of the phase space encodings. In 1977 Pesin proved that the KS-entropy of certain classes of non-linear dissipative or hyperbolic dynamical systems is given precisely by the sum of corresponding positive Lyapunov exponents for the system. A number of practical techniques exists for quite precisely computing Lyapunov exponents of simple systems from the observed dynamics. Thus the KS-entropy may be computed indirectly without reference to the source statistics for several simple dynamical systems. The "logistic map" provides an excellent example of a coded information source of known entropy.

What may be taken from non-linear dynamics is the idea that a duality exists between deterministic and probabilistic  formulations of a systems behaviour. Thus it was that Kolmogorov in effect anticipated that Shannon's probabilistic theory might have an alternative formulation. Algorithmic Information Theory was born out of this conjecture. However, AIT presents significant practical limits upon determination of information content for individual finite strings simply beacuse of its reliance on the existence of a universal computing engine, which is critical in determining the notion of "shortest algorithm". In sense AIT fails because it mixes syntactic-information content with symantic content by implicitly interpretting a string as a computer program, where sybmols are attributed instructional meaning. Deterministic IT avoids references to any computing machinary, and in this regard follows the initial steps of Lempel and Ziv (1976). But we go much further in relating information content to string complexity. Whereas probabilistic treatment abstracts out the underlying mechanisms of the information processes, making it difficult to apply indeed well nie impossible to apply meaningfully in the case of individual finite strings, the deterministic measure gives direct access to information and entropy quantities of individual finite strings without reference to the source or ensemble statistics.

Here the logistic map is used as a known information source to demonstrate DET-IT's unequivocal correspondence with Shannon's entropy, though there remains a clear challenge to prove formally their absolute equivalence.

The software "tcalc.c" which is used to compute these results is provided subject to conditions of the gnu general public license. executable versions of ftcalc and qtcalc may be available for PC Windows and Mac/Unix/Linux, upon request.



Links to entropy results: