Properties of the T-codes (In flux 21 Aug 2008)
 
Property 1 (string production):
T-augmentation may be viewed either as an algorithm for the construction  of code sets but also for the longest strings in the code set, as a string production algorithm.
 
Here for example the string ‘1001011110110’ is thus constructed in 4 steps.
 
 
 
The idea that the information content of a string might be measured in terms of number of steps required to construct it, has been proposed by Solomonff, Kolmogorov, and Chaitin at various points in time. Kolmogorov used the term algorithmic entropy to describe his proposed measure. Following the maxim that: “To he who has so more shall be given”, the term algorithmic entropy has been renamed Kolmogorov complexity (sometimes the Kolmogorov-Chaitin complexity) in deference to the fact that Solomonoff was actually first to identify this approach.
 
The Kolmogorov complexity of a string turns out to be uncomputable, making this notion utterly impracticable. As a result of this profound result, the proof of uncomputability, it is widely believed that all other measures of string complexity are flawed or have no practical value.
 
Nevertheless in 1976, Lempel and Ziv proposed a simple computable measure of string production complexity based on the idea that a finite string may be constructed from its alphabet... and that the minimum number of linear-pattern copying steps required to create this measures in some sense the complexity of the string.
 
Their ideas spawned a series of popular data compressors. Many have attempted to the original parsing rule LZ76 algorithm, to  measure string complexity but for strings of any substantial size the parsing algorithms tend to be slow.
 
T-augmentation offers an alternative production process to that proposed by Lempel and Ziv. Nevertheless  no proof exists to date as to whether this algorithm achieves optimality or something close to optimality., nor whether this is significantly better or worse than the LZ76 algorithm in terms of its measure. The uncomputability result of AIT suggest that it is not possible to prove optimality, or indeed how close to optimality a T-code based measure might be.
 
What motivates the thrust of this work in part is the practical ease with which the production costs may be assessed for finite strings of even tens of millions of symbols in length. In this regard the next property is significant.
 
Property 2 (Top-2-Bottom symmetries of T-code trees):
The nature of the copy/append steps beginning with the alphabet results in a reflected symmetry... As the example below shows, the intermediate sets appear both at the top and bottom of the final code set. It is this property that ensures that the longest string traverses all of the respective append nodes, corresponding in reverse order, to the sequence of selected prefixes.  
 
The result is an algorithm that enables one to discover the T-code tree construction parameters from knowledge of either maximal length string alone.
 
 
Interestingly one may not make a conclusive assessment about the actual alphabet used by the source. So for example the string ‘11111111’  is a string valid for an infinity of alphabet sizes. It could be from a binary set, but equally from an alphabet radix 10. So it is interesting to see that the copy factors that make up the construction steps for the T-code tree are independent of the alphabet.
 
This is useful in respect of the next property leading the definition of a string complexity.
 
Property 3 (Uniqueness properties of T-codes):
In 1995 Nicolescu furnished a proof (based on an T-decomposition algorithm implemented by the author in 1995) that for every string there exists a unique T-code set, for which the given string represents a maximal length code word.
 
From this result one may define a string complexity that is singular in value, and numerically computable from the T-augmentation production steps of the corresponding T-code tree for the string.
 
The string complexity os defined to be :
 
The copy factors may be derived from a string by way of the T-decomposition algorithm.
 
The following graph shows firstly the discretised nature of the complexity measure, but also its logarithmically driven density.
 
 
Property 4 (Lower Bound on complexity):
The lower bound is easily derived. Clearly to obtain a minimum complexity one must limit the number of contributions to the sum. This may be achieved with strings that are simply a repetition of a single symbol: for example ‘11111111...111’. Then we have a single copy factor given by 1 less than n, the length of the string. The complexity thus reduces to:
 
 
Property 5 (Upper bound on complexity for exhaustive simple T-augmentation):
By 1997 it was also evident from empirical results that the growth in the maximum T-complexity as a function of string length was quite precisely described by the function:
 
 
Where r is the radix of the smallest alphabet encompassing the string elements, and the logarithmic integral function. This result has since been proven for exhaustive simple T-augmented strings (c.f. complexity paper). While the bound is observed to hold for the intervening string lengths, no proof yet exists for these intermediate values. It seems reasonable and indeed is useful to assume that it does in fact hold.
 
We observe in passing that the asymptotic form of the Lempel Ziv bound derived for the LZ76 algorithm for binary strings is n/log(n) is very similar to that given above. However our bound on the T-complexity is good for almost all n, and not just in the limit of large n. This is exploited in the following linearisation of the complexity measure.
 
 
 
 
 
 
Property 6 (sensitivity to sources of constant information rate):
It is well known that certain non-linear dynamical systems can be used as information sources. The logistic map is an example of a one-dimensional recurrence relation that has a single control parameter. It is a well established result that rounding the real valued iterates to 0, 1 create corresponding binary encoded sequences whose shannon entropy are simply a function of the control parameter. By selecting control parameters that result in relatively uniformly distributed range of entropies we may investigate the responsiveness of the complexity measure to these source entropies:
 
It is evident the growth rates of the respective source complexities is dominated by the logarithmic integral function. We may thus effectively linearise the complexity measure by taking the inverse logarithmic integral function. We define:
 
The motivation for assigning units follows from our observation that the term contained inside the logarithmic function in the upper bound is :
 
If logarithm were to base 2, and the alphabet binary, i.e. r = 2, this term simply reduces to  n, the number of bits or amount of information needed to prescribe the string of n bits. So we sumise that the term is in fact the maximum information content of the strings from an alphabet r, and measured in (nats) rather than (bits).
 
Property 7 (A linearised measure):
linearisation
 
 
 
 
Property 8 (An Information  rate: the T-entropy):
a linearised information rate
 
 
 
Property 9 (normalisation):
normalisation with respect to the typical.
 
 
normalisation with respect to the typical.
 
 
 
 
 
Property 10 (Relationship of normalised T-entropy to Shannon entropy rate):
normalisation with respect to the typical string.
 
 
 
 
 
Property 11 (Relationship to Lyapunov exponent):
Here we may use a simple recurrence relation often referred to as the logistic map, or quadratic map. The equation, given to the right, includes a control parameter in the range of [0,4]. The initial value x0 is also constrained to the unit interval, ensuring that all further iterates remain also in the unit interval.
 
The time series [xt] then exhibits a variety of behaviours depending on the value r. For r less than the Feigenbaum point (approx 3.56...) the time series exhibits stable or periodic behaviour. Above the Feigenbaum point a mixter of behaviours may be observed, much of it chaotic, but dependent on the precise value of r. The Lyapunov exponent measures the average tendency for initially close neighbor states to diverge exponentially. If the Lyapunov exponent is negative or zero the system will exhibit a stable or periodic behaviour. But if positive, the system tends to be unpredictable. Pesin’s identity proved that for  certain classes of dynamical system, including the logistic map, the positive Lyapunov exponent is equal to the K-entropy of the system, essentially the maximum Shannon entropy of the optimal symbolically encoded time series.
 
 
It has also been proven that a finite bipartition 0.5 represents a generating partition for the logistic map. That is the binary coded strings resulting in effect from rounding the iterates xt achieve the optimum encoding, i.e. the Shannon entropy of this symbolically coded trajectory is that of the K-entropy, and thus by inference equal to the positive Lyapunov exponent. Since it is easier and more precise to compute the Lyapunov exponent from the real-valued time series, we may deduce the expected Shannon entropy of the corresponding binary sample strings indirectly by way of the positive Lyapunov exponents.
 
In effect we are using the logistic map as an information source with ‘known’ entropy output, controlled by the parameter r. The relationship between r and the entropy is not at all straightforward however, so it becomes necessary to compute the Lyapunov exponent across a range of values for small increments of r.
 
The block diagram of our string generator is thus:
 
 
Taking strings of 100,000 bits each, one for each incremented value of r, we obtain computed values for the Lyapunov exponent and the scaled / normalised T-entropy. As the graph below shows the T-entropy and the positive Lyapunov exponent closely match over the full range. We have observe that the longer the sample strings the closer the convergence between these two series.
 
 
Property 12 (Linearity with respect to logistic map with noise):
Yet another way exists to use the logistic map as an information source. This is due to Crutchfeld. Setting the control parameter r to the Feigenbaum point, i.e. the onset of chaos, then injecting noise into the logistic map the entropy of the resultant series is linearly related to the amplitude of the noise.
 
 
This  provides one means for exploring the linearity of the T-entropy measure over an extended range of entropy values. By varying the noise amplitude over 10 orders of magnitude, and using strings of sufficient length (10,000,000 bits long) we have verified a near linear sensitivity for the T-entropy over 3 and half orders of entropy magnitude.
 
 
 
 
This  provides one further means for verifying what size sample string is necessary for identifying especially the low entropy values.