An introduction to T-codes
(last updated 21 Aug 2008)
 
T-codes: a class of variable-length self-synchronising codes:

Imagine you are at the receiving end of a communications channel. All you see are the 1’s and 0’s of the binary encoded data stream, and you wish to make sense of this. 

As a rule, the symbols are not interpreted individually, but in groups. Grouping bits together to form words enriches the range of signaling representations, just as in English, we group letters from our alphabet to form words. It is the richness of the variety of words which then enable us to communicate effectively.  


Our problem is to identify correctly the symbol groupings to make sense of the message. This problem to be addressed is often referred to as word synchronisation. Modern communications systems often use fixed-length word sizes, for then once the decoder is synchronized it is a simple matter to count off the bits associated with each word.

There are costs with synchronisation. One is the overhead required to establish synchronization, every time it is lost. The second is that, if lost, it may take some time to recover correct synchronization, leading to significant data loss. A compromise must be reached between the synchronization delay which corresponds to an associated data loss, and the signaling overhead required to quickly establish synchronization.

In written English a ‘space’ character is used to frame words, occurring on average every 5-6 characters, implying a cost of between 15-20% of the available bandwidth.

But we would like to achieve this without necessarily compromising robustness. To help us identify the symbol groupings, in English text we use a ‘space’ delimiter. In effect the space character establishes decoder synchronization. Every time we see the space we ‘know’ we are at the end of the word.

Wemaywellbeabletomakesenseofasentencethatdoesnotincludespacesbutthespacescertainlymakeiteasier.

The ‘space’ character serves invariably as a synchronization marker but also helps to resolve ambiguities that also implicit in our language. For example, without spaces the word  “together” might well be read as “to get her” ? 

A further cost, though not always apparent, is an inefficiency inherent in assigning fixed word sizes to items that may not all occur with equal frequency. Huffman coding addresses this with a construction that invariably results in variable-length codes. Variable-length codes such as these are fine until an error occurs in the bit stream, at which point decoder synchronization is almost certainly lost.

Of importance then, in the area of communications, will be the detection and correction of errors. However, the associated costs must be balanced against efficiency gains from using for example, a Huffman coding scheme. There is never any free lunch....

on the other hand, many applications don’t require 100% integrity. Human communications have evolved to cope with unreliable exchanges of information. Much of our data can tolerate a degree of data loss. For example, the popular mp3 music coding and jpeg image compression schemes dispense with some of the data to achieve efficiency gains. So speech, music, video, are to some extent text, all may tolerate some degree of loss. Ironically it is generally the computing and communications machinery we rely on which behave unpredictably even catastrophically in the face of errors, so the drive for perfection is invariably a consequence of the design choices implicit in the data processing machinery.

This need not be the case. The T-codes are an example of a coding solution that is suited to communications of data storage requirements exposed to error processes, where robustness and stability of behaviour in the face of noise and disruptions may be more important than absolute data integrity.

Ironically fixed length codes cannot achieve the kind of robustness available with the variable-length T-codes. So variable-length coding turns out to be a prerequisite. We may capitalise on the opportunity to make efficiency gains though this is not he main point of our discussion. We simply observe that smaller words ought to be assigned to items that are communicated frequently, and retain the longer  words for less frequently occurring items. English has evolved towards this ideal... short words such as ‘the’ and ‘a’ occur frequently whereas  longer words like ‘tintinnabulation’, are rather less frequently used.

So lets begin by considering the most basic of alphabets, the binary alphabet {0,1} and think about how to form code words that would allow us to identify the correct word groupings from the message stream. We take our lead from English text, where the ‘space’ is reserved as a marker for the end of each word. Lets interpret the ‘0’ character as such a marker. 

Then representing the alphabet graphically as in the following figure: In traversing an inverted graph from its root node (top), the convention  will be to interpret a left vertex as a ‘0’, and one to the right, as a ‘1’.

The sets to the right of the alphabet, have been constructed using a simple two step rule; Rule 1: to copy the  initial set k times, and Rule 2: append the copy/copies to a selected ‘prefix’ node. Here the prefix code for each of the three examples is ‘1’.

When more than a single copy is made, as for example in the two rightmost examples, these copies are each appended in turn to the identical prefix position in the prior copy. 

In this respect the construction is entirely characterised by the prefix node and the copy factor.
For completeness the code tree always includes the all 1’s formed by repeating the prefix k+1 times. Though this code word contributes nothing to the synchronization process it enables us to preserve set completeness in recursive application of our construction rules.

You will notice is that for each of the three new sets shown to the right of the initial alphabet, it suffices to receive a ‘0’, i.e. the non-prefix character. The ‘0’ unequivocally marks the terminating symbol in a word, just as with the ‘space’ character does in English text. 

The only word which does not is the code word formed from the repeating-prefix ‘1...1’.  A repetitive run of 1’s inevitably delays synchronization but in fact, decodes correctly, at least til the point at which synchronization is established. 
We might have elected to use the ‘0’ as our prefix to obtain an alternative series of code sets as illustrated to the left.

Thus our construction rules achieve a code that allows the decoder to synchronization without more effort than the decoding itself. 

So far we are still left with a rather uninteresting set. What we may now consider, is reapply the same ‘set augmentation’ rules (referred to as T-augmentation), to one of the sets resulting from the previous already created from the alphabet. With each new round of T-augmentation, we select a new prefix, but with each enlargement of the set, i.e. each T-augmention, we multiply up the number of ‘non-prefix’ patterns, all of which have the effect of the ‘space’ character in terms of establishing correct synchronization. These come to dominate the set in number resulting in an overwhelming opportunity for synchronising the decode if for some reason it has been lost. Only the appearance of the prefix characters may delay resynchronization. Repeated application of the T-augmentation rule not only creates opportunities for synchronisation to occur but creates code distributions more likely to be useful in practice.
 
The following example shows the application of T-augmentation a total of four times beginning with the binary alphabet. Here we have selected one of the smallest available words as a prefix for each level of T-augmentation. 

We also limit the number of copies, ki, at each level, to 1 ensuring that the depth of the tree tends to grow minimally with each T-augmentation. Constraining the prefix to the smallest available at each step, and limiting the number of copies to 1, is referred to as Simple T-augmentation. If we continue the process of simple t-augmentation to exhaustion with respect to the prefix patterns of a given length L then the process is referred to as Exhaustive Simple T-augmentation. We observe later that these sets have interesting significance in terms of the bounds on set size and code distributions.

It is not unexpected that Exhaustive Simple T-augmentation creates distributions which by the central limit theorem tend towards the normal distribution. In an application it is the frequency with which the shortest code words are used which ultimately dominates the  coding rate.

Enumerating all code words from the above rightmost
set, in rank order of size, and assigning the letters of the english alphabet in accordance with their expected usage yields that illustrated in the table to the left.

The letter frequencies, here and taken from the wikepedia are not strictly representative of encoded text since the space character has been overlooked. In English the ‘space’ is the most popular character in text, occurring nearly 20% of the time. Amending percentages to take account of the  ‘space’ leads one to an average cost per character for transmitting English text around 4.2bits/character, close to that achievable by Huffman coding.  In contrast however the T-code is robust and the cost of achieving this in terms of the encoding efficiency hardly is negligible.  There is almost certainly no need for error control here.

It has to be said that all variable length codes exhibit some finite probability for synchronizing randomly following an error, but the vast majority will be slow and unpredictable, whereas The T-codes exhibit very quick synchronization, typically 2-3 code words is all it takes. The synchronisation delay tends to grows approximately logarithmically with set size so it is possible to have large sets indeed without seriously compromising the synchronization performance. Situations where synchronization may be delayed arise only from  repetitions of prefix patterns, but as we have noted the decoder correctly decodes during prolonged delays, making a single error at the point of resynchronization for each intermediate level of T-augmentation. Allowing for a possible error at the point when synchronization is lost, the total number of decoding errors which may  occur during resynchronization is thus bounded by (q+1).

For the code set in our table the maximum no of errors is 6 code words but the average will be close to 2.4  i.e. half the size of the average English word. With high likelihood, one may be able to read around any small error.

Examples of self-synchronization:
It is sufficient to demonstrate by encoding the first two words of:  
    the quick brown fox jumps ... 

Uusing ‘|’ to mark the boundaries we obtain for the quick...:
    0010|00010|011|010|1100011|11011|1010|11010|1000010| ...

Consider a late decoding start. we show the incorrectly decoded letters in red
and mark the point where synchronization occurs with |
Example 1 (missing 1st bit):
      010|00010|011|010|1100011|11011|1010|11010|1000010| ...
    <spc>   H       E  <spc>   Q           U        I       C           K         ...

Example 2 (missing 2 bits):
        1000010|011|010|1100011|11011|1010|11010|1000010| ...
               K       E  <spc>   Q           U        I       C           K         ...

Example 3 (missing 3 bits):
           0000|10011|010|1100011|11011|1010|11010|1000010| ...
              O      L     <spc>   Q           U        I       C           K         ...
   and so on.

Now lets consider (additive errors) bit inversion
Example 4 (1st bit inversion):
       1010|00010|011|010|1100011|11011|1010|11010|1000010| ...
           l      H         E  <spc>   Q           U        I       C           K         ...

Example 5 (2nd bit inversion):
      011|0000|10011|010|1100011|11011|1010|11010|1000010| ...
        E     O        L   <spc>   Q           U        I       C           K         ...

Example 6 (3rd bit inversion):
      0000|00010|011|010|1100011|11011|1010|11010|1000010| ...
        O        H      E   <spc>   Q           U        I       C           K         ...
  and so on.

Now consider timing errors (missing or added bits)
Example 7 (duplicate bit in 4th place):
      0100|0000|10011|010|1100011|11011|1010|11010|1000010| ...
         T       O      L    <spc>   Q           U        I       C           K         ...
and so on.

The bit patterns than can delay synchronization are ‘0000’  ‘0101’ and ‘1111’

Thus a sequence of the form:
 ‘111111.....11110000000...0000001010......10101111111111 ....11111 ....’
 may delay synchronization but will decode correctly with no more than  6 errors!

For additional material on the T-codes please visit papers:Papers.htmlshapeimage_2_link_0