Abstract
The genetic code is a mapping of 64 possible triplet codons from a 4-letter alphabet (A, C, G, U) into 20 amino acids and one STOP signal for protein synthesis. The pattern of degeneracies in codon assignments to amino acids suggests that there is an underlying variable length code, which meets all the optimal coding criteria dictated by Shannon’s Information theory. The genetic code can be viewed as an instantaneous, prefix code with unique decipherability and compactness. Optimal codon assignments and average code lengths for 20 species from 10 groups in the phylogenetic tree have been determined from the available data on codon usage, using the Huffman algorithm. The average binary code length of the genetic code exceeds the optimal average length of the Huffman code only by 2 to 5%, showing that the genetic code is close to optimal. But functionally, the genetic code is a fixed length code of 6 binary bits (3 bases). This provides the needed redundancy (?25%) for tolerance of errors in the DNA sequences due to mutations. This hybrid character of the genetic code, combining the advantages of variable and fixed length codes, supports a speculation that in the past the genetic code could have been a variable length code which has evolved to its modern version. The DNA sequence bears striking similarities to linguistic discourses from an Information Theoretic view. Both are complex adaptive systems with intrinsic elements of order and randomness. The complexity parameter, which we had defined earlier for linguistic discourses, is close to maximal in both DNA and natural languages. In this article, the first of two parts, we have focused on the variable length genetic code. In Part II, we deal with DNA sequence treated as a complex adaptive system and some possible correlation of certain parameters with evolution.