The Estimated Implications

Implications for Fundamental Understanding of Properties of Proteins

The convergence of linguistics and biology provides a framework to connect biological information gathered in massive numbers of studies, including both large-scale genome-wide experiments and more traditional small-scale experiments. The ultimate goal is to catalogue all the words and their respective meanings occurring in genomic sequences in a "biological dictionary." Sophisticated statistical language models will be able to calculate the probabilities for a specific amino acid within a protein context. It will be possible to examine what combinations of amino acid sequences give a meaningful sentence, and we will be able to predict where spelling mistakes are inconsequential for function and where they will cause dysfunction.

Cataloguing Biological Languages at Hierarchical Levels: Individual Proteins, Cell Types, Organs, and Related and Divergent Species

The language modeling approach is applicable to distinguishing biological systems at various levels, just as language varies among individuals, groups of individuals, and nations. At the most fundamental level, we aim at deciphering the rules for a general biological language, i.e., discovering what aspects are common to all sequences. This will enhance our fundamental understanding of biological molecules, in particular how proteins fold and function. At the second level, we ask how differences in concentrations, interactions, and activities of proteins result in formation and function of different cell-types and ultimately of organs within the same individual. This will allow us to understand the principles underlying cell differentiation. The third level will be to analyze the variations among individuals of the same species, the single nucleotide polymorphisms. We can then understand how differences in characteristics, such as intelligence or predisposition for diseases, are encoded in the genome sequence. Finally, the most general level will be to analyze differences in the biological languages of different organisms, with varying degree of relatedness.

Ideally, all life on earth will be catalogued. The impact on understanding complexity and evolution of species would be profound. Currently, it is estimated that there are 2-100 million species on earth. While it is not feasible to sequence the genomes of all the species, language modeling may significantly speed up obtaining "practical" sequences (Figure F.8). One of the bottlenecks in genome sequencing is the step from draft to finished sequence because of error correction and filling of gaps. However, if we define a vocabulary of the words for an organism from a partial or draft sequence, we should be able to predict blanks and correct mistakes in forward and backward direction using language modeling.

0 0

Post a comment