The Transforming Strategy

One test for convergence of technologies is that their methods are interchangeable, i.e., language technologies should be directly applicable to biological sequences. To date, many computational methods that are used extensively in language modeling have proven successful as applied to biological sequences, including hidden Markov modeling, neural network, and other machine learning algorithms, demonstrating the utility of the methodology. The next step is to fully explore linguistically inspired analysis of biological sequences. Thus, the Carnegie Mellon and Cambridge Statistical Language Modeling (SLM) Toolkit, utilized for natural language modeling and speech recognition in more than 40 laboratories worldwide, was applied to protein sequences, in which the 20 amino acids were treated as words and each protein sequence in an organism as a sentence of a book. Two exemplary results are described here.

1. In human languages, frequent words usually do not reveal the content of a text (e.g., "I", "and", "the"). However, abnormalities in usage of frequent words in a particular text as compared to others can be a signature of that text. For example, in Mark Twain's "Tom Sawyer", the word "Tom" is amongst the top 10 most frequently used words. When the SLM toolkit was applied to protein sequences of 44 different organisms (bacterial, archaeal, human), specific n-grams were found to be very frequent in one organism, while the same n-gram was rare or absent in all the other organisms. This suggests that there are organism-specific phrases that can serve as "genome signatures."

2. In human languages, rare events reveal the content of a text. Analysis of the distribution of rare and frequent n-grams over a particular protein sequence, that of lysozyme, a model system for protein folding studies, showed that the location of rare n-grams correlates with nucleation sites for protein folding that have been identified experimentally (Klein-Seetharaman 2002). This striking observation suggests that rare events in biological sequences have similar status for the folding of proteins, as have rare words for the topic of a text.

These two examples describing the usage of rare and frequent "words" and "phrases" in biology and in language clearly demonstrates that convergence of computational linguistics and biological chemistry yields important information about the mapping between sequence and biological function. This was observed even when the simplest of computational methods was used, statistical n-gram analysis. In the following, examples for the potential benefits of such information for improving human health and performance will be described.

0 0

Post a comment