Article ID: | iaor19971393 |
Country: | United States |
Volume: | 248 |
Issue: | 1 |
Start Page Number: | 1 |
End Page Number: | 18 |
Publication Date: | Jan 1995 |
Journal: | Journal of Molecular Biology |
Authors: | Snyder E.E., Stormo G.D. |
Keywords: | programming: dynamic |
The authors have developed a computer program, GeneParser, which identifies and determines the fine structure of protein genes in genomic DNA sequences. The program scores all subintervals in a sequence for content statistics indicative of introns and exons, and for sites that identify their boundaries. This information is weighted by a neural network to approximate the log-likelihood that each subinterval exactly represents an intron or exon (first, internal or last). A dynamic programming algorithm is then applied to this data to find the combination of introns and exons that maximizes the likelihood function. Using this method, the authors can rapidly generate ranked suboptimal solutions, each of which is the optimum solution containing a given intron-exon junction. They have tested the system on a large collection of human genes. On sequences not used in training, the authors achieved a correlation coefficient for exon nucleotide prediction of 0.89. For a subset of G+C-rich genes, a correlation coefficient of 0.94 was achieved. They have also quantified the robustness of the method to substitution and frame-shift errors and show how the system can be optimized for performance on sequences with known levels of sequencing errors.