Article ID: | iaor201112700 |
Volume: | 27 |
Issue: | 3 |
Start Page Number: | 336 |
End Page Number: | 362 |
Publication Date: | Aug 2011 |
Journal: | Computational Intelligence |
Authors: | Wu Xindong, Zhu Xingquan, He Dan |
Keywords: | pattern recognition, sequencing, DNA |
The rapid increase of available DNA, protein, and other biological sequences has made the problem of discovering meaningful patterns from sequences an important task for Bioinformatics research. Among all types of patterns defined in the literature, the most challenging one is to find repeating patterns with gap constraints. In this article, we identify a new research problem for mining approximate repeating patterns (ARPs) with gap constraints, where the appearance of a pattern is subject to an approximate match, which is very common in biological sequences. To solve the problem, we propose an ArpGap (ARP mining with Gap constraints) algorithm with three major components for ARP mining: (1) a data-driven pattern generation approach to avoid generating unnecessary candidates for validation; (2) a back-tracking pattern search process to discover approximate occurrences of a pattern under user specified gap constraints; and (3) an Apriori-like deterministic pruning approach to progressively prune patterns and cease the search process if necessary. Experimental results on synthetic and real-world protein sequences assert that ArpGap is efficient in terms of memory consumption and computational cost. The results further suggest that the proposed method is practical for discovering approximate patterns for protein sequences where the sequence length is usually several hundreds to one thousand and the pattern length is relatively short.