Mining approximate repeating patterns from sequence data with gap constraints

0.00 Avg rating—0 Votes

Article ID:	iaor201112700
Volume:	27
Issue:	3
Start Page Number:	336
End Page Number:	362
Publication Date:	Aug 2011
Journal:	Computational Intelligence
Authors:	Wu Xindong, Zhu Xingquan, He Dan
Keywords:	pattern recognition, sequencing, DNA

Abstract:

The rapid increase of available DNA, protein, and other biological sequences has made the problem of discovering meaningful patterns from sequences an important task for Bioinformatics research. Among all types of patterns defined in the literature, the most challenging one is to find repeating patterns with gap constraints. In this article, we identify a new research problem for mining approximate repeating patterns (ARPs) with gap constraints, where the appearance of a pattern is subject to an approximate match, which is very common in biological sequences. To solve the problem, we propose an ArpGap (ARP mining with Gap constraints) algorithm with three major components for ARP mining: (1) a data-driven pattern generation approach to avoid generating unnecessary candidates for validation; (2) a back-tracking pattern search process to discover approximate occurrences of a pattern under user specified gap constraints; and (3) an Apriori-like deterministic pruning approach to progressively prune patterns and cease the search process if necessary. Experimental results on synthetic and real-world protein sequences assert that ArpGap is efficient in terms of memory consumption and computational cost. The results further suggest that the proposed method is practical for discovering approximate patterns for protein sequences where the sequence length is usually several hundreds to one thousand and the pattern length is relatively short.

Reviews

Required fields are marked *. Your email address will not be published.