Estimating change‐points in biological sequences via the cross‐entropy method

0.00 Avg rating—0 Votes

Article ID:	iaor20119390
Volume:	189
Issue:	1
Start Page Number:	155
End Page Number:	165
Publication Date:	Sep 2011
Journal:	Annals of Operations Research
Authors:	Kroese P, Evans E, Sofronov Y, Keith M
Keywords:	cross-entropy

Abstract:

The genomes of complex organisms, including the human genome, are known to vary in GC content along their length. That is, they vary in the local proportion of the nucleotides G and C, as opposed to the nucleotides A and T. Changes in GC content are often abrupt, producing well‐defined regions. We model DNA sequences as a multiple change‐point process in which the sequence is separated into segments by an unknown number of change‐points, with each segment supposed to have been generated by a different process. Multiple change‐point problems are important in many biological applications, particularly in the analysis of DNA sequences. Multiple change‐point problems also arise in segmentation of protein sequences according to hydrophobicity. We use the Cross‐Entropy method to estimate the positions of the change‐points. Parameters of the process for each segment are approximated with maximum likelihood estimates. Numerical experiments illustrate the effectiveness of the approach. We obtain estimates of the locations of change‐points in artificially generated sequences and compare the accuracy of these estimates with those obtained via other methods such as IsoFinder (2004) and Markov Chain Monte Carlo. Lastly, we provide examples with real data sets to illustrate the usefulness of our method.

Reviews

Required fields are marked *. Your email address will not be published.