论文标题
计算天然基因组的重排距离
Computing the rearrangement distance of natural genomes
论文作者
论文摘要
在过去的25年中,基因组距离的计算一直是计算比较基因组学的非常活跃的领域。实质性结果包括汉嫩哈利和佩夫斯纳在1995年的反转距离的多项式时间可计算性以及Yancopoulos等人的双切割和联接(DCJ)距离的引入。然而,在2005年。这两个结果都取决于以下假设:比较中的基因组包含相同的独特标记集(同步基因组区域,有时也称为基因)。在2015年,Shao,Lin和Moret通过在分析中允许重复标记来放松这种情况。这种基因组距离问题的广义版本是NP-HARD,它们提供了一个足够有效的ILP解决方案,可以应用于现实世界数据集。对其方法的限制是,它只能应用于平衡基因组,这些基因组具有相等数量的任何标记物。因此,它仍然需要对输入数据进行微妙的预处理,其中必须去除过多的不平衡标记副本。 在本文中,我们提出了一种解决天然基因组基因组距离问题的算法,其中任何标记物可能发生任意次数。我们的方法基于一个新的图数据结构,即多关系图,该图允许Shao,Lin和Moret对ILP的优雅扩展,以计数一个在一个基因组中相对于另一个基因组而言不足或过度代表的标记,并且需要分别插入或删除。通过此扩展,对基因组构型的先前限制被取消,这是首次实现不妥协的重排分析。任何标记序列都可以直接用于距离计算。 对我们方法的评估表明,它可用于分析具有多达一万个标记的基因组,我们在模拟和真实数据上进行了证明。
The computation of genomic distances has been a very active field of computational comparative genomics over the last 25 years. Substantial results include the polynomial-time computability of the inversion distance by Hannenhalli and Pevzner in 1995 and the introduction of the double-cut and join (DCJ) distance by Yancopoulos et al. in 2005. Both results, however, rely on the assumption that the genomes under comparison contain the same set of unique markers (syntenic genomic regions, sometimes also referred to as genes). In 2015, Shao, Lin and Moret relax this condition by allowing for duplicate markers in the analysis. This generalized version of the genomic distance problem is NP-hard, and they give an ILP solution that is efficient enough to be applied to real-world datasets. A restriction of their approach is that it can be applied only to balanced genomes, that have equal numbers of duplicates of any marker. Therefore it still needs a delicate preprocessing of the input data in which excessive copies of unbalanced markers have to be removed. In this paper we present an algorithm solving the genomic distance problem for natural genomes, in which any marker may occur an arbitrary number of times. Our method is based on a new graph data structure, the multi-relational diagram, that allows an elegant extension of the ILP by Shao, Lin and Moret to count runs of markers that are under- or over-represented in one genome with respect to the other and need to be inserted or deleted, respectively. With this extension, previous restrictions on the genome configurations are lifted, for the first time enabling an uncompromising rearrangement analysis. Any marker sequence can directly be used for the distance calculation. The evaluation of our approach shows that it can be used to analyze genomes with up to a few ten thousand markers, which we demonstrate on simulated and real data.