上下文模式匹配

论文标题

上下文模式匹配

Contextual Pattern Matching

论文作者

Navarro, Gonzalo

论文摘要

关于索引重复字符串集合的研究集中在用于常规字符串收集的相同搜索问题上，尽管在这种情况下它们毫无意义。例如，当$ p $出现的所有位置列出所有位置时，基本模式匹配的查询都会产生巨大的输出，当$ p $出现在许多文档共享的区域中。所有这些发生基本上都是相同的。在本文中，我们提出了一个新查询，在这些集合中可以更合适，我们称{\ em上下文模式匹配}。此类型的基本查询还提供了上下文长度$ \ ell $以外的$ \ ell $，并要求报告所有{\ em Dintife}字符串的出现$ xpy $，带有$ | x | = | = | y | = \ ell $。尽管此查询在最佳时间和线性空间中很容易解决，但我们专注于使用与文本集合的重复性相关的空间，并介绍此类解决方案的第一个解决方案。 Letting $\ovr$ be the maximum of the number of runs in the BWT of the text $T[1..n]$ and of its reverse, our structure uses $O(\ovr\log(n/\ovr))$ space and finds the $c$ contextual occurrences $XPY$ of $(P,\ell)$ in time $O(|P| + c \log n)$.我们还为压缩和未压缩的索引提供了其他时空折衷。

The research on indexing repetitive string collections has focused on the same search problems used for regular string collections, though they can make little sense in this scenario. For example, the basic pattern matching query "list all the positions where pattern $P$ appears" can produce huge outputs when $P$ appears in an area shared by many documents. All those occurrences are essentially the same. In this paper we propose a new query that can be more appropriate in these collections, which we call {\em contextual pattern matching}. The basic query of this type gives, in addition to $P$, a context length $\ell$, and asks to report the occurrences of all {\em distinct} strings $XPY$, with $|X|=|Y|=\ell$. While this query is easily solved in optimal time and linear space, we focus on using space related to the repetitiveness of the text collection and present the first solution of this kind. Letting $\ovr$ be the maximum of the number of runs in the BWT of the text $T[1..n]$ and of its reverse, our structure uses $O(\ovr\log(n/\ovr))$ space and finds the $c$ contextual occurrences $XPY$ of $(P,\ell)$ in time $O(|P| + c \log n)$. We give other space/time tradeoffs as well, for compressed and uncompressed indexes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题