论文标题
了解与联接的近似查询处理的硬度
Understanding the hardness of approximate query processing with joins
论文作者
论文摘要
我们研究了涉及多个可能不同尺寸的表上的各种查询的近似查询处理(AQP)的硬度。如果查询结果为单个值(例如,计数,总和和计数(不同)),我们证明了AQP问题的最差信息理论理论理论下限,这些问题给出了参数$ε$和$δ$,并且返回的估计结果均在最大损失$δ$的错误结果的1+$ε$之内。特别是,在我们的结果中包含了在各种设置下的基础性估计的下限。从非正式的角度来看,我们的结果表明,对于带有连接的各种数据库查询,除非仅限于一组查询,否则总有保证的结果始终超过一个非常大的阈值,否则AQP算法返回准确近似值的信息量至少是最大桌子中的行数至少是线性的。在某些特殊情况下,类似的下限甚至可以容纳其他信息,例如Top-K重击球手和所有频率向量。如果查询结果不是单个数字的组,我们研究了任何近似算法所使用的信息量,该信息量没有报告任何不存在的组,并且不会错过大型总规模的组。我们的工作扩展了Alon,Gibbons,Matias和Szegedy [AGMS99]的工作。我们将我们的下限与Bernoulli采样所需的信息量进行比较,以提供准确的近似值。对于在相同大小的多个表上加入的计数查询,上限匹配下限,除非问题设置仅限于一组查询集,后者始终保证其结果超过了很大的阈值。
We study the hardness of Approximate Query Processing (AQP) of various types of queries involving joins over multiple tables of possibly different sizes. In the case where the query result is a single value (e.g., COUNT, SUM, and COUNT(DISTINCT)), we prove worst-case information-theoretic lower bounds for AQP problems that are given parameters $ε$ and $δ$, and return estimated results within a factor of 1+$ε$ of the true results with error probability at most $δ$. In particular, the lower bounds for cardinality estimation over joins under various settings are contained in our results. Informally, our results show that for various database queries with joins, unless restricted to the set of queries whose results are always guaranteed to be above a very large threshold, the amount of information an AQP algorithm needs for returning an accurate approximation is at least linear in the number of rows in the largest table. Similar lower bounds even hold for some special cases where additional information such as top-K heavy hitters and all frequency vectors are available. In the case of GROUP-BY where the query result is not a single number, we study the lower bound for the amount of information used by any approximation algorithm that does not report any non-existing group and does not miss groups of large total size. Our work extends the work of Alon, Gibbons, Matias, and Szegedy [AGMS99].We compare our lower bounds with the amount of information required by Bernoulli sampling to give an accurate approximation. For COUNT queries with joins over multiple tables of the same size, the upper bound matches the lower bound, unless the problem setting is restricted to the set of queries whose results are always guaranteed to be above a very large threshold.