论文标题
范围网络:大数据应用程序的高精度流媒体SVD
Range-Net: A High Precision Streaming SVD for Big Data Applications
论文作者
论文摘要
在大数据设置中,由于主要内存要求,主要的SVD因子是限制性的。最近引入了流媒体随机SVD方案在限制性假设下起作用,即数据的奇异值光谱具有指数衰减。对于任何实际数据,这都是不可能的。尽管由于相关的尾能误差界限,这些方法被认为适用于科学计算,但是当上述假设不存在时,奇异向量和值的近似误差很高。此外,从实际的角度来看,过采样仍然可以是记忆密集型或更糟的,或更糟的是超出数据的特征维度。为了解决这些问题,我们将Range-NET作为随机SVD的替代方案,可满足Eckart-Young-Mirsky(EYM)定理给出的尾能下限。 Range-Net是一种确定性的两阶段神经优化方法,具有随机初始化,其中主存储需求明确取决于特征维度和所需等级,而与样本维度无关。数据样本在流设置中读取,网络最小化问题会收敛到所需的秩-R近似。在所有网络输出和权重都具有特定含义的情况下,范围网络是完全可以解释的。我们提供理论保证,范围内提取的SVD因子满足了机器精确度的EYM尾能下限。我们对各种量表的真实数据的数值实验证实了这一结合。与最先进的随机SVD状态的比较表明,范围内的准确性越过六个数量级,同时有效。
In a Big Data setting computing the dominant SVD factors is restrictive due to the main memory requirements. Recently introduced streaming Randomized SVD schemes work under the restrictive assumption that the singular value spectrum of the data has exponential decay. This is seldom true for any practical data. Although these methods are claimed to be applicable to scientific computations due to associated tail-energy error bounds, the approximation errors in the singular vectors and values are high when the aforementioned assumption does not hold. Furthermore from a practical perspective, oversampling can still be memory intensive or worse can exceed the feature dimension of the data. To address these issues, we present Range-Net as an alternative to randomized SVD that satisfies the tail-energy lower bound given by Eckart-Young-Mirsky (EYM) theorem. Range-Net is a deterministic two-stage neural optimization approach with random initialization, where the main memory requirement depends explicitly on the feature dimension and desired rank, independent of the sample dimension. The data samples are read in a streaming setting with the network minimization problem converging to the desired rank-r approximation. Range-Net is fully interpretable where all the network outputs and weights have a specific meaning. We provide theoretical guarantees that Range-Net extracted SVD factors satisfy EYM tail-energy lower bound at machine precision. Our numerical experiments on real data at various scales confirms this bound. A comparison against the state of the art streaming Randomized SVD shows that Range-Net accuracy is better by six orders of magnitude while being memory efficient.