论文标题
检测野外无声数据损坏
Detecting silent data corruptions in the wild
论文作者
论文摘要
当内部缺陷在电路的一部分中表现出没有检查逻辑以检测不正确的电路操作时,就会发生硬件设备内的无声错误。这种缺陷的结果可以从单个数据值中的单个位翻转到导致软件执行错误的指令。硬件中的无声数据损坏(SDC)影响大规模应用程序的计算完整性。在其他硅因素和其他硅因素以及年龄的变化,温度差异和年龄以及其他硅因素以及其他硅因子以及其他因素中,无声错误的表现会加速。这些错误不会在系统日志中留下任何记录或跟踪。结果,无声错误在工作负载中未被发现,它们的效果可能会在几种服务中传播,从而导致问题出现在与原始缺陷相距甚远的系统中。在本文中,我们描述了测试策略,以检测大规模基础架构内的无声数据腐败。鉴于问题的挑战性质,我们尝试了不同的检测和缓解方法。我们比较和对比两种这样的方法-1。Fleetscanner(止产性测试)和2。Ripple(生产测试)。我们评估了3年以上生产经验的硅测试漏斗相关的基础设施权衡。
Silent Errors within hardware devices occur when an internal defect manifests in a part of the circuit which does not have check logic to detect the incorrect circuit operation. The results of such a defect can range from flipping a single bit in a single data value, up to causing the software to execute the wrong instructions. Silent data corruptions (SDC) in hardware impact computational integrity for large-scale applications. Manifestations of silent errors are accelerated by datapath variations, temperature variance, and age, among other silicon factors. These errors do not leave any record or trace in system logs. As a result, silent errors stay undetected within workloads, and their effects can propagate across several services, causing problems to appear in systems far removed from the original defect. In this paper, we describe testing strategies to detect silent data corruptions within a large scale infrastructure. Given the challenging nature of the problem, we experimented with different methods for detection and mitigation. We compare and contrast two such approaches - 1. Fleetscanner (out-of-production testing) and 2. Ripple (in-production testing).We evaluate the infrastructure tradeoffs associated with the silicon testing funnel across 3+ years of production experience.