论文标题
非正式数据转换认为有害
Informal Data Transformation Considered Harmful
论文作者
论文摘要
在本文中,我们采取的共同立场是,AI系统受其从他们所学习的数据的完整性而不是其算法的复杂性更大的限制,我们采取的不常见的立场是,在企业中实现更好的数据完整性的解决方案不是在需要的情况下清洁和验证数据,以确定数据湖的数据管理,从而导致数据湖泊的数据,该数据湖泊的数据是数据,该数据是数据的,该数据是数据的80%,该数据是在80%的数据中,并以80%的数据为单位,并将其验证到80%的数据,并将其验证为依据,并将其验证为80%的数据,并将其验证为数据,并将其验证到80%的数据,并将其验证为数据,并将其验证为数据,并将其置于数据管理时,并将其验证为数据。自动保证在整个企业中都会保留数据完整性(迁移,集成,组合,查询,查看等),以便不必不断地重新验证数据和程序,以供每个特定用途重新验证。
In this paper we take the common position that AI systems are limited more by the integrity of the data they are learning from than the sophistication of their algorithms, and we take the uncommon position that the solution to achieving better data integrity in the enterprise is not to clean and validate data ex-post-facto whenever needed (the so-called data lake approach to data management, which can lead to data scientists spending 80% of their time cleaning data), but rather to formally and automatically guarantee that data integrity is preserved as it transformed (migrated, integrated, composed, queried, viewed, etc) throughout the enterprise, so that data and programs that depend on that data need not constantly be re-validated for every particular use.