论文标题
与Python的HPC数据工程
Data Engineering for HPC with Python
论文作者
论文摘要
通过采用深度学习和机器学习,数据工程正成为科学发现的越来越重要的一部分。数据工程涉及各种数据格式,存储,数据提取,转换和数据移动。数据工程的一个目标是将数据从原始数据转换为以深度学习和机器学习应用程序接受的矢量/矩阵/张量格式。有许多结构,例如表,图形和树,以表示这些数据工程阶段中的数据。其中,表是一种多功能且常用的格式,用于加载和处理数据。在本文中,我们根据表抽象来表示代表和处理数据的分布式Python API。与仅用Python编写的现有最先进的数据工程工具不同,我们的解决方案采用了C ++中的高性能计算内核,并带有基于Cython的Python绑定的内存表表示。在核心系统中,我们使用MPI用于分布式内存计算,采用数据并行方法来处理HPC群集中的大数据集。
Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications. There are many structures such as tables, graphs, and trees to represent data in these data engineering phases. Among them, tables are a versatile and commonly used format to load and process data. In this paper, we present a distributed Python API based on table abstraction for representing and processing data. Unlike existing state-of-the-art data engineering tools written purely in Python, our solution adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings. In the core system, we use MPI for distributed memory computations with a data-parallel approach for processing large datasets in HPC clusters.