MixCode：通过基于混合的数据扩展来增强代码分类

论文标题

MixCode：通过基于混合的数据扩展来增强代码分类

MIXCODE: Enhancing Code Classification by Mixup-Based Data Augmentation

论文作者

Dong, Zeming, Hu, Qiang, Guo, Yuejun, Cordy, Maxime, Papadakis, Mike, Zhang, Zhenya, Traon, Yves Le, Zhao, Jianjun

论文摘要

受到深神网络（DNN）在自然语言处理（NLP）中的巨大成功的启发，DNN越来越多地应用于源代码分析中，并引起了软件工程社区的极大关注。由于其数据驱动的性质，DNN模型需要大量和高质量的标签培训数据才能实现专家级的性能。收集此类数据通常并不难，但是众所周知，标签过程是费力的。基于DNN的代码分析的任务甚至使情况恶化，因为源代码标签也需要复杂的专业知识。数据增强一直是在计算机视觉和NLP等领域中补充培训数据的流行方法。但是，代码分析中的现有数据增强方法采用了简单的方法，例如数据转换和对抗性示例生成，从而带来了有限的性能优势。在本文中，我们提出了一种数据增强方法混合代码，该方法旨在有效地补充有效的培训数据，这是受到计算机视觉中最近名为Mixup的启发。具体而言，我们首先使用多种代码重构方法来生成与原始数据保持一致标签的转换代码。然后，我们调整混合技术将原始代码与转换的代码混合以增强培训数据。我们评估两种编程语言（Java和Python），两个代码任务（问题分类和错误检测），四个基准数据集（Java250，Python800，CodRep1和Rebactory）以及七个模型体系结构（包括两个预训练的Codebert和GraplCodebert）。实验结果表明，MixCode的表现优于基线数据增强方法的准确性高达6.24％，鲁棒性的稳定性高达26.06％。

Inspired by the great success of Deep Neural Networks (DNNs) in natural language processing (NLP), DNNs have been increasingly applied in source code analysis and attracted significant attention from the software engineering community. Due to its data-driven nature, a DNN model requires massive and high-quality labeled training data to achieve expert-level performance. Collecting such data is often not hard, but the labeling process is notoriously laborious. The task of DNN-based code analysis even worsens the situation because source code labeling also demands sophisticated expertise. Data augmentation has been a popular approach to supplement training data in domains such as computer vision and NLP. However, existing data augmentation approaches in code analysis adopt simple methods, such as data transformation and adversarial example generation, thus bringing limited performance superiority. In this paper, we propose a data augmentation approach MIXCODE that aims to effectively supplement valid training data, inspired by the recent advance named Mixup in computer vision. Specifically, we first utilize multiple code refactoring methods to generate transformed code that holds consistent labels with the original data. Then, we adapt the Mixup technique to mix the original code with the transformed code to augment the training data. We evaluate MIXCODE on two programming languages (Java and Python), two code tasks (problem classification and bug detection), four benchmark datasets (JAVA250, Python800, CodRep1, and Refactory), and seven model architectures (including two pre-trained models CodeBERT and GraphCodeBERT). Experimental results demonstrate that MIXCODE outperforms the baseline data augmentation approach by up to 6.24% in accuracy and 26.06% in robustness.

下载PDF全文

下载文献需遵守相关版权规定

论文标题