DATA AUGMENTATION FOR NATURAL LANGUAGE PROCESSING

Authors

  • Parul Gupta Tata Consultancy Services
  • Maha Mahmood Iraqi Prime Minister's Office

DOI:

https://doi.org/10.69511/ijdsaa.v6i6.239

Keywords:

Natural Language Processing, Data Augmentation, Easy Data Augmentation, Machine Learning, Performance

Abstract

Recent years have seen a rise in the use of data augmentation approaches in natural language processing (NLP) to create more trustworthy models. Data augmentation has recently received a lot of interest in NLP due to new aims, more work being done in lower source domains, and the popularity of large-scale neural networks, which need a lot of training data. Despite this recent development, there hasn't been much research done in this area; this may be because the linguistic data presents some challenges. In this paper we compared four data augmentation (easy data augmentations (EDA), backtranslation, Mix-up and generative models like GPT-2 and BERT) approaches on two datasets for the NLP tasks of sentiment classification and question classification and used accuracy, precision, recall and f1- scores as evaluation metrics. We showed how not only accuracy, but other evaluation metrics are also required to choose the best model especially when the dataset is imbalance. We also show that these data augmentation approaches perform well only in low-data regime and the evaluation metrics for these augmentation techniques starts to get hurt when the training data is increased. Further we also concluded how backtranslation augmentation method performance depends on the language used for translation. Based on the findings, we made several recommendations for potential future work for the researchers to work on in the future.

Downloads

Published

2025-07-26

How to Cite

Gupta, P., & Mahmood, M. (2025). DATA AUGMENTATION FOR NATURAL LANGUAGE PROCESSING . International Journal of Data Science and Advanced Analytics, 6(2), 352–359. https://doi.org/10.69511/ijdsaa.v6i6.239

Issue

Section

Articles