DATA AUGMENTATION FOR NATURAL LANGUAGE PROCESSING
DOI:
https://doi.org/10.69511/ijdsaa.v6i6.239Keywords:
Natural Language Processing, Data Augmentation, Easy Data Augmentation, Machine Learning, PerformanceAbstract
Recent years have seen a rise in the use of data augmentation approaches in natural language processing (NLP) to create more trustworthy models. Data augmentation has recently received a lot of interest in NLP due to new aims, more work being done in lower source domains, and the popularity of large-scale neural networks, which need a lot of training data. Despite this recent development, there hasn't been much research done in this area; this may be because the linguistic data presents some challenges. In this paper we compared four data augmentation (easy data augmentations (EDA), backtranslation, Mix-up and generative models like GPT-2 and BERT) approaches on two datasets for the NLP tasks of sentiment classification and question classification and used accuracy, precision, recall and f1- scores as evaluation metrics. We showed how not only accuracy, but other evaluation metrics are also required to choose the best model especially when the dataset is imbalance. We also show that these data augmentation approaches perform well only in low-data regime and the evaluation metrics for these augmentation techniques starts to get hurt when the training data is increased. Further we also concluded how backtranslation augmentation method performance depends on the language used for translation. Based on the findings, we made several recommendations for potential future work for the researchers to work on in the future.Downloads
Published
2025-07-26
How to Cite
Gupta, P., & Mahmood, M. (2025). DATA AUGMENTATION FOR NATURAL LANGUAGE PROCESSING . International Journal of Data Science and Advanced Analytics, 6(2), 352–359. https://doi.org/10.69511/ijdsaa.v6i6.239
Issue
Section
Articles
License
Copyright (c) 2024 Parul Gupta, Maha Mahmood

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

International Journal of Data Science and Advanced Analytics (IJDSAA) is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. This license allows users to copy, distribute and transmit an article, adapt the article as long as the author is attributed and the article is not used for commercial purposes.
The author(s) confirms
- The manuscript submission has not been previously published, nor is it before another journal for consideration (or an explanation has been provided in Comments to the Editor).
- The published materials used in the manuscript were obtained permission for reproduction. (if any)