DATA AUGMENTATION FOR NATURAL LANGUAGE PROCESSING

Parul Gupta; Maha Mahmood

doi:10.69511/ijdsaa.v6i6.239

DATA AUGMENTATION FOR NATURAL LANGUAGE PROCESSING

Authors

Parul Gupta Tata Consultancy Services
Maha Mahmood Iraqi Prime Minister's Office

DOI:

https://doi.org/10.69511/ijdsaa.v6i6.239

Keywords:

Natural Language Processing, Data Augmentation, Easy Data Augmentation, Machine Learning, Performance

Abstract

Recent years have seen a rise in the use of data augmentation approaches in natural language processing (NLP) to create more trustworthy models. Data augmentation has recently received a lot of interest in NLP due to new aims, more work being done in lower source domains, and the popularity of large-scale neural networks, which need a lot of training data. Despite this recent development, there hasn't been much research done in this area; this may be because the linguistic data presents some challenges. In this paper we compared four data augmentation (easy data augmentations (EDA), backtranslation, Mix-up and generative models like GPT-2 and BERT) approaches on two datasets for the NLP tasks of sentiment classification and question classification and used accuracy, precision, recall and f1- scores as evaluation metrics. We showed how not only accuracy, but other evaluation metrics are also required to choose the best model especially when the dataset is imbalance. We also show that these data augmentation approaches perform well only in low-data regime and the evaluation metrics for these augmentation techniques starts to get hurt when the training data is increased. Further we also concluded how backtranslation augmentation method performance depends on the language used for translation. Based on the findings, we made several recommendations for potential future work for the researchers to work on in the future.

Downloads

Published

2025-07-26

How to Cite

Gupta, P., & Mahmood, M. (2025). DATA AUGMENTATION FOR NATURAL LANGUAGE PROCESSING . International Journal of Data Science and Advanced Analytics, 6(2), 352–359. https://doi.org/10.69511/ijdsaa.v6i6.239

Download Citation

Issue

Vol. 6 No. 2 (2024)

Section

Articles

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

International Journal of Data Science and Advanced Analytics (IJDSAA) is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. This license allows users to copy, distribute and transmit an article, adapt the article as long as the author is attributed and the article is not used for commercial purposes.

The author(s) confirms

The manuscript submission has not been previously published, nor is it before another journal for consideration (or an explanation has been provided in Comments to the Editor).
The published materials used in the manuscript were obtained permission for reproduction. (if any)

DATA AUGMENTATION FOR NATURAL LANGUAGE PROCESSING

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission