Comparison of dataset oversampling algorithms and their applicability to the categorization problem

Teslenko, D.; Sorokina, A.; Khovrat,  A.; Huliiev, N.; Kyriy, V.

Публікація:
Comparison of dataset oversampling algorithms and their applicability to the categorization problem

dc.contributor.author	Teslenko, D.
dc.contributor.author	Sorokina, A.
dc.contributor.author	Khovrat, A.
dc.contributor.author	Huliiev, N.
dc.contributor.author	Kyriy, V.
dc.date.accessioned	2024-01-17T18:27:26Z
dc.date.available	2024-01-17T18:27:26Z
dc.date.issued	2023
dc.description.abstract	The subject of research in the article is the problem of classification in machine learning in the presence of imbalanced classes in datasets. The purpose of the work is to analyze existing solutions and algorithms for solving the problem of dataset imbalance of different types and different industries and to conduct an experimental comparison of algorithms. The article solves the following tasks: to analyze approaches to solving the problem – preprocessing methods, learning methods, hybrid methods and algorithmic approaches; to define and describe the oversampling algorithms most often used to balance datasets; to select classification algorithms that will serve as a tool for establishing the quality of balancing by checking the applicability of the datasets obtained after oversampling; to determine metrics for assessing the quality of classification for comparison; to conduct experiments according to the proposed methodology. For clarity, we considered datasets with varying degrees of imbalance (the number of instances of the minority class was equal to 15, 30, 45, and 60% of the number of samples of the majority class). The following methods are used: analytical and inductive methods for determining the necessary set of experiments and building hypotheses regarding their results, experimental and graphic methods for obtaining a visual comparative characteristic of the selected algorithms. The following results were obtained: with the help of quality metrics, an experiment was conducted for all algorithms on two different datasets – the Titanic passenger dataset and the dataset for detecting fraudulent transactions in bank accounts. The obtained results indicated the best applicability of SMOTE and SVM SMOTE algorithms, the worst performance of Borderline SMOTE and k-means SMOTE, and at the same time described the results of each algorithm and the potential of their usage. Conclusions: the application of the analytical and experimental ethod provided a comprehensive comparative description of the existing balancing algorithms. The superiority of oversampling algorithms over undersampling algorithms was proven. The selected algorithms were compared using different classification algorithms. The results were presented using graphs and tables, as well as demonstrated in general using heat maps. Conclusions that were made can be used when choosing the optimal balancing algorithm in the field of machine learning.
dc.identifier.citation	Comparison of dataset oversampling algorithms and their applicability to the categorization problem / D. Teslenko, A. Sorokina, A. Khovrat, N. Huliiev, V. Kyriy // Сучасний стан наукових досліджень та технологій в промисловості. – 2023. – № 2(24). – С. 161–171.
dc.identifier.uri	https://openarchive.nure.ua/handle/document/25335
dc.language.iso	en
dc.publisher	ХНУРЕ
dc.subject	categorization
dc.subject	machine learning
dc.subject	methods of balancing
dc.subject	data generation methods
dc.subject	dataset
dc.subject	unbalanced datasets
dc.title	Comparison of dataset oversampling algorithms and their applicability to the categorization problem
dc.type	Article
dspace.entity.type	Publication

Файли

Оригінальний пакунок

Зараз показано 1 - 1 з 1

Назва:: EK_SSND_Pr_2023_n2_161-171.pdf
Розмір:: 610.81 KB
Формат:: Adobe Portable Document Format

Завантажити

Пакунок ліцензії

Зараз показано 1 - 1 з 1

Назва:: license.txt
Розмір:: 9.55 KB
Формат:: Item-specific license agreed upon to submission
Опис:

Завантажити

Колекції

Кафедра економічної кібернетики та управління економічною безпекою (ЕК)

Публікація: Comparison of dataset oversampling algorithms and their applicability to the categorization problem

Файли

Оригінальний пакунок

Пакунок ліцензії

Колекції

Публікація:
Comparison of dataset oversampling algorithms and their applicability to the categorization problem