Cost-Sensitive Variable Selection for Multi-Class Imbalanced Datasets Using Bayesian Networks

Multi-class classification in imbalanced datasets is a challenging problem. In these cases, common validation metrics (such as accuracy or recall) are often not suitable. In many of these problems, often real-world problems related to health, some classification errors may be tolerated, whereas othe...

Full description

Bibliographic Details
Main Authors: Ramos-López, Darío, Maldonado, Ana D.
Format: info:eu-repo/semantics/article
Language:English
Published: MDPI 2021
Subjects:
Online Access:http://hdl.handle.net/10835/9318
_version_ 1789406332322840576
author Ramos-López, Darío
Maldonado, Ana D.
author_facet Ramos-López, Darío
Maldonado, Ana D.
author_sort Ramos-López, Darío
collection DSpace
description Multi-class classification in imbalanced datasets is a challenging problem. In these cases, common validation metrics (such as accuracy or recall) are often not suitable. In many of these problems, often real-world problems related to health, some classification errors may be tolerated, whereas others are to be avoided completely. Therefore, a cost-sensitive variable selection procedure for building a Bayesian network classifier is proposed. In it, a flexible validation metric (cost/loss function) encoding the impact of the different classification errors is employed. Thus, the model is learned to optimize the a priori specified cost function. The proposed approach was applied to forecasting an air quality index using current levels of air pollutants and climatic variables from a highly imbalanced dataset. For this problem, the method yielded better results than other standard validation metrics in the less frequent class states. The possibility of fine-tuning the objective validation function can improve the prediction quality in imbalanced data or when asymmetric misclassification costs have to be considered.
format info:eu-repo/semantics/article
id oai:repositorio.ual.es:10835-9318
institution Universidad de Cuenca
language English
publishDate 2021
publisher MDPI
record_format dspace
spelling oai:repositorio.ual.es:10835-93182023-04-12T19:36:23Z Cost-Sensitive Variable Selection for Multi-Class Imbalanced Datasets Using Bayesian Networks Ramos-López, Darío Maldonado, Ana D. multi-class classification imbalanced data Bayesian networks variable selection Multi-class classification in imbalanced datasets is a challenging problem. In these cases, common validation metrics (such as accuracy or recall) are often not suitable. In many of these problems, often real-world problems related to health, some classification errors may be tolerated, whereas others are to be avoided completely. Therefore, a cost-sensitive variable selection procedure for building a Bayesian network classifier is proposed. In it, a flexible validation metric (cost/loss function) encoding the impact of the different classification errors is employed. Thus, the model is learned to optimize the a priori specified cost function. The proposed approach was applied to forecasting an air quality index using current levels of air pollutants and climatic variables from a highly imbalanced dataset. For this problem, the method yielded better results than other standard validation metrics in the less frequent class states. The possibility of fine-tuning the objective validation function can improve the prediction quality in imbalanced data or when asymmetric misclassification costs have to be considered. 2021-01-18T09:33:56Z 2021-01-18T09:33:56Z 2021-01-13 info:eu-repo/semantics/article 2227-7390 http://hdl.handle.net/10835/9318 en https://www.mdpi.com/2227-7390/9/2/156 Attribution-NonCommercial-NoDerivatives 4.0 Internacional http://creativecommons.org/licenses/by-nc-nd/4.0/ info:eu-repo/semantics/openAccess MDPI
spellingShingle multi-class classification
imbalanced data
Bayesian networks
variable selection
Ramos-López, Darío
Maldonado, Ana D.
Cost-Sensitive Variable Selection for Multi-Class Imbalanced Datasets Using Bayesian Networks
title Cost-Sensitive Variable Selection for Multi-Class Imbalanced Datasets Using Bayesian Networks
title_full Cost-Sensitive Variable Selection for Multi-Class Imbalanced Datasets Using Bayesian Networks
title_fullStr Cost-Sensitive Variable Selection for Multi-Class Imbalanced Datasets Using Bayesian Networks
title_full_unstemmed Cost-Sensitive Variable Selection for Multi-Class Imbalanced Datasets Using Bayesian Networks
title_short Cost-Sensitive Variable Selection for Multi-Class Imbalanced Datasets Using Bayesian Networks
title_sort cost-sensitive variable selection for multi-class imbalanced datasets using bayesian networks
topic multi-class classification
imbalanced data
Bayesian networks
variable selection
url http://hdl.handle.net/10835/9318
work_keys_str_mv AT ramoslopezdario costsensitivevariableselectionformulticlassimbalanceddatasetsusingbayesiannetworks
AT maldonadoanad costsensitivevariableselectionformulticlassimbalanceddatasetsusingbayesiannetworks