Imbalanced Data Classification Using a Relevant Information-Based Sampling Approach

Hoyos, Keider; Fernández, Jorge; Martinez, Beatriz; Henao, Óscar; Orozco, Álvaro; Daza, Genaro

Por favor, use este identificador para citar o enlazar este ítem: https://repositorio.uci.cu/jspui/handle/123456789/9472

Título :	Imbalanced Data Classification Using a Relevant Information-Based Sampling Approach
Autor :	Hoyos, Keider Fernández, Jorge Martinez, Beatriz Henao, Óscar Orozco, Álvaro Daza, Genaro
Palabras clave :	LEARNING ALGORITHMS;DATA PREPROCESSING;DATA VALIDATION
Fecha de publicación :	2018
Editorial :	Springer
Citación :	Hoyos K., Fernández J., Martinez B., Henao Ó., Orozco Á., Daza G. (2018) Imbalanced Data Classification Using a Relevant Information-Based Sampling Approach. In: Hernández Heredia Y., Milián Núñez V., Ruiz Shulcloper J. (eds) Progress in Artificial Intelligence and Pattern Recognition. IWAIPR 2018. Lecture Notes in Computer Science, vol 11047. Springer, Cham. https://doi.org/10.1007/978-3-030-01132-1_32
Resumen :	The imbalanced data refer to datasets where the number of samples in one class (majority class) is much higher than the other (minority class) causing biased classifiers in favor of the majority class. Currently, it is difficult to develop an effective model using machine learning algorithms without considering data preprocessing to balance the imbalanced data sets. In this paper, we propose a Relevant Information based under-sampling (RIS) approach to improve the classification performance for the minority class by selecting the most relevant samples from the majority class as training data. Our RIS approach is based on a self-organizing principle of relevant information, which allows extracting the underlying structure of the majority class preserving different levels of detail of the original data with a smaller number of samples. Additionally, the RIS captures the data structure beyond second order statistics by estimating information theoretic measures which quantify the statistical structure of the majority class accurately, decreasing the consequences of the imbalanced classes distribution problem. We test our methodology on synthetic and real-world imbalanced datasets. Finally, we use a cross-validation scheme to quantify the classifier performance by evaluating the geometric mean. Results show that our proposal outperforms the state of the art methods for imbalanced class distributions regarding classification geometric mean, especially in highly imbalanced datasets.
URI :	https://repositorio.uci.cu/jspui/handle/123456789/9472
Aparece en las colecciones:	UCIENCIA 2018

Ficheros en este ítem:

Fichero	Tamaño	Formato
A054.pdf	118.23 kB	Adobe PDF	Visualizar/Abrir

Mostrar el registro Dublin Core completo del ítem