TY - JOUR
T1 - A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification
AU - Algamal, Zakariya Yahya
AU - Lee, Muhammad Hisyam
N1 - Publisher Copyright:
© 2018, Springer-Verlag GmbH Germany, part of Springer Nature.
PY - 2019/9/1
Y1 - 2019/9/1
N2 - The common issues of high-dimensional gene expression data are that many of the genes may not be relevant, and there exists a high correlation among genes. Gene selection has been proven to be an effective way to improve the results of many classification methods. Sparse logistic regression using least absolute shrinkage and selection operator (lasso) or using smoothly clipped absolute deviation is one of the most widely applicable methods in cancer classification for gene selection. However, this method faces a critical challenge in practical applications when there are high correlations among genes. To address this problem, a two-stage sparse logistic regression is proposed, with the aim of obtaining an efficient subset of genes with high classification capabilities by combining the screening approach as a filter method and adaptive lasso with a new weight as an embedded method. In the first stage, sure independence screening method as a screening approach retains those genes representing high individual correlation with the cancer class level. In the second stage, the adaptive lasso with new weight is implemented to address the existence of high correlations among the screened genes in the first stage. Experimental results based on four publicly available gene expression datasets have shown that the proposed method significantly outperforms three state-of-the-art methods in terms of classification accuracy, G-mean, area under the curve, and stability. In addition, the results demonstrate that the top selected genes are biologically related to the cancer type. Thus, the proposed method can be useful for cancer classification using DNA gene expression data in real clinical practice.
AB - The common issues of high-dimensional gene expression data are that many of the genes may not be relevant, and there exists a high correlation among genes. Gene selection has been proven to be an effective way to improve the results of many classification methods. Sparse logistic regression using least absolute shrinkage and selection operator (lasso) or using smoothly clipped absolute deviation is one of the most widely applicable methods in cancer classification for gene selection. However, this method faces a critical challenge in practical applications when there are high correlations among genes. To address this problem, a two-stage sparse logistic regression is proposed, with the aim of obtaining an efficient subset of genes with high classification capabilities by combining the screening approach as a filter method and adaptive lasso with a new weight as an embedded method. In the first stage, sure independence screening method as a screening approach retains those genes representing high individual correlation with the cancer class level. In the second stage, the adaptive lasso with new weight is implemented to address the existence of high correlations among the screened genes in the first stage. Experimental results based on four publicly available gene expression datasets have shown that the proposed method significantly outperforms three state-of-the-art methods in terms of classification accuracy, G-mean, area under the curve, and stability. In addition, the results demonstrate that the top selected genes are biologically related to the cancer type. Thus, the proposed method can be useful for cancer classification using DNA gene expression data in real clinical practice.
KW - Cancer classification
KW - Gene selection
KW - Lasso
KW - SCAD
KW - Sparse logistic regression
UR - http://www.scopus.com/inward/record.url?scp=85071493243&partnerID=8YFLogxK
U2 - 10.1007/s11634-018-0334-1
DO - 10.1007/s11634-018-0334-1
M3 - Article
AN - SCOPUS:85071493243
SN - 1862-5347
VL - 13
SP - 753
EP - 771
JO - Advances in Data Analysis and Classification
JF - Advances in Data Analysis and Classification
IS - 3
ER -