A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification

Zakariya Yahya Algamal, Muhammad Hisyam Lee*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

64 Citations (Scopus)

Abstract

The common issues of high-dimensional gene expression data are that many of the genes may not be relevant, and there exists a high correlation among genes. Gene selection has been proven to be an effective way to improve the results of many classification methods. Sparse logistic regression using least absolute shrinkage and selection operator (lasso) or using smoothly clipped absolute deviation is one of the most widely applicable methods in cancer classification for gene selection. However, this method faces a critical challenge in practical applications when there are high correlations among genes. To address this problem, a two-stage sparse logistic regression is proposed, with the aim of obtaining an efficient subset of genes with high classification capabilities by combining the screening approach as a filter method and adaptive lasso with a new weight as an embedded method. In the first stage, sure independence screening method as a screening approach retains those genes representing high individual correlation with the cancer class level. In the second stage, the adaptive lasso with new weight is implemented to address the existence of high correlations among the screened genes in the first stage. Experimental results based on four publicly available gene expression datasets have shown that the proposed method significantly outperforms three state-of-the-art methods in terms of classification accuracy, G-mean, area under the curve, and stability. In addition, the results demonstrate that the top selected genes are biologically related to the cancer type. Thus, the proposed method can be useful for cancer classification using DNA gene expression data in real clinical practice.

Original languageEnglish
Pages (from-to)753-771
Number of pages19
JournalAdvances in Data Analysis and Classification
Volume13
Issue number3
DOIs
Publication statusPublished - 1 Sept 2019
Externally publishedYes

Keywords

  • Cancer classification
  • Gene selection
  • Lasso
  • SCAD
  • Sparse logistic regression

Fingerprint

Dive into the research topics of 'A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification'. Together they form a unique fingerprint.

Cite this