TY - JOUR
T1 - A proficient two stage model for identification of promising gene subset and accurate cancer classification
AU - Dass, Sayantan
AU - Mistry, Sujoy
AU - Sarkar, Pradyut
AU - Barik, Subhasis
AU - Dahal, Keshav
PY - 2023/3/10
Y1 - 2023/3/10
N2 - Over the past few decades, there has been a massive growth in the volume of biological data. In such datasets, the influence of dimensionality bias or existence of repetitive, noisy, and irrelevant genes has become a severe barrier in classifying gene expression data. Therefore, to reduce the impact of noisy genes and precisely identify gene patterns for enhancing classification accuracy, feature selection strategies are employed. This article proposes an innovative hybrid feature selection model by mixing statistical and filter-feature selection methodologies. Following the initial step of normalizing each sample, a non-parametric Kruskal–Wallis (KW’s) test and Bonferroni Correction (BC) using together to pick relevant genes. Finally, a correlation-based feature selection (CFS) method employed to determine how different genes are related, and a greedy search policy used to eliminate repetitious genes to discover promising gene subsets. Based on the results and comparison of six distinct microarray datasets, it is clear that the proposed method is superior to Chi-square, Joint Mutual Information (JMI), Conditional Mutual Information Maximization (CMIM), Relief-F, and Minimum Redundancy Maximum Relevance (mRMR) state-of-the-art feature selection algorithms while using Support Vector Machine (SVM), Naïve Bayes (NB), K-Nearest Neighbors (k-NN), and Decision Tree (DT) classifiers respectively. These findings lead us to believe that the suggested feature selection algorithm can effectively discriminate cancer patients from healthy persons.
AB - Over the past few decades, there has been a massive growth in the volume of biological data. In such datasets, the influence of dimensionality bias or existence of repetitive, noisy, and irrelevant genes has become a severe barrier in classifying gene expression data. Therefore, to reduce the impact of noisy genes and precisely identify gene patterns for enhancing classification accuracy, feature selection strategies are employed. This article proposes an innovative hybrid feature selection model by mixing statistical and filter-feature selection methodologies. Following the initial step of normalizing each sample, a non-parametric Kruskal–Wallis (KW’s) test and Bonferroni Correction (BC) using together to pick relevant genes. Finally, a correlation-based feature selection (CFS) method employed to determine how different genes are related, and a greedy search policy used to eliminate repetitious genes to discover promising gene subsets. Based on the results and comparison of six distinct microarray datasets, it is clear that the proposed method is superior to Chi-square, Joint Mutual Information (JMI), Conditional Mutual Information Maximization (CMIM), Relief-F, and Minimum Redundancy Maximum Relevance (mRMR) state-of-the-art feature selection algorithms while using Support Vector Machine (SVM), Naïve Bayes (NB), K-Nearest Neighbors (k-NN), and Decision Tree (DT) classifiers respectively. These findings lead us to believe that the suggested feature selection algorithm can effectively discriminate cancer patients from healthy persons.
KW - biomarker
KW - Bonferroni correction
KW - classification
KW - correlation-based feature selection
KW - gene expression
KW - gene selection
KW - Kruskal–Wallis test
KW - microarray
UR - http://www.scopus.com/inward/record.url?scp=85149806180&partnerID=8YFLogxK
U2 - 10.1007/s41870-023-01181-2
DO - 10.1007/s41870-023-01181-2
M3 - Article
AN - SCOPUS:85149806180
SN - 2511-2104
VL - 15
SP - 1555
EP - 1568
JO - International Journal of Information Technology (Singapore)
JF - International Journal of Information Technology (Singapore)
IS - 3
ER -