摘要:
<jats:p>In efforts to discover disease mechanisms and improve clinical diagnosis of tumors, it is useful to mine profiles for informative genes with definite biological meanings and to build robust classifiers with high precision. In this study, we developed a new method for tumor-gene selection, the Chi-square test-based integrated rank gene and direct classifier (<jats:italic>χ</jats:italic><jats:sup>2</jats:sup>-IRG-DC). First, we obtained the weighted integrated rank of gene importance from chi-square tests of single and pairwise gene interactions. Then, we sequentially introduced the ranked genes and removed redundant genes by using leave-one-out cross-validation of the chi-square test-based Direct Classifier (<jats:italic>χ</jats:italic><jats:sup>2</jats:sup>-DC) within the training set to obtain informative genes. Finally, we determined the accuracy of independent test data by utilizing the genes obtained above with<jats:italic>χ</jats:italic><jats:sup>2</jats:sup>-DC. Furthermore, we analyzed the robustness of<jats:italic>χ</jats:italic><jats:sup>2</jats:sup>-IRG-DC by comparing the generalization performance of different models, the efficiency of different feature-selection methods, and the accuracy of different classifiers. An independent test of ten multiclass tumor gene-expression datasets showed that<jats:italic>χ</jats:italic><jats:sup>2</jats:sup>-IRG-DC could efficiently control overfitting and had higher generalization performance. The informative genes selected by<jats:italic>χ</jats:italic><jats:sup>2</jats:sup>-IRG-DC could dramatically improve the independent test precision of other classifiers; meanwhile, the informative genes selected by other feature selection methods also had good performance in<jats:italic>χ</jats:italic><jats:sup>2</jats:sup>-DC.</jats:p>
期刊:
Bulgarian Journal of Agricultural Science,2013年19(6):1327-1336 ISSN:1310-0351
通讯作者:
Zhang, H. Y.
作者机构:
Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China;Hunan Agricultural University, College of Information Science and Technology, Changsha 410128, China;Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha 410128, China;[Wang H.Y.] Kansas State University, Department of Statistics, Manhattan, KS 66506, United States;[Yuan Z.M.; Wang L.F.] Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China, Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha 410128, China
通讯机构:
[Zhang, H. Y.] H;Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China
关键词:
Geo-statistics tool;Multidimensional time series;Prediction;Reasonable sample rejection;Support vector machine regression
摘要:
This paper proposes a method that creatively applies a Geo-statistics tool (GS) to complete fast and adequate order determination and introduces a novel algorithm, named Reasonable Sample Rejection (RSR) to realize rational sample selection. Then, combined with Support Vector Machine Regression (SVR), a high precision non-linear prediction method named GSRSR- SVR is proposed for multidimensional time series. The main steps of the novel method includes: 1) determine the order for the dependent variable of the training samples based on one-dimensional GS aftereffect duration (range), 2) screen the independent variables according to Leave-One-Out Cross Validation (LOOCV) based on the minimum Mean Squared Error (MSE), 3) reject some oldest training samples based on the minimum correlation coefficient of fitting absolute relative error of training sets of different rejected sizes and sample number. Three real-world datasets was used to test the effectiveness of GSRSR- SVR. The results show that GS-RSR-SVR has higher prediction precision and more stable prediction ability than MLR, ARIMA, CAR, BPNN, SVR and SVR-CAR.
关键词:
Support Vector Machine;Chisquare Statistic;Family Classifier;Marker Pair;Informative Gene
摘要:
One of the challenges in classification of cancer tissue samples based on gene expression data is to establish an effective method that can select a parsimonious set of informative genes. The Top Scoring Pair (TSP), k-Top Scoring Pairs (k-TSP), Support Vector Machines (SVM), and prediction analysis of microarrays (PAM) are four popular classifiers that have comparable performance on multiple cancer datasets. SVM and PAM tend to use a large number of genes and TSP, k-TSP always use even number of genes. In addition, the selection of distinct gene pairs in k-TSP simply combined the pairs of top ranking genes without considering the fact that the gene set with best discrimination power may not be the combined pairs. The k-TSP algorithm also needs the user to specify an upper bound for the number of gene pairs. Here we introduce a computational algorithm to address the problems. The algorithm is named Chisquare-statistic-based Top Scoring Genes (Chi-TSG) classifier simplified as TSG. The TSG classifier starts with the top two genes and sequentially adds additional gene into the candidate gene set to perform informative gene selection. The algorithm automatically reports the total number of informative genes selected with cross validation. We provide the algorithm for both binary and multi-class cancer classification. The algorithm was applied to 9 binary and 10 multi-class gene expression datasets involving human cancers. The TSG classifier outperforms TSP family classifiers by a big margin in most of the 19 datasets. In addition to improved accuracy, our classifier shares all the advantages of the TSP family classifiers including easy interpretation, invariant to monotone transformation, often selects a small number of informative genes allowing follow-up studies, resistant to sampling variations due to within sample operations. Redefining the scores for gene set and the classification rules in TSP family classifiers by incorporating the sample size information can lead to better selection of informative genes and classification accuracy. The resulting TSG classifier offers a useful tool for cancer classification based on numerical molecular data.