摘要:
Splice sites prediction has been a long-standing problem in bioinformatics. Although many computational approaches developed for splice site prediction have achieved satisfactory accuracy, further improvement in predictive accuracy is significant, for it is contributing to predict gene structure more accurately. Determining a proper window size before prediction is necessary. Overly long window size may introduce some irrelevant features, which would reduce predictive accuracy, while the use of short window size with maximum information may performs better in terms of predictive accuracy and time cost. Furthermore, the number of false splice sites following the GT–AG rule far exceeds that of true splice sites, accurate and rapid prediction of splice sites using imbalanced large samples has always been a challenge. Therefore, based on the short window size and imbalanced large samples, we developed a new computational method named chi-square decision table (χ2-DT) for donor splice site prediction. Using a short window size of 11 bp, χ2-DT extracts the improved positional features and compositional features based on chi-square test, then introduces features one by one based on information gain, and constructs a balanced decision table aimed at implementing imbalanced pattern classification. With a 2000:271,132 (true sites:false sites) training set, χ2-DT achieves the highest independent test accuracy (93.34%) when compared with three classifiers (random forest, artificial neural network, and relaxed variable kernel density estimator) and takes a short computation time (89 s). χ2-DT also exhibits good independent test accuracy (92.40%), when validated with BG-570 mutated sequences with frameshift errors (nucleotide insertions and deletions). Moreover, χ2-DT is compared with the long-window size-based methods and the short-window size-based methods, and is found to perform better than all of them in terms of predictive accuracy. Based on short window size and imbalanced large samples, the proposed method not only achieves higher predictive accuracy than some existing methods, but also has high computational speed and good robustness against nucleotide insertions and deletions. This article was reviewed by Ryan McGinty, Ph.D. and Dirk Walther.
作者:
Cai, X. H.;Chen, T.;Wang, R. Y.;Fan, Y. J.;Li, Y.;...
期刊:
Theoretical and Applied Climatology,2019年137(3-4):2139-2149 ISSN:0177-798X
通讯作者:
Zhou, W.;Zhou, Q. M.
作者机构:
[Wang, R. Y.; Fan, Y. J.; Hu, S. N.; Yuan, Z. M.; Li, Y.; Cai, X. H.; Zhou, W.] Hunan Agr Univ, Hunan Prov Engn & Technol Res Ctr Agr Big Data An, Changsha 410128, Hunan, Peoples R China.;[Wang, R. Y.; Fan, Y. J.; Hu, S. N.; Yuan, Z. M.; Li, Y.; Cai, X. H.; Zhou, W.] Hunan Agr Univ, Hunan Prov Key Lab Biol & Control Plant Dis & Ins, Changsha 410128, Hunan, Peoples R China.;[Wang, R. Y.; Fan, Y. J.; Hu, S. N.; Yuan, Z. M.; Li, Y.; Cai, X. H.; Zhou, W.] Hunan Agr Univ, Hunan Prov Engn & Technol Res Ctr Biopesticide &, Changsha 410128, Hunan, Peoples R China.;[Zhou, Q. M.; Chen, T.] Hunan Agr Univ, Coll Agr, Changsha 410128, Hunan, Peoples R China.;[Li, H. G.; Li, X. Y.] Hunan Tobacco Co, Chenzhou Co, Chenzhou 423000, Peoples R China.
通讯机构:
[Zhou, W.; Zhou, Q. M.] H;[Zhou, W.] T;Hunan Agr Univ, Hunan Prov Engn & Technol Res Ctr Agr Big Data An, Changsha 410128, Hunan, Peoples R China.;Hunan Agr Univ, Hunan Prov Key Lab Biol & Control Plant Dis & Ins, Changsha 410128, Hunan, Peoples R China.;Hunan Agr Univ, Hunan Prov Engn & Technol Res Ctr Biopesticide &, Changsha 410128, Hunan, Peoples R China.
摘要:
Tobacco wildfire disease is common globally, and climate change may increase the risk of outbreaks. Therefore, there is an urgent need to establish an effective climate model to forecast the occurrence of wildfire disease. To design such a model, we collected data for 40 wildfire disease indices via tobacco field surveys and data for 15 climate factors of Guiyang County in China from 2012 to 2016. First, we built multiple linear regression (MLR), stepwise linear regression (SLR) and support vector regression (SVR) models using three climate features (precipitation, mean daily temperature and sunshine duration), and we could not find an effective model. Second, we built three corresponding models using expanded 15 climate features and an in-house WDEM method (the worst descriptor elimination multi-roundly), and the independent test results showed that the best SVR model had not only a higher predictive accuracy (
$$ {Q}_{ext}^2 $$
= 0.94) but also a better stability. Finally, we further evaluated the biological significance of their retained climate features and the single-factor effects of the best model according to the interpretability analysis, and our results indicated that (1) the three climate factors (minimum value of wind velocity, daily range of temperature and daily pressure) strongly affected the occurrence of wildfire disease; (2) the ranges of relative humidity and sunshine hours were negatively correlated with the occurrence of wildfire disease, while daily mean vapour pressure was positively correlated with the occurrence of the disease. Our work enables a useful theoretical prediction for wildfire disease, especially in terms of climate-related predictions.
通讯机构:
Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, College of Plant Protection, Hunan Agricultural University, Changsha, China
作者机构:
[Xing, Pengwei; Chen, Yuan; Yuan, Zheming] Hunan Agr Univ, Hunan Engn & Technol Res Ctr Agr Big Data Anal De, Changsha 410128, Hunan, Peoples R China.;[Xing, Pengwei; Chen, Yuan; Yuan, Zheming] Hunan Agr Univ, Hunan Prov Key Lab Biol & Control Plant Dis & Ins, Changsha 410128, Hunan, Peoples R China.;[Gao, Jun] Univ Arkansas Med Sci, Dept Biochem & Mol Biol, Little Rock, AR 72205 USA.;[Bai, Lianyang] Hunan Acad Agr Sci, Biotechnol Res Ctr, Changsha 410125, Hunan, Peoples R China.
通讯机构:
[Yuan, Zheming; Bai, Lianyang] H;Hunan Agr Univ, Hunan Engn & Technol Res Ctr Agr Big Data Anal De, Changsha 410128, Hunan, Peoples R China.;Hunan Agr Univ, Hunan Prov Key Lab Biol & Control Plant Dis & Ins, Changsha 410128, Hunan, Peoples R China.;Hunan Acad Agr Sci, Biotechnol Res Ctr, Changsha 410125, Hunan, Peoples R China.
摘要:
<jats:title>Abstract</jats:title><jats:p>Selecting informative genes, including individually discriminant genes and synergic genes, from expression data has been useful for medical diagnosis and prognosis. Detecting synergic genes is more difficult than selecting individually discriminant genes. Several efforts have recently been made to detect gene-gene synergies, such as dendrogram-based <jats:italic>I</jats:italic>(<jats:italic>X</jats:italic><jats:sub>1</jats:sub>; <jats:italic>X</jats:italic><jats:sub>2</jats:sub>; Y) (mutual information), doublets (gene pairs) and <jats:italic>MIC</jats:italic>(<jats:italic>X</jats:italic><jats:sub>1</jats:sub>; <jats:italic>X</jats:italic><jats:sub>2</jats:sub>; <jats:italic>Y</jats:italic>) based on the maximal information coefficient. It is unclear whether dendrogram-based <jats:italic>I</jats:italic>(<jats:italic>X</jats:italic><jats:sub>1</jats:sub>; <jats:italic>X</jats:italic><jats:sub>2</jats:sub>; <jats:italic>Y</jats:italic>) and <jats:italic>doublets</jats:italic> can capture synergies efficiently. Although MIC(<jats:italic>X</jats:italic><jats:sub>1</jats:sub>; <jats:italic>X</jats:italic><jats:sub>2</jats:sub>; <jats:italic>Y</jats:italic>) can capture a wide range of interaction, it has a high computational cost triggered by its 3-D search. In this paper, we developed a simple and fast approach based on <jats:italic>abs</jats:italic> conversion type (<jats:italic>i.e</jats:italic>. Z = |<jats:italic>X</jats:italic><jats:sub>1</jats:sub> − <jats:italic>X</jats:italic><jats:sub>2</jats:sub>|) and <jats:italic>t</jats:italic>-test, to detect interactions in simulation and real-world datasets. Our results showed that dendrogram-based <jats:italic>I</jats:italic>(<jats:italic>X</jats:italic><jats:sub>1</jats:sub>; <jats:italic>X</jats:italic><jats:sub>2</jats:sub>; <jats:italic>Y</jats:italic>) and <jats:italic>doublets</jats:italic> are helpless for discovering pair-wise gene interactions, our approach can discover typical pair-wise synergic genes efficiently. These synergic genes can reach comparable accuracy to the individually discriminant genes using the same number of genes. Classifier cannot learn well if synergic genes have not been converted properly. Combining individually discriminant and synergic genes can improve the prediction performance.</jats:p>
作者机构:
[Zhe-Ming Yuan] Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha, China;[Aiping Wu; Taijiao Jiang] Center for Systems Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100005;Suzhou Institute of Systems Medicine, Suzhou, China;[Xinlei Zhang] Suzhou Geneworks Technology Company Limited, Suzhou, China;[Mingming Su] Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
通讯机构:
[Taijiao Jiang] C;[Zhe-Ming Yuan] H;Center for Systems Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100005;Suzhou Institute of Systems Medicine, Suzhou, China<&wdkj&>Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha, China
摘要:
High-throughput sequencing-based metagenomics has garnered considerable interest in recent years. Numerous methods and tools have been developed for the analysis of metagenomic data. However, it is still a daunting task to install a large number of tools and complete a complicated analysis, especially for researchers with minimal bioinformatics backgrounds. To address this problem, we constructed an automated software named MetaDP for 16S rRNA sequencing data analysis, including data quality control, operational taxonomic unit clustering, diversity analysis, and disease risk prediction modeling. Furthermore, a support vector machine-based prediction model for intestinal bowel syndrome (IBS) was built by applying MetaDP to microbial 16S sequencing data from 108 children. The success of the IBS prediction model suggests that the platform may also be applied to other diseases related to gut microbes, such as obesity, metabolic syndrome, or intestinal cancer, among others (http://metadp.cn: 7001/).
摘要:
<jats:title>Abstract</jats:title><jats:p>Informative gene selection can have important implications for the improvement of cancer diagnosis and the identification of new drug targets. Individual-gene-ranking methods ignore interactions between genes. Furthermore, popular pair-wise gene evaluation methods, <jats:italic>e.g</jats:italic>. TSP and TSG, are helpless for discovering pair-wise interactions. Several efforts to discover pair-wise synergy have been made based on the information approach, such as EMBP and FeatKNN. However, the methods which are employed to estimate mutual information, <jats:italic>e.g</jats:italic>. binarization, histogram-based and KNN estimators, depend on known data or domain characteristics. Recently, Reshef <jats:italic>et al</jats:italic>. proposed a novel maximal information coefficient (MIC) measure to capture a wide range of associations between two variables that has the property of generality. An extension from <jats:italic>MIC</jats:italic>(<jats:italic>X</jats:italic>; <jats:italic>Y</jats:italic>) to <jats:italic>MIC</jats:italic>(<jats:italic>X</jats:italic><jats:sub>1</jats:sub>; <jats:italic>X</jats:italic><jats:sub>2</jats:sub>; <jats:italic>Y</jats:italic>) is therefore desired. We developed an approximation algorithm for estimating <jats:italic>MIC</jats:italic>(<jats:italic>X</jats:italic><jats:sub>1</jats:sub>; <jats:italic>X</jats:italic><jats:sub>2</jats:sub>; <jats:italic>Y</jats:italic>) where <jats:italic>Y</jats:italic> is a discrete variable. <jats:italic>MIC</jats:italic>(<jats:italic>X</jats:italic><jats:sub>1</jats:sub>; <jats:italic>X</jats:italic><jats:sub>2</jats:sub>; <jats:italic>Y</jats:italic>) is employed to detect pair-wise synergy in simulation and cancer microarray data. The results indicate that <jats:italic>MIC</jats:italic>(<jats:italic>X</jats:italic><jats:sub>1</jats:sub>; <jats:italic>X</jats:italic><jats:sub>2</jats:sub>; <jats:italic>Y</jats:italic>) also has the property of generality. It can discover synergic genes that are undetectable by reference feature selection methods such as <jats:italic>MIC</jats:italic>(<jats:italic>X</jats:italic>; <jats:italic>Y</jats:italic>) and TSG. Synergic genes can distinguish different phenotypes. Finally, the biological relevance of these synergic genes is validated with GO annotation and OUgene database.</jats:p>
摘要:
The maximal information coefficient (MIC) captures dependences between paired variables, including both functional and non-functional relationships. In this paper, we develop a new method, ChiMIC, to calculate the MIC values. The ChiMIC algorithm uses the chi-square test to terminate grid optimization and then removes the restriction of maximal grid size limitation of original ApproxMaxMI algorithm. Computational experiments show that ChiMIC algorithm can maintain same MIC values for noiseless functional relationships, but gives much smaller MIC values for independent variables. For noise functional relationship, the ChiMIC algorithm can reach the optimal partition much faster. Furthermore, the MCN values based on MIC calculated by ChiMIC can capture the complexity of functional relationships in a better way, and the statistical powers of MIC calculated by ChiMIC are higher than those calculated by ApproxMaxMI. Moreover, the computational costs of ChiMIC are much less than those of ApproxMaxMI. We apply the MIC values tofeature selection and obtain better classification accuracy using features selected by the MIC values from ChiMIC.
摘要:
Assessment of the risk of chemicals is an important task in the environmental protection. In this paper, we developed quantitative structure-activity relationship (QSAR) methods to evaluate the toxicity of phenol to Photobacterium phosphoreum, which is an important indicator for water quality. We first built support vector regression (SVR) model using three descriptors, and the SVR model (t = 2) had the highest external prediction ability (MSEext = 0.068, Q(ext)(2) = 0.682), about 40% higher than literature model's. Second, to identify more effective descriptors, we applied in-house methods to select descriptors with clear meanings from 2835 descriptors calculated by the PCLIENT and used them to construct the SVR models. Our results showed that our twenty new QSAR models significantly increased the standard regression coefficient on test set (MSEext values ranged from 0.003 to 0.063 and Q(ext)(2) values ranged from 0.708 to 0.985). The Y random response permutation test and different splits of training/test datasets also supported the excellent predictive power of the best SVR model. We further evaluated the regression significance of our SVR model and the importance of each single descriptor of the model according to the interpretability analysis. Our work provided useful theoretical understanding of the toxicity of phenol analogues. (C) 2015 Published by Elsevier B.V.