期刊:
Frontiers in Genetics,2020年10:488214 ISSN:1664-8021
通讯作者:
Chen, Yuan
作者机构:
[Zhang, Haojian; Jiang, Heling; Yuan, Zheming; Chen, Yuan; Wang, Qifei; Liang, Yuqing] Hunan Agr Univ, Hunan Engn & Technol Res Ctr Agr Big Data Anal &, Changsha, Peoples R China.;[Tan, Siqiao] Hunan Agr Univ, Sch Informat Sci & Technol, Changsha, Peoples R China.;[Luo, Feng] Clemson Univ, Sch Comp, Clemson, SC USA.
通讯机构:
[Chen, Yuan] H;Hunan Agr Univ, Hunan Engn & Technol Res Ctr Agr Big Data Anal &, Changsha, Peoples R China.
关键词:
RNA sequencing;Maximal information coefficient;Differential expressed gene;Gene selection;normalized differential correlation
摘要:
For precision medicine, there is a need to identify genes that accurately distinguish the physiological state or response to a particular therapy, but this can be challenging. Many methods of analyzing differential expression have been established and applied to this problem, such as t-test, edgeR, and DEseq2. A common feature of these methods is their focus on a linear relationship (differential expression) between gene expression and phenotype. However, they may overlook nonlinear relationships due to various factors, such as the degree of disease progression, sex, age, ethnicity, and environmental factors. Maximal information coefficient (MIC) was proposed to capture a wide range of associations of two variables in both linear and nonlinear relationships. However, with MIC it is difficult to highlight genes with nonlinear expression patterns as the genes giving the most strongly supported hits are linearly expressed, especially for noisy data. It is thus important to also efficiently identify nonlinearly expressed genes in order to unravel the molecular basis of disease and to reveal new therapeutic targets. We propose a novel nonlinearity measure called normalized differential correlation (NDC) to efficiently highlight nonlinearly expressed genes in transcriptome datasets. Validation using six real-world cancer datasets revealed that the NDC method could highlight nonlinearly expressed genes that could not be highlighted by t-test, MIC, edgeR, and DEseq2, although MIC could capture nonlinear correlations. The classification accuracy indicated that analysis of these genes could adequately distinguish cancer and paracarcinoma tissue samples. Furthermore, the results of biological interpretation of the identified genes suggested that some of them were involved in key functional pathways associated with cancer progression and metastasis. All of this evidence suggests that these nonlinearly expressed genes may play a central role in regulating cancer progression.
作者机构:
[谭泗桥] chool of Information Science and Technology, Hunan Agricultural University, Changsha, 410128, China;[艾陈] College of Medicine, Shaoyang University, Shaoyang, 422000, China;[张席] College of Plant Protection, Hunan Agricultural University, Changsha, 410128, China;[李钎; 谭泗桥; 张席] Hunan Engineer Research Center for Information Technology in Agriculture, Changsha, 410128, China
通讯机构:
College of Medicine, Shaoyang University, Shaoyang, China
摘要:
Phosphorylation is the major post-translation modification to proteins, and it can be classified as kinase-specific and non-kinase-specific. This paper focuses on the prediction methods of non-kinase-specificity and using Dou's dataset of phosphorylation sites as the template, this paper develops a position-based chi-square table feature, χ~2-pos, and then integrates this feature with the pseudo position-specific scoring matrix (PsePSSM). A Support Vector Machine (SVM) classifier with balanced positive and negative samples was created, and the S, T, Y independent testing results for the Matthew correlation coefficient, the inferior surface integral of the ROC curve and the precision were (0.59, 0.87, 79.74%), (0.55, 0.85, 77.68%) and (0.50, 0.81, 75.22%), respectively, which are significantly superior to the results reported previously. The integration of the χ~2-pos and the PsePSSM offers a promising method to predict phosphorylation sites more accurately in proteins.
摘要:
Toxicity prediction can provide important information for environmental protection. The toxicity predictions of 228 alcohols and phenols were performed by quantitative structure-activity relationship (QSAR). Feature selection can reduce the training time of modelling, improve the prediction accuracy and enhance the interpretability of a model. Both dependent variables (toxicity) and independent variables (molecular descriptors) of the QSAR data sets are usually continuous variables. The well-known feature selection method, minimal redundancy maximal relevance (mRMR) can eliminate redundancy and extract relevant features effectively but can only be applied to discrete dependent variables. The distance correlation (dCor) can detect the nonlinear correlation of two continuous variables. In the present work, a new mRMR-dCor feature selection method was developed by combining mRMR with dCor and used to construct the QSAR models for three datasets based on the retained molecular descriptors and support vector regression (SVR). mRMR-dCor feature selection method showed better predication performance (the Q(2) of three datasets are 0.954, 0.941 and 0.981 respectively) than the reference feature selection methods and other methods reported in literature. In all, mRMR-dCor feature selection has a promising application prospect in the numerous domains of high dimensional feature selections such as QSAR.