Evaluation of predictive accuracy of gene expression in pigs using machine learning models
-
摘要:目的
对比不同机器学习模型利用基因顺式单核苷酸多态性(Single-nucleotide polymorphism, SNP)预测猪的基因表达量的效果,探究基因顺式遗传力(cis-heritability, cis-h2)和顺式SNP(cis-SNP)数量与不同模型预测准确性的关系。
方法基于PigGTEx项目猪肌肉组织样本的蛋白编码基因,使用18种不同机器学习模型,将基因转录起始位点±1 Mb范围内的cis-SNP用于训练,评估每种模型的预测准确性。
结果机器学习模型的预测准确性与基因cis-h2呈正相关,弹性网络回归模型和Lasso回归模型整体预测准确性最高,R2平均值分别为
0.0362 和0.0358 ;一定范围内,模型预测准确性与基因cis-SNP数量呈正相关。结论使用机器学习模型预测猪基因表达的准确性受基因cis-h2和cis-SNP数量影响较大,根据不同基因的cis-h2和cis-SNP数量选择合适的机器学习模型预测猪的基因表达量有利于提高预测准确性。
Abstract:ObjectiveThe goal was to compare the performance of various machine learning models in predicting gene expression in pigs utilizing single nucleotide polymorphisms (SNPs), and to investigate the relationship between cis-heritability (cis-h2), the number of cis-SNPs and the prediction accuracy of different models.
MethodBased on the protein encoding genes of pigs derived from muscle tissue of the PigGTEx project, we trained 18 distinct machine learning models by employing cis-SNPs located within a ±1 Mb window from the transcription start sites of genes. Subsequently, we evaluated the prediction accuracy of each model.
ResultThere was a positive correlation between the prediction accuracy of machine learning models and the cis-h2 of genes. Notably, the elastic net regression model and the Lasso regression model exhibited the highest overall prediction accuracy, with the means of R2 being
0.0362 and0.0358 , respectively. Furthermore, there was a positive correlation between the prediction accuracy of these machine learning models and the number of cis-SNPs around the genes within certain range.ConclusionThe accuracy of utilizing machine learning models to predict gene expression in pigs is largely influenced by both cis-h2 and the number of cis-SNPs of genes. Therefore, selecting an appropriate machine learning model tailored to the specific cis-h2 and the number of cis-SNPs of different genes can be advantageous in enhancing the accuracy for predicting pig gene expression levels.
-
Keywords:
- Machine learning /
- Pig /
- Gene expression /
- Transcriptome-wide association study /
- cis-heritability /
- cis-SNP
-
表 1 4组基因的顺式遗传力
Table 1 Cis-heritability of four groups of genes
顺式遗传力
cis-h2基因个数
Number of genes中位数
Median value平均值
Mean value标准误
Standard error≤0.01 433 1.000×10−6 1.000×10−3 1.02×10−4 (0.01, 0.10] 372 0.043 0.048 1.35×10−3 (0.10, 0.20] 139 0.139 0.144 2.49×10−3 > 0.20 141 0.281 0.315 9.21×10−3 表 2 不同顺式SNP数量范围的模型预测准确性(R2)与基因顺式遗传力(cis-h2)的相关系数
Table 2 Correlation coefficient between the model prediction accuracy (R2) within different ranges of cis-SNP number and cis-heritability (cis-h2) of genes
模型 Model 顺式SNP数量范围 The range of cis-SNP number [0, 1564 ]( 1564 ,3097 ]( 3097 ,5384 ]> 5384 OLS 0.638 0.647 0.673 0.694 PLS 0.656 0.668 0.698 0.647 Ridge 0.649 0.621 0.639 0.626 Lasso 0.782 0.780 0.795 0.774 LassoLars 0.796 0.792 0.801 0.767 Elastic net 0.806 0.802 0.811 0.792 Kernel ridge_linear 0.651 0.613 0.654 0.625 Kernel ridge_poly 0.787 0.803 0.814 0.808 Kernel ridge_RBF 0.794 0.805 0.810 0.768 Kernel ridge_sigmoid 0.657 0.650 0.653 0.615 Bayesian ridge 0.777 0.778 0.786 0.750 SVR_linear 0.617 0.617 0.634 0.632 SVR_poly 0.742 0.770 0.788 0.774 SVR_RBF 0.795 0.801 0.814 0.772 GLM_Gaussian 0.640 0.671 0.704 0.709 KNN 0.710 0.723 0.754 0.732 Decision tree 0.618 0.644 0.680 0.631 Random forest 0.756 0.766 0.783 0.773 -
[1] 牛安然, 张兴, 杨雨婷, 等. 全基因组关联分析在猪育种中的研究进展[J]. 畜牧与兽医, 2023, 55(5): 139-147. [2] LI T, WAN P, LIN Q, et al. Genome-wide association study meta-analysis elucidates genetic structure and identifies candidate genes of teat number traits in pigs[J]. International Journal of Molecular Sciences, 2023, 25(1): 451.
[3] ZENG H, ZHONG Z, XU Z, et al. Meta-analysis of genome-wide association studies uncovers shared candidate genes across breeds for pig fatness trait[J]. BMC Genomics, 2022, 23(1): 786.
[4] 窦腾飞, 吴姿仪, 白利瑶, 等. 全基因组关联分析鉴定大白猪生长性状遗传变异及候选基因[J]. 中国畜牧杂志, 2023, 59(8): 264-272. [5] LI X, WU J, ZHUANG Z, et al. Integrated single-trait and multi-trait GWASs reveal the genetic architecture of internal organ weight in pigs[J]. Animals, 2023, 13(5): 808.
[6] 张宇, 周佳伟, 吴俊静, 等. 大白猪繁殖性状全基因组关联分析[J]. 中国畜牧杂志, 2022, 58(8): 94-99. [7] TENG J, GAO Y, YIN H, et al. A compendium of genetic regulatory effects across pig tissues[J]. Nature Genetics, 2024, 56(1): 112-123.
[8] 郑韵頔, 冉雪琴, 牛熙, 等. 全基因组eQTL揭示猪11号染色体肉质性状新候选位点[J]. 农业生物技术学报, 2024, 32(4): 807-819. doi: 10.3969/j.issn.1674-7968.2024.04.007 [9] MAI J, LU M, GAO Q, et al. Transcriptome-wide association studies: Recent advances in methods, applications and available databases[J]. Communications Biology, 2023, 6(1): 899.
[10] GAMAZON E R, WHEELER H E, SHAH K P, et al. A gene-based association method for mapping traits using reference transcriptome data[J]. Nature Genetics, 2015, 47(9): 1091-1098.
[11] GUSEV A, KO A, SHI H, et al. Integrative approaches for large-scale transcriptome-wide association studies[J]. Nature Genetics, 2016, 48(3): 245-252.
[12] ROBINSON M D, OSHLACK A. A scaling normalization method for differential expression analysis of RNA-seq data[J]. Genome Biology, 2010, 11(3): R25.
[13] ZHENG X, LEVINE D, SHEN J, et al. A high-performance computing toolset for relatedness and principal component analysis of SNP data[J]. Bioinformatics, 2012, 28(4): 3326-3328.
[14] STEGLE O, PARTS L, PIIPARI M, et al. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses[J]. Nature Protocols, 2012, 7(3): 500-507.
[15] MEHMOOD T, LILAND K H, SNIPEN L, et al. A review of variable selection methods in Partial Least Squares Regression[J]. Chemometrics and Intelligent Laboratory Systems, 2012, 118: 62-69.
[16] HOERL A E, KENNARD R W. Ridge regression: Biased estimation for nonorthogonal problems[J]. Technometrics, 1970, 12(1): 55-67.
[17] TIBSHIRANI R. Regression shrinkage and selection via the lasso[J]. Journal of the Royal Statistical Society Series B: Statistical Methodology, 1996, 58(1): 267-288.
[18] ZOU H, HASTIE T. Regularization and variable selection via the elastic net[J]. Journal of the Royal Statistical Society Series B: Statistical Methodology, 2005, 67(2): 301-320.
[19] 汪廷华, 陈峻婷. 核函数的选择研究综述[J]. 计算机工程与设计, 2012, 33(3): 1181-1186. doi: 10.3969/j.issn.1000-7024.2012.03.068 [20] 李欣海. 随机森林模型在分类与回归分析中的应用[J]. 应用昆虫学报, 2013, 50(4): 1190-1197. doi: 10.7679/j.issn.2095-1353.2013.163 [21] YANG J, LEE S H, GODDARD M E, et al. GCTA: A tool for genome-wide complex trait analysis[J]. The American Journal of Human Genetics, 2011, 88(1): 76-82.
[22] WHEELER H E, SHAH K P, BRENNER J, et al. Survey of the heritability and sparse architecture of gene expression traits across human tissues[J]. PLoS Genetics, 2016, 12(11): e1006423.
[23] BAE S, CHOI S, KIM S M, et al. Prediction of quantitative traits using common genetic variants: Application to body mass index[J]. Genomics & Informatics, 2016, 14(4): 149-159.
[24] SPILIOPOULOU A, NAGY R, BERMINGHAM M L, et al. Genomic prediction of complex human traits: Relatedness, trait architecture and predictive meta-models[J]. Human Molecular Genetics, 2015, 24(14): 4167-4182.
[25] WANG J, GAMAZON E R, PIERCE B L, et al. Imputing gene expression in uncollected tissues within and beyond GTEx[J]. The American Journal of Human Genetics, 2016, 98(4): 697-708.
[26] FRYETT J J, MORRIS A P, CORDELL H J. Investigation of prediction accuracy and the impact of sample size, ancestry, and tissue in transcriptome-wide association studies[J]. Genetic Epidemiology, 2020, 44(5): 425-441.
[27] WAINBERG M, SINNOTT-ARMSTRONG N, MANCUSO N, et al. Opportunities and challenges for transcriptome-wide association studies[J]. Nature Genetics, 2019, 51(4): 592-599.
[28] FAN J, LV J. A selective overview of variable selection in high dimensional feature space[J]. Statistica Sinica, 2010, 20(1): 101.
[29] GUYON I, ELISSEEFF A. An introduction to variable and feature selection[J]. Journal of Machine Learning Research, 2003, 3: 1157-1182.