• 《中国科学引文数据库(CSCD)》来源期刊
  • 中国科技期刊引证报告(核心版)期刊
  • 《中文核心期刊要目总览》核心期刊
  • RCCSE中国核心学术期刊

机器学习模型对猪基因表达量预测准确性的评估

周天乐, 滕金言, 徐志婷, 张哲

周天乐, 滕金言, 徐志婷, 等. 机器学习模型对猪基因表达量预测准确性的评估[J]. 华南农业大学学报, 2025, 46(4): 549-557. DOI: 10.7671/j.issn.1001-411X.202409024
引用本文: 周天乐, 滕金言, 徐志婷, 等. 机器学习模型对猪基因表达量预测准确性的评估[J]. 华南农业大学学报, 2025, 46(4): 549-557. DOI: 10.7671/j.issn.1001-411X.202409024
ZHOU Tianle, TENG Jinyan, XU Zhiting, et al. Evaluation of predictive accuracy of gene expression in pigs using machine learning models[J]. Journal of South China Agricultural University, 2025, 46(4): 549-557. DOI: 10.7671/j.issn.1001-411X.202409024
Citation: ZHOU Tianle, TENG Jinyan, XU Zhiting, et al. Evaluation of predictive accuracy of gene expression in pigs using machine learning models[J]. Journal of South China Agricultural University, 2025, 46(4): 549-557. DOI: 10.7671/j.issn.1001-411X.202409024

机器学习模型对猪基因表达量预测准确性的评估

基金项目: 

国家生猪产业技术体系(CARS-35);国家重点研发计划(2022YFF1000900)

详细信息
    作者简介:

    周天乐,E-mail: tianle_zhou@foxmail.com

    通讯作者:

    张 哲,主要从事分子数量遗传学与动物育种研究,E-mail: zhezhang@scau.edu.cn

  • 中图分类号: TP181;S828

Evaluation of predictive accuracy of gene expression in pigs using machine learning models

  • 摘要:
    目的 

    对比不同机器学习模型利用基因顺式单核苷酸多态性(Single-nucleotide polymorphism, SNP)预测猪的基因表达量的效果,探究基因顺式遗传力(cis-heritability, cis-h2)和顺式SNP(cis-SNP)数量与不同模型预测准确性的关系。

    方法 

    基于PigGTEx项目猪肌肉组织样本的蛋白编码基因,使用18种不同机器学习模型,将基因转录起始位点±1 Mb范围内的cis-SNP用于训练,评估每种模型的预测准确性。

    结果 

    机器学习模型的预测准确性与基因cis-h2呈正相关,弹性网络回归模型和Lasso回归模型整体预测准确性最高,R2平均值分别为0.03620.0358;一定范围内,模型预测准确性与基因cis-SNP数量呈正相关。

    结论 

    使用机器学习模型预测猪基因表达的准确性受基因cis-h2cis-SNP数量影响较大,根据不同基因的cis-h2cis-SNP数量选择合适的机器学习模型预测猪的基因表达量有利于提高预测准确性。

    Abstract:
    Objective 

    The goal was to compare the performance of various machine learning models in predicting gene expression in pigs utilizing single nucleotide polymorphisms (SNPs), and to investigate the relationship between cis-heritability (cis-h2), the number of cis-SNPs and the prediction accuracy of different models.

    Method 

    Based on the protein encoding genes of pigs derived from muscle tissue of the PigGTEx project, we trained 18 distinct machine learning models by employing cis-SNPs located within a ±1 Mb window from the transcription start sites of genes. Subsequently, we evaluated the prediction accuracy of each model.

    Result 

    There was a positive correlation between the prediction accuracy of machine learning models and the cis-h2 of genes. Notably, the elastic net regression model and the Lasso regression model exhibited the highest overall prediction accuracy, with the means of R2 being 0.0362 and 0.0358, respectively. Furthermore, there was a positive correlation between the prediction accuracy of these machine learning models and the number of cis-SNPs around the genes within certain range.

    Conclusion 

    The accuracy of utilizing machine learning models to predict gene expression in pigs is largely influenced by both cis-h2 and the number of cis-SNPs of genes. Therefore, selecting an appropriate machine learning model tailored to the specific cis-h2 and the number of cis-SNPs of different genes can be advantageous in enhancing the accuracy for predicting pig gene expression levels.

  • 图  1   不同机器学习模型的总体预测准确性(R2)

    柱子中的红色圆点表示R2平均值,图中数字为平均值具体数值。

    Figure  1.   Overall prediction accuracy (R2) of different machine learning models

    Red points in the columns indicate means of R2, the numbers in the figure are the values of means.

    图  2   不同机器学习模型预测准确性(R2)与基因顺式遗传力(cis-h2)的关系

    Figure  2.   Relationship between the prediction accuracy (R2) of different machine learning models and the cis-heritability (cis-h2) of genes

    图  3   不同顺式遗传力范围内排名前7的模型的预测准确性(R2)

    Figure  3.   Prediction accuracy (R2) of the top 7 models within different cis-heritability ranges

    表  1   4组基因的顺式遗传力

    Table  1   Cis-heritability of four groups of genes

    顺式遗传力
    cis-h2
    基因个数
    Number of genes
    中位数
    Median value
    平均值
    Mean value
    标准误
    Standard error
    ≤0.01 433 1.000×10−6 1.000×10−3 1.02×10−4
    (0.01, 0.10] 372 0.043 0.048 1.35×10−3
    (0.10, 0.20] 139 0.139 0.144 2.49×10−3
    > 0.20 141 0.281 0.315 9.21×10−3
    下载: 导出CSV

    表  2   不同顺式SNP数量范围的模型预测准确性(R2)与基因顺式遗传力(cis-h2)的相关系数

    Table  2   Correlation coefficient between the model prediction accuracy (R2) within different ranges of cis-SNP number and cis-heritability (cis-h2) of genes

    模型 Model 顺式SNP数量范围 The range of cis-SNP number
    [0, 1564] (1564, 3097] (3097, 5384] > 5384
    OLS 0.638 0.647 0.673 0.694
    PLS 0.656 0.668 0.698 0.647
    Ridge 0.649 0.621 0.639 0.626
    Lasso 0.782 0.780 0.795 0.774
    LassoLars 0.796 0.792 0.801 0.767
    Elastic net 0.806 0.802 0.811 0.792
    Kernel ridge_linear 0.651 0.613 0.654 0.625
    Kernel ridge_poly 0.787 0.803 0.814 0.808
    Kernel ridge_RBF 0.794 0.805 0.810 0.768
    Kernel ridge_sigmoid 0.657 0.650 0.653 0.615
    Bayesian ridge 0.777 0.778 0.786 0.750
    SVR_linear 0.617 0.617 0.634 0.632
    SVR_poly 0.742 0.770 0.788 0.774
    SVR_RBF 0.795 0.801 0.814 0.772
    GLM_Gaussian 0.640 0.671 0.704 0.709
    KNN 0.710 0.723 0.754 0.732
    Decision tree 0.618 0.644 0.680 0.631
    Random forest 0.756 0.766 0.783 0.773
    下载: 导出CSV
  • [1] 牛安然, 张兴, 杨雨婷, 等. 全基因组关联分析在猪育种中的研究进展[J]. 畜牧与兽医, 2023, 55(5): 139-147.
    [2]

    LI T, WAN P, LIN Q, et al. Genome-wide association study meta-analysis elucidates genetic structure and identifies candidate genes of teat number traits in pigs[J]. International Journal of Molecular Sciences, 2023, 25(1): 451.

    [3]

    ZENG H, ZHONG Z, XU Z, et al. Meta-analysis of genome-wide association studies uncovers shared candidate genes across breeds for pig fatness trait[J]. BMC Genomics, 2022, 23(1): 786.

    [4] 窦腾飞, 吴姿仪, 白利瑶, 等. 全基因组关联分析鉴定大白猪生长性状遗传变异及候选基因[J]. 中国畜牧杂志, 2023, 59(8): 264-272.
    [5]

    LI X, WU J, ZHUANG Z, et al. Integrated single-trait and multi-trait GWASs reveal the genetic architecture of internal organ weight in pigs[J]. Animals, 2023, 13(5): 808.

    [6] 张宇, 周佳伟, 吴俊静, 等. 大白猪繁殖性状全基因组关联分析[J]. 中国畜牧杂志, 2022, 58(8): 94-99.
    [7]

    TENG J, GAO Y, YIN H, et al. A compendium of genetic regulatory effects across pig tissues[J]. Nature Genetics, 2024, 56(1): 112-123.

    [8] 郑韵頔, 冉雪琴, 牛熙, 等. 全基因组eQTL揭示猪11号染色体肉质性状新候选位点[J]. 农业生物技术学报, 2024, 32(4): 807-819. doi: 10.3969/j.issn.1674-7968.2024.04.007
    [9]

    MAI J, LU M, GAO Q, et al. Transcriptome-wide association studies: Recent advances in methods, applications and available databases[J]. Communications Biology, 2023, 6(1): 899.

    [10]

    GAMAZON E R, WHEELER H E, SHAH K P, et al. A gene-based association method for mapping traits using reference transcriptome data[J]. Nature Genetics, 2015, 47(9): 1091-1098.

    [11]

    GUSEV A, KO A, SHI H, et al. Integrative approaches for large-scale transcriptome-wide association studies[J]. Nature Genetics, 2016, 48(3): 245-252.

    [12]

    ROBINSON M D, OSHLACK A. A scaling normalization method for differential expression analysis of RNA-seq data[J]. Genome Biology, 2010, 11(3): R25.

    [13]

    ZHENG X, LEVINE D, SHEN J, et al. A high-performance computing toolset for relatedness and principal component analysis of SNP data[J]. Bioinformatics, 2012, 28(4): 3326-3328.

    [14]

    STEGLE O, PARTS L, PIIPARI M, et al. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses[J]. Nature Protocols, 2012, 7(3): 500-507.

    [15]

    MEHMOOD T, LILAND K H, SNIPEN L, et al. A review of variable selection methods in Partial Least Squares Regression[J]. Chemometrics and Intelligent Laboratory Systems, 2012, 118: 62-69.

    [16]

    HOERL A E, KENNARD R W. Ridge regression: Biased estimation for nonorthogonal problems[J]. Technometrics, 1970, 12(1): 55-67.

    [17]

    TIBSHIRANI R. Regression shrinkage and selection via the lasso[J]. Journal of the Royal Statistical Society Series B: Statistical Methodology, 1996, 58(1): 267-288.

    [18]

    ZOU H, HASTIE T. Regularization and variable selection via the elastic net[J]. Journal of the Royal Statistical Society Series B: Statistical Methodology, 2005, 67(2): 301-320.

    [19] 汪廷华, 陈峻婷. 核函数的选择研究综述[J]. 计算机工程与设计, 2012, 33(3): 1181-1186. doi: 10.3969/j.issn.1000-7024.2012.03.068
    [20] 李欣海. 随机森林模型在分类与回归分析中的应用[J]. 应用昆虫学报, 2013, 50(4): 1190-1197. doi: 10.7679/j.issn.2095-1353.2013.163
    [21]

    YANG J, LEE S H, GODDARD M E, et al. GCTA: A tool for genome-wide complex trait analysis[J]. The American Journal of Human Genetics, 2011, 88(1): 76-82.

    [22]

    WHEELER H E, SHAH K P, BRENNER J, et al. Survey of the heritability and sparse architecture of gene expression traits across human tissues[J]. PLoS Genetics, 2016, 12(11): e1006423.

    [23]

    BAE S, CHOI S, KIM S M, et al. Prediction of quantitative traits using common genetic variants: Application to body mass index[J]. Genomics & Informatics, 2016, 14(4): 149-159.

    [24]

    SPILIOPOULOU A, NAGY R, BERMINGHAM M L, et al. Genomic prediction of complex human traits: Relatedness, trait architecture and predictive meta-models[J]. Human Molecular Genetics, 2015, 24(14): 4167-4182.

    [25]

    WANG J, GAMAZON E R, PIERCE B L, et al. Imputing gene expression in uncollected tissues within and beyond GTEx[J]. The American Journal of Human Genetics, 2016, 98(4): 697-708.

    [26]

    FRYETT J J, MORRIS A P, CORDELL H J. Investigation of prediction accuracy and the impact of sample size, ancestry, and tissue in transcriptome-wide association studies[J]. Genetic Epidemiology, 2020, 44(5): 425-441.

    [27]

    WAINBERG M, SINNOTT-ARMSTRONG N, MANCUSO N, et al. Opportunities and challenges for transcriptome-wide association studies[J]. Nature Genetics, 2019, 51(4): 592-599.

    [28]

    FAN J, LV J. A selective overview of variable selection in high dimensional feature space[J]. Statistica Sinica, 2010, 20(1): 101.

    [29]

    GUYON I, ELISSEEFF A. An introduction to variable and feature selection[J]. Journal of Machine Learning Research, 2003, 3: 1157-1182.

图(3)  /  表(2)
计量
  • 文章访问数:  190
  • HTML全文浏览量:  28
  • PDF下载量:  47
  • 被引次数: 0
出版历程
  • 收稿日期:  2024-09-23
  • 网络出版日期:  2025-03-03
  • 发布日期:  2025-03-03
  • 刊出日期:  2025-07-09

目录

    Corresponding author: ZHANG Zhe, zhezhang@scau.edu.cn

    1. On this Site
    2. On Google Scholar
    3. On PubMed

    /

    返回文章
    返回