GSAE: An autoencoder with embedded gene-set nodes for genomics functional characterization

Hung I.Harry Chen, Yu Chiao Chiu, Tinghe Zhang, Songyao Zhang, Yufei Huang, Yidong Chen

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Background: Bioinformatics tools have been developed to interpret gene expression data at the gene set level, and these gene set based analyses improve the biologists' capability to discover functional relevance of their experiment design. While elucidating gene set individually, inter-gene sets association is rarely taken into consideration. Deep learning, an emerging machine learning technique in computational biology, can be used to generate an unbiased combination of gene set, and to determine the biological relevance and analysis consistency of these combining gene sets by leveraging large genomic data sets. Results: In this study, we proposed a gene superset autoencoder (GSAE), a multi-layer autoencoder model with the incorporation of a priori defined gene sets that retain the crucial biological features in the latent layer. We introduced the concept of the gene superset, an unbiased combination of gene sets with weights trained by the autoencoder, where each node in the latent layer is a superset. Trained with genomic data from TCGA and evaluated with their accompanying clinical parameters, we showed gene supersets' ability of discriminating tumor subtypes and their prognostic capability. We further demonstrated the biological relevance of the top component gene sets in the significant supersets. Conclusions: Using autoencoder model and gene superset at its latent layer, we demonstrated that gene supersets retain sufficient biological information with respect to tumor subtypes and clinical prognostic significance. Superset also provides high reproducibility on survival analysis and accurate prediction for cancer subtypes.

Original languageEnglish (US)
Article number142
JournalBMC Systems Biology
Volume12
DOIs
StatePublished - Dec 21 2018

Fingerprint

Functional Genomics
Genomics
Genes
Gene
Vertex of a graph
Computational Biology
Tumors
Tumor
Gene Components
Neoplasms
Aptitude
Survival Analysis
Design of Experiments
Reproducibility
Bioinformatics
Gene Expression Data
Gene expression
Level Set
Design of experiments

Keywords

  • Autoencoder
  • Deep learning
  • Gene superset analysis
  • Survival analysis

ASJC Scopus subject areas

  • Structural Biology
  • Modeling and Simulation
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Cite this

GSAE : An autoencoder with embedded gene-set nodes for genomics functional characterization. / Chen, Hung I.Harry; Chiu, Yu Chiao; Zhang, Tinghe; Zhang, Songyao; Huang, Yufei; Chen, Yidong.

In: BMC Systems Biology, Vol. 12, 142, 21.12.2018.

Research output: Contribution to journalArticle

Chen, Hung I.Harry ; Chiu, Yu Chiao ; Zhang, Tinghe ; Zhang, Songyao ; Huang, Yufei ; Chen, Yidong. / GSAE : An autoencoder with embedded gene-set nodes for genomics functional characterization. In: BMC Systems Biology. 2018 ; Vol. 12.
@article{905e34b39f9d450cb79d282f37872111,
title = "GSAE: An autoencoder with embedded gene-set nodes for genomics functional characterization",
abstract = "Background: Bioinformatics tools have been developed to interpret gene expression data at the gene set level, and these gene set based analyses improve the biologists' capability to discover functional relevance of their experiment design. While elucidating gene set individually, inter-gene sets association is rarely taken into consideration. Deep learning, an emerging machine learning technique in computational biology, can be used to generate an unbiased combination of gene set, and to determine the biological relevance and analysis consistency of these combining gene sets by leveraging large genomic data sets. Results: In this study, we proposed a gene superset autoencoder (GSAE), a multi-layer autoencoder model with the incorporation of a priori defined gene sets that retain the crucial biological features in the latent layer. We introduced the concept of the gene superset, an unbiased combination of gene sets with weights trained by the autoencoder, where each node in the latent layer is a superset. Trained with genomic data from TCGA and evaluated with their accompanying clinical parameters, we showed gene supersets' ability of discriminating tumor subtypes and their prognostic capability. We further demonstrated the biological relevance of the top component gene sets in the significant supersets. Conclusions: Using autoencoder model and gene superset at its latent layer, we demonstrated that gene supersets retain sufficient biological information with respect to tumor subtypes and clinical prognostic significance. Superset also provides high reproducibility on survival analysis and accurate prediction for cancer subtypes.",
keywords = "Autoencoder, Deep learning, Gene superset analysis, Survival analysis",
author = "Chen, {Hung I.Harry} and Chiu, {Yu Chiao} and Tinghe Zhang and Songyao Zhang and Yufei Huang and Yidong Chen",
year = "2018",
month = "12",
day = "21",
doi = "10.1186/s12918-018-0642-2",
language = "English (US)",
volume = "12",
journal = "BMC Systems Biology",
issn = "1752-0509",
publisher = "BioMed Central",

}

TY - JOUR

T1 - GSAE

T2 - An autoencoder with embedded gene-set nodes for genomics functional characterization

AU - Chen, Hung I.Harry

AU - Chiu, Yu Chiao

AU - Zhang, Tinghe

AU - Zhang, Songyao

AU - Huang, Yufei

AU - Chen, Yidong

PY - 2018/12/21

Y1 - 2018/12/21

N2 - Background: Bioinformatics tools have been developed to interpret gene expression data at the gene set level, and these gene set based analyses improve the biologists' capability to discover functional relevance of their experiment design. While elucidating gene set individually, inter-gene sets association is rarely taken into consideration. Deep learning, an emerging machine learning technique in computational biology, can be used to generate an unbiased combination of gene set, and to determine the biological relevance and analysis consistency of these combining gene sets by leveraging large genomic data sets. Results: In this study, we proposed a gene superset autoencoder (GSAE), a multi-layer autoencoder model with the incorporation of a priori defined gene sets that retain the crucial biological features in the latent layer. We introduced the concept of the gene superset, an unbiased combination of gene sets with weights trained by the autoencoder, where each node in the latent layer is a superset. Trained with genomic data from TCGA and evaluated with their accompanying clinical parameters, we showed gene supersets' ability of discriminating tumor subtypes and their prognostic capability. We further demonstrated the biological relevance of the top component gene sets in the significant supersets. Conclusions: Using autoencoder model and gene superset at its latent layer, we demonstrated that gene supersets retain sufficient biological information with respect to tumor subtypes and clinical prognostic significance. Superset also provides high reproducibility on survival analysis and accurate prediction for cancer subtypes.

AB - Background: Bioinformatics tools have been developed to interpret gene expression data at the gene set level, and these gene set based analyses improve the biologists' capability to discover functional relevance of their experiment design. While elucidating gene set individually, inter-gene sets association is rarely taken into consideration. Deep learning, an emerging machine learning technique in computational biology, can be used to generate an unbiased combination of gene set, and to determine the biological relevance and analysis consistency of these combining gene sets by leveraging large genomic data sets. Results: In this study, we proposed a gene superset autoencoder (GSAE), a multi-layer autoencoder model with the incorporation of a priori defined gene sets that retain the crucial biological features in the latent layer. We introduced the concept of the gene superset, an unbiased combination of gene sets with weights trained by the autoencoder, where each node in the latent layer is a superset. Trained with genomic data from TCGA and evaluated with their accompanying clinical parameters, we showed gene supersets' ability of discriminating tumor subtypes and their prognostic capability. We further demonstrated the biological relevance of the top component gene sets in the significant supersets. Conclusions: Using autoencoder model and gene superset at its latent layer, we demonstrated that gene supersets retain sufficient biological information with respect to tumor subtypes and clinical prognostic significance. Superset also provides high reproducibility on survival analysis and accurate prediction for cancer subtypes.

KW - Autoencoder

KW - Deep learning

KW - Gene superset analysis

KW - Survival analysis

UR - http://www.scopus.com/inward/record.url?scp=85058915812&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85058915812&partnerID=8YFLogxK

U2 - 10.1186/s12918-018-0642-2

DO - 10.1186/s12918-018-0642-2

M3 - Article

C2 - 30577835

AN - SCOPUS:85058915812

VL - 12

JO - BMC Systems Biology

JF - BMC Systems Biology

SN - 1752-0509

M1 - 142

ER -