TY - JOUR
T1 - Predicting Sites of Epitranscriptome Modifications Using Unsupervised Representation Learning Based on Generative Adversarial Networks
AU - Salekin, Sirajul
AU - Mostavi, Milad
AU - Chiu, Yu Chiao
AU - Chen, Yidong
AU - Zhang, Jianqiu
AU - Huang, Yufei
N1 - Funding Information:
We thank the computational support from UTSA's HPC cluster Shamu, operated by the Office of Information Technology. Funding. This work was supported by the National Institutes of Health (R01GM113245 to YH, CTSA 1UL1RR025767-01 to YC, and K99CA248944 to Y-CC), Cancer Prevention and Research Institute of Texas (RP190346 to YC and YH and RP160732 to YC), San Antonio Life Sciences Institute (SALSI Innovation Challenge Award 2016 to YH and YC and SALSI Post-doctoral Research Fellowship 2018 to Y-CC), and the Fund for Innovation in Cancer Informatics (ICI Fund to Y-CC and YC).
Publisher Copyright:
© Copyright © 2020 Salekin, Mostavi, Chiu, Chen, Zhang and Huang.
PY - 2020/6/19
Y1 - 2020/6/19
N2 - Epitranscriptome is an exciting area that studies different types of modifications in transcripts, and the prediction of such modification sites from the transcript sequence is of significant interest. However, the scarcity of positive sites for most modifications imposes critical challenges for training robust algorithms. To circumvent this problem, we propose MR-GAN, a generative adversarial network (GAN)-based model, which is trained in an unsupervised fashion on the entire pre-mRNA sequences to learn a low-dimensional embedding of transcriptomic sequences. MR-GAN was then applied to extract embeddings of the sequences in a training dataset we created for nine epitranscriptome modifications, namely, m6A, m1A, m1G, m2G, m5C, m5U, 2′-O-Me, pseudouridine (Ψ), and dihydrouridine (D), of which the positive samples are very limited. Prediction models were trained based on the embeddings extracted by MR-GAN. We compared the prediction performance with the one-hot encoding of the training sequences and SRAMP, a state-of-the-art m6A site prediction algorithm, and demonstrated that the learned embeddings outperform one-hot encoding by a significant margin for up to 15% improvement. Using MR-GAN, we also investigated the sequence motifs for each modification type and uncovered known motifs as well as new motifs not possible with sequences directly. The results demonstrated that transcriptome features extracted using unsupervised learning could lead to high precision for predicting multiple types of epitranscriptome modifications, even when the data size is small and extremely imbalanced.
AB - Epitranscriptome is an exciting area that studies different types of modifications in transcripts, and the prediction of such modification sites from the transcript sequence is of significant interest. However, the scarcity of positive sites for most modifications imposes critical challenges for training robust algorithms. To circumvent this problem, we propose MR-GAN, a generative adversarial network (GAN)-based model, which is trained in an unsupervised fashion on the entire pre-mRNA sequences to learn a low-dimensional embedding of transcriptomic sequences. MR-GAN was then applied to extract embeddings of the sequences in a training dataset we created for nine epitranscriptome modifications, namely, m6A, m1A, m1G, m2G, m5C, m5U, 2′-O-Me, pseudouridine (Ψ), and dihydrouridine (D), of which the positive samples are very limited. Prediction models were trained based on the embeddings extracted by MR-GAN. We compared the prediction performance with the one-hot encoding of the training sequences and SRAMP, a state-of-the-art m6A site prediction algorithm, and demonstrated that the learned embeddings outperform one-hot encoding by a significant margin for up to 15% improvement. Using MR-GAN, we also investigated the sequence motifs for each modification type and uncovered known motifs as well as new motifs not possible with sequences directly. The results demonstrated that transcriptome features extracted using unsupervised learning could lead to high precision for predicting multiple types of epitranscriptome modifications, even when the data size is small and extremely imbalanced.
KW - N-methyladenosine (mA)
KW - RNA modification site prediction
KW - epitranscriptome
KW - generative adversarial networks (GANs)
KW - methylated RNA immunoprecipitation sequencing (MeRIP-Seq)
KW - unsupervised representation learning
UR - http://www.scopus.com/inward/record.url?scp=85087484249&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85087484249&partnerID=8YFLogxK
U2 - 10.3389/fphy.2020.00196
DO - 10.3389/fphy.2020.00196
M3 - Article
AN - SCOPUS:85087484249
VL - 8
JO - Frontiers in Physics
JF - Frontiers in Physics
SN - 2296-424X
M1 - 196
ER -