Search in:

RNA Biology Volume 18, 2021 - Issue sup2

Submit an article Journal homepage

Free access

1,601

Views

CrossRef citations to date

Altmetric

Listen

Brief Communication

GlyinsRNA: a webserver for predicting glycosylation sites on small RNAs

Chunmei CuiDepartment of Biomedical Informatics, Moe Key Lab of Cardiovascular Sciences, School of Basic Medical Sciences, Peking University, Beijing, ChinaView further author information

Xiaobin WuDepartment of Biomedical Informatics, Moe Key Lab of Cardiovascular Sciences, School of Basic Medical Sciences, Peking University, Beijing, ChinaView further author information

Yuan ZhouDepartment of Biomedical Informatics, Moe Key Lab of Cardiovascular Sciences, School of Basic Medical Sciences, Peking University, Beijing, ChinaCorrespondence[email protected]
View further author information

Pages 600-603 | Received 08 Jul 2021, Accepted 15 Sep 2021, Published online: 24 Sep 2021

Cite this article
https://doi.org/10.1080/15476286.2021.1982574
CrossMark

In this article

ABSTRACT
Introduction
Results and discussion
Materials and methods
Supplemental material
Acknowledgements
Disclosure statement
Additional information
References

Full Article
Figures & data
References
Supplemental
Citations
Metrics
Reprints & Permissions
View PDF PDF View EPUB EPUB

ABSTRACT

Versatile RNA modifications play important roles in post-transcriptional regulations of gene expression, among which glycosylation modifications on small RNAs emerge as a novel clade whose characteristics need further interrogations. Here, we demonstrated that the sequence pattern around RNA glycosylation sites was not random and could be exploited for glycosylation site prediction. A machine learning predictor, GlyinsRNA, which integrated multiple RNA sequence representation encodings, was established. GlyinsRNA achieved AUROC (area under the receiver operating characteristic curve) of 0.7933 and 0.7979 in five-fold cross-validation and independent tests, respectively. GlyinsRNA was implemented as an online webserver, where both the predicted glycosylation sites and the overrepresented RNA-binding protein (RBP)-related motifs were annotated to facilitate the users. GlyinsRNA webserver is freely available at http://www.rnanut.net/glyinsrna.

KEYWORDS:

RNA modification
RNA glycosylation
modification site prediction
webserver
machine learning
RNA sequence analysis

Introduction

Various RNA modifications are extensively involved in the post-transcriptional gene expression regulation processes like RNA localization, RNA degradation, alternative splicing and translational efficiency regulation [Citation1]. Although more than one hundred types of RNA modification had been discovered [Citation1], just recently, the glycosylation modification has been shown to further extend the scope of RNA modification world [Citation2]. Unlike most of the known types of RNA modifications, in glycosylation, instead of typical small chemical group (e.g. methyl- or acetyl-group), a much larger carbohydrate residue (e.g. sialic acid and fucose) was conjugated onto the RNA substrates. In the leading reports from Flynn et al., RNA glycosylation was mainly enriched on small RNAs and plays roles on RNA localization and RNA recognition, suggesting its functional importance in cell signalling [Citation2]. However, the mechanism-of-action for RNA glycosylation remains to be fully elucidated, of which whether and how the sequence pattern contributes to this process has not been investigated.

In this study, we demonstrated that the sequence context around RNA glycosylation sites contains predictive information for RNA glycosylation, where both the positional sequence pattern and the RNA-binding protein (RBP)-related motifs are distributed in a non-random fashion. Based on these observations, GlyinsRNA, a sequence-based machine learning predictor of glycosylation sites on small RNA, was established.

Results and discussion

Non-random sequence pattern around RNA glycosylation sites

The sequence context around one site can be represented by a sequence window of 51 nucleotides (51 nt) with the modification sites at the centre. shows the positional sequence pattern of RNA glycosylation sites, where several positions, especially those proximal to the central glycosylation site (e.g. positions 0, +1, +2, +4 and +5), show non-random nucleotide propensities. On the other hand, however, no positional nucleotide propensity is observed for most of the other positions. One possible explanation is that there are position-independent sequence patterns (like short motifs) in addition to the observed positional sequence patterns. To test this possibility, we scanned the sequence windows with the pre-computed binding motifs of known RBPs. Indeed, the motifs of seven RBPs (RPS11, LIN28B, CPSF-100, IGF2BP1, FAM120A, DDX54 and NOP58) were significantly overrepresented in the sequence windows of the glycosylation sites in comparison with those of the non-modified sites (, bar plots), and most of the results could be still observed with shorter window sizes until those of 37 nt (Supplementary Fig. S1), indicating plausible functional links between glycosylated RNA and these RBP regulators. Weak or moderate position preferences were observed when checking the positional distribution of few overrepresented motifs along the sequence windows (, violin plots). For example, RPS11 and LIN28B have a tendency of binding to the upstream regions of glycosylation sites. Nonetheless, no strong position-biased distributions could be observed for any of these motifs. These results suggest the largely position-independent, short motif-like sequence patterns around glycosylation sites.

Figure 1. The positional and position-independent sequence pattern around the RNA glycosylation sites. (a) The positional sequence pattern around glycosylation sites. (b) The overrepresented RBP-related motifs found in the flanking sequence windows of glycosylation sites

Sequence-based prediction of RNA glycosylation site by machine learning models

Based on the above observations, we employed multiple sequence encoding schema to depict this non-random sequence pattern and trained random forest models based on these sequence feature encodings. We firstly optimized the window size k for each encoding. The performances of prediction models changing with the window size were summarized in Supplementary Fig. S2. The optimal window sizes of models by six encodings (PSNSP, PSDSP, One-hot, KNF, KSNPF, and PseKNC) were 51 nt, 51 nt, 51 nt, 17 nt, 19 nt and 37 nt, respectively.

Then, we constructed the models using the optimal window size for each encoding. As shown in , the best-performed model in cross-validation is that based on the PSDSP encoding (AUROC = 0.7704). We further tested if combining multiple models could achieve better performance. As a result, a predictor combining three encodings (PSDSP, One-hot and KNF) was found to be the most competitive one (AUROC = 0.7933). Interestingly, One-hot [Citation3,Citation4] and KNF [Citation5,Citation6] encodings were designed to represent positional and position-independent sequence patterns, while PSDSP [Citation7] encoding considered both aspects. Thus, the selection of the best encoding combination again suggests that both the positional and the position-independent sequence patterns provide predictive information for glycosylation sites.

Figure 2. Predictor performance evaluation results. (a) The ROC curve showing the performance of single and combined random forest models in five-fold cross-validation. (b) The ROC curve showing the performance of predictor on independent test dataset

To further validate the predictor performance, we tested the combined predictor on the independent test dataset. Samples in this dataset were not used for model training nor for model parameter and model combination selection and therefore could serve as a more rigorous and unbiased validation of the performance. On the independent test dataset (), the combined predictor (AUROC = 0.7979) also outperformed the best single-encoding-based model (AUROC = 0.7711). We also tried to evaluate the prediction model performance by using the precision–recall curve and found similar tendency (Supplementary Fig. S3).

GlyinsRNA: webserver for small RNA glycosylation site prediction

Based on the above performance evaluation results, we finally selected the combined predictor to build our RNA glycosylation site prediction tool, i.e. GlyinsRNA. GlyinsRNA has been implemented as an online webserver, which requires only RNA sequences as its input. Users could also select one stringent threshold according to their expected specificity level. Using a low-specificity threshold means to report more low-scored prediction results and therefore allow a higher tolerance to false-positive results. Here, we provided six alternatives for specificity level to users, including 99.9%, 99.5%, 99.0%, 95.0%, 90.0% and 85.0%. The primary output of the webserver is the position and score of each predicted glycosylation site. Besides, the sequence context around each predicted glycosylation site is also displayed, where both the glycosylation sites and overrepresented RBP-related motifs (if any) are highlighted by the distinct colours (Supplementary Fig. S4). We hope this easy-to-use webserver would facilitate further experimental investigations of small RNA glycosylation in the future.

Materials and methods

Dataset

The genomic coordinate annotation of glycosylation site was obtained from GEO (accession: GSE136967). The sites exclusively observed on the glycoRNA library (i.e. RNA library with glycosylation modification) were selected as the positive samples, while those only observed in the input library (i.e. background RNA library) but not in the glycoRNA library were assigned as the unmodified negative samples. Compared to the random negative sites, such selection could exclude unrelated sites and therefore is more rigorous. We primarily obtained 2720 positive and 5078 negative sites, which were randomly separated into training dataset and independent test dataset with the proportion of 7:3 (which is same for both the positive samples and the negative samples) to construct and assess the predictor. After mapping the sample sites to small RNA sequences, we employed CD-HIT-EST tool [Citation8] to remove the redundant sequence samples with 90% identity threshold, by which 857 positive samples and 1068 negative samples were retained. Among all retained sample sequences, 114 positive and 186 negative samples were in the independent test set, while the rest were in the training set (with no overlap between the training and the testing samples).

Sequence pattern analysis

The sequence context around one site can be represented by a sequence window of k nucleotides with the modification/unmodified sites at the centre. The window size k was preliminarily optimized as 51 nt according to the cross-validation performance of the best-performed encodings. Meanwhile, we also explored the enrichment of RBP motifs with other window sizes and observed overrepresentation of most of the seven RBP motifs in shorter windows until those of 37 nt (Supplementary Fig. S1). If a sequence window extends beyond the border of an RNA, gap characteristics were filled to ensure all sequence windows were aligned to the centre. To investigate the positional sequence pattern of glycosylation sites, the aligned positive and negative sequence windows were analysed by TwoSampleLogo server [Citation9]. Besides, we screened the significantly overrepresented RBP motifs in the positive sequence windows (in contrast to the negative sequence windows) by the AME tool of MEME Suite [Citation10]. The Bonferroni adjusted P-value <0.05 was adopted as the significance threshold for the motif overrepresentation. The known RBP motifs were derived from the high-throughput CLIP-Seq experiment datasets, with the detailed description available in the previous work [Citation11].

Predictor construction and evaluation

Here, the One-hot [Citation3,Citation4], PSNSP [Citation12], PSDSP [Citation7], KNF [Citation5,Citation6], KSPNF [Citation4,Citation13] and PseKNC [Citation14] encodings were employed to depict the sequence pattern around glycosylation sites. These encodings are briefly introduced as following:

1) One-hot encoding translates four nucleotides, A, C, G, T, into binary numeric vectors, 0001,0010,0100,1000.

2) Position-specific nucleotide sequence profile (PSNSP) encoding displays the difference in frequencies of four nucleotides between positive samples and negative samples at each position of the sequence window.

3) Position-specific dinucleotide sequence profile (PSDSP) encoding is similar to PSNSP encoding, while it compares the frequency difference in dinucleotide between positive samples and negative samples.

4) K-nucleotide frequencies (KNF) encoding depicts the frequencies of continuous k-nucleotide on the sequence. The selection of K was preliminarily optimized as 2, 3 and 4.

5) K-spaced nucleotide pair frequencies (KSPNF) encoding also describes the frequencies of dinucleotide on the sequence. The dinucleotide is split by K-random nucleotides, and selection of K was preliminarily optimized as 0, 1, 2, 3 and 4.

6) Pseudo k-tuple nucleotide composition (PseKNC) encoding is the combination of the frequency and the physico-chemical property of dinucleotide or trinucleotide.

For each encoding, a random forest model was trained, and the prediction results from each were combined by the weighted summing schema where the weights of all models sum to 1. To determine the optimal window sizes of sequences, we trained models with various windows from 13 nt (larger than the RBP motif length which is 12 nt) to 51 nt with the step of 2 and selected optimal window sizes for each encoding. The best combination of random forest models was selected based on five-fold cross-validation performance. Besides, we also tested the models on the independent test set to avoid overestimation of performance. The performance was measured by typical area under the receiver operating characteristic curve (AUROC) metric. Finally, the predictor was implemented as an online webserver which was constructed under the ‘Linux + Apache + PHP’ framework.

Author contributions

CC established and tested the models. CC constructed the online webserver. CC and XW performed the sequence analysis. YZ and CC prepared and revised the manuscript. YZ supervised the study.

Supplemental material

Supplemental Material

Download MS Word (1.6 MB)

Acknowledgments

We thank Prof. Qinghua Cui, Yiran Zhou and Wanqing Zhao for their suggestions.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The RNA glycosylation site data from Flynn et al. is publicly available in public GEO series (accession: GSE136967) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE136967here

Supplementary material

Supplemental data for this article can be accessed here.

Additional information

Funding

This work has been supported by the National Natural Science Foundation of China [31801099 and 32070658 to Y.Z.].

References

Wiener D, Schwartz S. The epitranscriptome beyond m(6)A. Nat Rev Genet. 2021;22(2):119–131.
PubMed Web of Science ®Google Scholar
Flynn RA, Pedram K, Malaker SA, et al. Small RNAs are modified with N-glycans and displayed on the surface of living cells. Cell. 2021;184(12):3109–24 e22.
PubMed Web of Science ®Google Scholar
Uriarte-Arcia AV, Lopez-Yanez I, Yanez-Marquez C. One-hot vector hybrid associative classifier for medical data classification. PloS One. 2014;9(4):e95715.
PubMed Web of Science ®Google Scholar
Zhou Y, Zeng P, Li YH, et al. SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features. Nucleic Acids Res. 2016;44(10):e91.
PubMed Web of Science ®Google Scholar
Liu B, Fang L, Wang S, et al. Identification of microRNA precursor with the degenerate K-tuple or Kmer strategy. J Theor Biol. 2015;385:153–159.
PubMed Web of Science ®Google Scholar
Xiang S, Yan Z, Liu K, et al. AthMethPre: a web server for the prediction and query of mRNA m6A sites in Arabidopsis thaliana. Mol Biosyst. 2016;12(11):3333–3337.
PubMedGoogle Scholar
Li GQ, Liu Z, Shen HB, et al. TargetM6A: identifying N6-Methyladenosine sites from RNA sequences via position-specific nucleotide propensities and a support vector machine. IEEE Trans Nanobioscience. 2016;15(7):674–682.
PubMed Web of Science ®Google Scholar
Huang Y, Niu B, Gao Y, et al. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010;26(5):680–682.
PubMed Web of Science ®Google Scholar
Vacic V, Iakoucheva LM, Radivojac P. Two sample logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics. 2006;22(12):1536–1537.
PubMed Web of Science ®Google Scholar
Bailey TL, Boden M, Buske FA, et al. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 2009;37:W202–8.
PubMed Web of Science ®Google Scholar
Wu X, Zhao W, Cui Q, et al. Computational screening of potential regulators for mRNA-protein expression level discrepancy. Biochem Biophys Res Commun. 2020;523(1):196–201.
PubMed Web of Science ®Google Scholar
Wang X, Yan R, Li J, et al. SOHPRED: a new bioinformatics tool for the characterization and prediction of human S-sulfenylation sites. Mol Biosyst. 2016;12(9):2849–2858.
PubMedGoogle Scholar
Chen YZ, Tang YR, Sheng ZY, et al. Prediction of mucin-type O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs. BMC Bioinformatics. 2008;9(1):101.
PubMedGoogle Scholar
Chen W, Zhang X, Brooker J, et al. PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics. 2015;31(1):119–120.
PubMed Web of Science ®Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

GlyinsRNA: a webserver for predicting glycosylation sites on small RNAs

ABSTRACT

Introduction

Results and discussion

Non-random sequence pattern around RNA glycosylation sites

Sequence-based prediction of RNA glycosylation site by machine learning models

GlyinsRNA: webserver for small RNA glycosylation site prediction

Materials and methods

Dataset

Sequence pattern analysis

Predictor construction and evaluation

Author contributions

Supplemental Material

Acknowledgments

Disclosure statement

Data availability statement

Supplementary material

References

Information for

Open access

Opportunities

Help and information

GlyinsRNA: a webserver for predicting glycosylation sites on small RNAs

ABSTRACT

Introduction

Results and discussion

Non-random sequence pattern around RNA glycosylation sites

Sequence-based prediction of RNA glycosylation site by machine learning models

GlyinsRNA: webserver for small RNA glycosylation site prediction

Materials and methods

Dataset

Sequence pattern analysis

Predictor construction and evaluation

Author contributions

Supplemental Material

Acknowledgments

Disclosure statement

Data availability statement

Supplementary material

Additional information

Funding

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date