The Single-molecule long-read sequencing of Scylla paramamosain

Aug 29, 2019Scientific reports

Detailed genetic sequencing of the mud crab Scylla paramamosain

AI simplified

Circadian Biology on OpenScience ↗PubMed ↗DOI ↗OA ↗

Abstract

A total of 79,005 high-quality unique transcripts were obtained from 284,803 full-length non-chimeric reads sequenced from various tissues of the crab Scylla paramamosain.

The transcriptome was constructed from 12 different tissues, including gill, hepatopancreas, and muscle.
A total of 52,544 transcripts were successfully annotated using multiple protein databases.
The study identified 23,644 (lncRNAs) and 131,561 (SSRs).
Isoforms of many genes were also detected in the analysis.
This research provides a comprehensive set of full-length cDNA sequences for Scylla paramamosain, which may enhance future studies on this species.

AI simplified

Scylla paramamosain is an important aquaculture crab, which has great economical and nutritional value. To the best of our knowledge, few full-length crab transcriptomes are available. In this study, a library composed of 12 different tissues including gill, hepatopancreas, muscle, cerebral ganglion, eyestalk, thoracic ganglia, intestine, heart, testis, ovary, sperm reservoir, and hemocyte was constructed and sequenced using Pacific Biosciences single-molecule real-time (SMRT) long-read sequencing technology. A total of 284803 full-length non-chimeric reads were obtained, from which 79005 high-quality unique transcripts were obtained after error correction and sequence clustering and redundant. Additionally, a total of 52544 transcripts were annotated against protein database (NCBI nonredundant, Swiss-Prot, KOG, and KEGG database). A total of 23644 (lncRNAs) and 131561 (SSRs) were identified. Meanwhile, the isoforms of many genes were also identified in this study. Our study provides a rich set of full-length cDNA sequences for S. paramamosain, which will greatly facilitate S. paramamosain research.

Key numbers

79005

Total Unique Transcripts

High-quality unique transcripts obtained from sequencing.

66.5%

Annotated Transcripts Percentage

Percentage of transcripts annotated against protein databases.

23154

Identified

Common predicted by three different bioinformatics software.

Full Text

What this is

This research focuses on the sequencing of the crab Scylla paramamosain using advanced long-read technology.
A library from 12 different tissues was constructed and sequenced, yielding a wealth of transcript data.
The study aims to enhance genetic understanding and facilitate future research in crustaceans.

Essence

The study produced 79,005 high-quality unique transcripts from Scylla paramamosain, significantly enriching the genetic information available for this species.

Key takeaways

A total of 284,803 full-length non-chimeric reads were obtained, leading to 79,005 high-quality unique transcripts. This extensive dataset enhances the genetic resources available for S. paramamosain.
The sequencing identified 52,544 annotated transcripts, with 66.5% successfully matched to protein databases. This high annotation rate supports the utility of the dataset for functional studies.
The analysis revealed 23,154 (), providing a foundation for further exploration of their roles in crustacean biology.

Caveats

The study's reliance on third-generation sequencing technology, while advantageous, may still encounter challenges related to raw data error rates and the absence of a reference genome.
Identifying was limited by the lack of genomic data for S. paramamosain, which could affect the classification and validation of these transcripts.

Definitions

long non-coding RNAs (lncRNAs): Non-coding RNA molecules longer than 200 nucleotides that play critical roles in regulating gene expression.
simple sequence repeats (SSRs): Short, repetitive sequences in DNA that can serve as genetic markers.

AI simplified

Introduction

Scylla paramamosain is an important aquaculture crab and has great economical and nutritional value. According to the statistics result, it had been estimated that the aquaculture production of S. paramamosain reached approximately 157,712 tons in China in 2018 (China Fishery Statistical Yearbook 2019). Up to date, the genome information for most crustaceans is not available. However, the application of second-generation sequencing technologies that do not need genome data has greatly accelerated the research of the crustacean. In crab and shrimp, the high-throughput sequencing technology has been applied in Eriocheir sinensis^1–21, Portunus trituberculatus^22–29, S. paramamosain^30–33, S. olivacea³⁴, Carcinus maenas^35,36, Gecarcinus lateralis^37–41, P. sanguinolentus⁴², Charybdis feriatus⁴³, Litopenaeus vannamei^44–48, Macrobrachium rosenbergii^49–52, M. nipponense^53,54, Exopalaemon carinicauda^55–57, Oratosquilla oratoria^58–60, Homarus americanus⁶¹, and so on. Many genes related with reproduction, growth, and immunity of crab and shrimp have been obtained through the transcriptome data.

However, the length of sequencing reads obtained using the second-generation sequencing technologies was usually short (usually 100–250 bp), which needs further bioinformatics analysis to assemble using the software such as Trinity to obtain the transcript sequence⁶². But it had been estimated that many repetitive elements exist in the crustacean genome DNA^63,64, which could influence the assembled result, such as the undesirable N50 length of assembled unigenes and the majority of non-full-length transcript sequences.

The third-generation sequencing technology is also called the single-molecule real-time sequencing technology which include smart sequencing and nanopore sequencing developed by Pacific Biosciences and Oxford Nanopore Technologies, respectively. Compared to the second-generation sequencing technologies, the third-generation sequencing technology has many advantages, such as (1) the longer sequencing length, (2) the obtainment of full-length transcripts, (3) the direct sequencing without the need for fragmentation or post-sequencing assembly, (4) the analysis of alternative splicing⁶⁵. But up to date, the application of the third-generation sequencing technology in crustacean is scare.

In this study, a RNA library consisted of multiple tissues of S. paramamosain (gill, hepatopancreas, muscle, cerebral ganglion, eyestalk, thoracic ganglia, intestine, heart, testis, ovary, sperm reservoir and hemocyte) was constructed and sequenced using the third-generation sequencing technology (Pacbio) for the first time, which would not only further enrich the genetic information and promote the application of proteomic techniques in S. paramamosain, but also pave the way for the application of the third-generation sequencing technology in other crustacean.

Results

The quality examination of pooled RNA used for library construction and the evaluation of sequencing result

The quality of pooled total RNA extracted from twelve tissues was examined before library construction. The examined result indicated that the RNA was high quality and was appropriate for following experiment. The evaluation of sequencing result was carried out using 3 methods and the results were as follows: (1) The analysis result of BUSCO software revealed that 876 (82.2%) complete single-copy and duplicated BUSCOs, 34 (3.2%) fragmented BUSCOs (Benchmarking Universal Single Copy Orthologs), 156 (14.6%) missing BUSCOs (Fig. 1) (2) the aligned ratio of published transcriptome data sequenced by second-generation technology with that sequenced by Pacbio technology in this study was more than 77% (3) the sequences of published genes (relish, dorsal, TGF-beta type I receptor and amine oxidase) were consistent with sequencing result performed by Pacbio technology.

Figure 1

The evaluation of sequencing result analyzed by BUSCO software.

Functional annotation of transcripts

The identified transcripts were blasted against protein database (Nr, Swiss-prot, KOG, and KEGG) and the result indicated that a total of 52,544 transcripts (66.5%) were annotated. Of which 52,262 transcripts were annotated in Nr database, 41,054 transcripts in Swiss-prot database, 37,117 transcripts in KOG database, and 27,777 transcripts in KEGG database. The venn diagram was shown in Fig. 2. GO analysis result indicated that 13,441 transcripts were annotated in biological process, 7,288 transcripts in molecular function, and 8,055 transcripts in cellular component. The detail information of GO annotation was shown in Fig. 3.

According to the annotated results, the species distribution of transcripts BLASTx matches against the Nr protein database was performed and the result indicated that the top 10 species all belong to invertebrate, which included Hyalella Azteca, S. paramamosain, Zootermopsis nevadensis, Thermobia domestica, Daphnia magna, Limulus Polyphemus, Diaphorina citri, Lingula anatine, E. sinensis, and L. vannamei. The detailed information of species distribution was shown in Fig. 4.

Figure 2

The venn diagram of annotated result in 4 different databases.

Figure 3

The GO annotation result.

Figure 4

The top 10 species distribution of BLASTX results.

Identification of long non-coding RNAs (lncRNAs)

In this study, the coding potential of the unannotated transcripts was analyzed with three different bioinformatics softwares, Coding Potential Calculator (CPC), Coding-Non-Coding Index (CNCI), and Protein family (Pfam). The predicted result revealed that 24,201 LncRNAs were identified with the software of CPC, 23,644 LncRNAs with the software of CNCI and 26,147 LncRNAs with the software of Pfam, among which 23,154 common LncRNAs were predicted by three different bioinformatics software (Fig. 5).

Figure 5

The venn diagram of LncRNAs prediction result by three softwares.

Identification of simple sequence repeats (SSRs)

A total of 131,561 SSRs were identified across all the transcripts, with 28,267 transcripts containing more than one SSR. Most of the SSRs identified were di-nucleotide repeats (58.53%), followed by the tri-nucleotide repeats (30.35%), tetra-nucleotide repeats (8.96%), penta-nucleotide repeats (1.82%) and Hexa-nucleotide (0.34%). In the di-nucleotide repeats, tri-nucleotide repeats, tetra-nucleotide repeats, the motif of AC/GT, AAT/ATT and AAAT/ATTT was the most dominant style, respectively. The detailed information was shown in Fig. 6A,B.

Figure 6

Distribution of simple sequence repeat (SSR) nucleotide classes among different nucleotide types found in the transcriptome of.. S paramamosain

The analysis of alternative splicing in transcriptome

The analysis result of alternative splicing indicated that there were seven different types existing in transcriptome, including 247 skipping exon (SE), 580 alternative 5′ splice site (A5), 600 alternative 3′ splice site (A3), 160 mutually exclusive exon (MX), 1780 retained intron (RI), 38 alternative first exon (AF), and 40 alternative last exon (AL), among which retained intron was the main type of alternative splicing, accounting for more than 5% (Fig. 7). The isoform analysis result indicated that the isoform number of some genes was more than ten (Fig. 8). For example, a total of 22 different isoforms of LIM domain-binding protein 3 were identified in this study and the sequence analysis result was shown in Fig. 9 (an example of RI). Additionally, 7 different isoforms of ferritin were identified and the sequence analysis result was shown in Fig. 10 (an example of A5).

Figure 7

The statistics of alternative splicing events in the transcriptome of.shown in pie chart. Note: A3 represents alternative 3′ splice site, A5 represents alternative 5′ splice site, AF represents alternative first exon, AL represents alternative last exon, MX represents mutually exclusive exon, RI represents retained intron, SE represents skipping exon. S paramamosain

Figure 8

The statistics result of isoform of some genes.

Figure 9

The sequence analysis of different isoforms of LIM domain-binding protein 3. Note: COGENT002635 represents the super-transcripts constructed with different isoforms of LIM domain-binding protein 3, the others represent the different isoforms.

Figure 10

The sequence analysis of different isoforms of ferritin. Note: COGENT003937 represents the super-transcripts constructed with different isoforms of ferritin, the others represent the different isoforms.

The validation of sequencing result with several published full-length genes

In order to validate the accuracy of sequencing result, several published genes, for example, relish (GI number MH047674.1), dorsal (GI number MH047675.1), TGF-beta type I receptor (GI number MH187960.1), and amine oxidase (GI number MG878093.1) were blasted against sequencing result using the blast software and the results indicated that the sequences of several published full-length genes were completely identical to the sequencing result except dorsal gene, which indicated the accuracy of sequencing result. The detailed blast results were shown in Supplemental File.

Discussion

The obtainment of full-length gene is the first step to study gene function, but it can’t obtain on a large scale and is time consuming, labor intensive and expensive through rapid amplification of cDNA ends (RACE) technology in general. With the development of technology, the second-generation sequencing technologies are developed such as Illumine, Roche 454, Solexa, SOLID, the sequencing reads length of which is usually short. Though, part of full-length transcripts could be obtained through the transcriptome data sequenced by second-generation sequencing technologies on a large scale, majority of assembled transcripts is short and is not full-length. The third-generation sequencing technology is the most advanced technology, which could obtain full-length transcripts on a large scale. In this study, a total of 79005 high-quality unique transcripts is obtained, among which 50% transcripts is full-length, which is more efficient than RACE and the second-generation sequencing technology^30,32,33,48. These full-length transcripts identified in this study will facilitate further study of S. paramamosain.

It is well known that the sequencing length of the third-generation sequencing technology could reach as long as 2 Mb, avoiding the influence of the complex repeat motif. In this study, the longest transcript is 14701 bp and the N50 (an important parameter used for evaluating the quality of assembly) is 3160 bp, which is longer than that in S. paramamosain studies that used the second-generation sequencing technologies. For instance, in the gonad transcriptome, gill transcriptome, and hemocyte transcriptome of S. paramamosain, the N50 of assembled unigenes is only 477 bp, 1601bp, and 1488 bp, respectively^30–32, which is far shorter than that in this study and indicates that the result of the third-generation sequencing technology is better than that of the second-generation sequencing technology.

Alternative splicing is an important way of regulating gene expression and plays vital roles in a variety of biological processes including sex differentiation and immunological resistance. In the study of E. sinensis, the two splice isoforms of the gene fruitless are obtained and could play important roles in sex-specific character development⁶⁶. In the study of L. vannamei, a total of 6 sex-lethal splice isoforms are cloned used RACE technology and the different isoform may play different roles during embryo development⁶⁷. In the study of S. paramamosain, the gene of down syndrome cell adhesion molecule (Dscam) is cloned and the bioinformatics result reveals that it could encode as high as 36,736 unique isoforms to bind different pathogen to protect the crab from the pathogen infection⁶⁸. However, in crustacean, the identification of alternative splicing on a large scale is scare because of the absence of genome information which makes the study of alternative splicing in crustacean difficult. Because of the longer sequencing length, the third-generation sequencing technology could obtain the full-length of transcripts, which provides the basis for the research of alternative splicing in S. paramamosain. In this study, the constructed sequencing library was consistent of 12 different tissues, therefore, more isoforms were identified comparing to the result that obtained using single tissue constructed sequencing library, which also indicated that different isoforms may play different roles in different tissues and the function of these isoforms needed further research. For example, a total of 6 different ferminazation-1 transcripts were identified in this study and their predicted protein sequences were completely identical to the protein sequences obtained through gonad transcriptome data in our laboratory (unpublished data). However, only 3 different ferminazation-1 transcripts (fem-1a, fem-1b, fem-1c) were identified in E. sinensis transcriptome data sequencing using second-generation sequencing technology, which indicated the third-generation sequencing technology is more efficient than second-generation sequencing technology in identifying isoforms.

It has been reported that the transcripts sequenced using the third-generation sequencing technology has more annotation rate than the second-generation sequencing technology in L.vannamei⁴⁸. In published articles about S. paramamosain transcriptome, the annotation rate of transcripts was 59%, 15.7% and 48.38%, respectively^30–32. In this study, the annotation rate of obtained transcripts was 66.5%, which was higher than that previously obtained using the second-generation sequencing technology and consistent with the result in L. vannamei⁴⁸.

Previous studies have shown that raw data error rate of the third-generation sequencing technology is relatively high, but the raw data error rate could be corrected by the data of second-generation sequencing technology⁶⁹. In this study, the raw data has been corrected by the transcriptome data sequenced using Illumina platform in our laboratory (unpublished result), which ensure the reality of the sequencing result. The consistent blast result of several published genes, relish, dorsal, TGF-beta type I receptor, amine oxidase with sequencing result also indicate the reliability of sequencing result in this study.

LncRNAs are non-coding RNAs that are longer than 200 nucleotides long and play vital roles in many physiological processes⁷⁰. However, the identification of LncRNAs in S. paramamosain using the third-generation sequencing technology has never been reported. In this study, a total of 23154 common LncRNAs predicted by three softwares are obtained, which will facilitate the function study of these LncRNAs in S. paramamosain. In spite of the identification of LncRNAs through the third-generation sequencing technology in this study, the classification and false rate of identified LncRNAs could not be done because of the absence of genome data of S. paramamosain.

Materials and Methods

Samples

Healthy sexually adult male (n = 4) and female (n = 4) S. paramamosain (weight = 250 ± 10 g) were purchased from a local agricultural market in Xiamen, China. A total of 12 different tissues (gill, hepatopancreas, muscle, cerebral ganglion, eyestalk, thoracic ganglia, intestine, heart, testis, ovary, sperm reservoir and hemocyte) were collected. The total RNA was extracted using the E.Z.N.A.®. Total RNA Kit II (Omega, Norcross, GA, USA) following the protocol provided by the manufacturer. The integrity of the RNA was determined with the Agilent 2100 Bioanalyzer and agarose gel electrophoresis. The purity and concentration of the RNA were determined with the Nanodrop micro-spectrophotometer (Thermo Fisher, USA).

SMRT library construction, sequencing, and quality control

mRNA was enriched by Oligo (dT) magnetic beads. Then the enriched mRNA was reverse transcribed into cDNA using Clontech SMARTer PCR cDNA Synthesis Kit (Takara, Shiga, Japan). PCR cycle optimization was used to determine the optimal amplification cycle number for the downstream large-scale PCR reactions. Then the optimized cycle number was used to generate double-stranded cDNA, followed by size selection using the Blue Pippin TM Size-Selection System to generate three libraries (1–2 kb, 2–3 kb, 3–6 kb). Then large-scale PCR was performed for the different size libraries for the next SMRT bell library construction. Different input amount of cDNA of size-selected samples was used to DNA damage repaired, end repaired, and ligated to sequencing adapters. The SMRT bell template was annealed to sequencing primer and bound to polymerase, and sequenced on the PacBio sequel platform by Gene Denovo Biotechnology Company (Guangzhou, China).

Data processing

The raw sequencing reads of cDNA libraries were classified and clustered into transcript consensus using the SMRT Link v5.0.1 pipeline⁷¹ supported by Pacific Biosciences. Briefly, CCS (circular consensus sequence) reads were extracted out of subreads BAM file. Then CCS reads were classified into full-length non-chimeric (FL), non-full-length (nFL), chimeras, and short reads based on cDNA primers and polyA tail signal. Short reads were discarded. Subsequently, the full-length non-chimeric (FLNC) reads were clustered by Iterative Clustering for Error Correction (ICE) software to generate the cluster consensus isoforms. Then non full-length reads were used to polish the above obtained cluster consensus isoforms by Quiver software to finally obtain the FL polished high quality consensus sequences (accuracy ≥ 99%). The final transcriptome isoform sequences were filtered by removing the redundant sequences with software CD-HIT-v4.6.7 using a threshold of 0.99 identities.

The evaluation of sequencing result and functional annotation of transcripts

The evaluation of sequencing result was performed through 3 different methods: (1) The protein sequences predicted from the sequencing result were analyzed by BUSCO v2.0 using arthropoda database to evaluate the completeness of sequencing result. (2) The published transcriptome data (SRR8792478, SRR8792479, SRR5814909, SRR5814910, SRR5814911, SRR5814912, SRR5814913, SRR5814914, SRR5814915, SRR5814916, SRR5814917) downloaded from NCBI database and the transcriptome results sequenced by our laboratory were aligned to sequencing result with bowtie2 software to evaluate the sequencing result. (3) Several recently published genes (relish: MH047674.1, dorsal: MH047675.1, TGF-beta type I receptor: MH187960.1 and amine oxidase: MG878093.1) were compared with the sequencing result to validate the accuracy of sequencing result. Basic annotation of transcripts includes protein functional annotation, pathway annotation, COG/KOG functional annotation and Gene Ontology (GO) annotation. To annotate the transcripts, transcripts were blasted against the NCBI non-redundant protein (Nr) database (http://www.ncbi.nlm.nih.gov), the Swiss-Prot protein database (http://www.expasy.ch/sprot), the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (http://www.genome.jp/kegg), and the COG/KOG database (http://www.ncbi.nlm.nih.gov/COG) with BLASTx program (http://www.ncbi.nlm.nih.gov/BLAST/) at an E-value threshold of 1e–5 to evaluate sequence similarity with genes of other species. GO annotation was analyzed by Blast2GO software⁷² with Nr annotation results of transcripts. Transcripts ranking the first 20 highest score and no shorter than 33 HSPs (High-scoring Segment Pair) hits were selected to conduct Blast2GO analysis. Then, functional classification of transcripts was performed using WEGO software⁷³.

Characterization of long non-coding RNAs

CNCI v2.0⁷⁴, pfam⁷⁵ and CPC v1.0⁷⁶ were used to assess the protein-coding potential of transcripts without annotations by default parameters for potential long non-coding RNAs. To better annotate lncRNAs in evolution level, the software Infernal (http://eddylab.org/infernal/) was used in sequence alignment. The lncRNAs were classified by secondary structures and sequence conservation.

Alternative splicing detection

To analyze alternative splicing events of transcript isoforms, COding GENome reconstruction Tool (Cogent) was firstly used to partition transcripts into gene families based on k-mer similarity and reconstructed each family into a coding reference genome based on De Bruijn graph methods. Then SUPPA tool was used to analyze alternative splicing events of transcript isoforms.

Identification of SSRs

The SSR identification was analyzed employing the software of MISA v1.0 (http://pgrc.ipk-gatersleben.de/misa/) 64 with default parameters in the whole transcriptome. The primers used for PCR were designed using primer3 with default parameters. The overall analysis pipeline was shown in Fig. 11.

Figure 11

The overall analysis pipeline performed in this study.

Supplementary information

Acknowledgements

This study was supported by grants from the National Key R&D Program of China (2018YFD0900205) and the Natural Science Foundation of China (41676161, 31672681).

Author Contributions

H.W. wrote the main manuscript text. Z.Z. and Y.W. designed the experiments and revised the main manuscript. H.W., X.J. and P.Z. carried out the experiments and analyzed the data. All authors approved and read the final manuscript.

Competing Interests

The authors declare no competing interests.

Footnotes

Contributor Information

Ziping Zhang, Email: zhangziping@fafu.edu.cn.

Yilei Wang, Email: ylwang@jmu.edu.cn.

Supplementary information

Supplementary information accompanies this paper at 10.1038/s41598-019-48824-8.

The Single-molecule long-read sequencing of Scylla paramamosain

Abstract

Key numbers

Full Text

What this is

Essence

Key takeaways

Caveats

Definitions

Introduction

Results

The quality examination of pooled RNA used for library construction and the evaluation of sequencing result

Functional annotation of transcripts

Identification of long non-coding RNAs (lncRNAs)

Identification of simple sequence repeats (SSRs)

The analysis of alternative splicing in transcriptome

The validation of sequencing result with several published full-length genes

Discussion

Materials and Methods

Samples

SMRT library construction, sequencing, and quality control

Data processing

The evaluation of sequencing result and functional annotation of transcripts

Characterization of long non-coding RNAs

Alternative splicing detection

Identification of SSRs

Supplementary information

Acknowledgements

Author Contributions

Competing Interests

Footnotes

Contributor Information

Supplementary information

References

Associated Data

Supplementary Materials

You found one interesting study. We’ll send the next 7.

what lands in your inbox each week:

Recent issues from the circadian biology brief

Abstract

Key numbers

Full Text

What this is

Essence

Key takeaways

Caveats

Definitions

Introduction

Results

The quality examination of pooled RNA used for library construction and the evaluation of sequencing result

Functional annotation of transcripts

Identification of long non-coding RNAs (lncRNAs)

Identification of simple sequence repeats (SSRs)

The analysis of alternative splicing in transcriptome

The validation of sequencing result with several published full-length genes

Discussion

Materials and Methods

Samples

SMRT library construction, sequencing, and quality control

Data processing

The evaluation of sequencing result and functional annotation of transcripts

Characterization of long non-coding RNAs

Alternative splicing detection

Identification of SSRs

Supplementary information

Acknowledgements

Author Contributions

Competing Interests

Footnotes

Contributor Information

Supplementary information

References

Associated Data

Supplementary Materials

Related papers

You found one interesting study. We’ll send the next 7.

what lands in your inbox each week:

Recent issues from the circadian biology brief