Insert Size Peak (Evaluated by Paired-end Reads): 0

Genome Biol. 2013; 14(2): R12.

SOAPfuse: an algorithm for identifying fusion transcripts from paired-end RNA-Seq data

Wenlong Jia,^# ^1, ² Kunlong Qiu,^# ^1, ² Minghui He,^# ^1, ^two Pengfei Song,^# ² Quan Zhou,^1, ^2, ³ Feng Zhou,^2, ⁴ Yuan Yu,ⁱⁱ Dandan Zhu,^two Michael 50 Nickerson,^v Shengqing Wan,^1, ^two Xiangke Liao,^half-dozen Xiaoqian Zhu,^six, ^seven Shaoliang Peng,^6, ⁷ Yingrui Li,^1, ² Jun Wang,^i, ^2, ^8, ^ix and Guangwu Guo ^1, ²

Wenlong Jia

¹BGI Tech Solutions Co., Ltd, Beishan Industrial Zone, Yantian District, Shenzhen 518083, People's republic of china

²BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, People's republic of china

Kunlong Qiu

ⁱBGI Tech Solutions Co., Ltd, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China

²BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, Prc

Minghui He

ⁱBGI Tech Solutions Co., Ltd, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China

²BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, Cathay

Pengfei Song

²BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, People's republic of china

Quan Zhou

ⁱBGI Tech Solutions Co., Ltd, Beishan Industrial Zone, Yantian Commune, Shenzhen 518083, China

ⁱⁱBGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, Red china

³School of Life Science and Technology, Academy of Electronic Scientific discipline and Technology of China, No.4, Section 2, North Jianshe Road, Chengdu 610054, China

Feng Zhou

²BGI-Shenzhen, Beishan Industrial Zone, Yantian Commune, Shenzhen 518083, China

⁴School of Bioscience and Bioengineering, Southward China University of Technology, Guangzhou College Education Mega Eye, Panyu District, Guangzhou 510006, Cathay

Yuan Yu

²BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China

Dandan Zhu

²BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China

Michael 50 Nickerson

⁵Cancer and Inflammation Programme, National Cancer Constitute, National Institutes of Wellness, 1050 Boyles Street, Frederick, Dr. 21702, Us

Shengqing Wan

¹BGI Tech Solutions Co., Ltd, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China

²BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China

Xiangke Liao

^{half dozen}School of Calculator Science, National Academy of Defense force Engineering, No.47, Yanwachi street, Kaifu District, Changsha, Hunan 410073, People's republic of china

Xiaoqian Zhu

⁶School of Estimator Scientific discipline, National University of Defense Technology, No.47, Yanwachi street, Kaifu Commune, Changsha, Hunan 410073, China

^viiState Primal Laboratory of High Performance Computing, National Academy of Defense force Technology, No.47, Yanwachi street, Kaifu District, Changsha, Hunan 410073, Mainland china

Shaoliang Peng

⁶School of Informatics, National Academy of Defense Technology, No.47, Yanwachi street, Kaifu Commune, Changsha, Hunan 410073, China

⁷State Key Laboratory of High Performance Computing, National University of Defense force Technology, No.47, Yanwachi street, Kaifu Commune, Changsha, Hunan 410073, People's republic of china

Yingrui Li

¹BGI Tech Solutions Co., Ltd, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China

²BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, Prc

Jun Wang

ⁱBGI Tech Solutions Co., Ltd, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China

²BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, Red china

^viiiThe Novo Nordisk Foundation Center for Bones Metabolic Research, University of Copenhagen, DK-1165 Copenhagen, Denmark

^nineDepartment of Biology, University of Copenhagen, DK-1165 Copenhagen, Denmark

Guangwu Guo

¹BGI Tech Solutions Co., Ltd, Beishan Industrial Zone, Yantian District, Shenzhen 518083, Prc

²BGI-Shenzhen, Beishan Industrial Zone, Yantian Commune, Shenzhen 518083, China

Received 2012 Aug 16; Revised 2012 Oct 24; Accepted 2013 Feb 14.

Supplementary Materials: Additional file one Tables S1 - data on all known fusions from ii previous studies. Boosted detailed information on the known fusions in 2 previous studies (melanoma and breast cancer researches). All information of fusions is based on release 59 of the Ensembl hg19 note database.

GUID: 4EF8DE2B-80C4-43F9-8BC7-201DE4866486

Additional file 2 Table S2 - software selected for evaluation of performance and sensitivity.

GUID: F95A3201-E56F-4409-87D6-740097877B20

Boosted file three Supplementary notes.

GUID: 95DB537D-6F2D-4F4A-87D3-40DC593675C2

Boosted file 4 Table S3 - detailed data on performance and fusion detection sensitivity of 6 tools. CPU time, maximum memory usage and sensitivity of fusion detection for each tool are shown. For the multiple procedure operations, CPU fourth dimension has been translated to single procedure usage.

GUID: 5FA567E6-E8F7-4C5A-94B1-14ED1CD46F66

Additional file 5 Tabular array S4 - detection screen of half dozen tools on two previous study datasets.

GUID: 68610B0A-2D40-4FF9-A67A-E7A8895B0E7D

Additional file 6 Tables S5, S6 and S7. Table S5: detailed data on simulated RNA-Seq reads. Tabular array S6: listing of 150 simulated fusion events. Table S7: number of fusion-supporting reads for each fusion event.

GUID: AE8ABB7D-3B23-45D6-8399-BB293E215BF5

Boosted file vii Tables S8 and S9. Tabular array S8: TP and FP rates of SOAPfuse, deFuse and TopHat-Fusion based on simulated datasets. Table S9: detailed data on the fake fusion events detected past SOAPfuse, deFuse and TopHat-Fusion.

GUID: 9341CD41-665C-4A5C-8AE5-07D8FD274B0F

Boosted file viii Tables S10 and S11. Table S10: fusion transcripts detected by SOAPfuse and deFuse in two bladder cancer cell lines. Table S11: primers and Sanger sequences of confirmed fusions in ii bladder cancer cell lines.

GUID: 64BEEF65-249B-43D4-B8AC-19B3B094779E

Additional file 9 Figure S1 - models of fusion transcripts generated by genome rearrangement. (a) Fusion transcript created by genomic inversion of Cistron A and Factor B, which are from different Dna strands. (b) Fusion transcript formed by genomic translocation in which Gene C and Gene D are from the same DNA strand and are far from each other.

GUID: 6ECC3D42-6905-4DCF-8E45-C46831E87F75

Additional file 10 Figure S2 - schematic diagrams of nine steps in the SOAPfuse pipeline. The SOAPfuse algorithm consists of ix steps (from S01 to S09) and details of each step are in the Materials and methods or Boosted file three.

GUID: 92DFE4EC-41F6-41D0-A457-67D2866A62A8

Boosted file eleven Table S12 - sixteen combination of bridge-read. There are xvi combinations based on serial numbers of reads and their mapped orientations, but only iv combinations are rational, supporting two types of fusions in which the upstream and downstream genes are different.

GUID: 941D54D3-47AC-4CE6-8416-357396EF5769

Additional file 12 Figure S3 - schematic diagrams of fusion event RECK-ALX3. (a) Alignment of supporting reads against the predicted junction sequence. The upstream function of the junction sequence is in light-green, and the downstream part is in cerise. Span-reads are displayed above the predicted junction sequence with the colored dotted line linking paired-stop reads. Junc-reads are shown below the junction sequence. (b,c) Expression assay of the exons in RECK and ALX3 by RNA-Seq read coverage. Transcripts of RECK and ALX3 are shown beneath the coordinates. The junction site is shown as a red round dot and a greenish arrow indicates the transcript orientation in the genome sequence. The region covered past the scarlet line is the region mapped by supporting reads. In this case, we establish that the expression levels of RECK and ALX3 exons at bilateral sides of junction sites are significantly different. The exons involved in the fusion transcript are expressed more highly than other ones.

GUID: F6C97596-E688-47BF-94BB-A633D583C258

Abstract

Nosotros take adult a new method, SOAPfuse, to place fusion transcripts from paired-cease RNA-Seq data. SOAPfuse applies an improved fractional exhaustion algorithm to construct a library of fusion junction sequences, which tin can be used to efficiently place fusion events, and employs a series of filters to nominate loftier-confidence fusion transcripts. Compared with other released tools, SOAPfuse achieves college detection efficiency and consumed less calculating resource. We applied SOAPfuse to RNA-Seq data from ii bladder cancer cell lines, and confirmed 15 fusion transcripts, including several novel events common to both cell lines. SOAPfuse is bachelor at http://soap.genomics.org.cn/soapfuse.html.

Background

Gene fusions, arising from the juxtaposition of two distinct regions in chromosomes, play of import roles in carcinogenesis and can serve as valuable diagnostic and therapeutic targets in cancer. Aberrant factor fusions have been widely described in malignant hematological disorders and sarcomas [1-3], with the recurrent BCR-ABL fusion gene in chronic myeloid leukemia as the classic instance [4]. In contrast, the biological and clinical bear on of gene fusions in more common solid tumor types has been less appreciated [ii]. Nevertheless, recent discoveries of recurrent cistron fusions, such as TMPRSS2-ERG in a majority of prostate cancers [5,vi], EML4-ALK in non-minor-cell lung cancer [7] and VTI1A-TCF7L2 in colorectal cancer [8], point to their functionally important role in solid tumors. These fusion events were not detected until recently due to technical and analytic issues encountered in the identification of balanced chromosomal aberrations in complex karyotypic profiles of solid tumors.

Massively parallel RNA sequencing (RNA-Seq) using a adjacent-generation sequencing (NGS) platform provides a revolutionary, new tool for precise measurement of levels of transcript abundance and structure in a large variety of species [nine-xvi]. In addition, RNA-Seq has been proven to be a sensitive and efficient approach to gene fusion discovery in many types of cancers [17-twenty]. Compared with whole genome sequencing, which is also able to detect gene-fusion-creating rearrangements, RNA-Seq identifies fusion events that generate abnormal transcripts that are more than likely to be functional or causal in biological or disease settings.

Recently, several computational methods, including FusionSeq [21], deFuse [22], TopHat-Fusion [23], FusionHunter [24], SnowShoes-FTD [25], chimerascan [26] and FusionMap [27], have been developed to identify fusion transcript candidates by analyzing RNA-Seq data. Although these methods were capable of detecting 18-carat fusion transcripts, many challenges and limitations remain. For instance, to determine the junction sites in a given fusion transcript, FusionSeq selected all exons that were potentially involved in the junction from both of the factor pairs, and then covered the exons with a set of 'tiles' that were spaced i nucleotide apart [21]. A fusion junction library was constructed by creating all pairwise junctions betwixt these tiles, and the junctions were identified by mapping the RNA-Seq reads to the junction library. This module makes FusionSeq time consuming, especially for genes with more and larger exons. In addition, FusionHunter identified only fusion transcripts with junction sites at the exon edge (splicing junction), but could not detect a fusion transcript with junction sites in the middle of an exon [24]. Many homologous genes and repetitive sequences frequently masquerade every bit fusion events due to cryptic alignments of short NGS sequencing reads. The lack of constructive filtering mechanisms promoted frequent detection of spurious fusion transcripts. Furthermore, several software consumed large amounts of computational resources (CPU time and memory usage), which was a serious problem when analyzing hundreds of samples in parallel.

To address the limitations above, nosotros present a new algorithm, SOAPfuse, which detects fusion transcripts in cancer from paired-finish RNA-Seq data. SOAPfuse combines alignment of RNA-Seq paired-end reads against the homo genome reference sequence and annotated genes, with detection of candidate fusion events. It seeks two types of reads supporting a fusion event (Figure 1a): discordant mapping paired-end reads (span-read) that connect the candidate fusion cistron pairs; and junction reads (junc-read) that confirm the exact junction sites. SOAPfuse applies an improved partial burnout algorithm to efficiently construct a putative junction library and also adopts a serial of filters and quality command measures to discriminate likely genuine fusions from sequencing and alignment artifacts (Figure 1b; see Materials and methods). The program reports a high-conviction list of fusion transcripts with the precise locations of junction sites at single nucleotide resolution. Furthermore, SOAPfuse supplies the predicted junction sequences of fusion transcripts, which are helpful for the blueprint of bilateral primers in preparation for RT-PCR validation. Moreover, SOAPfuse creates schematic diagrams that can display the alignment of supporting reads (span-reads and junc-reads) on junction sequences and expression levels of exons from each gene pair. Figures are created in lossless prototype format (SVG, scalable vector graphics) and, with detailed information on fusion events, will facilitate comprehensive characterization of fusion transcripts at single base resolution and will profoundly aid manual selection of the fusion events of interest for further enquiry. SOAPfuse can distinguish specific features of RNA-Seq data, such equally insert size and read length, so it still works well even when a single sample includes unlike types of paired-terminate RNA-Seq information.

An external file that holds a picture, illustration, etc. Object name is gb-2013-14-2-r12-1.jpg

Framework of SOAPfuse for discovering fusion events. (a) Model of the fusion event from Gene A and Gene B that is supported by bridge-reads and junc-reads. Gene A and reads mapped to it are in blue, and Gene B is in orange. The junction site is marked by a yellow dot. Ii bridge-reads and two junc-reads are shown. (b) The four parts of the SOAPfuse algorithm: read alignment against the human genome reference and annotated transcript sequences; identifying candidate gene pairs by seeking span-reads; detection of predicted fusions; filtering fusions with several criteria to generate the last high-confidence fusion transcript list. Methods with central roles in the algorithm are indicated in reddish. Steps marked past an asterisk indicate key filtering steps.

Results

Evaluation of functioning and sensitivity of SOAPfuse

To assess the performance and sensitivity of SOAPfuse, we applied SOAPfuse to paired-stop RNA-Seq datasets from two previous studies: dataset A, consisting of six melanoma samples and 1 chronic myelogenous leukemia sample, in which 15 confirmed fusions were detected [19]; and dataset B from four breast cancer cell lines with 27 validated fusions [20]. According to Sanger sequences, nosotros characterized these fusion transcripts using release 59 of the Ensembl note database [28], including factor symbols, chromosome locations and exact genomic coordinates of junction sites (Additional file 1). To compare SOAPfuse with other published software (Additional file 2), we also run deFuse [22], TopHat-Fusion [23], FusionHunter [24], SnowShoes-FTD [25] and chimerascan [26] on both RNA-Seq datasets. FusionSeq [21] and FusionMap [27] were abandoned due to computational limitations (Additional file 3). We examined different parameters for each tool to obtain college sensitivity with lower consumption of computational resources (Additional file 3). For a given fusion outcome, the altitude between a junction site identified by these tools and the real ane equally determined past previous reports should be less than 10 bp, or the fusion event was considered equally non detected. Figure ii shows the calculating resource (CPU time and memory usage) and sensitivity for SOAPfuse and the other five methods (Additional file 4).

An external file that holds a picture, illustration, etc. Object name is gb-2013-14-2-r12-2.jpg

Functioning and sensitivity comparison among six tools based on datasets from previous studies. Dataset A is from a melanoma study, and dataset B is from breast cancer research. (a-c) Average CPU time (a), maximum memory usage (b) and sensitivity of fusion detection (c) of half dozen tools are shown in histograms with detailed values (tiptop).

For dataset A, which contains approximately 111 one thousand thousand paired-end reads, SOAPfuse consumed the least CPU time (approximately 5.2 hours) and the second least retention (approximately vii.1 Gigabytes) to complete the information analysis (including the alignment of reads against reference), and was able to detect all fifteen fusion events. DeFuse and FusionHunter detected comparable numbers of known fusion events (12 to 13 of the 15 fusions), just took 82.1 and 21.iii CUP hours, respectively, at to the lowest degree four times equally much every bit SOAPfuse (Additional file 5). The computational resource toll of SnowShoes-FTD was comparable with SOAPfuse, just SnowShoes-FTD identified but 8 of 15 events. The remaining ii tools, chimerascan and TopHat-Fusion, detected four confirmed fusion events merely used significantly more CPU hours or memory usage. For dataset B containing approximately 55 million paired-end reads, SOAPfuse detected 26 of the 27 reported fusion events with 4.i CPU hours and six.three Gigabytes memory. The other five tools were able to identify comparable numbers of reported fusions (xv to 21) and toll at to the lowest degree 6.4 hours CPU time. One fusion result, NFS1-PREX1, was missed past all methods, including SOAPfuse (Additional file 3).

The process of data analysis for all six tools included two stages: read alignment then detection of fusion events. For both datasets, SOAPfuse, SnowShoes-FTD, and chimerascan consumed less retentiveness than the other three tools. Chimerascan used less memory than SOAPfuse because it used Bowtie [29], which required less retentivity than SOAP2 [30] in SOAPfuse, to align reads. The retention usage of the other tools (deFuse, FusionHunter, and TopHat-Fusion) were almost two to 3 times that of SOAPfuse. They reached maximum memory usage at the fusion detection stage, but non at the read alignment phase, which suggests there may still be room for algorithm improvement for fusion detection. SOAPfuse uses several optimized algorithms to reduce memory consumption with low cost to computation speed. For the two datasets, SOAPfuse expended less CPU time and memory than nearly of the other five tools, and reached the highest detection sensitivity, with almost all reported fusion events rediscovered (41 of 42), showing its superior performance and high sensitivity.

Judge of the false negative and false positive rates by simulated datasets

To estimate the imitation negative (FN) and imitation positive (FP) rates of fusion detection by SOAPfuse, we applied SOAPfuse to a simulated RNA-Seq dataset. We used the short-read simulator provided by MAQ [31] to generate paired-end RNA-Seq reads from 150 faux fusions with 9 dissimilar expression levels (v- to 200-fold; Additional files 3 and half dozen). We mixed simulated reads with the RNA-Seq dataset (approximately xix million paired-terminate reads) from human being embryonic stem cells, which was also used as background data by FusionMap [27]. Chimerascan, FusionHunter and SnowShoes-FTD only detected fusion events with junction sites at the exon boundaries. Their performances could not be evaluated considering some imitation fusion events harbored junction sites in the center of exons. Nosotros tested deFuse, TopHat-Fusion and SOAPfuse on fake paired-end reads. Several strategies were applied to adequately compare the performance of these tools (Boosted file 3). In total, 149 (99%) of the 150 fusion events were rediscovered, and 142 (94%) were detected by at to the lowest degree two tools, indicating our simulation was reasonable. To be bourgeois, the performance comparison was based on the 142 events that were supported by at least two algorithms (Figure 3; Additional file 7).

An external file that holds a picture, illustration, etc. Object name is gb-2013-14-2-r12-3.jpg

Evaluation of false negative and fake positive rates of SOAPfuse based on simulated datasets. A comparison amid three tools based on false fusion events is shown with different expression levels of the fusion transcripts. (a,b) FN charge per unit (a) and FP rate (b) of 3 tools are shown in line graphs. (c) Distribution of 142 simulated fusion events detected by the 3 methods. SOAPfuse missed iii faux fusion events (red) that were identified by both deFuse and TopHat-Fusion.

Equally expected, FN rates decreased with increasing expression levels of fusion transcripts (Figure 3a). SOAPfuse and deFuse achieved the lowest FN rates at 5% with fusion transcript expression levels of xxx-fold or greater. TopHat-Fusion had college FN rates, especially at low fusion transcript expression levels (5- to 20-fold). For the FP charge per unit (Effigy 3b), just SOAPfuse achieved <5% at different fusion transcript expression levels, while deFuse and TopHat-Fusion had higher FP rates at lower fusion transcript expression levels.

Generally, lower FN rates and lower FP rates are contradictory for detection of fusions; however, SOAPfuse and deFuse are good at reducing FN and FP rates during fusion transcript identification. SOAPfuse missed three simulated fusions, which are detected past both deFuse and TopHat-Fusion (Figure 3c), revealing a weakness in analysis of homologous gene sequences and short fusion transcripts of long genes (Additional file 3). In summary, SOAPfuse showed optimal performance with low FN and FP rates at dissimilar expression levels of fusion transcripts.

Awarding to bladder cancer cell lines

We next practical SOAPfuse to two float cancer cell lines, 5637 and T24. We performed high-throughput RNA-Seq, using Illumina HiSeq sequencing engineering, on mRNA from both cell lines and acquired more than 30 meg paired-end reads for each (Table 1; meet Materials and methods). SOAPfuse identified a total of 16 fusion transcripts, all of which are intrachromosomal and fused at the exon boundaries. We designed primers for RT-PCR experimental validation of all predicted fusions, and Sanger sequencing of the amplicons confirmed 15 (94%) events, of which 6 were detected in both cell lines (Effigy 4a; Table 2; Additional file 8). Detailed analysis showed that several confirmed fusion events (Table 2) might exist consequences of chromosomal rearrangements. For instance, the HADHB-RBKS fusion transcript (Figure 4b) fuses two genes from dissimilar Deoxyribonucleic acid strands, indicating a potential inversion (Figure S1a in Additional file 9). Furthermore, some fusions implied possible intrachromosomal translocations (Figure S1b in Additional file 9), such as CIRH1A-TMCO7, PSMD8-SIPA1L3, and TIAM1-ATP5O (Figure 4c-e). Intrachromosomal translocations as a machinery to create fusions were likewise found in ovarian carcinoma [32] and glioblastoma [33]. To our knowledge, all the confirmed fusion events have not been reported past previous studies on bladder cancer, indicating their potential significance for further research.

Tabular array 1

RNA-Seq data from ii bladder cancer prison cell lines

Sample ID	Read type	Insert size	Read length	Number of paired-end reads
5637	Paired end	200	90	32,228,742
T24	Paired end	200	90	36,830,100

An external file that holds a picture, illustration, etc. Object name is gb-2013-14-2-r12-4.jpg

Confirmed fusions in ii bladder cancer cell lines. (a) RT-PCR amplifications of confirmed fusions in two bladder cancer cell lines. Marker (M), positive (β-actin) and negative (ddH₂O) controls are too shown. Fusion events in red are detected in both jail cell lines. For fusions that accept multiple RT-PCR products, genuine amplicons of a fusion transcript are boxed in yellow. Ane fusion, SNAP23-LRRC57, reported by deFuse, is further discussed in the text. (b-e) Fusion events that indicate potential chromosomal rearrangements, including potential inversion (b) and intrachromosomal translocations (c-e), are shown. Blue segments are upstream genes, and downstream genes are in orange. Gene symbols are followed with their DNA strands. Exons around the junction sites are drawn with a double slash indicating exons that are non shown. The start positions of upstream genes and finish positions of downstream genes are noted with a colon separating chromosomal location and reference genome coordinate. The bridge-reads and junc-reads from RNA-Seq are shown over and under the junction sequences, respectively. Sanger sequencing of junction sequences are displayed nether the junction sites.

Table 2

Confirmed fusion events from two bladder cancer prison cell lines

Sample ID	Fusion genes (5'-3')	Chromosome (5'-iii')	5' position	3' position	Fusion reads (span/junc)	Detected in both cell lines	Potential chromosomal rearrangement
5637	BDKRB2-BDKRB1	14-fourteen	96703518	96728989	2/iii	Yes	No
5637	CIRH1A-TMCO7	16-16	69184807	69117388	20/20	Yes	Yes
5637	CLN6-CALML4	15-15	68521840	68489966	4/4	Yep	No
5637	GATSL1-GTF2I	7-7	74867229	74143124	3/15	No	Yes
5637	HADHB-RBKS	2-2	26502983	28070964	12/10	Yes	Yes
5637	POLA2-CDC42EP2	11-xi	65063461	65088015	iii/five	No	No
5637	PSMD8-SIPA1L3	19-xix	38871639	38673159	five/five	Yep	Yes
5637	TIAM1-ATP5O	21-21	32537279	35276325	nine/21	Aye	Yes
T24	BDKRB2-BDKRB1	14-14	96703518	96728989	3/3	Yes	No
T24	CIRH1A-TMCO7	16-16	69184807	69117388	xix/24	Yes	Aye
T24	CLN6-CALML4	15-fifteen	68521840	68489966	6/5	Yes	No
T24	CTBS-GNG5	1-1	85028940	84967653	3/six	No	No
T24	HADHB-RBKS	ii-2	26502983	28070964	half dozen/7	Yes	Yes
T24	PSMD8-SIPA1L3	19-19	38871639	38673159	six/5	Yes	Yeah
T24	TIAM1-ATP5O	21-21	32537279	35276325	eight/28	Yes	Yes

All data is based on release 59 of the Ensembl hg19 notation database. Vi fusions were detected in both bladder cell lines, and five events may be derived from potential chromosomal rearrangements on the genome. For Fusion reads (span/junc), numbers of span-reads and junc-reads are separated by a slash.

We likewise used deFuse to reanalyze this dataset and identified 11 fusions, of which 10 (91%) events were able to exist confirmed by RT-PCR experiments. Nine of the ten confirmed events were also detected by SOAPfuse (Tabular array S10 in Additional file 8) and the remaining fusion transcript (SNAP23-LRRC57; Figure 4a) was missed by SOAPfuse. Sanger sequencing shows that exon 5 of SNAP23 is fused to the antisense sequence of exon 4 of LRRC57. This implies that deFuse has a somewhat dissimilar definition of a fusion compared to SOAPfuse (Figure 5a; Boosted file 3). The distance between the junction sites in SNAP23 and LRRC57 is approximately thirty kbp, which is e'er immune by the alternative splicing. We speculated the fusion predicted by deFuse might be an culling splicing effect in the upstream gene, SNAP23. So we checked the latest version of the Ensembl annotation database (release 69) and plant a transcript sequence (SNAP23-017) of gene SNAP23 in which the antisense sequences of exon 4 in LRRC57 has been annotated as a new exon in the SNAP23 gene (Figure 5b). Based on this discovery, we believe the SNAP23-LRRC57 fusion event reported past deFuse is an alternative splicing event in SNAP23.

An external file that holds a picture, illustration, etc. Object name is gb-2013-14-2-r12-5.jpg

Analysis of SNAP23-LRRC57 reported past deFuse. (a) SNAP23-LRRC57 analysis based on release 59 of the Ensembl annotation database. Sanger sequencing of RT-PCR amplicons of the SNAP23-LRRC57 fusion result reported by deFuse is shown. The upstream factor (LRRC57) is in blue, and the downstream part is in orangish. Gene symbols are followed with their Deoxyribonucleic acid strands. The downstream fusion role is the antisense strand sequence of exon 4 of LRRC57. (b) In the latest release (release 69) of the Ensembl annotation database, the downstream part of the SNAP23-LRRC57 fusion is annotated equally part of exon 4 of SNAP23-017, 1 of the SNAP23 transcripts. The fusion SNAP23-LRRC57 reported by deFuse is in fact an alternative splicing effect in cistron SNAP23.

Discussion

Nosotros have developed a new method called SOAPfuse to aid in fusion transcript discovery from paired-end RNA-Seq information. Comparing SOAPfuse with other tools on two previously published datasets, i simulated dataset and two bladder cancer cell line datasets, we authenticated superior functioning and loftier sensitivity of SOAPfuse. By evaluating the plan on a faux dataset, SOAPfuse showed a low FP rate (v%) at different expression levels of fusion transcripts and information technology also achieved a low FN charge per unit of 5% when the expression levels of fusion transcripts were greater than 30-fold. Using the float cancer cell line datasets, we demonstrated with RT-PCR-validated fusions that SOAPfuse has substantially high accuracy (15 of 16, 94%) and nosotros as well identified several novel fusion transcripts that may exist derived from chromosomal rearrangements.

In the simulated dataset, SOAPfuse missed three fusion transcripts. The program had some difficulties detecting fusion transcripts from cistron pairs having highly like sequences, and fusion transcripts involving brusque transcripts of long genes. However, preliminary solutions accept been practical to remedy these shortcomings successfully (Additional file iii), and will be included in futurity versions of SOAPfuse. After analyzing the characteristics of the fusion events, we found that several novel fusion transcripts detected in the bladder cancer cell lines were more likely to be derived from chromosomal rearrangements of the DNA. Whole genome sequencing volition be helpful for determining whether the fusion transcripts are from genomic DNA variations and if the breakpoints can be detected. We take started to develop a new algorithm to detect chromosomal rearrangements that tin can generate predicted fusion transcripts from whole genome sequencing data based on the results from SOAPfuse. Information technology volition be complementary to SOAPfuse for performing genome assay of fusions with tools similar CREST [34]. We will continuously refine SOAPfuse and update it on our official website.

Conclusions

Here we nowadays an optimized publicly available methodology for identifying novel fusion transcripts from RNA-Seq data. Our results suggest that SOAPfuse achieves better performance than other published tools and it produces a highly authentic listing of fusion events in a fourth dimension-efficient manner. Furthermore, it provides predicted junction sequences and schematic diagrams of fusion events, which are helpful to clarify detected fusions. Overall, SOAPfuse is a useful method that will enable other inquiry groups to make discoveries from their ain RNA-Seq information collections.

Materials and methods

Outline of the general approach

SOAPfuse seeks two types of reads (bridge-reads and junc-reads; Figure 1a) to identify fusion transcripts. Paired-finish reads that map to two different genes (a gene pair) are defined as span-reads, and reads roofing the junction sites are called junc-reads. Span-reads are used to identify candidate gene pairs, and junc-reads are used to narrate the exact junction sites at single base resolution. Indistinguishable span-reads and junc-reads are removed before computing the number of supporting reads (Figure 6a). SOAPfuse contains nine steps in its pipeline (Additional file 10), and can be divided into four parts (Figure 1b): (i) read alignment (steps S01 to S03); (ii) identifying candidate gene pairs (steps S04 and S05); (3) detection of predicted fusions (steps S06 and S07); and (four) filtering fusions (steps S08 and S09). A detailed description of the algorithm is in Boosted file 3.

An external file that holds a picture, illustration, etc. Object name is gb-2013-14-2-r12-6.jpg

Bones filtering of candidate cistron pairs in SOAPfuse. (a) Duplicated span-reads and junc-reads are removed earlier calculating the number of supporting reads and but one duplicated read is retained. (b) Genes C and D are side by side, and they share ii exons: exon 4 and exon 5 from Gene C overlap with exon ane and exon two of Gene D, respectively. Bridge-reads from the overlapped exons are excluded by SOAPfuse. (c) Cistron pair Thou and N has regions with homogenous/like sequences and reads from these regions are filtered out.

Read alignment

SOAPfuse initially aligns paired-end reads confronting the human reference genome sequence (hg19) using SOAP2 [30] (SOAP-2.21; step S01 in Additional file 10). We divided the reads into three types according to the read alignment results: PE-S01, SE-S01 and UM-S01, where PE stands for paired-terminate mapped event, SE for single-stop mapped consequence, and UM for unmapped read. PE-S01 reads indicate the paired-end reads mapping to the genome with the proper insert sizes (<10,000 bp). SE-S01 contains paired-end reads in which simply i of two ends mapped to the reference genome, and paired-end reads indicating a fragment with an abnormal insert size or mapped orientation. All unmapped reads are saved in UM-S01 with a FASTA format. PE-S01 is used to evaluate insert size (Additional file 3). SOAPfuse then aligns UM-S01 reads against annotated transcripts (Ensembl release; step S02 in Additional file x) and generates SE-S02 and UM-S02. To filter unmapped reads caused by small indels, UM-S02 reads are realigned to annotated transcripts using BWA [35] (BWA-0.v.ix; maximum number of gap extensions is 5), and the remaining unmapped reads are called filtered-unmapped (FUM).

Iteratively trimming and realigning reads

The latest protocols for NGS RNA-Seq library preparation can generate paired-finish reads with an insert size shorter than the total length of both reads (with the 3' ends of both reads overlapped). The paired-end reads with overlapped 3' ends may come from the junction regions containing the junction sites and these paired-end reads are not mapped to the reference if the overlapped regions cover the junction sites. These reads are components of FUM generated in pace S02 (Additional file 10) and cannot become span-reads, which volition reduce the capability of fusion detection. SOAPfuse estimates whether the number of these paired-end reads with overlapped three' ends exceeds the threshold (twenty% of full reads by default). If yes, or the user enables a trimming functioning accessible in the configuration file, SOAPfuse will iteratively trim and realign FUM reads to annotated transcripts (Figure 7; pace S03 in Additional file 10). The length of reads subsequently trimming should exist at least 30 nucleotides (default parameter in SOAPfuse). The trimmed reads that are able to be mapped to annotated transcripts are stored in SE-S03 (Additional file 3). 2 steps were used to finish the trimming and realigning operation: first, FUM reads were progressively trimmed off five bases from the 3'-finish and mapped to annotated transcripts over again until a match was found; 2d, using the same strategy, we trimmed the remaining FUM reads from the 5'-finish. All mapped paired-end reads from these two steps were merged together (footstep S04 in Additional file x).

An external file that holds a picture, illustration, etc. Object name is gb-2013-14-2-r12-7.jpg

Trimming and realigning the paired-terminate reads in which both 3' ends overlap each other. A junction sequence is shown with the junction site noted past a yellow dot. The blue region is from Cistron A, and orange is from Factor B. The paired-end read with overlapped 3' ends (black thick line) cannot map to Factor A and Gene B, every bit reads embrace the junction site. A series of trimmed reads (greyness thick line) are obtained by iteratively trimming 5 nucleotides (nts) each time from the iii' ends until the reads could map to Gene A and Cistron B. In this instance, end one of a paired-end read requires 2 cycles of trimming to attain successful alignment, while end ii needs five cycles.

Identifying candidate cistron pairs

From all discordantly aligned reads, SOAPfuse seeks span-reads to support candidate gene pairs (step S05 in Additional file x). Both the bridge-reads that mapped uniquely to the reference (human genome and annotated transcripts) and the trimmed reads that have multiple hits were used to notice the candidate gene pairs. The maximum hits for each bridge-read is a parameter in the configuration file. To ensure accurate detection of the fusion cistron pairs, SOAPfuse imposes several filters on the predicted candidate gene pair listing (Boosted file 3), such as excluding gene pairs from the same gene families and pairs with overlapped or homogenous exon regions (Figure 6b).

Determining the upstream and downstream genes in the fusion events

After obtaining the candidate gene pairs, the upstream and the downstream genes of the fusion were determined based on the information from span-read alignment against the reference. In the process of paired-cease sequencing, the fragments are sequenced from bilateral edges to the middle part: one stop starts from the 3' end of the fragment, while the other finish starts from the 3' stop of the complementary base-pairing sequence of the fragment (Effigy 8a). This information is used to define the up- and downstream genes in a fusion transcript.

An external file that holds a picture, illustration, etc. Object name is gb-2013-14-2-r12-8.jpg

Determining the upstream and downstream genes in fusion events. (a) A fragment of paired-end sequencing is shown with its complementary fragment. Paired-end reads (reads 'a' and 'b') are shown with their sequencing direction (from 5' to iii', noted past arrows on reads). Read 'a' is generated from the fragment itself, while read 'b' is from the complementary fragment. The sequencing orientation is from bilateral edges to the center of the fragment, so the paired-end reads are generated caput-to-caput. (b,c) Different classifications of bridge-read (read 'a' and 'b') support different upstream and downstream genes. The factor aligned past reads in the plus orientation must exist the upstream gene. In (b), read 'a' aligns to Factor A in a plus orientation. Based on the paired-finish sequencing shown in (a), Gene A must be the upstream factor and Cistron B must be the downstream gene. In (c), read 'b' aligns to Factor B in a plus orientation. And so, Gene B is an upstream cistron and Gene A is a downstream gene.

A span-read (paired-stop reads 'a' and 'b') supports a candidate gene pair (Gene A and Cistron B). According to the serial number ('1' or 'two') and mapped orientation ('+' or '-') of paired-finish reads (read 'a' and 'b'), there are 16 combinations, but simply four are rational. These four combinations support two types of fusions in which the upstream and downstream genes are different (Additional file 11. Table S12). The judgment rule is: the factor aligned by reads in the plus orientation must be the upstream gene. Here, we presume that read 'a' maps to Gene A and read 'b' maps to Factor B (Figure 8b,c). In Figure 8b, read 'a' aligns to Cistron A (annotated transcripts) in the plus orientation, so Gene A must be the upstream factor; while in Effigy 8c, read 'b' aligns to Gene B in the plus orientation, so Gene B must be the upstream factor. Co-ordinate to this dominion, SOAPfuse defines the upstream and downstream genes in fusion events.

Obtaining the fused regions

Earlier nosotros defined the fused regions in which the junction sites may located, we obtained a non-redundant transcript sequence from transcript(due south) of each annotated cistron (Additional file 3). Two methods were used to define the fused regions in cistron pairs. In the beginning method, SOAPfuse bisects each FUM read, and generates two isometric segments, each called a half-unmapped read (HUM read; step S06 in Additional file 10). HUM reads are aligned against candidate factor pairs with SOAP2. A 18-carat junction read (junc-read) should accept at least one HUM read that does non embrace the junction site and could map to i gene of the pair. Based on the mapped HUM read, SOAPfuse extends one HUM read length from the mapped position in not-redundant transcripts to define the fused region wherein the junction site might exist located (Figure 9a). For HUM reads with multiple hits, all locations of the hits are taken into account. Original reads of mapped HUM reads are called equally useful unmapped reads (UUM read).

An external file that holds a picture, illustration, etc. Object name is gb-2013-14-2-r12-9.jpg

Obtaining fused regions by two methods. A junction sequence in a fusion transcript from a factor pair, Gene A and Gene B, in blue and orange, respectively, is shown. The junction site is displayed as yellowish round dots on the fusion sequence. (a) 2 unmapped reads (candidate junc-reads) are shown effectually the fusion sequence. Each read is bisected into two isometric HUM reads: one HUM can map to i cistron of the pair, while the other one cannot map to the cistron as it covers the junction site (yellow round dot). From the location of the mapped HUM read, SOAPfuse extends 1 HUM read-length to obtain the fused region, in which the junction site is located (xanthous triangle). (b) Span-read mapping to the gene pair is shown. Cease 1 maps to Gene A (with position MP1), and end 2 maps to Gene B (with position MP2). From the mapped positions of both ends, SOAPfuse determines the potential fused region based on insert sizes (INS), standard deviation of insert sizes (SD), and the length of reads (RL1 and RL2 for both ends, respectively) and extends proper flanking bases to obtain the fused region.

SOAPfuse also uses span-reads to observe the fused regions in candidate gene pairs (step S07-a in Boosted file ten). Span-reads, the paired-end reads supporting the candidate fusion gene pairs, are derived from the fused transcripts and the junction sites are oftentimes located in regions of the fused transcripts between both ends of span-reads. For upstream and downstream genes, nosotros tin can extend 1 region with length equal to insert size (evaluated in stride S01) from the mapped position of each 3' end span-read to estimate the fused region covering the junction site (Figure 9b). Every cistron pair is e'er supported by at least two span-reads, corresponding to several fused regions that may have overlaps with each other. We presumed that end 1 of a span-read mapped to position MP1 in Gene A, and stop ii of the span-read mapped to position MP2 in Factor B. The lengths of ends 1 and two of the span-reads are RL1 and RL2, respectively. The average of insert sizes (INS) and their standard difference (SD) are evaluated in stride S01. The fused regions were estimated by the following intervals:

The intervals of fused regions for the upstream genes are:

[Grand Pi +R 50one -F L B, M P1 +I N S + iii*Southward D -R L2 +F L B - ane]

And the intervals of fused regions for the downstream genes are:

[K P2 +R L2 -I N S - 3*S D +R Fifty1 -F Fifty B , Thou Ptwo +F L B - 1]

In the above formula, a flanking region with length of FLB was considered considering sometimes a few bases from the three' end of a bridge-read embrace the junction sites in the mismatch-allowed alignment.

SOAPfuse combined the fused regions adamant by the higher up 2 methods to detect the junction sites using the fractional exhaustion algorithm as described beneath.

Construction of fusion junction sequence library with partial exhaustion algorithm

To simplify the explanation of the algorithm, we telephone call the fused regions determined past the above two methods every bit fused regions one and fused regions 2, respectively. Fused region 1, divers by the mapped HUM reads, is a small region roofing the junction sites with length smaller than i NGS read. Fused region two is a big region divers by the NGS library insert sizes, which are always much longer than HUM reads. Generally, fused region i is more useful than fused region 2 to define the junction sites.

However, not all mapped HUM reads are from genuine junc-reads. Sometimes, one unmapped read from a given cistron does not map this factor equally a effect of more mismatches than are immune by SOAP2. Unmapped reads like this are non junc-reads and subsequently the bisection into ii HUM reads, 1 of the HUM reads could be mapped to the original factor, which results in spurious fused regions. Fused region 2 involves alignments of ii ends of a span-read simultaneously, which are too filtered past several effective criteria (see the 'Obtaining candidate gene pairs' section). SOAPfuse combined fused regions 1 and 2 to efficiently ascertain the junction sites. SOAPfuse classifies fused region 2 into two types of sub-regions: overlapped parts betwixt fused regions one and ii are called the credible-region, while the other parts of fused region 2 are called the potential-region (Figure 10a).

An external file that holds a picture, illustration, etc. Object name is gb-2013-14-2-r12-10.jpg

Edifice the fusion junction sequence library using a partial exhaustion algorithm. A junction sequence in a fusion transcript from a gene pair, Gene A and Gene B in blue and orange, respectively, is shown. The junction site is shown as xanthous round dots on the fusion segment, and every bit yellowish triangles on the gene pair. (a) Fused regions 1 and 2 from two different methods are shown and fused region 2 is divided into credible-regions and potential-regions with the coordinates of each sub-region labeled in red font. An upstream putative junction site (U_i ) is selected from fused region 2 in Gene A, and a downstream putative junction site (D_j ) is selected from fused region 2 in Gene B. (b) For each U_i and D_j , SOAPfuse generates the candidate fusion junction sequence by creating pair-wise connections between U_i and D_j . U_i and D_j should not be located in potential-regions at the same time.

In order to build the fusion junction sequence library, we covered fused region 2 from each factor pair with 'tiles' that are spaced one nucleotide apart and we finally generated the candidate fusion junction library past creating all pair-wise connections between these tiles (Figure 10b). To eliminate the imitation positives in the junction sequence library, only the junction sequences in which at least one of two junction sites in a gene pair is located in the credible-region were selected for further assay. SOAPfuse carried out this partial burnout algorithm to reduce the size of the putative junction library and retain 18-carat junction sequences as much equally possible.

Detection of junction sites in fusion transcripts

To place the junction sites of fusion transcripts, we mapped the useful-unmapped-reads (UUM reads; run into the 'Obtaining the fused regions' section) to the putative fusion junction sequence library to seek the junction reads (step S07-b in Boosted file x). We required that a candidate fusion should be supported by multiple span-reads, junction reads, and other criteria (step S08 in Additional file 10; Additional file 3). To exclude FP fusion events, we removed the initial candidate fusion gene pairs that closed with each other and that had homogenous/overlapping regions effectually the junction sites (Figure 6c; step S09 in Boosted file 10). SOAPfuse not only reports high-confident fusions but also provides the predicted junction sequences for farther RT-PCR experimental validations. SVG figures are also created, showing the alignments of supporting reads on junction sequences and expression level of gene pairs (for example, Boosted file 12).

Preparation of simulated datasets

False RNA-Seq data were generated to evaluate the FN and FP rate of SOAPfuse. We generated 150 simulated fusion transcripts in two steps based on homo annotated genes. The beginning stride involved randomly selecting candidate factor pairs with several criteria, such every bit controlling the distance between paired genes and avoiding gene pairs from gene families. The second step involved randomly selecting transcripts and junction sites at the exon edges or in the middle of exons. Using the curt-read simulator provided by MAQ [31], we generated paired-end reads at 9 sequencing depth (v- to 200-fold) to simulate different expression levels of fusion transcripts. Paired-end reads from H1 human embryonic stalk cells were used as background information. Details of the simulation work can exist found in Boosted file 3.

Full RNA preparation from float cancer prison cell lines

Two float cancer jail cell lines (5637 and T24) were purchased from the American Type Culture Collection (Manassas, VA, Usa). They were cultured in RPMI 1640 medium (Invitrogen, Grand Island, NY, The states) containing 10% fetal bovine serum (Sigma, Saint Louis, MO, USA). Total RNAs were prepared using Trizol (Invitrogen) according to the manufacturer's instructions. They were treated with RNase-free DNase I to remove residual Deoxyribonucleic acid. The quality of total RNAs was evaluated using an Agilent 2100 Bioanalyser.

cDNA library construction for RNA-Seq

The cDNA libraries were constructed as described in previous studies [36,37]. Briefly, beads (Invitrogen) with oligo (dT) were used to isolate poly (A) mRNA from full RNAs. To avoid priming bias in the process of synthesizing cDNA, mRNA was fragmented earlier the cDNA synthesis. Purified mRNA was so fragmented in fragmentation buffer at an elevated temperature. Using these brusk fragments as templates, random hexamer-primers were used to synthesize the first-strand cDNA. The second-strand cDNA was synthesized using buffer, dNTPs, RNase H and DNA polymerase I. Short double-stranded cDNA fragments were purified with a QIAquick PCR extraction kit (Qiagen, Hilden, Germany) and then subjected to an end repair procedure and the add-on of a unmarried 'adenine' base. Adjacent, the short fragments were ligated to Illumina sequencing adaptors. cDNA fragments of a selected size were gel-purified and amplified past PCR. In full, nosotros constructed 1 paired-end transcriptome library for each prison cell line, and sequenced them on the Illumina HiSeq2000 platform. Both paired-end libraries were sequenced to a 90-bp read length with insert sizes ranging from 150 to 200 bp. RNA-Seq data from the 2 bladder cancer cell lines has been submitted to the NCBI Sequence Read Archive (SRA) and are available under accession number [SRA052960].

Fusion validation by RT-PCR

The digested total RNAs from the bladder cancer jail cell lines were opposite-transcribed to cDNA for validation using reverse transcriptase (Invitrogen) and oligo-d(t) primers (TaKaRa, Dalian, China). Then, fusion transcripts were validated using RT-PCR amplification followed past Sanger sequencing. For the RT-PCR amplification, the primers were designed using Primer (version 5.0) and all primer sequences can be plant in Table S11 in Additional file 8. Nosotros carried out the RT-PCR amplifications using TaKaRa Taq™ Hot Start Version and performed reactions in 20 μl volumes with two μl of x× PCR buffer (Mg^two+Plus), 2 μl of dNTP mixture (each two.5 mM), ii μl of primers (each ten μM), 0.5 μl of TaKaRa Taq HS (v U/μl), 20 ng of cDNA and up to 20 μl using ddH_iiO. The thermocycler programme used was the following: (i) 95°C for iv minutes, (two) 95°C for 40 seconds, (iii) 55°C to 62°C for 30 seconds, (iv) 72°C for 45 seconds, (5) steps ii through 4 repeated 35 times, and (half dozen) 72°C for 10 minutes. The products of RT-PCR amplification were analyzed on a two% agarose gel to make sure that no unexpected bands were amplified. The purified RT-PCR products were sequenced in forward and reverse directions with the ABI PRISM Big Dye Terminator Bicycle Sequencing Fix Reaction kit (version three) and ABI PRISM 3730 Genetic Analyzer (Applied Biosystems, Foster Urban center, CA, The states). Chromatograms were generated past Chromas (version 2.22), and then were analyzed by BLAT (online genome alignment on the UCSC Genome Browser [38].

Abbreviations

bp: base pair; CPU: cardinal processing unit; FN: fake negative; FP: faux positive; FUM: filtered unmapped reads; HUM: half unmapped read bisected from FUM reads; NGS: next-generation sequencing; PE: paired-end mapped result; RL: read length; RT-PCR: reverse transcription polymerase chain reaction; SE: single-terminate mapped consequence; SVG: Scalable Vector Graphics; UM: unmapped read.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JW and GG conceived and designed the bones algorithm of SOAPfuse. WJ and KQ implemented and optimized the algorithm. PS and SW performed the validation experiments. WJ, KQ, MH, PS, QZ and FZ carried out the comparison among dissimilar tools. YY developed the prison cell lines. XL, XZ and SP tested and deployed the software on TianHe serial supercomputers. JW, YL and GG supervised the project and gave advice. WJ, KQ, MH, PS, DZ, MLN and GG wrote and revised the manuscript. All authors read and approved the final manuscript.

Supplementary Material

Additional file 1:

Tables S1 - information on all known fusions from two previous studies. Boosted detailed information on the known fusions in two previous studies (melanoma and breast cancer researches). All data of fusions is based on release 59 of the Ensembl hg19 annotation database.

Additional file two:

Tabular array S2 - software selected for evaluation of functioning and sensitivity.

Boosted file 3:

Supplementary notes.

Additional file 4:

Tabular array S3 - detailed information on performance and fusion detection sensitivity of six tools. CPU time, maximum memory usage and sensitivity of fusion detection for each tool are shown. For the multiple process operations, CPU fourth dimension has been translated to single process usage.

Boosted file 5:

Tabular array S4 - detection screen of six tools on 2 previous written report datasets.

Additional file six:

Tables S5, S6 and S7. Table S5: detailed information on fake RNA-Seq reads. Table S6: listing of 150 simulated fusion events. Table S7: number of fusion-supporting reads for each fusion event.

Additional file seven:

Tables S8 and S9. Table S8: TP and FP rates of SOAPfuse, deFuse and TopHat-Fusion based on simulated datasets. Table S9: detailed information on the imitation fusion events detected by SOAPfuse, deFuse and TopHat-Fusion.

Additional file 8:

Tables S10 and S11. Table S10: fusion transcripts detected past SOAPfuse and deFuse in two bladder cancer jail cell lines. Table S11: primers and Sanger sequences of confirmed fusions in two bladder cancer jail cell lines.

Additional file 9:

Figure S1 - models of fusion transcripts generated by genome rearrangement. (a) Fusion transcript created by genomic inversion of Gene A and Gene B, which are from different DNA strands. (b) Fusion transcript formed by genomic translocation in which Factor C and Cistron D are from the same DNA strand and are far from each other.

Additional file ten:

Effigy S2 - schematic diagrams of 9 steps in the SOAPfuse pipeline. The SOAPfuse algorithm consists of nine steps (from S01 to S09) and details of each step are in the Materials and methods or Boosted file iii.

Additional file 11:

Table S12 - sixteen combination of span-read. There are xvi combinations based on serial numbers of reads and their mapped orientations, but but four combinations are rational, supporting two types of fusions in which the upstream and downstream genes are different.

Additional file 12:

Figure S3 - schematic diagrams of fusion event RECK-ALX3. (a) Alignment of supporting reads confronting the predicted junction sequence. The upstream role of the junction sequence is in green, and the downstream part is in red. Span-reads are displayed above the predicted junction sequence with the colored dotted line linking paired-cease reads. Junc-reads are shown beneath the junction sequence. (b,c) Expression analysis of the exons in RECK and ALX3 by RNA-Seq read coverage. Transcripts of RECK and ALX3 are shown beneath the coordinates. The junction site is shown as a cerise circular dot and a greenish arrow indicates the transcript orientation in the genome sequence. The region covered by the carmine line is the region mapped by supporting reads. In this instance, we plant that the expression levels of RECK and ALX3 exons at bilateral sides of junction sites are significantly unlike. The exons involved in the fusion transcript are expressed more highly than other ones.

Acknowledgements

This work was supported past the National Basic Research Program of China (973 plan 2011CB809200), National High Technology Research and Development Plan of Mainland china (863 Plan, 2012AA02A201). This project was besides funded by the Shenzhen municipal regime of China and the local authorities of the Yantian District of Shenzhen. Thanks to Xueda Hu (BGI-shenzhen) for giving communication and supporting this project. Thanks to TianHe research and development team of National University of Defense Technology for testing, optimizing and deploying the software on TianHe serial supercomputers. Nosotros thank D Kim (TopHat-Fusion group), SL Salzberg (TopHat-Fusion group), and A McPherson (deFuse group) for help on operating software. We also thank the FusionMap group for helpful comments on simulation data.

References

Mitelman F, Johansson B, Mertens F. Fusion genes and rearranged genes equally a linear function of chromosome aberrations in cancer. Nat Genet. 2004;14:331–334. doi: x.1038/ng1335. [PubMed] [CrossRef] [Google Scholar]
Mitelman F, Johansson B, Mertens F. The affect of translocations and cistron fusions on cancer causation. Nat Rev Cancer. 2007;14:233–245. doi: x.1038/nrc2091. [PubMed] [CrossRef] [Google Scholar]
Frohling South, Dohner H. Chromosomal abnormalities in cancer. Northward Engl J Med. 2008;fourteen:722–734. doi: 10.1056/NEJMra0803109. [PubMed] [CrossRef] [Google Scholar]
Tkachuk DC, Westbrook CA, Andreeff K, Donlon TA, Cleary ML, Suryanarayan 1000, Homge 1000, Redner A, Grayness J, Pinkel D. Detection of bcr-abl fusion in chronic myelogeneous leukemia by in situ hybridization. Science. 1990;14:559–562. doi: 10.1126/science.2237408. [PubMed] [CrossRef] [Google Scholar]
Tomlins SA, Rhodes DR, Perner Southward, Dhanasekaran SM, Mehra R, Sunday XW, Varambally Due south, Cao X, Tchinda J, Kuefer R, Lee C, Montie JE, Shah RB, Pienta KJ, Rubin MA, Chinnaiyan AM. Recurrent fusion of TMPRSS2 and ETS transcription gene genes in prostate cancer. Science. 2005;14:644–648. doi: 10.1126/scientific discipline.1117679. [PubMed] [CrossRef] [Google Scholar]
Tomlins SA, Laxman B, Dhanasekaran SM, Helgeson BE, Cao X, Morris DS, Menon A, Jing X, Cao Q, Han B, Yu J, Wang L, Montie JE, Rubin MA, Pienta KJ, Roulston D, Shah RB, Varambally Southward, Mehra R, Chinnaiyan AM. Distinct classes of chromosomal rearrangements create oncogenic ETS factor fusions in prostate cancer. Nature. 2007;fourteen:595–599. doi: 10.1038/nature06024. [PubMed] [CrossRef] [Google Scholar]
Soda Yard, Choi YL, Enomoto M, Takada S, Yamashita Y, Ishikawa Southward, Fujiwara Due south, Watanabe H, Kurashina K, Hatanaka H, Bando Grand, Ohno S, Ishikawa Y, Aburatani H, Niki T, Sohara Y, Sugiyama Y, Mano H. Identification of the transforming EML4-ALK fusion factor in not-small-cell lung cancer. Nature. 2007;14:561–566. doi: 10.1038/nature05945. [PubMed] [CrossRef] [Google Scholar]
Bass AJ, Lawrence MS, Caryatid LE, Ramos AH, Drier Y, Cibulskis Yard, Sougnez C, Voet D, Saksena G, Sivachenko A, Jing R, Parkin M, Pugh T, Verhaak RG, Stransky North, Boutin AT, Barretina J, Solit DB, Vakiani East, Shao Westward, Mishina Y, Warmuth 1000, Jimenez J, Chiang DY, Signoretti S, Kaelin WG, Spardy N, Hahn WC, Hoshida Y, Ogino S. et al.Genomic sequencing of colorectal adenocarcinomas identifies a recurrent VTI1A-TCF7L2 fusion. Nat Genet. 2011;fourteen:964–968. doi: 10.1038/ng.936. [PMC costless commodity] [PubMed] [CrossRef] [Google Scholar]
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;fourteen:621–628. doi: x.1038/nmeth.1226. [PubMed] [CrossRef] [Google Scholar]
Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. Culling isoform regulation in human being tissue transcriptomes. Nature. 2008;14:470–476. doi: 10.1038/nature07509. [PMC complimentary article] [PubMed] [CrossRef] [Google Scholar]
Hillier LW, Reinke 5, Green P, Hirst M, Marra MA, Waterston RH. Massively parallel sequencing of the polyadenylated transcriptome of C. elegans. Genome Res. 2009;fourteen:657–666. doi: 10.1101/gr.088112.108. [PMC costless article] [PubMed] [CrossRef] [Google Scholar]
Nagalakshmi U, Wang Z, Waern Thousand, Shou C, Raha D, Gerstein M, Snyder K. The transcriptional mural of the yeast genome defined by RNA sequencing. Science. 2008;fourteen:1344–1349. doi: 10.1126/science.1158441. [PMC gratis commodity] [PubMed] [CrossRef] [Google Scholar]
Filichkin SA, Priest HD, Givan SA, Shen R, Bryant DW, Fox SE, Wong WK, Mockler TC. Genome-broad mapping of alternative splicing in Arabidopsis thaliana. Genome Res. 2010;14:45–58. doi: 10.1101/gr.093302.109. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
McManus CJ, Coolon JD, Duff MO, Eipper-Mains J, Graveley BR, Wittkopp PJ. Regulatory deviation in Drosophila revealed by mRNA-seq. Genome Res. 2010;14:816–825. doi: 10.1101/gr.102491.109. [PMC gratuitous commodity] [PubMed] [CrossRef] [Google Scholar]
Zhang G, Guo Yard, Hu X, Zhang Y, Li Q, Li R, Zhuang R, Lu Z, He Z, Fang 10, Chen L, Tian Due west, Tao Y, Kristiansen K, Zhang X, Li S, Yang H, Wang J. Deep RNA sequencing at single base-pair resolution reveals high complexity of the rice transcriptome. Genome Res. 2010;14:646–654. doi: 10.1101/gr.100677.109. [PMC free commodity] [PubMed] [CrossRef] [Google Scholar]
Wang B, Guo G, Wang C, Lin Y, Wang X, Zhao M, Guo Y, He Chiliad, Zhang Y, Pan L. Survey of the transcriptome of Aspergillus oryzae via massively parallel mRNA sequencing. Nucleic Acids Res. 2010;14:5075–5087. doi: ten.1093/nar/gkq256. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing 10, Sam 50, Barrette T, Palanisamy N, Chinnaiyan AM. Transcriptome sequencing to discover factor fusions in cancer. Nature. 2009;14:97–101. doi: 10.1038/nature07638. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
Maher CA, Palanisamy N, Brenner JC, Cao X, Kalyana-Sundaram S, Luo Southward, Khrebtukova I, Barrette TR, Grasso C, Yu J, Lonigro RJ, Schroth M, Kumar-Sinha C, Chinnaiyan AM. Chimeric transcript discovery by paired-finish transcriptome sequencing. Proc Natl Acad Sci USA. 2009;14:12353–12358. doi: x.1073/pnas.0904720106. [PMC gratuitous article] [PubMed] [CrossRef] [Google Scholar]
Berger MF, Levin JZ, Vijayendran Chiliad, Sivachenko A, Adiconis X, Maguire J, Johnson LA, Robinson J, Verhaak RG, Sougnez C, Onofrio RC, Ziaugra Fifty, Cibulskis G, Laine Eastward, Barretina J, Winckler Due west, Fisher DE, Getz Thousand, Meyerson M, Jaffe DB, Gabriel SB, Lander ES, Dummer R, Gnirke A, Nusbaum C, Garraway LA. Integrative assay of the melanoma transcriptome. Genome Res. 2010;xiv:413–427. doi: 10.1101/gr.103697.109. [PMC costless commodity] [PubMed] [CrossRef] [Google Scholar]
Edgren H, Murumagi A, Kangaspeska South, Nicorici D, Hongisto V, Kleivi K, Rye IH, Nyberg S, Wolf M, Borresen-Dale AL, Kallioniemi O. Identification of fusion genes in chest cancer by paired-end RNA-sequencing. Genome Biol. 2011;14:R6. doi: 10.1186/gb-2011-12-1-r6. [PMC gratis article] [PubMed] [CrossRef] [Google Scholar]
Sboner A, Habegger L, Pflueger D, Terry S, Chen DZ, Rozowsky JS, Tewari AK, Kitabayashi North, Moss BJ, Chee MS, Demichelis F, Rubin MA, Gerstein MB. FusionSeq: a modular framework for finding factor fusions by analyzing paired-stop RNA-sequencing data. Genome Biol. 2010;xiv:R104. doi: 10.1186/gb-2010-eleven-x-r104. [PMC costless commodity] [PubMed] [CrossRef] [Google Scholar]
McPherson A, Hormozdiari F, Zayed A, Giuliany R, Ha K, Sun MG, Griffith M, Heravi Moussavi A, Senz J, Melnyk Due north, Pacheco M, Marra MA, Hirst M, Nielsen TO, Sahinalp SC, Huntsman D, Shah SP. deFuse: an algorithm for gene fusion discovery in tumor RNA-Seq information. PLoS Comput Biol. 2011;14:e1001138. doi: 10.1371/periodical.pcbi.1001138. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
Kim D, Salzberg SL. TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biol. 2011;fourteen:R72. doi: x.1186/gb-2011-12-viii-r72. [PMC gratis article] [PubMed] [CrossRef] [Google Scholar]
Li Y, Chien J, Smith DI, Ma J. FusionHunter: identifying fusion transcripts in cancer using paired-end RNA-seq. Bioinformatics. 2011;fourteen:1708–1710. doi: 10.1093/bioinformatics/btr265. [PubMed] [CrossRef] [Google Scholar]
Asmann YW, Hossain A, Necela BM, Middha Southward, Kalari KR, Dominicus Z, Chai HS, Williamson DW, Radisky D, Schroth GP, Kocher JP, Perez EA, Thompson EA. A novel bioinformatics pipeline for identification and characterization of fusion transcripts in breast cancer and normal cell lines. Nucleic Acids Res. 2011;xiv:e100. doi: x.1093/nar/gkr362. [PMC free commodity] [PubMed] [CrossRef] [Google Scholar]
Iyer MK, Chinnaiyan AM, Maher CA. ChimeraScan: a tool for identifying chimeric transcription in sequencing data. Bioinformatics. 2011;14:2903–2904. doi: 10.1093/bioinformatics/btr467. [PMC gratuitous article] [PubMed] [CrossRef] [Google Scholar]
Ge H, Liu K, Juan T, Fang F, Newman M, Hoeck Due west. FusionMap: detecting fusion genes from next-generation sequencing data at base-pair resolution. Bioinformatics. 2011;14:1922–1928. doi: 10.1093/bioinformatics/btr310. [PubMed] [CrossRef] [Google Scholar]
Flicek P, Amode MR, Barrell D, Beal K, Brent Southward, Chen Y, Clapham P, Coates M, Fairley Southward, Fitzgerald S, Gordon L, Hendrix One thousand, Hourlier T, Johnson North, Kahari A, Keefe D, Keenan S, Kinsella R, Kokocinski F, Kulesha E, Larsson P, Longden I, McLaren Westward, Overduin B, Pritchard B, Riat HS, Rios D, Ritchie GR, Ruffier M, Schuster M. et al.Ensembl 2011. Nucleic Acids Res. 2011;14:D800–806. doi: x.1093/nar/gkq1064. [PMC gratuitous article] [PubMed] [CrossRef] [Google Scholar]
Langmead B, Trapnell C, Popular M, Salzberg SL. Ultrafast and memory-efficient alignment of short Dna sequences to the human genome. Genome Biol. 2009;14:R25. doi: 10.1186/gb-2009-x-3-r25. [PMC complimentary article] [PubMed] [CrossRef] [Google Scholar]
Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009;14:1966–1967. doi: 10.1093/bioinformatics/btp336. [PubMed] [CrossRef] [Google Scholar]
Li H, Ruan J, Durbin R. Mapping brusk Dna sequencing reads and calling variants using mapping quality scores. Genome Res. 2008;14:1851–1858. doi: ten.1101/gr.078212.108. [PMC free commodity] [PubMed] [CrossRef] [Google Scholar]
Salzman J, Marinelli RJ, Wang PL, Green AE, Nielsen JS, Nelson BH, Drescher CW, Chocolate-brown PO. ESRRA-C11orf20 is a recurrent cistron fusion in serous ovarian carcinoma. PLoS Biol. 2011;14:e1001156. doi: x.1371/journal.pbio.1001156. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
Singh D, Chan JM, Zoppoli P, Niola F, Sullivan R, Castano A, Liu EM, Reichel J, Porrati P, Pellegatta S, Qiu Yard, Gao Z, Ceccarelli M, Riccardi R, Brat DJ, Guha A, Aldape K, Golfinos JG, Zagzag D, Mikkelsen T, Finocchiaro G, Lasorella A, Rabadan R, Iavarone A. Transforming fusions of FGFR and TACC genes in human glioblastoma. Science. 2012;14:1231–1235. doi: 10.1126/science.1220834. [PMC gratuitous article] [PubMed] [CrossRef] [Google Scholar]
Wang J, Mullighan CG, Easton J, Roberts Southward, Heatley SL, Ma J, Rusch MC, Chen K, Harris CC, Ding L, Holmfeldt 50, Payne-Turner D, Fan X, Wei Fifty, Zhao D, Obenauer JC, Naeve C, Mardis ER, Wilson RK, Downing JR, Zhang J. CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nat Methods. 2011;14:652–654. doi: 10.1038/nmeth.1628. [PMC gratuitous article] [PubMed] [CrossRef] [Google Scholar]
Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;14:1754–1760. doi: 10.1093/bioinformatics/btp324. [PMC gratis article] [PubMed] [CrossRef] [Google Scholar]
Peng Z, Cheng Y, Tan BC, Kang 50, Tian Z, Zhu Y, Zhang W, Liang Y, Hu X, Tan X, Guo J, Dong Z, Bao L, Wang J. Comprehensive assay of RNA-Seq data reveals extensive RNA editing in a homo transcriptome. Nat Biotechnol. 2012;xiv:253–260. doi: 10.1038/nbt.2122. [PubMed] [CrossRef] [Google Scholar]
Gao F, Liu X, Wu XP, Wang XL, Gong D, Lu H, Xia Y, Song Y, Wang J, Du J, Liu Due south, Han 10, Tang Y, Yang H, Jin Q, Zhang X, Liu M. Differential DNA methylation in detached developmental stages of the parasitic nematode Trichinella spiralis. Genome Biol. 2012;xiv:R100. doi: 10.1186/gb-2012-thirteen-x-r100. [PMC costless commodity] [PubMed] [CrossRef] [Google Scholar]
BLAT Search Genome. http://genome.ucsc.edu/cgi-bin/hgBlat?command=start

eatoncusuch.blogspot.com

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4054009/

Insert Size Peak (Evaluated by Paired-end Reads): 0

SOAPfuse: an algorithm for identifying fusion transcripts from paired-end RNA-Seq data

Wenlong Jia

Kunlong Qiu

Minghui He

Pengfei Song

Quan Zhou

Feng Zhou

Yuan Yu

Dandan Zhu

Michael 50 Nickerson

Shengqing Wan

Xiangke Liao

Xiaoqian Zhu

Shaoliang Peng

Yingrui Li

Jun Wang

Guangwu Guo

Abstract

Background

Results

Evaluation of functioning and sensitivity of SOAPfuse

Judge of the false negative and false positive rates by simulated datasets

Awarding to bladder cancer cell lines

Tabular array 1

Table 2

Discussion

Conclusions

Materials and methods

Outline of the general approach

Read alignment

Iteratively trimming and realigning reads

Identifying candidate cistron pairs

Determining the upstream and downstream genes in the fusion events

Obtaining the fused regions

Construction of fusion junction sequence library with partial exhaustion algorithm

Detection of junction sites in fusion transcripts

Preparation of simulated datasets

Full RNA preparation from float cancer prison cell lines

cDNA library construction for RNA-Seq

Fusion validation by RT-PCR

Abbreviations

Competing interests

Authors' contributions

Supplementary Material

Acknowledgements

References

0 Response to "Insert Size Peak (Evaluated by Paired-end Reads): 0"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel