Manual for circseq_cup

(http://ibi.zju.edu.cn/bioinplant/)

circseq_cup is a combined tool to identify full-length circular RNAs (circRNAs).
We refer to the CIRCexplorer algorithm from Yang Lab for Tophat-Fusion pipeline and the way to deal with BAM file and ref.txt.

Version: 1.0

Last Modified: 2016-4-18

Authors: Xingchen Zhang (zhangxc_bio@foxmail.com), Chu-Yu Ye (yecy@zju.edu.cn)

##Prerequisites

###Software / Package

####TopHat / STAR / segemehl

* TopHat & TopHat-Fusion
+ [TopHat](http://ccb.jhu.edu/software/tophat/index.shtml) 2.1.0
+ [TopHat-Fusion](http://ccb.jhu.edu/software/tophat/fusion_index.html) included in TopHat 2.1.0
* STAR
+ [STAR](https://github.com/alexdobin/STAR) 2.4.2a
* segemehl
+ [segemehl](http://www.bioinf.uni-leipzig.de/Software/segemehl) 0.1.9

####Others
* [bedtools](https://github.com/arq5x/bedtools2)
* [SAMtools](http://samtools.sourceforge.net)
* [pysam](https://github.com/pysam-developers/pysam)
* [interval](https://github.com/kepbod/interval) (:bangbang: Note: it is not the interval package in PyPI, please do not install it using `pip install interval`)
* [docopt](https://github.com/docopt/docopt)
* [biopython](https://pypi.python.org/pypi/biopython/1.65)
* [cap3](http://seq.cs.iastate.edu/cap3.html)

###Aligner

circseq_cup was originally developed as a circRNA analysis toolkit supporting TopHat & TopHat-Fusion, STAR, segemehl and other aligners.

####TopHat & TopHat-Fusion

To obtain junction reads for circRNAs, two-step mapping strategy was exploited:

* Multiple mapping with TopHat

```bash
tophat2 -a 6 --microexon-search -m 2 -p 10 -G knownGene.gtf -o tophat hg19_bowtie2_index RNA_seq.fastq
```

* Convert unmapped reads (using bamToFastq from [bedtools](https://github.com/arq5x/bedtools2))

```bash
bamToFastq -i tophat/unmapped.bam -fq tophat/unmapped.fastq
```

* Unique mapping with TopHat-Fusion

```bash
tophat2 -o tophat_fusion -p 15 --fusion-search --keep-fasta-order --bowtie1 --no-coverage-search hg19_bowtie1_index tophat/unmapped.fastq
```

####STAR

To detect fusion junctions with STAR, `--chimSegmentMin` should be set to a positive value. For more details about STAR, please refer to [STAR manual](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf).

```bash
star_parse.py Chimeric.out.junction fusion_junction.txt
```

* parse `fusion_junction.txt`

####segemehl

To activate the split read mapping, it is only required to give the option [-S, --splits]. For more details about segemehl, please refer to [segemehl manual](http://www.bioinf.uni-leipzig.de/Software/segemehl/segemehl_manual_0_1_7.pdf).

```bash
grep :C: splitmap.txt > splitmap_c.txt
segemehl_parse.pl splitmap_c.txt fusion_info.txt
```

* parse `fusion_info.txt`

##Usage

```bash

circseq_cup -- circular RNA analysis toolkits.

Usage: RunMe.py [options]

Options:
-h --help Show this screen.
--version Show version.
-o PREFIX --output=PREFIX Output prefix [default: circ].
-t TYPE --type=TYPE The type of upstream aligners.(tophat_fusion or others) [default: others]
-f FILE --file=FILE The input file the upstream aligner made.\
TopHat-Fusion fusion BAM file (used in TopHat-Fusion mapping) dir for tophat_fusion, \
the fusion site file for others.
-l LENGHT --length=LENGHT The maximal length between candidate start and end [default: 5000].
-r REF --ref=REF Gene annotation.
-g GENOME --genome=GENOME Genome FASTA file.
-1 FASTQ_1 --fastq_1=FASTQ_1 Reads1.(*.fastq or *.fastq.gz)
-2 FASTQ_2 --fastq_2=FASTQ_2 Reads2.
```

###Note

* Gene annotation is in the format ([Gene Predictions and RefSeq Genes with Gene Names](https://genome.ucsc.edu/FAQ/FAQformat.html#format9))

* GENOME is genome sequence in FASTA format and indexed (using `samtools faidx` from [SAMtools](http://samtools.sourceforge.net))

* You could download relevant ref.txt (Known Genes, RefSeq or Ensembl) and the genome fasta file from UCSC.

###Example

####TopHat & TopHat-Fusion

```bash
RunMe.py -o prefix -t tophat_fusion -f tophat_fusion_dir -r ref.txt -g genome.fa -1 Read_1.fastq -2 Read_2.fastq > prefix.out
```

[example](python /workcenters/circseq_cup/RunMe.py -o hg19_circ -t tophat_fusion -f /workcenters/tophat_output/hg19_circ_fusion -r /workcenters/reference/hg19/hg19_refFlat.txt -g /workcenters/reference/hg19/hg19_genome.fa -1 /workcenters/data/hg19/SRR444974_1.fastq.gz -2 /workcenters/hg19/SRR444974_2.fastq.gz > hg19_circ.out)

####Other aligners

```bash
RunMe.py -o prefix -t other_aligner -f fusion_site_file -r ref.txt -g genome.fa -1 Read_1.fastq -2 Read_2.fastq > prefix.out
```

The fusion_site_file should be in the following format:

chrom start end
...

The chrom name should be consistent with the genome.See [the example file] (example/input_example/example_fusion_site_file.txt)

*Attention: fusion_site_file should be 0-base

####Scripts

To run RunMe.py would call following scripts:

* get_candidate_site.py (To get circle candidates according to TopHat-Fusion fusion BAM file or fusion info file)
* pack_and_assemble.py (To pack the PEs supporting joint and use Cap3 to assemble circle sequence)
* circ_anotation.py (To filter assembled circle sequence with some rules. It compared the contig with references including the union of all exons, every linear transcript and the genomic sequence between the backsplicing sites to found the best hit and the compared result was output)
* circ_seq_statistics.pl (To make statistics of the output)

##Output

The programme will create a folder in the current directory named prefix_output containing following folders and files.

* circ_index (folder, with bowtie2 index)
* circ_thout (folder,the output of Tophat of '2n' reference)
* pack_reads (folder,pack the pair reads covering joint)
* cap3_circ_res (file,the raw result of Cap3)
* prefix_reads_num (file,the number of pair reads for each joint)
* prefix_res (target file,the brief results)
* prefix_res_dtails (target file,the details of the results)
* prefix_res_statistics.out (target file,the statistics of the results)

The positions of all the sites are 0-base.

### prefix_res

chrom_start_end
overlap (>=5)
overlap error (0 or 1)
circle type (exon intron inter(intergenic))
length of predict circle sequence
supporting PE number (>=2)
splicing site
mismatch number
length equal flag (-1 0 1)

predict circle sequence

See details in [the example file] (example/output_example/example_res)

### prefix_res_dtails

chrom_start_end
ovlap:x ovlap_err:x pair_num:x circ_len:x map_err:x(x=0) gene:x bound:x x exon_block:x x
circ_seq:
xxx

chrom_start_end
ovlap:x ovlap_err:x pair_num:x circ_len:x map_err:x(x>0) gene:x bound:x x exon_block:x x
ref_seq:
xxx
circ_seq:
xxx

* ovlap (>=5, overlap length, overlap means the same sequence in both ends)
* ovlap_err (0 or 1, the mismatch number)
* pair_num (>=2, supporting PE pair number)
* circ_len (the length of the predicted circle sequence)
* map_err (the mismatch number compared with the best hit reference)
* gene (parental gene. the union of all exons is the best hit, e.g. LOC_Os01g09550; one of the linear transcript is the best hit, e.g. LOC_Os01g09550.1)
* bound (splicing signals and annotated spliced sites flag. For annotated splicing sites flag, 0 means that the site is not an annotated splicing site; 1 means that it is an annotated splicing site with the same direction(start to start or end to end),-1 means that it is an annotated splicing site with different directions(start to end or end to start))
* exon_block (the region of exon between the two sites)

See details in [the example file] (example/output_example/example_res_dtails)

### prefix_res_statistics

Use union-find sets to find alternative circularization.