RNA-MATE user manual (preliminary documentation)
Version 1.01
March 2009
Contact:
Nicole Cloonan
n.cloonan@https://www.wendangku.net/doc/2f17911552.html,
Institute for Molecular Bioscience
The University of Queensland
St Lucia, QLD, 4072
License:
Copyright ? 2008, 2009 Nicole Cloonan, Qinying Xu, Geoffrey Faulkner, and Sean Grimmond.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This package is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see
Table of Contents
License 2
pipeline
4
RNA-MATE
The
4
Description
General
5
(optional)
Part 1: Quality checking of
tag
the
Part 2: Recursive alignment to the human or mouse genome 5
(optional) 5 Part 3: Multi-mapping tag rescue
Part 4: Creation of visualization files 5
Availability 6
Requirements 6 Installation 6
Scripts 7
7
rna-mate-v1.01.pl
script:
Master
7
Configuration
file
options 8 Configuration
Modules 11 File 12 Log
outputs
13
inputs
and
Module
Modifying the pipeline to work with other queues 17
Optimizing performance on your cluster 17
The RNA-MATE pipeline
General Description
For mammalian genomes, there are technical challenges associated with mapping and counting short-tag sequences derived from high-throughput sequencing data. Firstly, mammalian transcripts are non-contiguous due to the splicing of introns from the pre-mRNA. This means that there will be a portion of tags that cross exon-exon boundaries that will not map directly to the genome. The ability to use short tag information relies directly upon being able to place short tags uniquely within the genome. The presence of genome wide repeats and other repetitive sequence in the mouse and human genomes mean that a sizeable proportion of short tags can not be placed uniquely. Finally, the random fragmentation of mRNA creates a distribution of sizes, of which a significant proportion will be less than the full length of the tag, and these will contain adaptor sequence that will not map to the genome. Here we present a computational pipeline to map RNAseq data, which generates both tag counting and genome-browser visualization of genomic and exon-junction matching results. RNA-MATE (Mapping and Alignment Tool for Expression) is designed for the rapid mapping of data from the Applied Biosystems SOLiD system (Figure 1).
Figure 1. The RNA-MATE recursive mapping pipeline. The pipeline consists of 4 major components. (1) The optional tag quality module filters tags based on the quality values for each basecall. (2) The alignment module attempts to align tags first to the genome, and then to a library of known exon-junction sequences. If a tag fails to align, then the tag is truncated, and the process is repeated. (3) The optional tag rescue module is uses information derived from both single-mapping and multi-mapping tags to uniquely place multi-mapping tags. (4) Finally, UCSC genome browser compatible wiggle plots and BED files are generated.
Start End
Read
Configuration File
check quality?Genome/Junction
Alignment
tag aligned?Quality Check
Trim Tag
Yes
Yes
rescue multimappers?
Select Single Mapping Tags
Multimapping Tag Rescue
Create Wiggle Plot Files Create Junction BED files
Yes No
No
No 1
2
3
4
Part 1: Quality checking of the tag (optional)
Depending on the downstream applications of the matched data the quality of individual tags may need to be assessed before their inclusion in the mapping pipeline. To accommodate this, we have provided an optional tag quality module which assesses the tags by the number of basecalls with PHRED scores of less than 10. Tags that pass the QC are fed into the recursive alignment module. If this option is disabled, all tags are passed to the alignment module.
Part 2: Recursive alignment to the human or mouse genome
Alignment of the short tags to a reference genome is done using mapreads (https://www.wendangku.net/doc/2f17911552.html,/gf/project/mapreads/), an algorithm specifically designed for the rapid mapping of data from the Applied Biosystems SOLiD system (ie. color-space data). Tags are first matched against all chromosomes of the reference genome, and then against a library of known exon-junctions (hg18 and mm9 are currently supported). Tags that fail to map to the genome or junctions are chopped to user defined lengths, and the genomic mapping is restarted. In this way, tags that have adaptor sequence, or poor quality ends are recovered at their longest length. The number of mismatches between the reference and tag is user defined, and when mappings from all tags are collated into a single file, only the mappings at the highest level of stringency are retained.
Part 3: Multi-mapping tag rescue (optional)
For most downstream applications, tags are only informative if they can be placed uniquely within a genome. Tags that align to multiple places within a genome make up a sizeable proportion of transcriptome derived tags, primarily from the inherent redundancy of the genome, but also from CpG islands and genome wide repeat elements. Strategies to rescue ambiguous sequences have recently been applied to high-throughput sequencing data, and we have refined our previously published algorithm to work efficiently with large data sets. For every multi-mapping tag, the algorithm considers all tags that map near to each of the possible locations of the tag (within a user-specified window) to determine the most likely mapping position of the tag. Where a tag can not be unambiguously assigned, a fractional weighting to the relevant positions is assigned. In practice, between 40-60% of multi-mapping tags can be assigned a single position with ≥60% likelihood, depending on the relative sequence coverage. The recommended window size for shotgun sequencing is 10 (Cloonan, et al., 2008), whereas the window recommended for CAGE data and other disparate data sets is X (ref?).
Part 4: Creation of visualization files
Finally, UCSC genome browser compatible wiggle plots for genome mapped data, and BED files for exon-junction mapped data are generated automatically from the collated results. The wiggle plots are strand-specific, single-nucleotide resolution coverage plots, and directly represent the number of times an individual nucleotide has been seen in the sequencing data. BED files depict hits to junction sequences, and graphically display exon combinatorics. In addition, plots containing only start sites of tags are included to facilitate tag-counting applications.
Availability
All source code, documentation, and associated files described in this manual are freely available for download from:
https://www.wendangku.net/doc/2f17911552.html,.au/RNA-MATE/
or
https://www.wendangku.net/doc/2f17911552.html,/gf/project/rnamate/
Requirements
This pipeline is written predominantly in perl (with some python thrown in for good measure), and requires that you have version 5.8.8 of perl or later, and python version 2.4 or later. It is designed to run in a unix environment, with a PBS queue manager. The scripts can be modified to work with an LSF or SGE manager. It is not recommended to run this pipeline on a system without access to a cluster due to the large computational requirements of mapping to mammalian genomes – however, the scripts could potentially be modified to do this.
You will need to install the ForkManager.pm perl module if you do not already have it, as well as Path-Class-0.16. Both are available from CPAN.
The alignment section of this pipeline is dependant upon the mapreads tool, available from: https://www.wendangku.net/doc/2f17911552.html,/gf/project/mapreads/.
Installation
Simply unzip the tarball and add the path of the installation directory to @INC using the command:
export PERL5LIB=${PERL5LIB}:/[full path]/RNA-MATEv1.0/perl/
This can be added to the ~/.bash_profile or ~/.profile files for automatic loading, or it can be added to the default profile for all users. The script mask_schemas_mapreads.pl should be placed in the same folder as the mapreads program.
Scripts
Master script: rna-mate-v1.01.pl
This script will call the required modules in order. There is only one user-defined parameter for this script, which allows you to specify a configuration file containing all the required parameters for the entire mapping pipeline.
To run this script, use the following command:
[path]/rna-mate-v1.01.pl –c [configuration file]
Configuration file
The configuration file is a text file containing all the required parameters to run RNA-MATE. In this file, directory listings must end with a “/”, there must be no other punctuation at the end of the lines, and there should be no empty lines in this file. An example of the configuration file is given below:
max_length_tag=35
tag_length=35,30
num_mismatch=3
mask=11111111111111111111111111111111111
max_multimatch=10
expect_strand=+
rescue_window=10
exp_name=tag_20000_F3
chromosomes=chrM,chr2
chr_path=/data/matching/hg18_fasta/
junction=/data/matching/libraries/hg18_junctions_best_quality.fas
ta.cat
junction_index=/data/matching/libraries/hg18_junctions_best_quali
ty.fasta.index
output_root=/data/cxu/
output_dir=/data/cxu/tag_20000_F3/
raw_qual=/data/raw/tag20000.qual
raw_csfasta=/data/raw/tag20000.csfasta
status_out=/data/cxu/tag_20000_F3/total_rep2/map_status.out
raw_csfasta=/data/cxu/tag_20000_F3/total_rep2.csfasta
email=bob@https://www.wendangku.net/doc/2f17911552.html,
run_rescue=false
num_parallel_rescue=4
quality_check=true
script_chr_start=/data/matching/chr_start.pl
script_chr_wig=/data/matching/chr_wig.pl
f2m=/data/matching/f2m.pl
mapreads=/data/matching/mapreads
rescue=/data/matching/chr_rescueSOLiD.py
master_script=/data/matching/rna-mate-v1.01.pl
Configuration options
max_length_tag=35
This parameter defines the longest length of the tags contained in the csfasta file.
tag_length=35,30
This parameter defines the lengths at which matching will occur recursively. Lengths should be in multiples of 5nt. The minimum recommended length for transcriptome data is 30nt although this will depend on how many mismatches you allow. For up to 3 mismatches to a mammalian genome, the minimum length should be 30nt because even though you still get a large proportion of single mapping tags at this length, the specificity is very poor. If you allow up to 1 mismatch, then you can still achieve good sensitivity at 25nt. For other smaller genomes, the minimum acceptable matches should be determined on a case by case basis.
num_mismatch=3
The number of mismatches permissible between the tag and the reference sequence.
Currently this pipeline does not support the “valid adjacent error” feature of mapreads.
This upgrade is planned for a future release.
NOTE: Mapping schemas must be available to do the mapping at the specified length and number of mistmatches, or else the pipeline will fail. ie. in this example, the schemas required are:
schema_35_3
schema_30_3
Mapping schemas are available from https://www.wendangku.net/doc/2f17911552.html,/.
mask=11111111111111111111111111111111111
This setting allows you to ignore particular bases in the tag when computing the number of mismatches. 1 = consider this base, 0 = do not consider this base. The length of the mask should equal the length of the longest tags.
max_multimatch=10
Defines the maximum number of positions to be reported for multi-mapping tags. The higher this number, the more disk space is required to store the data, and the slower the program will run. Recommended size for most applications is 10.
expect_strand=+
This defines the strandedness of the data. For example, libraries made with the SREK protocol or other direct ligation protocols will have tags that are sequenced in the sense (+) strand relative to the expressed gene. Libraries made with the SQRL protocol will have tags that are sequenced in the antisense (-) relative to the expressed gene.
rescue_window=10
This parameter defines the window size used for multi-map tag rescue. The recommended setting for shotgun sequencing data is 10, whereas the recommended setting for CAGE and other disparate data sets is 100.
exp_name=tag_20000_F3
Set the experiment name with this parameter.
chromosomes=chrM,chr2
Defines the names of the chromosomes to map against. The filenames are expected to be: [chromosome_name].fa
chr_path=/data/matching/hg18_fasta/
The full path of the chromosome fasta files.
junction=/data/matching/libraries/hg18_junction.fasta.cat
The full path of the junction library against which you can map.
junction_index=/data/matching/libraries/hg18_junction.fasta.index The full path of the junction index file.
output_root=/data/cxu/
output_dir=/data/cxu/tag_20000_F3/
The full paths of the output root and output directories.
raw_qual=/data/raw/tag20000.qual
The full path of the QV file.
raw_csfasta=/data/raw/tag20000.csfasta
The full path of the csfasta file to be matched.
status_out=/data/cxu/tag_20000_F3/total_rep2/map_status.out
Not used in this implementation of RNA-MATE.
email=bob@https://www.wendangku.net/doc/2f17911552.html,
Not used in this implementation of RNA-MATE.
run_rescue=true
This parameter allows you to turn on or off the rescue of multi-mapping tags module. Acceptable values are “true” or “false”. True = run multi-map rescue, false = do not run multi-map rescue.
NOTE: multi-map rescue can be a very memory intensive process. Rescue for a single chromosome of a transcriptome dataset with > 100 million mappable tags can consume more than 20 Gb of resident memory. The amount of memory used will depend on the size of the data set, the number of multi-mapping tags versus single mapping tags, the underlying complexity of the data set, and the number of positions of each tag to be rescued.
num_parallel_rescue=4
This parameter allows you to adjust the number of rescue jobs that are run in parallel. The settings chosen here will depend on the amount of memory available on your system, the number of CPUs available, and the amount of memory consumed by the rescue (see the note above regarding multi-mapping tag rescue and memory usage).
quality_check=true
This parameter allows you to turn on or off the quality checking of tags module. Acceptable values are “true” or “false”. True = run quality check, False = do not run quality check.
script_chr_start=/data/matching/chr_start.pl
script_chr_wig=/data/matching/chr_wig.pl
f2m=/data/matching/f2m.pl
mapreads=/data/matching/mapreads
rescue=/data/matching/chr_rescueSOLiD.py
master_script=/data/matching/rna-mate-v1.01.pl
These parameters define the full path showing the location of the various scripts required to run RNA-MATE.
tools_mapping.pm
This module includes four functions: creating log files; checking whether the jobs on the queue are finished; creating new csfasta files”; and chopping tags for recursive mapping. tag_quality.pm
This module checks tag quality, making sure that each tag contains less then five nucleotides where the QV value for that basecall is less than 10 (PHRED scale). Currently this threshold is hardcoded. Future implementations will allow user defined values at this point.
mapping.pm
This module automatically arranges genome and junction mapping for different tag lengths.
single_selection.pm
This module attempts to select a single mapping position for each tag based on the mapping results at the highest stringency. For example, if a tag maps once with zero mismatches, and 3 times with one mismatch, then the tag is recorded as a single mapping tag at a stringency of zero mismatches.
new_rescue.pm
This module use new version rescue program which can parallel rescue for each chromosome and use less memory.
wiggle_plot.pm:
This module creates strand specific wiggle plot (or bedGraph) files for visualization in the UCSC genome browser. This module also creates “start site plots” which facilitates tag counting applications.
UCSC_junction.pm.
This module creates BED files for displaying exon-junction usage in the UCSC genome browser.
tag_20000_F3.log
This is an example of the output log file for the tag_20000_F3 experiment. Each status output includes two lines, the first line is system time and the second is what the system doing at that time.
Mon Nov 17 13:45:41 2008
[PROCESS]: Welcome to our mapping strategy system!
Mon Nov 17 13:45:42 2008
[SUCCESS]: Created csfasta file for different tag length, in which tag quality is checked!
Mon Nov 17 13:45:42 2008
[PROCESS]: mapping to all chromosomes
Mon Nov 17 13:45:42 2008
waiting for queue
Mon Nov 17 14:04:42 2008
[SUCCESS]: mapped to all chromosomes
Mon Nov 17 14:04:42 2008
[PROCESS]: collating genome mers:35
Mon Nov 17 14:04:42 2008
[SUCCESS]: collated genome mers:35
Mon Nov 17 14:04:42 2008
[PROCESS]: mapping to junction
Mon Nov 17 14:10:42 2008
[SUCCESS]: mapped to junction
Mon Nov 17 14:10:42 2008
[PROCESS]: collating junction mers:35
Mon Nov 17 14:10:42 2008
[SUCCESS]: collated junction mers:35
Mon Nov 17 14:10:42 2008
[PRPCESS]: chopping tag
Mon Nov 17 14:10:42 2008
[PROCESS]: mapping to all chromosomes
Mon Nov 17 14:10:42 2008
waiting for queue
Mon Nov 17 14:28:42 2008
[SUCCESS]: mapped to all chromosomes
Mon Nov 17 14:28:42 2008
[PROCESS]: collating genome mers:30
Mon Nov 17 14:28:42 2008
[SUCCESS]: collated genome mers:30
Mon Nov 17 14:28:42 2008
[PROCESS]: mapping to junction
Mon Nov 17 14:33:42 2008
[SUCCESS]: mapped to junction
Mon Nov 17 14:33:42 2008
[PROCESS]: collating junction mers:30
Mon Nov 17 14:33:42 2008
[SUCCESS]: collated junction mers:30
Mon Nov 17 14:33:42 2008
[PROCESS]: rescue multi mapped tags
Mon Nov 17 14:33:42 2008
[SUCCESS]: rescue tags are done!
Mon Nov 17 14:33:42 2008
[PROCESS]: prepare data for wiggle plot...
Mon Nov 17 14:33:42 2008
[SUCCESS]: prepared data file for parallel wig plot.
Mon Nov 17 14:33:42 2008
[SUCCEED]: all done! enjoy the data!
Module inputs and outputs
This section details the input and output files generated from each of the modules in this pipeline for the tag_20000_F3 experiment with the configuration file as above.
tools_mapping.pm
sub create_csfata
input:
output:
tag_20000_F3.mers30.unique.csfasta
mapping.pm (35mers)
sub genomic_mapping (35)
input:
output:
chrM.tag_20000_F3.mers35.unique.csfasta.ma.35.3
(35)
collate_genomic_matches
sub
input:
output:
tag_20000_F3.mers35.genomic.non_matched
sub junction_mapping (35)
input:
output:
ic.non_matched.ma.35.3
(35)
collate_junction_matches
sub
input:
ic.non_matched.ma.35.3
output:
tools_mapping.pm
sub chop_tag
input:
tag_20000_F3.mers35.junction.non_matched
tag_20000_F3.mers30.unique.csfasta
output:
tag_20000_F3.mers30.unique.csfasta
mapping.pm (30mers)
sub genomic_mapping (30)
input:
tag_20000_F3.mers30.unique.csfasta
output:
chr2.tag_20000_F3.mers30.unique.csfasta.ma.30.3
chrM.tag_20000_F3.mers30.unique.csfasta.ma.30.3
(30)
collate_genomic_matches
sub
input:
chr*30.*.3
output:
tag_20000_F3.mers30.genomic.collated
tag_20000_F3.mers30.genomic.non_matched
(30)
junction_mapping
sub
input:
tag_20000_F3.mers30.genomic.non_matched
output:
hg18_junctions_best_quality.tag_20000_F3.mers30.genom
ic.non_matched.ma.30.3
(30)
collate_junction_matches
sub
input:
hg18_junctions_best_quality.tag_20000_F3.mers30.genom
ic.non_matched.ma.30.3
output:
tag_20000_F3.mers30.junction.non_matched
single_select.pm
input:
tag_20000_F3.mers30.genomic.collated
output:
tag_20000_F3.mers35.genomic.stats
chr2.tag_20000_F3.for_wig.negative
chr2.tag_20000_F3.for_wig.positive
chrM.tag_20000_F3.for_wig.negative
chrM.tag_20000_F3.for_wig.positive
wiggle_plot.pm
sub paralle_wig_fork
input:
chr2.tag_20000_F3.for_wig.positive
chrM.tag_20000_F3.for_wig.negative
chrM.tag_20000_F3.for_wig.positive
output:
chr2.tag_20000_F3.for_wig.negative.sorted
chr2.tag_20000_F3.for_wig.negative.wig
chr2.tag_20000_F3.for_wig.negative.wig.success
chr2.tag_20000_F3.for_wig.positive.sorted
chr2.tag_20000_F3.for_wig.positive.wig
chr2.tag_20000_F3.for_wig.positive.wig.success
chrM.tag_20000_F3.for_wig.negative.sorted
chrM.tag_20000_F3.for_wig.negative.wig
chrM.tag_20000_F3.for_wig.negative.wig.success
chrM.tag_20000_F3.for_wig.positive.sorted
chrM.tag_20000_F3.for_wig.positive.wig
chrM.tag_20000_F3.for_wig.positive.wig.success
sub start_plot_fork
input:
chr2.tag_20000_F3.for_wig.negative
chr2.tag_20000_F3.for_wig.positive
chrM.tag_20000_F3.for_wig.negative
chrM.tag_20000_F3.for_wig.positive
output:
chr2.tag_20000_F3.for_wig.negative.starts
chr2.tag_20000_F3.for_wig.negative.starts.success
chr2.tag_20000_F3.for_wig.positive.starts
chr2.tag_20000_F3.for_wig.positive.starts.success
chrM.tag_20000_F3.for_wig.negative.starts
chrM.tag_20000_F3.for_wig.negative.starts.success
chrM.tag_20000_F3.for_wig.positive.starts
chrM.tag_20000_F3.for_wig.positive.starts.success
tag_20000_F3.negative.starts
tag_20000_F3.positive.starts
sub collect_data
input:
chr2.tag_20000_F3.for_wig.negative.starts
chr2.tag_20000_F3.for_wig.positive.starts
chr2.tag_20000_F3.for_wig.negative.wig
chr2.tag_20000_F3.for_wig.positive.wig
chrM.tag_20000_F3.for_wig.negative.wig
chrM.tag_20000_F3.for_wig.positive.wig
chrM.tag_20000_F3.for_wig.negative.starts
chrM.tag_20000_F3.for_wig.positive.starts
output:
tag_20000_F3.positive.wiggle
UCSC_junction.pm
sub
(35)
single_selection
input:
ic.non_matched.ma.35.3
output:
tag_20000_F3.junction35.positive.stats
tag_20000_F3.junction35.single_map.negative
tag_20000_F3.junction35.single_map.positive
sub search_junctionID (positive file)
input:
tag_20000_F3.junction35.single_map.positive
output:
tag_20000_F3.junction35.single_map.positiveID
sub search_junctionID (negative file)
input:
tag_20000_F3.junction35.single_map.negative
output:
tag_20000_F3.junction35.single_map.negativeID
sub single_selection (30)
input:
hg18_junctions_best_quality.tag_20000_F3.mers30.genom
ic.non_matched.ma.30.3
output:
tag_20000_F3.junction30.negative.stats
tag_20000_F3.junction30.positive.stats
tag_20000_F3.junction30.single_map.negative
tag_20000_F3.junction30.single_map.positive
sub search_junctionID (positive file)
input:
tag_20000_F3.junction30.single_map.positive output:
tag_20000_F3.junction30.single_map.positiveID
sub search_junctionID (negative file)
input:
tag_20000_F3.junction30.single_map.negative
output:
sub create_BED (positive file)
input:
tag_20000_F3.junction35.single_map.positiveID
output:
sub create_BED (negative file)
input:
tag_20000_F3.junction35.single_map.negativeID
output:
Modifying the pipeline to work with other queues
In order to make this program compatible with other queue managers the mapping.pm module will need to be edited. Specifically, lines in the genome_mapping and junction_mapping subroutines that contain:
$comm = "qsub -l walltime=48:00:00,ncpus=2 -o $mysh.out -e $mysh.err $mysh > $mysh.id ";
will need to be replaced with the appropriate job submission commands and parameters that are specific to your system.
Till Bayer (from the MPI for Evolutionary Biology in Germany) has kindly provided instructions on modifying this script to work on SGE systems. The line above should be changed to: $comm = "qsub -l s_rt=48:00:00,s_cpu=2 -o $mysh.out -e $mysh.err $mysh > $mysh.id";
In addition to modifying the lines above, lines 76 and 133 which read:
print OUT $comm;
should be changed to include the “#$/bin/sh” line, and a newline after the actual command needs to be inserted.
Optimizing performance on your cluster
The script as written asks for two CPUs per mapping job. As mapreads is not parallelized, this is an inefficient, but necessary throttle if you are running an NFS file transfer protocol. The entire pipeline (including mapreads) is very I/O intensive, and depending on the setup, users may find that NFS will timeout if too much is asked of it. For systems using less archaic protocols this will not be necessary, and the script can be modified to request a single CPU.