Workflows - Data Preparation
Camera workflows provide several different tools to reduce potential sequencing artifacts that may be present within unassembled raw read data. Each tool is designed to address problems characteristic of a particular sequencing technology, for example 454 or Illumina.
QC (Quality Control) Filter:
Each base in a given read has a quality score, “Q”, associated with it. Q=-10*log10(p), where “p” is the probability error. To have a sense of the quality of the given reads, the read average score can be used to see the quality performance. The QC filter takes fasta and qual files or fastq file as input, calculates the average score for each read, then fetches high quality reads, filters out shorter than minimum read length, and generates statistical analysis on the input reads.
454 Duplicate Clustering:
This workflow identifies the duplicates from 454 reads, including exact duplicates and near identical duplicates. These duplicates are mostly sequencing artifacts in metagenomic samples, and therefore should be removed. However, most duplicates in transcriptomic reads are not artificial, so it is not suggested to run this workflow for transcriptomic datasets. Duplicates are either exactly identical or meet the following criteria:
- They start at the same position.
- Their lengths can be different, but shorter one must be fully aligned with the longer one (the seed).
- They can only have 4% mismatches (insertion, deletion, and substitution) *
- Only 1 base is allowed per insertion or deletion *
* These parameters can be adjusted by clicking on the “Advanced Parameters” tab on the workflow submission form.
Workflow Components:
The Data Preparation workflows are composed of several components, including:
- cd‐hit‐454: Identifies natural and artificial duplicates from pyrosequencing reads.
Notes:
- Currently the data preparation workflows do not have a graphical output. Please download the results to your local computer for viewing.
