Workflows - Clustering
CAMERA has developed a number of clustering workflows to generate non-redundant DNA and/or protein sequences. These workflows accept a FASTA file as input and produce four output files: non-redundant sequences, a cluster file (contain clusters and their corresponding sequences within clusters), an image file for clustering distribution with regard to clustering size and/or percentage of sequences, and finally, a cluster table describing the read-to-cluster relationship.
DNA Clustering:
Use cd-hit-est to cluster DNA sequences.
Protein Clustering:
Use cd-hit (with default sequence identity cutoff=0.9) to cluster protein sequences in just one step.
Hierarchical protein clustering:
Use cd-hit to cluster protein sequences in two steps. First, use default sequence identity cutoff=0.9 for clustering. Second, based on previous step's clustering results, use cd-hit (default sequence identity cutoff=0.6) again for clustering.
Workflow Components:
The clustering workflows are composed of several components, including:
- cd-hit
- cd-hit-est
Notes:
- Currently the clustering workflows do not have a graphical output. Please download the results to your local computer for viewing.
