Workflows - Clustering

CAMERA has developed a number of clustering workflows to generate non-redundant DNA and/or protein sequences. These workflows accept a FASTA file as input and produce four output files: non-redundant sequences, a cluster file (contain clusters and their corresponding sequences within clusters), an image file for clustering distribution with regard to clustering size and/or percentage of sequences, and finally, a cluster table describing the read-to-cluster relationship.

DNA Clustering:
Use cd-hit-est to cluster DNA sequences.


Protein Clustering:
Use cd-hit (with default sequence identity cutoff=0.9) to cluster protein sequences in just one step.


Hierarchical protein clustering:
Use cd-hit to cluster protein sequences in two steps. First, use default sequence identity cutoff=0.9 for clustering. Second, based on previous step's clustering results, use cd-hit (default sequence identity cutoff=0.6) again for clustering.

Workflow Components:


The clustering workflows are composed of several components, including:

  • cd-hit
  • cd-hit-est

Notes:

  • Currently the clustering workflows do not have a graphical output. Please download the results to your local computer for viewing.

Resources: