Skip to main content
All CollectionsFAQOther Questions
Q: How do I get the reads that are classified to a specific taxon?
Q: How do I get the reads that are classified to a specific taxon?
Denise Lynch avatar
Written by Denise Lynch
Updated over a month ago

After classifying your samples using the One Codex Database, you may be curious which reads were classified to a particular taxon. While we do provide a file that lists the taxon for each read (a button to download "Read-Level Data" when viewing a sample's results page), you may want to pull the sequences for just those reads out of your original sequence file for further analyses. We provide users with a means to pull out taxon-specific reads.

Download via web browser

You will find a Complete Results Table at the bottom of your classification results page. For each row, you'll see a download button to the very right of the table.

Clicking this for your chosen taxon will generate a pop-up, asking if you would like to include child taxa. Child taxa are more-specific taxa that fall under your chosen taxon. For instance, if you pick a genus-level taxon, including child taxa will include all of the species, sub-species, strains etc that fall under that genus.

Once you confirm, we will begin to extract the reads that map to your chosen taxon. You will receive an email with a link to download your new subsetted fasta/fastq file.

If the download button is not available to you, you can use our command line tool to download your reads.

Download via Command Line

There are a few things you need in order to do that.

  • The raw reads file, which should be interleaved. If you have paired-end data in two separate files (R1 and R2), once they are uploaded to the platform, we interleave those files, so our scripts work with interleaved files. This means that both R1 and R2 files are combined into one file, with sequence 1 from the R1 file followed by sequence 1 from the R2 file; then sequence 2 from R1 file, and so on. If you are unsure of how to interleave your files, feel free to reach out to us.

  • You’ll need the classification ID for the One Codex Database classification results for that sample. That can be found in the URL when viewing the classification results for that sample. I.e. https://app.onecodex.com/classification/<classification_UUID>

  • You’ll need the TaxID for the species of interest. That can be found in the Complete Results Table at the bottom of the results page for that sample. Note that you can include/exclude multiple taxa by passing the -t flag multiple times. You can also find the taxID for any species in our reference list: https://app.onecodex.com/references

  • Lastly, you’ll need the onecodex command line interface installed. There are instructions to install that here.

Once you have those, you can create a file of the taxon-specific reads using the following command:

onecodex scripts subset_reads -t <TaxID> --include-lowconf <classification_id> <fastq_filename>

Note that the --include-lowconf flag is optional, and will return reads for which we have low confidence. You can chose to exclude low confidence reads by removing this flag from the command.

If you wish to exclude reads for a given taxon from your results, you can do this by passing the --exclude-reads flag to the command. With this, the resulting fasta/fastq file will contain all reads except those matching to the tax ID indicated.

Another useful flag, --with-children, allows you to subset to the reads of your taxon of interest, and any more-specific taxa that fall under this. For instance, if you wanted to get the reads for a genus, and all species/strains that belong to that genus, you can provide the genus-level TaxID, and the --with-children flag, to get all of the reads that classify at genus-level, as well as reads that classify to species or strains in that genus.

Did this answer your question?