After classifying your samples using the One Codex Database, you may be curious which reads were classified to a particular taxon. While we do provide a file that lists the taxon for each read (a button to download "Read-Level Data" when viewing a sample's results page), you may want to pull the sequences for just those reads out of your original sequence file for further analyses. We provide users with a means to pull out taxon-specific reads. There are a few things you need in order to do that.
The raw reads file, which should be interleaved. If you have paired-end data in two separate files (R1 and R2), once they are uploaded to the platform, we interleave those files, so our scripts work with interleaved files. This means that both R1 and R2 files are combined into one file, with sequence 1 from the R1 file followed by sequence 1 from the R2 file; then sequence 2 from R1 file, and so on. If you are unsure of how to interleave your files, feel free to reach out to us.
You’ll need the classification ID for the One Codex Database classification results for that sample. That can be found in the URL when viewing the classification results for that sample. I.e. https://app.onecodex.com/classification/<classification_UUID>
You’ll need the TaxID for the species of interest. That can be found in the Complete Results Table at the bottom of the results page for that sample.
Lastly, you’ll need the onecodex command line interface installed. There are instructions to install that here.
Once you have those, you can create a file of the taxon-specific reads using the following command:
onecodex scripts subset_reads -t <TaxID> --include-lowconf <classification_id> <fastq_filename>
Note that the
--include-lowconf flag is optional, and will return reads for which we have low confidence. You can chose to exclude low confidence reads by removing this flag from the command.