What it does
Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.
Cleaning your data in this way is often required: Reads from small-RNA sequencing contain the 3’ sequencing adapter because the read is longer than the molecule that is sequenced, such as in microRNA, or CRISPR data, or Poly-A tails that are useful for pulling out RNA from your sample but often you don’t want them to be in your reads.
Cutadapt helps with these trimming tasks by finding the adapter or primer sequences in an error-tolerant way. It can also modify and filter reads in various ways. Cutadapt searches for the adapter in all reads and removes it when it finds it. Unless you use a filtering option, all reads that were present in the input file will also be present in the output file, some of them trimmed, some of them not. Even reads that were trimmed entirely (because the adapter was found in the very beginning) are output. All of this can be changed with options in the tool form above.
See the complete Cutadapt documentation for additional details.
If you use Cutadapt, please cite Marcel, 2011 under Citations below.
Accepted input formats for the tool are:
To trim an adapter, input the ADAPTER sequence in plain text or in a FASTA file e.g. AACCGGTT (with the characters: $, ^, ..., if anchored or linked).
Option Sequence 3’ (End) Adapter ADAPTER Anchored 3’ Adapter ADAPTER$ 5’ (Front) Adapter ADAPTER Anchored 5’ Adapter ^ADAPTER 5’ or 3’ (Both possible) ADAPTER Linked Adapter - 3' (End) only ADAPTER1...ADAPTER2 Non-anchored Linked Adapter - 5' (Front) only ADAPTER1...ADAPTER2
Below is an illustration of the allowed adapter locations relative to the read and depending on the adapter type:
Example: Illumina TruSeq Adapters
If you have reads containing Illumina TruSeq adapters, for example, follow these steps.
For Single-end reads as well as the first reads of Paired-end data:
Read 1
In the 3' (End) Adapters option above, insert A + the “TruSeq Indexed Adapter” prefix that is common to all Indexed Adapter sequences, e.g insert:
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
For the second reads of Paired-end data:
Read 2
In the 3' (End) Adapters option above, insert the reverse complement of the “TruSeq Universal Adapter”:
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
The adapter sequences can be found in the document Illumina TruSeq Adapters De-Mystified.
Paired Adapters
Normally, the tool looks for adapters on R1 and R2 reads independently. That is, the best matching R1 adapter of each type (3' End, 5' End, Anywhere) is removed from R1 and the best matching R2 adapter of each type is removed from R2.
To change this, you can use the Pairwise adapter search (--pair-adapters) option, which causes each R1 adapter to be paired up with its corresponding R2 adapter. The first R1 adapter of a given type that you specify will be paired up with the first R2 adapter of that type, and so on. The adapters are then always removed in pairs from a read pair.
For example, if you specify the following two 3'-end adapters for the R1 reads:
and these two 3'-end adapters for the R2 reads:
then, with this option enabled, the tool will trim a pair of reads only if:
Two limitations exist in this mode:
This mode is useful, for example, for demultiplexing Illumina unique dual indices (UDIs).
Optionally, under Output Options you can choose to output
- Report
- Info file
Report
Cutadapt can output per-adapter statistics if you select to generate the report above.
Example:
This is cutadapt 3.4 with Python 3.9.2 Command line parameters: -j=1 -a AGATCGGAAGAGC -A AGATCGGAAGAGC --output=out1.fq.gz --paired-output=out2.fq.gz --error-rate=0.1 --times=1 --overlap=3 --action=trim --minimum-length=30:40 --pair-filter=both --cut=0 bwa-mem-fastq1_assimetric_fq_gz.fq.gz bwa-mem-fastq2_assimetric_fq_gz.fq.gz Processing reads on 1 core in paired-end mode ... Finished in 0.01 s (129 µs/read; 0.46 M reads/minute). === Summary === Total read pairs processed: 99 Read 1 with adapter: 2 (2.0%) Read 2 with adapter: 4 (4.0%) Pairs that were too short: 3 (3.0%) Pairs written (passing filters): 96 (97.0%) Total basepairs processed: 48,291 bp Read 1: 24,147 bp Read 2: 24,144 bp Total written (filtered): 48,171 bp (99.8%) Read 1: 24,090 bp Read 2: 24,081 bp
Info file
The info file contains information about the found adapters. The output is a tab-separated text file. Each line corresponds to one read of the input file.
Columns contain the following data:
- 1st: Read name
- 2nd: Number of errors
- 3rd: 0-based start coordinate of the adapter match
- 4th: 0-based end coordinate of the adapter match
- 5th: Sequence of the read to the left of the adapter match (can be empty)
- 6th: Sequence of the read that was matched to the adapter
- 7th: Sequence of the read to the right of the adapter match (can be empty)
- 8th: Name of the found adapter
- 9th: Quality values corresponding to sequence left of the adapter match (can be empty)
- 10th: Quality values corresponding to sequence matched to the adapter (can be empty)
- 11th: Quality values corresponding to sequence to the right of the adapter (can be empty)
The concatenation of columns 5-7 yields the full read sequence. Column 8 identifies the found adapter. Adapters without a name are numbered starting from 1. Fields 9-11 are empty if quality values are not available. Concatenating them yields the full sequence of quality values.
If no adapter was found, the format is as follows:
- Read name
- The value -1
- The read sequence
- Quality values
When parsing the file, be aware that additional columns may be added in the future. Note also that some fields can be empty, resulting in consecutive tabs within a line.
If the --times option is used and greater than 1, each read can appear more than once in the info file. There will be one line for each found adapter, all with identical read names. Only for the first of those lines will the concatenation of columns 5-7 be identical to the original read sequence (and accordingly for columns 9-11). For subsequent lines, the shown sequence are the ones that were used in subsequent rounds of adapter trimming, that is, they get successively shorter.
The --rename option expects a template string such as {id} extra_info {adapter_name} as a parameter. It can contain regular text and placeholders that consist of a name enclosed in curly braces ({placeholdername}).
The read name will be set to the template string in which the placeholders are replaced with the actual values relevant for the current read.
The following placeholders are currently available for single-end reads:
- {header} – the full, unchanged header
- {id} – the read ID, that is, the part of the header before the first whitespace
- {comment} – the part of the header after the whitespace following the ID
- {adapter_name} – the name of adapter that was found in this read or no_adapter if there was none adapter match. If you use --times to do multiple rounds of adapter matching, this is the name of the last found adapter.
- {match_sequence} – the sequence of the read that matched the adapter (including errors). If there was no adapter match, this is set to an empty string. If you use a linked adapter, this is to the two matching strings, separated by a comma.
- {cut_prefix} – the prefix removed by the --cut (or -u) option (that is, when used with a positive length argument)
- {cut_suffix} – the suffix removed by the --cut (or -u) option (that is, when used with a negative length argument)
- {rc} – this is replaced with the string rc if the read was reverse complemented. This only applies when reverse complementing was requested
If the --rename option is used with paired-end data, the template is applied separately to both R1 and R2. That is, for R1, the placeholders are replaced with values from R1, and for R2, the placeholders are replaced with values from R2. For example, {comment} becomes R1’s comment in R1 and it becomes R2’s comment in R2.
For paired-end data, the placeholder {rn} is available (“read number”), and it is replaced with 1 in R1 and with 2 in R2.
In addition, it is possible to write a placeholder as {r1.placeholdername} or {r2.placeholdername}, which always takes the replacement value from R1 or R2, respectively. The {r1.placeholder} and {r2.placeholder} notation is available for all placeholders except {rn} and {id} because the read ID needs to be identical for both reads.
Galaxy Wrapper Development
Original author: Lance Parsons <lparsons@princeton.edu>