Module 3 | Lesson 3 | Fly-CURE - Trimming and Filtering

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How can I get rid of sequence data that doesn’t meet my quality standards?

Objectives
  • Clean FASTQ reads using Trimmomatic.

  • Select and set multiple options for command-line bioinformatic tools.

  • Write for loops with two variables.

Recorded Lesson:

Module 3 | Lesson 3 | Fly-CURE - Trimming and Filtering

Create local directory

We are going to install a tool called GoCommands to facilitate working in a local directory and moving the data to and from the data store.

Since the tool is not yet built into the app, the first step will be to install GoCommands. Video directions in the alignment lesson.

$ cd
$ GOCMD_VER=$(curl -L -s https://raw.githubusercontent.com/cyverse/gocommands/main/VERSION.txt); \
curl -L -s https://github.com/cyverse/gocommands/releases/download/${GOCMD_VER}/gocmd-${GOCMD_VER}-linux-amd64.tar.gz | tar zxvf -

The second step is to connect GoCommands to your account.

$ ./gocmd init (hit enter)
iRODS Host [data.cyverse.org]: (hit enter)
iRODS Port [1247]: (hit enter)
iRODS Zone [iplant]: (hit enter)  
iRODS Username: (enter your CyVerse username)  
iRODS Password: (enter your CyVerse password and hit enter. *You will not see the text as you are typing.)  

The last step is to copy the data you need from the datastore to a local directory. To minimize the data transfer, each lesson will contain the path to grab only the data or directory you need. Although it seems like more work, it will take less time to upload using this method.

$ cd
$ mkdir -p FlyCURE/scripts
$ mkdir -p FlyCURE/fastq_joined
$ ./gocmd sync i:/iplant/home/your_username/FlyCURE/fastq_joined ./FlyCURE/

All paths in this lesson assume you are working in the local FlyCURE/ directory.

Cleaning Reads

In the previous lesson, we took a high-level look at the quality of each of our samples using FastQC. We visualized per-base quality graphs showing the distribution of read quality at each base across all reads in a sample and extracted information about which samples fail which quality checks. Some of our samples failed quite a few quality metrics used by FastQC. This doesn’t mean, though, that our samples should be thrown out! It’s very common to have some quality metrics fail, and this may or may not be a problem for your downstream application. For our variant calling workflow, we will be removing some of the low quality sequences to reduce our false positive rate due to sequencing error.

We will use a program called Trimmomatic to filter poor quality reads and trim poor quality bases from our samples.

Download adapters for Trimmomatic

mkdir -p FlyCURE/adapters
cd FlyCURE/adapters

curl -O https://data.cyverse.org/dav-anon/iplant/home/kbieser/fastq_joined/custom_PE.fa

Trimmomatic Options

Trimmomatic has a variety of options to trim your reads. If we run the following command, we can see some of our options.

$ trimmomatic

Which will give you the following output:

Usage:
       PE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>...
   or:
       SE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] <inputFile> <outputFile> <trimmer1>...
   or:
       -version

This output shows us that we must first specify whether we have paired end (PE) or single end (SE) reads. Next, we specify what flag we would like to run. For example, you can specify threads to indicate the number of processors on your computer that you want Trimmomatic to use. In most cases using multiple threads (processors) can help to run the trimming faster. These flags are not necessary, but they can give you more control over the command. The flags are followed by positional arguments, meaning the order in which you specify them is important. In paired end mode, Trimmomatic expects the two input files, and then the names of the output files. These files are described below. While, in single end mode, Trimmomatic will expect 1 file as input, after which you can enter the optional settings and lastly the name of the output file.

option meaning
<inputFile1> Input reads to be trimmed. Typically the file name will contain an _1 or _R1 in the name.
<inputFile2> Input reads to be trimmed. Typically the file name will contain an _2 or _R2 in the name.
<outputFile1P> Output file that contains surviving pairs from the _1 file.
<outputFile1U> Output file that contains orphaned reads from the _1 file.
<outputFile2P> Output file that contains surviving pairs from the _2 file.
<outputFile2U> Output file that contains orphaned reads from the _2 file.

The last thing trimmomatic expects to see is the trimming parameters:

step meaning
ILLUMINACLIP Perform adapter removal.
SLIDINGWINDOW Perform sliding window trimming, cutting once the average quality within the window falls below a threshold.
LEADING Cut bases off the start of a read, if below a threshold quality.
TRAILING Cut bases off the end of a read, if below a threshold quality.
CROP Cut the read to a specified length.
HEADCROP Cut the specified number of bases from the start of the read.
MINLEN Drop an entire read if it is below a specified length.
TOPHRED33 Convert quality scores to Phred-33.
TOPHRED64 Convert quality scores to Phred-64.

We will use only a few of these options and trimming steps in our analysis. It is important to understand the steps you are using to clean your data. For more information about the Trimmomatic arguments and options, see the Trimmomatic manual.

However, a complete command for Trimmomatic will look something like the command below. This command is an example and will not work, as we do not have the files it refers to:

$ trimmomatic PE -threads 4 SRR_1056_1.fastq SRR_1056_2.fastq  \
              SRR_1056_1.trimmed.fastq SRR_1056_1un.trimmed.fastq \
              SRR_1056_2.trimmed.fastq SRR_1056_2un.trimmed.fastq \
              ILLUMINACLIP:SRR_adapters.fa SLIDINGWINDOW:4:20

In this example, we’ve told Trimmomatic:

code meaning
PE that it will be taking a paired end file as input
-threads 4 to use four computing threads to run (this will speed up our run and should equal the number of cores)
SRR_1056_1.fastq the first input file name
SRR_1056_2.fastq the second input file name
SRR_1056_1.trimmed.fastq the output file for surviving pairs from the _1 file
SRR_1056_1un.trimmed.fastq the output file for orphaned reads from the _1 file
SRR_1056_2.trimmed.fastq the output file for surviving pairs from the _2 file
SRR_1056_2un.trimmed.fastq the output file for orphaned reads from the _2 file
ILLUMINACLIP:SRR_adapters.fa to clip the Illumina adapters from the input file using the adapter sequences listed in SRR_adapters.fa
SLIDINGWINDOW:4:20 to use a sliding window of size 4 that will remove bases if their phred score is below 20

Multi-line commands

Some of the commands we ran in this lesson are long! When typing a long command into your terminal, you can use the \ character to separate code chunks onto separate lines. This can make your code more readable.

Running Trimmomatic for a single sample

Now we will run Trimmomatic on our data. To begin, navigate to your fastq_joined data directory:

$ cd FlyCURE/fastq_joined

We are going to run Trimmomatic on one of our paired-end samples and then build the pieces of a script to run it on all of our samples. At the start of this lesson you downloaded adapters Trimmomatic will use. These adapters include a polyG sequence (present in some of the samples as seen viewing FastQC results), NexteraPE, TruSeq2, and TruSeq3 adapter sequences that come standard with the installation of Trimmomatic.

We will use a leading and trailing threshold quality of 3. The entire read will be dropped if the sequence length is less than 36 bp (the MINLEN setting). This command will take about 10 minutes to run.

$ trimmomatic PE -threads 4 \
G-3-4_R1.fastq.gz G-3-4_R2.fastq.gz \
G-3-4_R1.trim.fastq.gz G-3-4_R1.orphan.fastq.gz \
G-3-4_R2.trim.fastq.gz G-3-4_R2.orphan.fastq.gz \
ILLUMINACLIP:../adapters/custom_PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36
TrimmomaticPE: Started with arguments:
 -threads 8 G-3-4_R1.fastq.gz G-3-4_R2.fastq.gz G-3-4_R1.fastq.gz G-3-4_R1.orphan.fastq.gz G-3-4_R2.trim.fastq.gz G-3-4_R2.orphan.fastq.gz ILLUMINACLIP:../adapters/custom_PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36
Using PrefixPair: 'AGATGTGTATAAGAGACAG' and 'AGATGTGTATAAGAGACAG'
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT'
Using Long Clipping Sequence: 'GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'TTTTTTTTTTAATGATACGGCGACCACCGAGATCTACAC'
Using Long Clipping Sequence: 'TTTTTTTTTTCAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
Using Long Clipping Sequence: 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTCCGAGCCCACGAGAC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT'
Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTGACGCTGCCGACGA'
Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
ILLUMINACLIP: Using 2 prefix pairs, 15 forward/reverse sequences, 1 forward only sequences, 1 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 19671500 Both Surviving: 18935218 (96.26%) Forward Only Surviving: 1 (0.00%) Reverse Only Surviving: 0 (0.00%) Dropped: 736281 (3.74%)
TrimmomaticPE: Completed successfully

You may have noticed that Trimmomatic automatically detected the quality encoding of our sample. It is always a good idea to double-check this or to enter the quality encoding manually.

We can confirm that we have our output files:

$ ls -lh G-3-4*
-rwx------ 1 gea_user gea_user 1.3G May 30 20:27 G-3-4_R1.fastq.gz
-rwx------ 1 gea_user gea_user 1.3G May 30 20:29 G-3-4_R2.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.3G May 31 18:45 G-3-4_R1.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.3G May 31 18:45 G-3-4_R2.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user  12M May 31 18:45 G-3-4_R1.orphan.fastq.gz
-rw-r--r-- 1 gea_user gea_user 282K May 31 18:45 G-3-4_R2.orphan.fastq.gz

The output files are also FASTQ files. The output .trim files should be smaller than our input file, because we’ve removed reads. The output orphan files are the smallest because those are sequences that we removed because their mate was a terrible read. The absolute worst reads are now gone. It’s partner ended up in those orphan files.

We’ve just successfully run Trimmomatic on one of our FASTQ files! However, there is some bad news. Trimmomatic can only operate on one sample at a time and we have more than one sample. The good news is that we can use a for loop to iterate through our sample files quickly!

Running Trimmomatic using the script method

To build our Trimmomatic for loop, let’s start by reviewing the pieces of a for loop.

$ for infile in *_R1.fastq.gz
> do
> echo ${infile}
> done

Here, we defined our variable infile to call our samples that we will run Trimmomatic on.

A44_R1.fastq.gz
B-2-13_R1.fastq.gz
B-2-16_R1.fastq.gz
Control_R1.fastq.gz
F-1-4_R1.fastq.gz
H22_R1.fastq.gz
L31_R1.fastq.gz
L-3-2_R1.fastq.gz
N-1-1_R1.fastq.gz
N-1-4_R1.fastq.gz
B-1-3_R1.fastq.gz
E-2-2_R1.fastq.gz
G-3-4_R1.fastq.gz
H-3-2_R1.fastq.gz
O-2-2_R1.fastq.gz

Next, continue working through the steps of building the for loop.

$ for infile in *_R1.fastq.gz
> do
> echo ${infile}
> echo "this is a line that isn't a filename"
> done
A44_R1.fastq.gz
this is a line that isn't a filename
B-1-3_R1.fastq.gz
this is a line that isn't a filename
B-2-13_R1.fastq.gz
this is a line that isn't a filename
B-2-16_R1.fastq.gz
this is a line that isn't a filename
Control_R1.fastq.gz
this is a line that isn't a filename
F-1-4_R1.fastq.gz
this is a line that isn't a filename
E-2-2_R1.fastq.gz
this is a line that isn't a filename
G-3-4_R1.fastq.gz
this is a line that isn't a filename
H22_R1.fastq.gz
this is a line that isn't a filename
H-3-2_R1.fastq.gz
this is a line that isn't a filename
L31_R1.fastq.gz
this is a line that isn't a filename
L-3-2_R1.fastq.gz
this is a line that isn't a filename
N-1-1_R1.fastq.gz
this is a line that isn't a filename
N-1-4_R1.fastq.gz
this is a line that isn't a filename

Consider the original Trimmomatic command, it had 2 inputs and 4 output files and our loop currently is only aware of 1 of our 2 input files. We need to think of a way to make it aware of all of the others. What do the file names have in common?

$ ls -lh

It looks like there is a common suffix or base to the name. So we can manipulate the content of the variable that we already know called infile. We can manipulate the variable called infile and save those manipulated contents to a new variable.

Let’s use basename to demonstrate that we can remove the suffix.

$ basename G-3-4_R1.fastq.gz _R1.fastq.gz
G-3-4

How can we get the output of basename to get stored in a new variable? We can use a special syntax with $() to start a command, run something, and then get back to finishing what it started. Let’s make a new variable and do a subshell of it to see how it works. The output of what is run in () gets stored in prefix.

$ prefix=$(basename G-3-4_R1.fastq.gz _R1.fastq.gz)
$ echo ${prefix}
G-3-4_S1

We can combine the basename command with our existing variable. Let’s go to the script we are writing.

$ for infile in *_R1.fastq.gz
> do
>  echo ${infile}
>  base=$(basename ${infile} _R1.fastq.gz)
>  echo ${base}
> done
A44_R1.fastq.gz
A44
B-1-3_R1.fastq.gz
B-1-3_S1
B-2-13_R1.fastq.gz
B-2-13_S1
B-2-16_R1.fastq.gz
B-2-16_S2
Control_R1.fastq.gz
Control
F-1-4_R1.fastq.gz
F-1-4
E-2-2_R1.fastq.gz
E-2-2_S4
G-3-4_R1.fastq.gz
G-3-4_S2
H22_R1.fastq.gz
H22
H-3-2_R1.fastq.gz
H-3-2_S3
L31_R1.fastq.gz
L31
L-3-2_R1.fastq.gz
L-3-2_S3
N-1-1_R1.fastq.gz
N-1-1_S4
N-1-4_R1.fastq.gz
N-1-4_S5

Remember your history command? We had a pretty complicated Trimmomatic command that we used and we want to substitute the variable in there. Let’s remember what that command was. Who remembers what command shows us things we’ve already done? Look for the Trimmomatic command in your history (or go back to the beginning of this lesson to find it).

$ history | grep trimmomatic

Copy your Trimmomatic command and open nano.

$ nano

Nano is unfortunately bad with copying and pasting. You will need to scroll by using arrow keys and use a \ return to get the lines in a readable format.

$ trimmomatic PE -threads 4 \
G-3-4_R1.fastq.gz G-3-4_R2.fastq.gz \
G-3-4_R1.trim.fastq.gz G-3-4_R1.orphan.fastq.gz \
G-3-4_R2.trim.fastq.gz G-3-4_R2.orphan.fastq.gz \
ILLUMINACLIP:../adapters/custom_PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36

Once it is copied in you are now going to convert it to a for loop by replacing the explicit parts with variables (see the Exercise below). Expect that the Trimmomatic script will run for 1 to 1.5 hours.

Tips to view progress of a script

As a tip to view the progress of a script and make sure it is running, you can do a few things. In a new terminal you can type the top command and you should see trimmomatic and trim.sh in the list (or whatever process you may be running). Another useful method is to watch file sizes change in the directory they are being created. Navigate to fastq_joined and type ls -lh. What this will do is show you that the .trim and .orphan files are being generated and that the file sizes of these should change over time. If you refresh the command over time you will see new files being created and the file sizes changing. Once Trimmomatic is complete, you can view fastq_trimmed and fastq_trimmed_orphans and the files should be there if you built the mkdir and mv commands into your script.

Exercise

Using the steps we tested above, create a script and for loop to run Trimmomatic on all of our fastq files. The first line of every script will be #!/bin/bash, so that the script is run in bash. Since you have the Trimmomatic command pasted into nano, comment out each line and use it to help you build the for loop. Save the text file as trim.sh and move to your scripts directory. If you really want to challenge yourself, you can add in commands to make a new fastq_trimmed/ and move the .trim and .orphan files to there own new directory’s all within the script. Comment out notes within the text to help yourself. Don’t forget you have to make the script executable in order to run it. Run the script.

Solution

I moved into scripts/ and used nano to open a trim.sh There are different variations that could be used which will still work. This is what I did.

#!/bin/bash
# Script to run trimmomatic
# Run me from fastq_joined

# Example of trimmomatic command to run on a single sample
# trimmomatic PE -threads 4 \
# A44_R1.fastq.gz A44_R2.fastq.gz \
# A44_R1.trim.fastq.gz A44_R1.orphan.fastq.gz \
# A44_R2.trim.fastq.gz A44_R2.orphan.fastq.gz \
# ILLUMINACLIP:../adapters/custom_PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36

# Script to run trimmomatic on all samples
for infile in *_R1.fastq.gz
do
base=$(basename ${infile} _R1.fastq.gz)
trimmomatic PE -threads 4 \
${infile} ${base}_R2.fastq.gz \
${base}_R1.trim.fastq.gz ${base}_R1.orphan.fastq.gz \
${base}_R2.trim.fastq.gz ${base}_R2.orphan.fastq.gz \
ILLUMINACLIP:../adapters/custom_PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36
done

mkdir -p FlyCURE/results/fastq_trimmed
mkdir -p FlyCURE/results/fastq_trimmed_orphans
mv *.trim* ../results/fastq_trimmed
mv *.orphan* ../results/fastq_trimmed_orphans

If you didn’t create the script within the scripts directory, then:

$ mv trim.sh  FlyCURE/scripts

Make the script executable

$ chmod +x trim.sh

Run the script.

$ cd  FlyCURE/fastq_joined
$ ../scripts/trim.sh &
Using PrefixPair: 'AGATGTGTATAAGAGACAG' and 'AGATGTGTATAAGAGACAG'
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT'
Using Long Clipping Sequence: 'GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'TTTTTTTTTTAATGATACGGCGACCACCGAGATCTACAC'
Using Long Clipping Sequence: 'TTTTTTTTTTCAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
Using Long Clipping Sequence: 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTCCGAGCCCACGAGAC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT'
Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTGACGCTGCCGACGA'
Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
ILLUMINACLIP: Using 2 prefix pairs, 15 forward/reverse sequences, 1 forward only sequences, 1 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 19671500 Both Surviving: 18658198 (94.85%) Forward Only Surviving: 276410 (1.41%) Reverse Only Surviving: 39388 (0.20%) Dropped: 697504 (3.55%)
TrimmomaticPE: Completed successfully

You will need to see TrimmomaticPE: Completed successfully for each of the mutants to know that it has completed for all samples. anticipate that this will take a minimum of 30 minutes or longer depending upon how many samples you are trimming.

$ cd FlyCURE/results/fastq_trimmed

Confirm that all of your files were created and of the size you anticipate with an ls -lh.

-rw-r--r-- 1 gea_user gea_user 897M Feb 24 21:02 A44_R1.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 920M Feb 24 21:02 A44_R2.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.3G Feb 24 21:02 B-1-3_R1.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.3G Feb 24 21:03 B-1-3_R2.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.3G Feb 24 21:03 B-2-13_R1.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.4G Feb 24 21:03 B-2-13_R2.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.3G Feb 24 21:03 B-2-16_R1.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.3G Feb 24 21:04 B-2-16_R2.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.2G Feb 24 21:04 Control_R1.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.2G Feb 24 21:04 Control_R2.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.6G Feb 24 21:08 F-1-4_R1.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.6G Feb 24 21:09 F-1-4_R2.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.2G Feb 24 21:04 E-2-2_R1.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.2G Feb 24 21:04 E-2-2_R2.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.3G Feb 24 21:05 G-3-4_R1.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.3G Feb 24 21:05 G-3-4_R2.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.7G Feb 24 21:05 H22_R1.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.7G Feb 24 21:06 H22_R2.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.3G Feb 24 21:05 H-3-2_R1.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.3G Feb 24 21:05 H-3-2_R2.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.7G Feb 24 21:06 L31_R1.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.7G Feb 24 21:07 L31_R2.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.4G Feb 24 21:06 L-3-2_R1.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.4G Feb 24 21:06 L-3-2_R2.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.4G Feb 24 21:07 N-1-1_R1.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.4G Feb 24 21:07 N-1-1_R2.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.4G Feb 24 21:07 N-1-4_R1.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.5G Feb 24 21:08 N-1-4_R2.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.2G Feb 24 21:08 O-2-2_R1.trim.fastq.gz
-rw-r--r-- 1 gea_user gea_user 1.3G Feb 24 21:08 O-2-2_R2.trim.fastq.gz

Exercise

Now that our samples have gone through quality control, they should perform better on the quality tests run by FastQC. Go ahead and re-run FastQC on your trimmed FASTQ files and visualize the HTML files to see whether your per base sequence quality is higher after trimming. Modify your fastqc.sh script to run on the trim.fastq.gz. You can choose to use flags in the fastqc command to move the files > to a new directory or you can move the files after they have been created with a mv command. Name the new script fastqc_trimmed.sh. Include comments for yourself.

Solution

First edit your fastqc.sh and save as fastqc_trimmed.sh in your scripts/.

#!/bin/bash
# script to run fastqc on trimmed reads and move output to a new directory
# Run me in  FlyCURE/results/fastq_trimmed or where the fastq_trimmed reads are located

mkdir -p FlyCURE/results/fastqc_trimmed_reads

fastqc -t 10 -o ../fastqc_trimmed_reads *.trim.fastq*
$ chmod +x fastqc_trimmed.sh
$ cd FlyCURE/results/fastq_trimmed
$ ../../scripts/fastqc_trimmed.sh &

Again, top should show that fastqc_trimmed is running and the terminal where the script is running will show you that the analysis » has started. Any other message indicates an error.

Save analysis back to the data datastore

Since we were working locally, you will want to save your work to the datastore for future access. DO NOT FORGET THIS STEP.

$ cd
$ ./gocmd sync --single_threaded FlyCURE/ i:/iplant/home/your_username/

Note: The --single_threaded flag reduces the number of TCP connections (aka communication between devices over a network) to 1 thread which will create 4 TCP connections. The assumption is the amount of TCP connections we are creating is overwhelming our environment. By reducing the number of threads and TCP connections this “should” prevent the syncs from failing. This is also dependent upon how many people are trying to do this at the same time. Fewer threads means the sync will be slower. If you want, you can also use the flag --thread_num 0 and replace the number with 1 or 2 or 3 or 4 to increase the number of threads and therefore the processing speed, but you increase failure by increasing threads. Each thread creates 4 TCP connections.

Key Points

  • The options you set for the command-line tools you use are important!

  • Data cleaning is an essential step in a genomics workflow.