Module 3 | Lesson 6 | Fly-CURE - bcftools

Overview

Teaching: 13 min
Exercises: 240+ min

Questions

How do I identify unique SNPs?

Objectives

Call unique SNPs by genomic coordinate for each mutant.

Write for loops for bcftools mpileup and calls.

Recorded Lesson:

Copying data to local directory

If this is a continuing analysis and you still have your FlyCURE/ directory, skip to Review Markdup Files.

If this is a new analysis, complete the following commands before proceeding. These are the only directories needed for the next lesson.

$ cd
$ GOCMD_VER=$(curl -L -s https://raw.githubusercontent.com/cyverse/gocommands/main/VERSION.txt); \
curl -L -s https://github.com/cyverse/gocommands/releases/download/${GOCMD_VER}/gocmd-${GOCMD_VER}-linux-amd64.tar.gz | tar zxvf -

The second step is to connect GoCommands to your account.

$ ./gocmd init (hit enter)
iRODS Host [data.cyverse.org]: (hit enter)
iRODS Port [1247]: (hit enter)
iRODS Zone [iplant]: (hit enter)  
iRODS Username: (enter your CyVerse username)  
iRODS Password: (enter your CyVerse password and hit enter. *You will not see the text as you are typing.)  

The last step is to copy the data you need from the datastore to a local directory. To minimize the data transfer, each lesson will contain the path to grab only the data or directory you need. Although it seems like more work, it will take less time to upload using this method.

$ mkdir -p FlyCURE/results/clean_bams
$ mkdir -p FlyCURE/scripts
$ ./gocmd sync i:/iplant/home/your_username/FlyCURE/results/clean_bams ./FlyCURE/results
$ ./gocmd sync i:/iplant/home/your_username/FlyCURE/ref_genome ./FlyCURE

All paths in this lesson assume you are working in the local FlyCURE/ directory.

Review the markdup files

Let’s first review the outputs from our bam_factory script.

$ cd FlyCURE/results/clean_bams
$ less G-3-4.markdup.flags.log

59629295 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
63703 + 0 supplementary
0 + 0 duplicates
59629295 + 0 mapped (100.00% : N/A)
59436664 + 0 paired in sequencing
29718332 + 0 read1
29718332 + 0 read2
58651556 + 0 properly paired (98.68% : N/A)
59436664 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
537202 + 0 with mate mapped to a different chr
194938 + 0 with mate mapped to a different chr (mapQ>=5)

This output is the final alignment statistics after completing all of the prior samtools scripts. From the first line, you can see that ~59 million reads passed quality control and 0 reads failed. If you continue down the list, 100% of the reads mapped to the genomic reference and 98.87% of both read 1 and read 2 paired and mapped properly. This tells us we can have high confidence in our alignment. If you review each of the .markdup.flags.log files, you should see that each of them have equally high percentages.

Run bcftools for making base calls

We now have the full alignments after running samtools. The next step is to use a program called bcftools. Bcftools will take the alignment files and produce a list of unique SNPs identified in each of the genomes we are analyzing. The first step will be to run pileup which takes our multiple alignment files (.bam) and generates a VCF containing genotype likelihoods. The second step will be to make variant base calls. Some of the flags we utilize will be program specific while others will be our selections based upon the type of mutation predictions we have. The output file type will be a VCF file which stands for the Variant Call Format.

Link to bcftools

Similar to before, we are going to construct a script containing 2 for loops. This is an example if you were running the scripts on a single mutant sample.

#!/bin/bash
# what I do:

# Loop 1: mpileup on each sample
# Example of command for a single sample
# bcftools mpileup -Q 30 -C 50 -P Illumina \
# -Ov -B -f ref_genome/bdgp6.32.fa \
# -o vcfs/A44.pileup.vcf \
# clean_bams/A44.markdup.bam &

# Loop 2: sample call for each sample
# Example of a command for a single sample
# bcftools call -c -Ov -v -V 'indels' \
# -o vcfs/A44.calls.vcf vcfs/A44.pileup.vcf &

# Run me in clean_bams where the *.markdup.bam files are

What the bcftools mpileup and bcftools call scripts do

If we break down the script, you should notice there are 2 for loops. In Loop 1, we use bcftools mpileup to take the multiple alignment files (.bam) and generate a VCF (Variant Call Format) containing genotype likelihoods. The VCF file will then be used as the input for our bcftools call program. Bcftools call will make variant base calls to help us begin our identification of a causative mutation in the fly.

These are the options we are utilizing in Loop 1 for bcftools mpileup:

-Q : 30 (minimum base quality for a base to be considered)
-C : 50 (recommended value for BWA)
-P : A file with tab-delimited sample pairs to compare.
Illlumina : Sequencing platform used
-Ov : Output (O) type as uncompressed VCF (v) file
-B : BAQ is the Phred-scaled probability of a read base being misaligned. Applying this option greatly helps to reduce false SNPs caused by misalignments.
-f : The faidx-indexed reference file in the FASTA format.
-o : Output directory

These are the options you may utilize in Loop 2 for bcftools call:

-c : The original samtools/bcftools calling method
-Ov : Our output is an uncompressed vcf
-v : Output variant sites only
-V indels : Skip indel sites (indel = insertions and deletions). *Although EMS doesn’t classicly result in indels, we have identified some mutants with them.

There are additional options you could use which are viewable at the bcftools website. These are the standard options for our pipeline.

Exercise

Now that you have an example of the 2 for loops provided above, turn these into for loops that will run all of our samples. Name this new script bcftools_mpileup.sh. Be sure to include the output directory at the top of the script as follows: outdir=../vcfs (The vcfs directory will be created in results and where the output files for the 2 for loops will be placed.)

When you are confident in your script, make it executable and run. this script will take a few hours as it is identifying SNPs (single nucleotide polymorphisms or variants) in each genome.
Solution
cd FlyCURE/scripts
nano bcftools_mpileup.sh
# Run me in clean_bams where the *.markdup.bam files are

outdir='../vcfs'
mkdir -p $outdir

# loop #1 bcftools mpileup
for i in *.markdup.bam; do
 prefix=$(basename $i .markdup.bam)
 bcftools mpileup -Q 30 -C 50 -P Illumina \
 -Ov -B -f ../../ref_genome/bdgp6.46.fa \
 -o $outdir/${prefix}.pileup.vcf \
 ${i} &
done
wait

# loop #2 bcftools call
for i in *.markdup.bam; do
 prefix=$(basename $i .markdup.bam)
 bcftools call -c -Ov -v \
 -o $outdir/${prefix}.calls.vcf $outdir/${prefix}.pileup.vcf &
done
Note: I chose to include indel calls for my script. You always have the option to exlude them now and re-do the analysis to include them later. The choice is yours!

Make the script executable!
$ chmod +x bcftools_mpileup.sh
Run the script!
$ cd FlyCURE/results/clean_bams
$ ../../scripts/bcftools_mpileup.sh &

This will likely take a few hours (~2+) to run. I utilize top in one terminal to monitor that bcftools is running. I also utilize a second terminal to occasionally use ls -lh to see that files are being created and that the file sizes are changing. (If the screen starts scrolling really fast control C and try the script again). Once your output is complete and looks comparable to the image below, you may proceed to reduce the file sizes, by zipping your *.pileup.vcf.

$ cd FlyCURE/results/vcfs
$ ls -lh

When the script is finished here is what you should see (note .gz files won’t appear until after zipping). In top, bcftools will no longer be present.

-rw-r--r-- 1 gea_user gea_user 110M Apr 15 22:46 A44.calls.vcf
-rw-r--r-- 1 gea_user gea_user 2.8G Apr 15 22:46 A44.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 116M Apr 15 22:47 B-1-3.calls.vcf
-rw-r--r-- 1 gea_user gea_user  17G Apr 15 22:48 B-1-3.pileup.vcf
-rw-r--r-- 1 gea_user gea_user 3.0G Apr 15 22:50 B-1-3.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 119M Apr 15 22:50 B-2-13.calls.vcf
-rw-r--r-- 1 gea_user gea_user 3.1G Apr 15 22:51 B-2-13.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 116M Apr 15 22:51 B-2-16.calls.vcf
-rw-r--r-- 1 gea_user gea_user 3.0G Apr 15 22:51 B-2-16.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 117M Apr 15 22:52 Control.calls.vcf
-rw-r--r-- 1 gea_user gea_user  22G Apr 15 22:52 Control.pileup.vcf
-rw-r--r-- 1 gea_user gea_user 3.0G Apr 15 22:54 Control.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 132M Apr 15 23:12 F-1-4.calls.vcf
-rw-r--r-- 1 gea_user gea_user 3.3G Apr 15 23:12 F-1-4.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 112M Apr 15 22:54 E-2-2.calls.vcf
-rw-r--r-- 1 gea_user gea_user 2.9G Apr 15 22:55 E-2-2.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 114M Apr 15 22:55 G-3-4.calls.vcf
-rw-r--r-- 1 gea_user gea_user 3.0G Apr 15 22:55 G-3-4.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 128M Apr 15 22:56 H22.calls.vcf
-rw-r--r-- 1 gea_user gea_user  17G Apr 15 22:57 H22.pileup.vcf
-rw-r--r-- 1 gea_user gea_user 3.4G Apr 15 22:59 H22.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 130M Apr 15 22:55 H-3-2.calls.vcf
-rw-r--r-- 1 gea_user gea_user 3.0G Apr 15 22:56 H-3-2.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 125M Apr 15 23:01 L31.calls.vcf
-rw-r--r-- 1 gea_user gea_user 3.3G Apr 15 23:01 L31.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 125M Apr 15 22:59 L-3-2.calls.vcf
-rw-r--r-- 1 gea_user gea_user  17G Apr 15 23:00 L-3-2.pileup.vcf
-rw-r--r-- 1 gea_user gea_user 3.2G Apr 15 23:01 L-3-2.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 121M Apr 15 23:02 N-1-1.calls.vcf
-rw-r--r-- 1 gea_user gea_user  22G Apr 15 23:03 N-1-1.pileup.vcf
-rw-r--r-- 1 gea_user gea_user 3.1G Apr 15 23:05 N-1-1.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 118M Apr 15 23:06 N-1-4.calls.vcf
-rw-r--r-- 1 gea_user gea_user  22G Apr 15 23:07 N-1-4.pileup.vcf
-rw-r--r-- 1 gea_user gea_user 3.2G Apr 15 23:09 N-1-4.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 119M Apr 15 23:09 O-2-2.calls.vcf
-rw-r--r-- 1 gea_user gea_user  17G Apr 15 23:10 O-2-2.pileup.vcf
-rw-r--r-- 1 gea_user gea_user 3.0G Apr 15 23:11 O-2-2.pileup.vcf.gz

Reduce file size by zipping the *.pileup.vcf. Again, this will take some time.

$ cd FlyCURE/results/vcfs
$ sudo apt-get install pigz
$ pigz *.pileup.vcf

Save analysis back to the data datastore

Since we were working locally, you will want to save your work to the datastore for future access. DO NOT FORGET THIS STEP.

$ cd
$ ./gocmd sync --single_threaded FlyCURE/results/vcfs i:/iplant/home/your_username/FlyCURE/results

Note: The --single_threaded flag reduces the number of TCP connections (aka communication between devices over a network) to 1 thread which will create 4 TCP connections. The assumption is the amount of TCP connections we are creating is overwhelming our environment. By reducing the number of threads and TCP connections this “should” prevent the syncs from failing. This is also dependent upon how many people are trying to do this at the same time. Fewer threads means the sync will be slower. If you want, you can also use the flag --thread_num 0 and replace the number with 1 or 2 or 3 or 4 to increase the number of threads and therefore the processing speed, but you increase failure by increasing threads. Each thread creates 4 TCP connections.

Key Points

Bioinformatic command line tools are collections of commands that can be used to carry out bioinformatic analyses.

To use most powerful bioinformatic tools, you’ll need to use the command line.

There are many different file formats for storing genomics data. It’s important to understand what type of information is contained in each file, and how it was derived.

previous episode

Fly-CURE: Shell Genomics, Data Wrangling, and SNP Analyses

next episode

Module 3 | Lesson 6 | Fly-CURE - bcftools

Overview

Recorded Lesson:

Copying data to local directory

Review the markdup files

Run bcftools for making base calls

What the bcftools mpileup and bcftools call scripts do

Exercise

Solution

Save analysis back to the data datastore

Key Points

previous episode

next episode