Module 3 | Lesson 6 | Fly-CURE - bcftools
Overview
Teaching: 13 min
Exercises: 240+ minQuestions
How do I identify unique SNPs?
Objectives
Call unique SNPs by genomic coordinate for each mutant.
Write for loops for bcftools mpileup and calls.
Recorded Lesson:
Module 3 | Lesson 6 | Fly-CURE - bcftools
Copying data to local directory
If this is a continuing analysis and you still have your FlyCURE/ directory, skip to Review Markdup Files.
If this is a new analysis, complete the following commands before proceeding. These are the only directories needed for the next lesson.
$ cd
$ GOCMD_VER=$(curl -L -s https://raw.githubusercontent.com/cyverse/gocommands/main/VERSION.txt); \
curl -L -s https://github.com/cyverse/gocommands/releases/download/${GOCMD_VER}/gocmd-${GOCMD_VER}-linux-amd64.tar.gz | tar zxvf -
The second step is to connect GoCommands to your account.
$ ./gocmd init (hit enter)
iRODS Host [data.cyverse.org]: (hit enter)
iRODS Port [1247]: (hit enter)
iRODS Zone [iplant]: (hit enter)
iRODS Username: (enter your CyVerse username)
iRODS Password: (enter your CyVerse password and hit enter. *You will not see the text as you are typing.)
The last step is to copy the data you need from the datastore to a local directory. To minimize the data transfer, each lesson will contain the path to grab only the data or directory you need. Although it seems like more work, it will take less time to upload using this method.
$ mkdir -p FlyCURE/results/clean_bams
$ mkdir -p FlyCURE/scripts
$ ./gocmd sync i:/iplant/home/your_username/FlyCURE/results/clean_bams ./FlyCURE/results
$ ./gocmd sync i:/iplant/home/your_username/FlyCURE/ref_genome ./FlyCURE
All paths in this lesson assume you are working in the local FlyCURE/ directory.
Review the markdup files
Let’s first review the outputs from our bam_factory script.
$ cd FlyCURE/results/clean_bams
$ less G-3-4.markdup.flags.log
59629295 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
63703 + 0 supplementary
0 + 0 duplicates
59629295 + 0 mapped (100.00% : N/A)
59436664 + 0 paired in sequencing
29718332 + 0 read1
29718332 + 0 read2
58651556 + 0 properly paired (98.68% : N/A)
59436664 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
537202 + 0 with mate mapped to a different chr
194938 + 0 with mate mapped to a different chr (mapQ>=5)
This output is the final alignment statistics after completing all of the prior samtools scripts. From the first line, you can see that ~59 million reads passed quality control and 0 reads failed. If you continue down the list, 100% of the reads mapped to the genomic reference and 98.87% of both read 1 and read 2 paired and mapped properly. This tells us we can have high confidence in our alignment. If you review each of the .markdup.flags.log files, you should see that each of them have equally high percentages.
Run bcftools for making base calls
We now have the full alignments after running samtools. The next step is to use a program called bcftools. Bcftools will take the alignment files and produce a list of unique SNPs identified in each of the genomes we are analyzing. The first step will be to run pileup which takes our multiple alignment files (.bam) and generates a VCF containing genotype likelihoods. The second step will be to make variant base calls. Some of the flags we utilize will be program specific while others will be our selections based upon the type of mutation predictions we have. The output file type will be a VCF file which stands for the Variant Call Format.
Similar to before, we are going to construct a script containing 2 for loops
. This is an example if you were running the scripts on a single mutant sample.
#!/bin/bash
# what I do:
# Loop 1: mpileup on each sample
# Example of command for a single sample
# bcftools mpileup -Q 30 -C 50 -P Illumina \
# -Ov -B -f ref_genome/bdgp6.32.fa \
# -o vcfs/A44.pileup.vcf \
# clean_bams/A44.markdup.bam &
# Loop 2: sample call for each sample
# Example of a command for a single sample
# bcftools call -c -Ov -v -V 'indels' \
# -o vcfs/A44.calls.vcf vcfs/A44.pileup.vcf &
# Run me in clean_bams where the *.markdup.bam files are
What the bcftools mpileup and bcftools call scripts do
If we break down the script, you should notice there are 2 for loops
. In Loop 1, we use bcftools mpileup to take the multiple alignment files (.bam) and generate a VCF (Variant Call Format) containing genotype likelihoods. The VCF file will then be used as the input for our bcftools call program. Bcftools call will make variant base calls to help us begin our identification of a causative mutation in the fly.
These are the options we are utilizing in Loop 1 for bcftools mpileup:
- -Q : 30 (minimum base quality for a base to be considered)
- -C : 50 (recommended value for BWA)
- -P : A file with tab-delimited sample pairs to compare.
- Illlumina : Sequencing platform used
- -Ov : Output (O) type as uncompressed VCF (v) file
- -B : BAQ is the Phred-scaled probability of a read base being misaligned. Applying this option greatly helps to reduce false SNPs caused by misalignments.
- -f : The faidx-indexed reference file in the FASTA format.
- -o : Output directory
These are the options you may utilize in Loop 2 for bcftools call:
- -c : The original samtools/bcftools calling method
- -Ov : Our output is an uncompressed vcf
- -v : Output variant sites only
- -V
indels
: Skip indel sites (indel = insertions and deletions). *Although EMS doesn’t classicly result in indels, we have identified some mutants with them.
There are additional options you could use which are viewable at the bcftools website. These are the standard options for our pipeline.
Exercise
Now that you have an example of the 2
for loops
provided above, turn these intofor loops
that will run all of our samples. Name this new scriptbcftools_mpileup.sh
. Be sure to include the output directory at the top of the script as follows: outdir=../vcfs
(The vcfs directory will be created in results and where the output files for the 2for loops
will be placed.)When you are confident in your script, make it executable and run. this script will take a few hours as it is identifying SNPs (single nucleotide polymorphisms or variants) in each genome.
Solution
cd FlyCURE/scripts nano bcftools_mpileup.sh
# Run me in clean_bams where the *.markdup.bam files are outdir='../vcfs' mkdir -p $outdir # loop #1 bcftools mpileup for i in *.markdup.bam; do prefix=$(basename $i .markdup.bam) bcftools mpileup -Q 30 -C 50 -P Illumina \ -Ov -B -f ../../ref_genome/bdgp6.46.fa \ -o $outdir/${prefix}.pileup.vcf \ ${i} & done wait # loop #2 bcftools call for i in *.markdup.bam; do prefix=$(basename $i .markdup.bam) bcftools call -c -Ov -v \ -o $outdir/${prefix}.calls.vcf $outdir/${prefix}.pileup.vcf & done
Note: I chose to include indel calls for my script. You always have the option to exlude them now and re-do the analysis to include them later. The choice is yours!
Make the script executable!
$ chmod +x bcftools_mpileup.sh
Run the script!
$ cd FlyCURE/results/clean_bams $ ../../scripts/bcftools_mpileup.sh &
This will likely take a few hours (~2+) to run. I utilize top
in one terminal to monitor that bcftools is running. I also utilize a second terminal to occasionally use ls -lh
to see that files are being created and that the file sizes are changing. (If the screen starts scrolling really fast control C
and try the script again). Once your output is complete and looks comparable to the image below, you may proceed to reduce the file sizes, by zipping your *.pileup.vcf
.
$ cd FlyCURE/results/vcfs
$ ls -lh
When the script is finished here is what you should see (note .gz files won’t appear until after zipping). In top
, bcftools will no longer be present.
-rw-r--r-- 1 gea_user gea_user 110M Apr 15 22:46 A44.calls.vcf
-rw-r--r-- 1 gea_user gea_user 2.8G Apr 15 22:46 A44.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 116M Apr 15 22:47 B-1-3.calls.vcf
-rw-r--r-- 1 gea_user gea_user 17G Apr 15 22:48 B-1-3.pileup.vcf
-rw-r--r-- 1 gea_user gea_user 3.0G Apr 15 22:50 B-1-3.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 119M Apr 15 22:50 B-2-13.calls.vcf
-rw-r--r-- 1 gea_user gea_user 3.1G Apr 15 22:51 B-2-13.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 116M Apr 15 22:51 B-2-16.calls.vcf
-rw-r--r-- 1 gea_user gea_user 3.0G Apr 15 22:51 B-2-16.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 117M Apr 15 22:52 Control.calls.vcf
-rw-r--r-- 1 gea_user gea_user 22G Apr 15 22:52 Control.pileup.vcf
-rw-r--r-- 1 gea_user gea_user 3.0G Apr 15 22:54 Control.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 132M Apr 15 23:12 F-1-4.calls.vcf
-rw-r--r-- 1 gea_user gea_user 3.3G Apr 15 23:12 F-1-4.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 112M Apr 15 22:54 E-2-2.calls.vcf
-rw-r--r-- 1 gea_user gea_user 2.9G Apr 15 22:55 E-2-2.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 114M Apr 15 22:55 G-3-4.calls.vcf
-rw-r--r-- 1 gea_user gea_user 3.0G Apr 15 22:55 G-3-4.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 128M Apr 15 22:56 H22.calls.vcf
-rw-r--r-- 1 gea_user gea_user 17G Apr 15 22:57 H22.pileup.vcf
-rw-r--r-- 1 gea_user gea_user 3.4G Apr 15 22:59 H22.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 130M Apr 15 22:55 H-3-2.calls.vcf
-rw-r--r-- 1 gea_user gea_user 3.0G Apr 15 22:56 H-3-2.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 125M Apr 15 23:01 L31.calls.vcf
-rw-r--r-- 1 gea_user gea_user 3.3G Apr 15 23:01 L31.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 125M Apr 15 22:59 L-3-2.calls.vcf
-rw-r--r-- 1 gea_user gea_user 17G Apr 15 23:00 L-3-2.pileup.vcf
-rw-r--r-- 1 gea_user gea_user 3.2G Apr 15 23:01 L-3-2.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 121M Apr 15 23:02 N-1-1.calls.vcf
-rw-r--r-- 1 gea_user gea_user 22G Apr 15 23:03 N-1-1.pileup.vcf
-rw-r--r-- 1 gea_user gea_user 3.1G Apr 15 23:05 N-1-1.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 118M Apr 15 23:06 N-1-4.calls.vcf
-rw-r--r-- 1 gea_user gea_user 22G Apr 15 23:07 N-1-4.pileup.vcf
-rw-r--r-- 1 gea_user gea_user 3.2G Apr 15 23:09 N-1-4.pileup.vcf.gz
-rw-r--r-- 1 gea_user gea_user 119M Apr 15 23:09 O-2-2.calls.vcf
-rw-r--r-- 1 gea_user gea_user 17G Apr 15 23:10 O-2-2.pileup.vcf
-rw-r--r-- 1 gea_user gea_user 3.0G Apr 15 23:11 O-2-2.pileup.vcf.gz
Reduce file size by zipping the *.pileup.vcf
. Again, this will take some time.
$ cd FlyCURE/results/vcfs
$ sudo apt-get install pigz
$ pigz *.pileup.vcf
Save analysis back to the data datastore
Since we were working locally, you will want to save your work to the datastore for future access. DO NOT FORGET THIS STEP.
$ cd
$ ./gocmd sync --single_threaded FlyCURE/results/vcfs i:/iplant/home/your_username/FlyCURE/results
Note: The --single_threaded
flag reduces the number of TCP connections (aka communication between devices over a network) to 1 thread which will create 4 TCP connections. The assumption is the amount of TCP connections we are creating is overwhelming our environment. By reducing the number of threads and TCP connections this “should” prevent the syncs from failing. This is also dependent upon how many people are trying to do this at the same time. Fewer threads means the sync will be slower. If you want, you can also use the flag --thread_num 0
and replace the number with 1 or 2 or 3 or 4 to increase the number of threads and therefore the processing speed, but you increase failure by increasing threads. Each thread creates 4 TCP connections.
Key Points
Bioinformatic command line tools are collections of commands that can be used to carry out bioinformatic analyses.
To use most powerful bioinformatic tools, you’ll need to use the command line.
There are many different file formats for storing genomics data. It’s important to understand what type of information is contained in each file, and how it was derived.