Fly-CURE: Shell Genomics, Data Wrangling, and SNP Analyses

This course is divided into 3 modules.

Module 1: Introduction to the Command Line for Genomics
Module 2: Data Wrangling and Processing for Genomics
Module 3: Genomic Analysis of mutated Drosophila melanogaster and SNP identification

In module 1 you will be introduced to the command line interface (OS shell) and graphic user interface (GUI) which are different ways of interacting with a computer’s operating system. The shell is a program that presents a command line interface which allows you to control your computer using commands entered with a keyboard instead of controlling graphical user interfaces (GUIs) with a mouse/keyboard combination.

There are quite a few reasons to start learning about the shell:

For most bioinformatics tools, you have to use the shell. There is no graphical interface. If you want to work in metagenomics or genomics you’re going to need to use the shell.
The shell gives you power. The command line gives you the power to do your work more efficiently and more quickly. When you need to do things tens to hundreds of times, knowing how to use the shell is transformative.
To use remote computers or cloud computing, you need to use the shell.

A lot of genomics analysis is done using command-line tools for three reasons:

1) you will often be working with a large number of files, and working through the command-line rather than through a graphical user interface (GUI) allows you to automate repetitive tasks,
2) you will often need more compute power than is available on your personal computer, and connecting to and interacting with remote computers requires a command-line interface, and
3) you will often need to customize your analyses, and command-line tools often enable more customization than the corresponding GUI tools (if in fact a GUI tool even exists).

In a module 1, you will learn how to use the bash shell to interact with your computer through a command line interface. In module 2, you will be applying this new knowledge to carry out a common genomics workflow - identifying variants among sequencing samples taken from multiple individuals within a population. We will be starting with a set of sequenced reads (.fastq files), performing some quality control steps, aligning those reads to a reference genome, and ending by identifying and visualizing variations among these samples. In module 3, you will use these same tools to analyze genomes from EMS (Ethyl methanesulfonate) mutated Drosophila melanogaster to identify the SNP variant responsible for causing a mutant phenotype.

As you progress through the modules, keep in mind that, even if you aren’t going to be doing this same workflow in your research, you will be learning some very important lessons about using command-line bioinformatic tools. What you learn here will enable you to use a variety of bioinformatic tools with confidence and greatly enhance your research efficiency and productivity.

Getting Started

The lessons in module 1 assume no prior experience with the tools covered in the module. However, learners are expected to have some familiarity with biological concepts, including the concept of genomic variation within a population. Participants should bring their laptops and plan to participate actively.

Module 2 assumes a working understanding of the bash shell. If you haven’t already completed the lessons in module 1, and aren’t familiar with the bash shell, please review those materials before starting this lesson. This lesson also assumes some familiarity with biological concepts, including the structure of DNA, nucleotide abbreviations, and the concept of genomic variation within a population.

Completion of modules 1 & 2 will prepare you to complete module 3.

Schedule

	Setup	Download files required for the lesson
00:00	1. Module 1 \| Lesson 1 \| Getting Started	What is a command shell and why would I use one? What programs will I be using in class? How can I find the terminal and what is it?
00:26	2. Module 1 \| Lesson 2 \| Introducing the Shell	What is a command shell and why would I use one? How can I move around on my computer? How can I see what files and directories I have? How can I specify the location of a file or directory on my computer?
01:26	3. Module 1 \| Lesson 3 \| Navigating Files and Directories	How can I perform operations on files outside of my working directory? What are some navigational shortcuts I can use to make my work more efficient?
02:06	4. Module 1 \| Lesson 4 \| Working with Files and Directories	How can I view and search file contents? How can I create, copy and delete files and directories? How can I control who has permission to modify a file? How can I repeat recently used commands?
03:46	5. Module 1 \| Lesson 5 \| Redirection	How can I search within files? How can I combine existing commands to do new things?
05:46	6. Module 1 \| Lesson 6 \| Writing Scripts and Working with Data	How can we automate a commonly used set of commands?
06:34	7. Module 1 \| Lesson 7 \| Project Organization	How can I organize my file system for a new bioinformatics project? How can I document my work?
06:54	8. Module 2 \| Lesson 1 \| Next-Generation Sequencing Methods	How does NGS sequencing work?
06:54	9. Module 2 \| Lesson 2 \| Background and Metadata	What data are we using? Why is this experiment important?
06:54	10. Module 2 \| Lesson 3 \| Assessing Read Quality	How can I describe the quality of my data?
08:14	11. Module 2 \| Lesson 4 \| Trimming and Filtering	How can I get rid of sequence data that doesn’t meet my quality standards?
08:14	12. Module 2 \| Lesson 5 \| Variant Calling Workflow	How do I find sequence variants between my sample and a reference genome?
08:14	13. Module 2 \| Lesson 6 \| Automating a Variant Calling Workflow	How can I make my workflow more efficient and less error-prone?
08:14	14. Module 3 \| Lesson 1 \| Fly-CURE - Project Overview	What is the Fly-CURE?
08:14	15. Module 3 \| Lesson 2 \| Fly-CURE - Assessing Read Quality	How can I describe the quality of my data?
08:14	16. Module 3 \| Lesson 3 \| Fly-CURE - Trimming and Filtering	How can I get rid of sequence data that doesn’t meet my quality standards?
08:14	17. Module 3 \| Lesson 4 \| Fly-CURE - Alignment	How do I find sequence variants between my sample(s) and a reference genome?
14:34	18. Module 3 \| Lesson 5 \| Fly-CURE - Converting, Sorting, and Indexing bam files	How do I convert sam to bam files to sort and index the alignment?
18:48	19. Module 3 \| Lesson 6 \| Fly-CURE - bcftools	How do I identify unique SNPs?
23:01	20. Module 3 \| Lesson 7 \| Fly-CURE - SnpEff and SnpSift	How do I call unique SNPs for each of my mutants? Do any of the unique SNPs affect protein function?
23:01	21. Module 3 \| Lesson 8 \| Fly-CURE - Final Identification of SNPs	Can you identify the SNP causing the mutant phenotype? What changes would you make to the bioinformatics pipeline? What additional experiments would you conduct to verify the presumptive mutant SNP?
23:01	Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.