Full Comp Micro Lab Manual
User Manual:
Open the PDF directly: View PDF .
Page Count: 134
Download | |
Open PDF In Browser | View PDF |
Computational Microbiology Lab Manual (Biology 2003/3004) Tonya Ward Contributions by: Elise Morton, Gabe Al-Ghalith, Joseph Guhlin, Andrew Honsey, Jen Teshera-Levye, Rachel Soble, & Maxwell Kramer Getting Started Overview of Computational Microbiology Workflow During the computational microbiology section of this course you will use your own computer, and the supercomputers at the Minnesota Supercomputing Institute (MSI) to analyze microbiome data. Below is a diagram explaining which resources and programs will be found on your computer and on MSI. In order to analyze the microbiome datasets available, we must install some programs on your computer. These programs are listed in the table below. Links for these programs are also available on the course Moodle for your convenience. ## Warning: package 'knitr' was built under R version 3.3.2 Program Use Download Cost Putty Connect to MSI (Windows only) Mac: Not Needed Windows: http://www.putty.org free FileZilla Transfer files to/from MSI Mac & Windows:https:FileZilla-project.org free Sublime Text Writing text files Mac & Windows: www.sublimetext.com/3 free RStudio Analyze & plot data Mac and Windows: First: [https://cran.r-project.org & www.rstudio.com/products/rstudio/download free PuTTY PuTTY allows people with Windows to access the computers on MSI from the command line. It is a terminal application that we will use specifically for MSI. Mac and Linux users do not need to this, and will access MSI using their terminal instead. FileZilla FileZilla will let you connect your computer to the supercomputers at MSI https://FileZillaproject.org. This is a point and click program with a visual user interface. You can install it and we will set up the MSI-specifics later when we first log in to MSI. Sublime Text Sublime Text is a text editor https://www.sublimetext.com/3. Just like we might use Microsoft Word for writing our papers and assignments, we need a text editor to write the type of text files we use in computational microbiology. We need a special text editor because the computer will be reading in and processing our files directly, so they must be formatted properly. We cannot use something like Word, or Notepad because they embed special modifications into the files they create. For the ease of troubleshooting, everyone is required to use Sublime Text 3. This is a point and click (and type) program with a visual user interface. You can install and use it right away. RStudio R is the programming language we will use to analyze and plot our data. We can think of it like Microsoft Excel. R can do everything we would need to do in Excel and a lot more. To download R, we will need to download it for your particular operating system. Choose your appropriate link from here: http://cran.us.r-project.org/. After this downloads, you will probably get two R icons which open a dialogue which looks similar to the terminal. We will never be using these because it is kind of a clunky way to use R. Instead we will be installing an Integrated Development Environment (IDE) for R called RStudio. It works as a wrapper for R to create somewhat of a personalized console with different panes containing different information about what you are working on. To download it go to this link and choose the right installer for your operating system: http://www.rstudio.com/products/rstudio/download/. Accessing MSI What is MSI? MSI stands for the Minnesota Supercomputing Institute. It is an academic unit of the University of Minnesota. They offer numerous services, such a batch high performance computing (HPC), interactive HPC and data storage. For the context of this course, you can think of MSI as a resource of many large computers that we can use to perform our computational tasks. Serial versus Parallel Computing When we use MSI we can complete our tasks faster if we run them in parallel. This is similar to thinking about electricity. Look at the example below with light bulbs. On the left the circuit is set up in series (bulbs one after another), on the right the circuit is set up in parallel (electricity reaching all bulbs at the same time). We can do the same type of thing with computers. Below is an example of serial computation. When run a command on MSI we are taking a big dataset and executing a bunch of instructions. The instructions have to be done in a certain order and only one can be done at a time. If we do this in series, it can take a long time. For example, if we want to align millions of DNA sequences to a reference, also known as OTU picking (we will talk about what this is later in the course), it takes a really long time if we want to do it accurately. Below is an example of parallel computation. We can break up our problem into smaller chunks and run them at the same time on multiple processors. This way we can get to our results faster. In the example we are using OTU picking, but many of our computational microbiology commands can be run in parallel. So really, we use MSI so that we can quickly complete tasks that would take our own computers a really long time. Accessing MSI If you want to use your personal computer, you would physically interact with the keyboard and mouse attached to it. The computers at MSI are not located where we are, so we must access them remotely. To understand how, we should first learn some terminology: term Definition Client The computer where the user is sitting and where the connection is initiated from Server The remote computer that accepts the connection and provides the service IP Address Internet Protocol address. 4 8-bit number that all computers connected to the internet have, telling us where/who the computer is. MSI: @login.msi.umn.edu SSH Secure Shell. Enables command-line connection by creating an encrypted connection between your computer and the remote server To access MSI you must be using the UMN secure wifi, eduroam, or you must be logged into the VPN If you don't have the VPN, follow instructions here for on Moodle. Logging In: Mac/Linux: Open your terminal and type (test): ssh x500@login.msi.umn.edu It will then ask for your password, which is your x500 password. It will not show up as you type it. Windows: open PuTTY: First Time: Open PuTTY and do the following. Once you hit open, it will ask for your x500 password. Every Other time: click on your biol1961 PuTTY shortcut. Logout: exit Servers Available at MSI When you first access MSI you are on the login server. This server can be used to look at which files are there and make simple text files. To perform any other tasks, you must move to a server that is capable of performing larger tasks. The server we will use is 'lab'. Move to Lab Server ssh lab Directory Structure on MSI We can see the exact directory structure of MSI using FileZilla. Generally it looks like this: /home/ biol1961/ home_directory/ shared/ public/ Connecting to MSI with FileZilla Type in the host IP for MSI: login.msi.umn.edu Type in your username: x500 Type in your password Click Quickconnect Commenting Code "Programs must be written for people to read, and only incidentally for machines to execute." -Structure and Interpretation of Computer Programs, 1985 What is commenting? When we write and execute code we are telling the computer to do something. The computer reads our code, interprets what to do and executes it. As people, we write comments in the code for other people to read. These comments include explanations as to what the code is doing, instructions, what the inputs and outputs should be and other important pieces of information. Comments are not read by the computer. We ensure they are not read by the computer by placing special symbols before the comment text that stops the computer from reading that text and trying to interpret it. There are different types of comment symbols depending on which language you are writing in. For this course, we will be using the following symbol: # The pound symbol, or hashtag, tells the computer to not read any text that follows the symbol. The symbol is line specific, so once a new line is started the computer can again read the text. Example: # this line is a comment, not read by the computer cd file_path/to/DNA_sequences # the line above is NOT a comment and is read and executed by the computer Summary: '#' = a comment for us, and not the computer. Comments on Comments Commenting is also important for later in the course when we begin to write our own code. You will be expected to comment any code you write so that it can be interpreted by your instructors and peers. For this class, please consider the following points: 1. The value of a comment is proportional to the distance between the comment and the code. Good comments stay as close as possible to the code they're referencing. As distance increases, the comment becomes misleading. 2. Comments should be clear and concise 3. Comments should capture intent. Because we are learning in this course, what we want our code to do and what it actually does might be two different things. Commenting your intent can help your instructors help you! Adapted from https://blog.codinghorror.com/when-good-comments-go-bad/ The Command Line Why Command Line? We need to be able to interact with the Minnesota Supercomputing Institute's (MSI) supercomputers. Some of our computers work with the Windows or Mac iOS systems, while the supercomputers at MSI run UNIX. To work with UNIX, you have to learn how to enter commands from the terminal to tell the computer what to do. This will likely start out as trial and error for you, so don't get discouraged! You should try these commands and combinations of these commands to get a better understanding. Repetition is essential to learning these commands and how they work. Much of working with the command-line is memory. So copy-and-pasting commands is discouraged. Typing them on the command-line yourself will accelerate your learning and the command line provides instant-feedback. If you make a mistake, you can use the up arrow on the keyboard to correct your command, or you can simply re-type it and try to correct your error. When working from the command line, the case of the letter is important. Lowercase 'a' does not equal uppercase 'A'. So not only do you have to be careful of typos, you also have to be careful of the case of the letters! Please note that in this manual command line test will be in a different font. Inputs (commands) that you can actually try out will be bolded. With the new font, you should be able to tell the difference between many common characters. Characters such as 0 and a capital O are distinct. As are lower-case l and the number 1. The pipe | and l and 1 are also different. This font is called Monaco. Additional Help The following website can help with learning UNIX. You can type in a command, with options, and it will provide a description of that command and any options you supplied. Try it with ls -a Help website: http://www.explainshell.com/ Commands A command in UNIX/Linux can be as simple as ls. But can also be more advanced and have flags, affecting how the program is run, and arguments, telling the program where to work or which file to work with, depending on the command. Let's consider the following command: ls -a ~/../sample/ This command lists the files and directories, including hidden ones, in the sample directory. Notice the following: • The command, flags, and argument are all separated by a space • The first set of characters before any space occurs are the command (ls). If we forgot a space, and typed ls-a, the computer would try to run a command that is literally named 'lsa' which does not exist, and you will get command not found. Anatomy of a command (as it looks in the terminal): tward@login02 [~] % ls -a test_directory • ls is the command itself – • -a is a flag to the command, affecting how the program functions – • ls stands for list. It tells the computer to list the contents of something Flags modify the command. -a means all, and is telling the computer to list all files (even hidden ones) test_directory is the argument to the command – It is telling the computer what to list: List is all the files in the directory ~/Desktop/biol1961/file/ • tward@login02 [~] % is the terminal output – It tells us where we are (logged in under tward, in the login server login02). What's inside the [] tells us which directory we are in (~ means "home"), and % tells you when the command starts. We will leave this part out for all future examples. Commands may have a zero, one, or more flags and zero, one, or more arguments. For example, the cp command which copies a file will always have two arguments, but rarely uses flags. cp test_directory/sample.fastq test_directory/my_sample.fastq • cp is the command itself • test_directory/sample.fastq is the first argument, the name and location of the file to be copied • test_directory/my_sample.fastq is the second argument, the name and location to copy the file to This will copy the sample.fastq file to the same directory, and the name of the new file will be my_sample.fastq. Where Are We Working? • When working from the command line, we first start in our home directory (a directory is another name for a folder) • From our home directory we can reference other directories or paths to files • Commands are performed in the directory you are in • When you change directories (cd) you move from one location to another • Commands default to working in your current directory, but you can tell them to work somewhere else using a specified path Think of your own computer. When you open your "Finder" (Mac) or Windows Explorer (Windows), you start in your home directory. From there you have other directories you can enter. Inside those directories you might have even more subdirectories. Let's use an example of a pictures folder on your desktop. If you want to access Picture_1 from 2014, you would go to your Desktop, then into Pictures, then into Pictures_2014 and click picture_1. Your Home Directory Your home directory is your default starting place when you log in to MSI. On MSI, my home directory is '/home/biol1961/tward'. I can check this by opening my terminal. logging in to MSI, and typing pwd (print working directory). pwd /home/biol1961/tward I can also test this by writing using cd ~. With this we move to our home directory (cd = change directory, ~ = home). We then check what directory we are in with pwd (print working directory). cd ~ pwd /home/biol1961/tward We can also check using echo, which will print the output of the ~ to the screen. echo ~ /Users/tward Changing Directories • The command cd changes directories • Use cd ~ to change to your home directory • Use cd .. to move into the parent directory (up one from where you are) • Use cd shared to move into a directory called 'shared' cd ~/../shared/ means to change directories to: 1. My home directory using ~ (which is home/biol1961/tward) 2. Go up to the parent directory using ../ 3. And the to the shared directory using /shared If I want to move back to my home directory from here, I can type: cd ~ which means to move to my home directory. note that cd ~ will bring you home no mater where you are. Try moving around on MSI from your home directory to the main biol1961 directory, to the shared directory, to the public directory... Telling Commands Where Files Are • Commands & programs work from where you currently are (see: pwd) • When telling commands how to get somewhere, you can give a start position, such as in the following command: cp ~/../shared/File.txt ~ This tells the cp command to start at your home directory (~), and then move into the parent directory (..), and into the shared folder and copy the File.txt file to your home directory (~). Our Biology 1961 directory structure should look like this: home/ biol1961/ x500/ shared/ public/ If we want to work from our home directory we can reference it, regardless of where we currently are, like this /home/biol1961/x500. Note: a slash in front means the absolute path from the root of the directory tree (home). Relative path: biol1961/x500 Absolute path: /home/biol1961/x500 or ~ **remember ~ is short for your home directory. Absolute versus Relative File Paths Notice that in comparison to what we talk about above, here we are talking about referencing the path to a file, not the path itself. But the rules are still the same. An absolute file path is the file's address from the root of the computer's file system. For example: /home/biol1961/x500/my-file.txt is an absolute file path, and will work from any directory. A relative file path could be my-file.txt The command receiving this file path will only work if that file is in the same directory the command is running from. Another relative file path could be: ../my-file.txt This example would only work if the file is backward one directory (parent directory) from your current directory. You'll get a "No such file or directory" error when you have not referenced the location of the file properly. Relative file paths are used to reduce typing and make things easier. Absolute file paths will almost always work. Remember, the way to tell a relative path from an absolute path is by a leading forward slash /. Listing Files and Directories • The command ls will allow you to list files and directories • The default location it looks is your present working directory, which you can identify with the command pwd • You can also specify other locations to examine; by using ls ~/sample_directory/ you are asking to list files and directories found in the sample_directory folder • You can list hidden files by using the flag -a and you can see file size, date, modification, and permissions by using -l, you can get file sizes in human readable format with -h • You can combine these, and use ls -lah ~/Desktop/ to find out this information for files in the shared folder ls -lah /home/biol1961/tward/sample_directory total 1.0G drwxr-s---. drwxr-xr-x. -rw-r--r--. -rw-r--r--. -rw-r--r--. -rw-------. -rw-r--r--. -rw-r--r--. 3 tward 55 root 1 tward 1 tward 1 tward 1 tward 1 tward 1 tward biol1961 biol1961 biol1961 biol1961 biol1961 biol1961 biol1961 biol1961 4.0K Oct 3 13:04 . 12K Sep 6 09:30 .. 3.1K Sep 23 12:04 alpha_div.txt 0 Sep 27 13:39 alpha_otu.txt 0 Sep 27 13:39 alpha_taxa.txt 5.8K Oct 3 13:02 .bash_history 403 Oct 3 13:04 .bashrc 2.4M Sep 28 11:14 Gever_100.biom Try listing all of the files in your home directory Tab Complete If you want to write a path or file path, you can use the tab button to help you. Let's say we want to type the following /home/biol1961/tward/sample_directory/my-file.txt. If we start typing the first couple of letters /home/biol1961/twa, we can then use tab to complete the rest for us. If there is only one option, tab will fill in the word. If there is more than one option, hitting tab once will not fill in the word, but hitting it twice will list all our options. Try writing the path to your desktop by using the tab button. Moving/Copying Files To move files, use mv. To copy files, use cp. If we are in the shared/ directory, this command moves a file called seq.txt from the shared directory to our home directory. mv seq.txt ~ To copy the same file, we would have used cp instead and we would have a copy in both locations. Try copying a file from named copy_me.txt from the shared/ directory into your home directory. Creating a Directory To create a new directory, use mkdir. In this example, we use pwd to show we are in a new directory on the desktop. We use ls to show that there are no files or directories in the current directory. We then create a new directory with mkdir, and can now see it with ls, and then we change into that directory with cd. pwd Users/tward/Desktop/new_directory ls mkdir new_sub_directory ls new_sub_directory cd new_sub_directory/ Try making a new directory in your home directory. Trying copying a file from the shared/ folder (copy_me.txt) into the new directory you made. Counting Lines, words and characters In computational microbiology we will be using multiple types of text files. Some of these files will include sequences of DNA (all the A,T,G,C's), tables of how many bacteria are in each sample, and other files with information about diversity. Sometimes it is handy to be able to pull snippets of information from these files. For example: • we can count the number of lines, words, and characters in a file with the command wc • "Words" are defined here by groups of characters (strings) separated by a space. In this example, we get only the number of lines in the my_file.txt file using the -l flag with wc. Next, we get the number of lines, words, and number of characters in the file when we don't use the -l flag. wc -l ~/my_file.txt 552544 /home/biol1961/tward/my_file.txt wc ~/Desktop/biol1961/my_file.fastq 552544 552544 90114322 /home/biol1961/tward/my_file.txt Try counting the number of lines in the copy_me.txt file Getting lines from a file (beginning or end) • To get lines from the beginning of a file or end of a file, we can use the commands head or tail • By default, these commands will give us 10 lines, but we can change this with the flag -n K where K is the number of lines you would like to receive Here is an example, searching a file called sample.fastq. This file contains all the sequences of DNA for a sample in a dataset. So, the top 2 lines from a file would be head -n 2 sample.fastq head -n 2 ~/Desktop/biol1961/sample.fastq @M00784_000000000-A8BPP:1:1101:14310:1364#0/2 CGACAACCATGCATCACCTGTCACTTCTGTCCCCGAAGGGAAAAATGCGATTAGGCATCGGTCAAAAGGATCTCACC CTTCGCTCATCTTCTTCGCGTTGCTTCTAATTCCACCACATGCTCCCCTACTTCTCCGCCTCCCCCTCACTTCCTTT GAGTTTCACTCTTGCGAGCGTACTTCCCAGGCGTAGTACTTAATGCTTTCGCTGCGCCACCGTCGCGCTTCCCCCCC CACCCCTCCTTCCCATCTTTTCCTCCCTCCCCCTCCCGCGTCTCCCATCCCCCTCCCCTTCTCCCCCACC Searching a File • We can quickly search files with the grep command. • To use grep you supply a string to search for, followed by a file to search in • grep will send the result to standard output (typically the screen), which is the entire line that matches the search string you've provided. Here are some examples, searching sample.fastq. This file contains all the sequences of DNA for a sample in a dataset. grep GGGGGGATGAT ~/Desktop/biol1961/sample.fastq CGACGGCCATGCAACACCTCCACAGGCGCCCCGAAGGGCCTCATCATCTCTGAAACATTCGCCTACAGTTCAAGCTC CGGTAAGGTTCCTCGCGTATCATCCAATTAAACCCCCAGTTCCTCCGCTTTTGCCGGCCCCCGTCAATTCCTTTGAG GTTCTACCCTGCCGGCGTACTCCCCCGGGGGGATGATTCATGCCTTCGCTTGGCCGCTTACGACAGACGCAACCAAC GATCAACCATCATTTACGGCGTGCACTACACGGCTCACGATTCTCACTCCTCTCATCTATCACCACTCCC CGACAACCATGCAGCACCTGTATCAGTATCCCCGAAGGGACTATGTAACTTTACAGGAATTACTGGAAGGCAAGACC TGGGAAGGGTCCTCGCGTTGCTACGAAATAAAACAAAAGCTCCGCAGCCTGTGCGGGCCCCCGTCAATTACATTGAG GTTCAAACTTGCGGCCGTACTCACCAGGGGGGATGATTAATGTGTTTACTTCGGAAAAGAAGGGGTCGATACCCAAT ACACCTAGCAGCAATCGTTTACAGTGTGGACTACAAGGGTATCTAGTCACCTGTATCTTATACAAATCTG CGACAACCATGCAGCACCTGCAAAGAGAGTACGAAGGAAGAGATAGTATTCAAAAGGGGCCACTGCAATTCAAGCAC GGGGAAGGGTCCTCGGCGATCATTGAATTAAAACACATGGTCCTACGGTTGTGACGGGCCCCGTCAATTTCTTTGAG GTTCACTGTTGCCGGAGTTATCCCCAGGGGGGATGATTAATGATTTTGCTGGGCCGCTCGAATGGTCTGGACAACAC AGGGACTCGACATTATACGTTGAGGCGTGCCAGGGACACGAACACACGGTCATTTGTCATCAACACACCC The lines returned are very long and are printed on multiple lines. However, they are the same length and you can see we find three lines that contain GGGGGGATGAT Redirecting command output (to another command) • We can use the output of one command as the input of another • The pipe, | , is a key typically above the enter key. It is a vertical line • The pipe redirects output from one command to input of another command For example, we can see how many lines in the sample.fastq file contains the sequence 'GATTACA'. grep counts the number of lines 'GATTACA' appears in the file, and feeds it to wc to count the number of lines. grep GATTACA ~/Desktop/biol1961/sample.fastq | wc -l 151 We can also see the last 2 lines that match GATTACA: grep GATTACA ~/Desktop/biol1961/sample.fastq | tail -2 CGACGGCCATGCACCACCTCGGCCTCCGTCCGAAGAGCCACCCATCTCTGGGTGTTTCAGGCGCCGTTCGAGCCCGT GTAAGGTTTCTTGCGTTTCATTGAATTTAACCACCTGTTTCTACGCCTGTTCGGGCCCCCCTCCAATTCCTTGAGGT TTCACGCTTCCGATGTTCCTCCCAGGTGGATGTACTATTGCTGTCGCCTGGGCACCGACAGGGTTCCGCCGGCGGAC ACCCATTATTCCTTGTTGAGTGGATTACATGGCAAGCTAATCACCCGTCTGTGTCTCTTCACACTCGCTC CGACGGCCATGCAACACATGTTTTCATGTCCCCGAAGGGAAAGCTCCATCTCTGGAGCGGTCAATCAATGTCAAGCC TTGGTAAGGTTCTTCGCGTTGCGTCGAATTAAACCACATACTCCACCGCTTGTGCGGGCCCCCGTAAATTCCTTTGA GGTTCATCCTTGCGGACGTACTCCCCAGGCGGGGTACTTATTGCGTTAACTCCGGCACAGAAGGGGTCGATACCTCC TACACCGAGTACCCATCGTTTACGGCAAGGACTACCGGGGATTACAACTCCCTGTCGCCTCTACCAATCT We can also string multiple grep commands together, for example we could search for lines containing 'ATG' that also contain 'TAG' that also contain 'GATTACA', and count the number of matches. grep ATG ~/Desktop/biol1961/sample.fastq | grep TAG | grep GATTACA | wc -l 138 Redirecting command output (to a file) • We can also redirect the output of a command to a file • Output can be directed to a new file with > (this will replace existing content if something is already there!) • You can add to the end of a file with >> (this will NOT replace existing content, but adds to it instead) We can use the same example as above, where we used grep to search for lines containing 'ATG' that also contain 'TAG' that also contain 'GATTACA', and direct the output to a file called 'many_grep.txt'. grep ATG ~/Desktop/biol1961/sample.fastq | grep TAG | grep GATTACA | wc -l > many_grep.txt We can write direct text to a file using echo. For example, we can write "the number of lines that contain ATG, TAG and GATTACA:" to the end of the many_grep.txt file. echo "This is the number of lines that contain ATG, TAG, GATTACA" >> many_grep.txt Try writing "I will master computational microbiology" to a new file called 'mantra.txt' on your desktop. Exploring .txt files from terminal • We can open and look at text files from the command line with the command nano • nano file.txt opens the file • Arrow keys move up and down • Directions for quitting the file are located at the bottom of the screen We can explore what the sample.fastq file looks like by typing the following the command, and then quitting by typing x. nano sample.fastq Try looking into your mantra.txt file on your Desktop. For more commands you can use, please see the Bash_commands file on Moodle. Please use this file and update it with commands you find useful as you complete the computational microbiology section. Modifying '.bashrc' on MSI The following instructions are for modifying your unmask setting on MSI. This setting is listed within your .bashrc file. It controls who has access to the files you create on MSI. The default setting is 077, which means all files and directories are private. We want to set it to 027, so that people in our group (e.g., TAs and instructors) can have access to the files you create. 1. Log onto MSI ssh x500@login.msi.umn.edu PuTTY users: Just click on your MSI PuTTY shortcut 2. Copy the .bashrc file from the shared directory to your home directory cp /home/biol1961/shared/.bashrc ~ 3. Source your file to make it active cd ~ source .bashrc Job Submission on MSI Why submit jobs MSI uses job queues to efficiently and fairly manage when computations are executed. The queuing system that MSI uses is called PBS, which stands for Portable Batch System. We submit a script to the supercomputer to be run at a later time (when the resources are available). This is the alternative to waiting around for hours, potentially longer, for the resources you need to become available. What is a script? A script is a file containing commands to be run in order. The script itself can then be run like a command and will execute all the tasks outlined in the file. There are many types of scripts. For example, all the commands we will run in QIIME are scripts that execute many tasks for us. A PBS job script is a type of script we use with the MSI supercomputers. It is a small plain text file containing information about what resources a job requires - including time, number of nodes and memory. The PBS script also contains the commands needed to begin the desired computation. Below is an example of a PBS script that we will use repeatedly. For our purposes, the text in black will remain the same for all jobs submitted to MSI. The text in red will vary, depending on the user and the job to be submitted. #! /bin/bash -l #PBS -l nodes=1:ppn=16,mem=2Gb,walltime=3:00:00 #PBS -m abe #PBS -M your_email #PBS -o job_name_stdout #PBS -e job_name_stderr cd /home/biol1961/x500 module load name_of_software command_X_Y_Z #!/bin/bash -l The first line in the PBS script defines which type of shell the script will be read with. We need the BASH shell and it will be read line by line (-l). #PBS -l nodes=1:ppn=16,mem=2Gb,walltime=3:00:00 The second line contains the PBS resource request. The sample job will require 3 hours, 1 node, each with 1 processor core per node (ppn), and 2 gigabytes of RAM (mem). Note that commands for the PBS queuing system begin with #PBS. Other commands to be run do not have the '#PBS'. #PBS -m abe #PBS -M your_email The third and fourth lines are both commands having to do with sending message emails to the user. The first of these lines instructs the PBS system to send a message email when the job aborts, begins, or ends. The second command specifies the email address to be used. #PBS -o job_name_stdout #PBS -e job_name_stderr The fifth and sixth lines specify the names of the files to which the job's output and errors should be written, respectively. You can change the name (job_name) to specify the job that is run (e.g. OTU_picking_stdout). cd home/biol1961/x500 A PBS script should also contain the appropriate change directory commands to get to the job execution location (in this case the computer will move to the user's home directory). module load name_of_software The script also needs to contain module load commands for any software modules that the calculation might need. 'Module load' is effectively the same as opening an application on your computer. This could be something like: module load qiime/1.8.0 command_X_Y_Z The last lines of a PBS script contain commands used to execute the job. This could be something like: pick_otus.py -i sequences.fasta -o otus. You must write your PBS scripts in your plain text editor (Sublime Text 3) and save them with the extension .pbs Check a Job Status there are two commands used to to check the progress of our job: qstat and showq. For our own jobs, we need to use the -u username flag to specify. If there is no flag for qstat, it will show all of the jobs currently running or waiting on the specified machine. showq will show more information about the jobs, including the starting time, expected finishing time, and usage of computational resources (processors, nodes). qstat -u x500 showq -u x500 Kill a Job When we check the job status, there is a Job ID starting with a number. This number can be used to kill the job. For example, if the Job ID is 4327.node1081.locald, type qdel 4327 or qdel 4327.node1081.locald We can kill multiple jobs in a row by specifying the Job IDs, or use all, if all the jobs need to be killed. QIIME What is QIIME? QIIME stands for Quantitative Insights Into Microbial Ecology. It is pronounced 'chime'. It is pipeline for preforming microbiome analysis from raw DNA sequences. Some of the things QIIME can do for us includes: • Quality filtering • OTU picking • Assigning taxonomy • Diversity analysis • Visualizations • Statistics QIIME uses a mix of other existing softwares and algorithms to perform its tasks. Because of this we call it a 'wrapper'. That means it wraps up many other existing tools and algorithms in a package that works as one cohesive unit. How Do We Use QIIME? As mentioned above, QIIME is a wrapper for many different components. This means installing QIIME can be extremely challenging because it requires MANY dependencies (other programs and algorithms). For this reason (and the computer power of MSI), we use the 1.8.0 version of QIIME installed on MSI. When we are on MSI and logged into the lab server, we can turn QIIME 'on' by typing: module load qiime/1.8.0 module load means out of all the modules (programs installed on MSI), load the one we are going to specify. There are different versions of QIIME on MSI, so we must specify that we want version 1.8.0 with qiime/1.8.0. If you wanted to see all the different modules available on MSI you could do so with module avail. If you wanted to turn off a module you had loaded, you can type module unload followed by the name of the module. QIIME has many commands (files that contain ordered lists of commands to be run), and these commands can be found at the link below. This is how we complete different steps in our analysis. QIIME Commands: http://qiime.org/scripts/ Try going to this page and click some of the commands links, for example click summarize_taxa.py and read what this command does, what the inputs are and the examples. Notice that all commands end in '.py'. This is because all the commands are written in a language called Python, and commands in Python end in '.py'. QIIME Workflow Here is a flowchart of the how QIIME works: Notice the initial inputs in green: Input Definition Sequencing Output This is all of the raw DNA sequences from the sequencer (.fastq files) Metadata This is another word for mapping file. It's a tab-delimited text file, where the sample IDs are rows and the columns are different categories of data. For example, which primers were used for sequencing, which body site the sample is from, clinical data (like if the sample came from a person with the disease or a control), etc. Many times we actually make these files in Excel, and export them as .txt files. The rest of the steps in this pipeline will be covered in detail throughout the rest of the course. Running Commands To determine how to run a command, we have to look up the documentation. We can either go to the command page (for QIIME) mentioned above and click on the command we want, or we can type the following: qiime_command.py -h By specifying -h we are saying 'HELP!'. The command will list the documentation associated with it. This output will not be as comprehensive as what is available online, but will at least tell us all the possible inputs and outputs. Example collapse_samples.py Online: This tells us what the command does: Uses the mapping file to collapse the OTU table. The minimum we need to specify (required) is: • The OTU table we care about in .biom format • The mapping file we care about • The output file path for our collapsed OTU table • The output file path for our collapsed OTU table • The field we would like to collapse by Optionally, we can also determine: + The collapse mode + To normalize or not Terminal: collapse_samples.py -h This will tell us (more or less) the same information, but prints it to the terminal screen. Quality Control of Sequence Data When we get our DNA sequences from the sequencer, there is some quality control that must be done. For example, the barcodes and primers used in the sequencing reaction should be removed. Also any sequences that are too low in quality should be discarded. If we are analyzing 16S amplicon data, then we should also trim our sequences to the expected amplicon size. If we used paire-end sequencing we can also stitch our pairs together to make our sequences longer and of higher quality for alignment. We can trim and filter our raw sequences using many different tools, but in this course we will be using SHI7 (pronouned shizen). This program will do all the quality control we need and produce a final combined file sequence file that we will use as the input for OTU picking (to determine which bacteria are in our samples). The input that SHI7 requires is a directory of .fastq files, where each sample as its own fastq file. What's a fastq file? A .fastq file contains our DNA sequences as well as other information regarding the quality of the sequencing reaction. Each sequence within a .fastq has four lines of information: • Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description • Line 2 is the raw sequence letters (A,T,G,C...) • Line 3 begins with a '+' character and is optionally followed by the same sequence identifier • Line 4 encodes the quality values for the sequence in Line 2, and contains the same number of symbols as letters in the sequence What's a fasta file? A .fasta or .fna file contains our DNA sequences only. Each sequence within a .fasta has two lines of information: • Line 1 begins with a '>' character and is followed by a sequence identifier and an optional description • Line 2 is the raw sequence letters (A,T,G,C...) The .fasta or .fna format is the input format required for OTU picking. When we quality control our sequences we also convert them from .fastq to .fna. How Do We Quality Control Sequences? SHI7 is currently installed in the shared directory, and our .bashrc file contains information to have MSI use this program as if it were installed as a module. The full documentation for SHI7 is located here: https://github.com/knights-lab/shi7. To run SHI7 picking, we will use the main SHI7 command with the following parameters: shi7.py -i directory_with_fastqs/ -o qc_reads_output The input file path is to the sequences you want to process (-i). This should be a directory with one fastq per sample. The name of each fastq should be the sample ID followed by the .fastq file extension. The output directory is where you want your final clean sequence file to be (-o). There are some optional parameters to use depending on the dataset. The include: -SE # # # # # # This will use single-end mode. If you don't have paired reads, use -SE -trim_q 32 Trim sequences based on quality, default is 20, increase to 32 if sequencing run is old (before 2015) --adaptor Nextera You can specifically take out the adapter that was used In most recent sequencing it's Nextera adapters --strip_underscore T # You can process the file names to keep the first part For the full list of options, you can type: shi7.py -h This is telling SHI7 to print the help page. To run this command, we must submit a job file. Your job file should look like this: #!/bin/bash -l #PBS -l nodes=1:ppn=16,mem=2Gb,walltime=6:00:00 #PBS -m abe #PBS -M x500@umn.edu #PBS -o job_name_stout #PBS -e job_name_stderr cd /home/biol1961/x500 module load python shi7.py -i directory_with_fastqs/ -o qc_reads_output Of course, you have to modify the file to specify YOUR file paths and the outputs to what you want. Things that cannot change include: • The first 3 lines. The nodes and ppn must be 1 and 16 for the lab queue • The __module load python (SHI7 requires python to run) All the parameters for the actual command must be all on one line in the job file, with a space between the parameter letter and value. In the above example they are on separate lines so that you can read them easily. The files and paths must be specific to you. As mentioned above, the script will run its own jobs after it has started. You will know quallity control is done when you have the following files in your output directory (-o): shi7.log (all the information about the quality control) combined_seqs.fna (your combined and cleaned sequences in .fna format) You will also have some files that were generated by the jobs submitted by this script. They include: job_name_stout (the standard out captured by the job submission) job_name_stderr (the errors captured by the job submission) Picking OTUs What is OTU picking? OTU picking is how we take our 16S DNA sequences and assign them to an OTU identifier. An operational taxonomic unit (OTU) is a cluster of similar 16S sequence variants. Each cluster is meant to represent a taxonomic unit of bacteria (species, genus, phylum..) depending on the sequence similarity threshold. OTU clusters are usually defined by a 97% identity threshold of the 16S gene sequence variants. There are three main types of OTU picking that we can do. De-novo • Doesn't use a reference database • Majority of the reads are clustered • Very slow • Erroneous reads get clustered • Cannot assign taxonomy Closed Reference • Reference database is quality filtered • Faster because you can use parallel computation • No new OTUs can be observed • Reference database bias • Uses the GreenGenes database of all known 16S • Can assign taxonomy Open Reference • Combines the two approaches • No data is thrown out • De-novo clustered OTUs cannot be assigned taxonomy How do we actually pick OTUs? Depending on the OTU picker you choose to use, OTU picking can be very computionally heavy. This means it can require a lot of time and resources. Thanks to the development efforts of microbiome researchers, we have been able to speed up this process immensely. The OTU pickers within QIIME are currently not the gold standard, so we will use a different OTU picker that is installed seperately. For this course we will used a closed reference OTU picker called NINJA, which stands for NINJA Is Not Just Another aligner. NINJA is currently installed in the shared directory, and our .bashrc file contains information to have MSI use this program as if it were installed as a module. The full documentation for NINJA is located here: https://github.com/GabeAl/NINJAOPS. To run OTU picking, we will use the main NINJA command with the following parameters: ninja.py -i combined_seqs.fna -o ninja_otus -m normal -p 4 -z -d 2 The input file path is to the sequences you want to align (-i). These should be the output of the quality control we did earlier. The output directory is where you want your final OTU table to be (-o). The -m parameter set to normal tells NINJA to run at medium sensitivity (to maximize the speed to accuracy ratio). We will use 4 threads (-p), and we will search both DNA strands (z). We will also set denoising to 2 (-d), which means we will discard any sequences that appear less than 2 times. To run this command, we must submit a job file. Your job file should look like this: #!/bin/bash -l #PBS -l nodes=1:ppn=16,mem=2Gb,walltime=6:00:00 #PBS -m abe #PBS -M x500@umn.edu #PBS -o job_name_stout #PBS -e job_name_stderr cd /home/biol1961/x500 module load python bowtie2 ninja.py -i combined_seqs.fna -o ninja_otus -m normal -p 4 -z -d 2 Of course, you have to modify the file to specify YOUR file paths and the outputs to what you want. Things that cannot change include: • The first 3 lines. The nodes and ppn must be 1 and 16 for the lab queue • The __module load python bowtie2 (NINJA requires python and bowtie2 to run) All the parameters for the actual command must be all on one line in the job file, with a space between the parameter letter and value. In the above example they are on separate lines so that you can read them easily. The files and paths must be specific to you. As mentioned above, the script will run its own jobs after it has started. You will know OTU picking is done when you have the following files in your output directory (-o): ninja_log.txt (all the information about the alignment) ninja_otutable.biom (your otu table) You will also have some files that were generated by the jobs submitted by this script. They include: job_name_stout (the standard out captured by the job submission) job_name_stderr (the errors captured by the job submission) If you want to pick otus again, you should delete these files prior to submitting another OTU picking job. What's an OTU table? The output of ninja.py is an OTU table in .biom format. When in .biom format the file is not in a form we can read easily. We must first convert the table to .txt file to view it. Converting .biom files A .biom file is a way to package a lot of information in a way that doesn't take up too much space. Because all the information is compact, it makes the file not human readable. If you call head on a .biom file, the output will look mostly like gibberish. What is important is that QIIME and other microbiome softwares use .biom files because they are smaller and fast to work with. If you want to put your OTU table in a human-readable format you have to convert it to a tabdelimited file. We will cover this later. Biom Summaries OTU Tables in .biom format We store out OTU tables in two different formats, either as a tab-delimited text file (.txt) or as a compact, human non-readable biom format (.biom). When we store the table as a biom file, we cannot easily look in the file to see how many OTUs or samples there are, but we can access a summary of the file using some biom commands through QIIME. Biom Summary We can summarize our OTU table with the biom summarize-table command while using QIIME interactively: ssh lab cd /home/biol1961/x500 module load qiime/1.8.0 biom summarize-table -i file/path/to/otu_table.biom -o OTU_summary.txt The input file would be the file path the YOUR OTU table and the output can be whatever you would like to name the summary file. This command will make a text file that contains a summary of your OTU table. We can look at the nonctents of the .txt file by using nano (the text editor on MSI). nano OTU_summary.txt You can see an example of the summary file above. It tells us: * The number of samples * THe number of observations (OTUs) * The minimum, maximum, median, mode and standard deviations of of the number of counts per sample * The taxonomy is stored as the observation metadata * A list of how many counts are in each sample Rarefaction What is Rarefaction? In microbiome research, diversity represents the number OTUs within in a data set. This number can be greatly impacted with different sequencing depths. For example, the deeper you sequence the more species you will find. This is a problem, especially if you sequence 50,000 reads from one sample and only 100 reads from another sample. You would likely find more species in the sample that is deeply sequenced (50,000 reads) in comparison to the one that was shallowly sequenced (100 reads). Below is an example where we are going to sequence one sample three times. Each colored dot represents a microbe and each color represents a different species. Through the process of DNA isolation, 16S PCR amplification, sequencing, quality trimming and OTU picking we can lose information or sequences. To prevent any bias we may see in our diversity analysis we can rarefy our data. A rarefaction is a random collection of sequences from a sample, with a specified number of sequences (depth). For example, a rarefaction with a depth of 1000 reads per sample is a simulation of what your sequencing results would look like if you sequenced exactly 1000 reads from each sample. By rarefying our OTU table we can fairly measure alpha diversity across samples. Exploring Rarefied Data with Alpha Diversity In QIIME, we can first explore our data by looking at alpha diversity across multiple different sequencing depths. This task is performed using the alpha_rarefaction.py command that takes your OTU table and makes a directory full of many OTU tables, all of which are repeats of rarefactions at specfifc depths. The output of this command allows us to visualize how measurements in alpha diversity will change across a range of sequence depths per sample. Once we know how the diversity changes with depth, we can create one final rarefied table to use for our alpha diversity and beta diversity calculations. The command is run using these parameters: alpha_rarefaction.py -i input_file_path -o output_directory -t tree_file_path -m metadata_file_path The input file path is to the otu table in .biom format that you want to rarefy (-i). The output directory is where you want your final results to be (-o). The tree file path is the location of the GreenGenes 97 percent phylogeny tree (-r). The alpha_rarefaction.py command will do multiple things: (1) Create multiple rarefied OTU tables at ten step increments (+ ten sequences each time) starting at a minimum level of ten sequences and stopping at the median number of sequences per sample (2) Run alpha_diversity.py on each of the rarefied OTU tables using the chao1, observed_species, and phylogenetic distance metrics (3) Collates the results for each metric at the various depths into one table per metric, within the alpha_div_collated/ subdirectory (4) Plots the different metrics for each category in the metadata file and places those within the alpha_rarefaction_plots/ subdirectory (5) Deletes all of intermediate the OTU tables it had generated to do the analysis (6) Creates a log file and overall rarefaction plot within the main output directory To run this QIIME command, we must submit a job. Your job file should look like this: #!/bin/bash -l #PBS -l nodes=1:ppn=16,mem=2Gb,walltime=6:00:00 #PBS -m abe #PBS -M x500@umn.edu #PBS –o job_name_stout #PBS -e job_name_stderr cd /home/bioltrm1/x500 module load qiime/1.8.0 alpha_rarefaction.py -i file/path/to/otu_table.biom -o file/path/to/whatever_you_want -t /home/biol1961/shared/97_otus.tre -m /home/biol1961/shared/map.txt Of course, you have to modify the file to specify YOUR file paths and the outputs to what you want. Things that cannot change include: * The first 3 lines. The nodes and ppn must be 1 and 16 for the lab queue * The module load qiime/1.8.0 We must always load QIIME and it must be qiime/1.8.0 * The –t must be to the 97 percent GreenGenes tree All the parameters for the actual command must be all on one line in the job file, with a space between the parameter letter and value. In the above example they are on separate lines so that you can read them easily. As mentioned above, the script will run its own jobs after it has started. You will know alpha_rarefaction.py is done when you have the following files in your output directory (-o): alpha_div_collated/ (one table per metric in here) alpha_rarefaction_plots/ (plots per metadata column) log_##.txt (log file) rarefaction_plots.html (overall plot) You will also have some files that were generated by the jobs submitted by this script. They include: job_name_stout (the standard out captured by the job submission) job_name_stderr (the errors captured by the job submission) If you want to run the command again, you should delete these files prior to submitting another job. To look at your plots, you must transfer the entire alpha_rarefaction.py output folder from MSI to your computer. The rarefaction_plots.html file needs other information supplied within the subfolders. You can move through all of the plots by selecting different categories and diversity metrics. Creating a Rarefied OTU Table How do you pick a depth? You want to pick a depth that: (1) Keeps as many samples as possible (isn’t too high) (2) Isn’t so low that samples aren’t representative of the total diversity In the above examples the left plot shows all samples and the right plot shows the mean number of observed species according to body site. If we chose a depth of 2000 sequences, we would loose some the orange samples on the right (saliva). We know this because on the group plots, the line will stop where at least one sample in that group no longer has that many sequences. The depth you choose is entirely up to you! But you should be able to justify why you pick it. Normally we try to pick a depth where the rarefaction curves begin to level off. This means for each increase in the number of sequences, we are not detecting any (or very few) new OTUs. In the above example we would probably rarefy at no lower than 1,500. Creating your Rarefied OTU Table Once you have picked a depth based on the alpha_rarefaction.py outputs, you are ready to create a rarefied OTU table. To do this, we use the single_rarefaction.py command with the following parameters: single_rarefaction.py -i input_file_path -o output_file_path -d number_of_sequences The input file path is to the OTU table in .biom format that you want to rarefy (-i). The output file path is where you want your final rarefied OTU table to be (-o, make sure the name is different from the original!). Both the input and output will be .biom files. The last parameter, (d) is the depth you have chosen based on the plots generated by alpha_rarefaction.py. Remember, all the parameters for the actual command must be all on one line in the job file, with a space between the parameter letter and value and must be specific to you. The output of this command will be the OTU table you use for alpha and beta diversity calculations. Filtering OTU Tables Why do we Filter Samples From an OTU Table? Filtering low depth samples from an OTU table can be used as an alternative for rarefying an OTU table. Rarefying results in taking only a small fraction of the original data. It causes an increase in two types of error: Type I * "Decreased specificity" or an increased likelihood in saying two groups are different when they aren't * Caused by rarefied samples remaining over-dispersed (a small number of sequences come from a variety of sources) Type II * "Loss of power" or "decreased sensitivity" to detect real differences between groups * Caused by valuable information about diversity being thrown out Let's discuss rarefaction by looking at the biom summary for an original OTU table and the rarefied version. Biom Summary of Original OTU table | Biom summary of rarefied OTU table (2000 seqs) Notice how in the above example, the first samples in the original OTU table are low in sequence number. One sample has only 4 sequences. The last sample listed has over 7,000 sequences, and that is still low compared to the rest of the sequences in the dataset (the median is 12,121 sequences). The table on the right is the same OTU table rarefied to 2,000 sequences. Notice that the samples below 2000 sequences are thrown out, and that the remaining samples are subsampled to an even depth of 2,000. That means most samples have lost about 10,000 sequences! That's is a lot of information to throw out! We should note, that rarefying our data is still the gold standard when measuring alpha and beta diversity. When looking for specific taxa, however, we can filter low-depth samples from our OTU table and keep the full depth of sequences for the rest of the samples. We can then account for differences in sequencing depth by transforming the data later. We accomplish the filtering of low-depth samples through the filter_samples_from_otu_table.py command in QIIME. Normally, we keep samples that have >1,000 sequences and throw out the others. You can choose to go higher or lower depending on the sequencing results. How do We Filter an OTU Table? In QIIME, this task is performed on your OTU table. The QIIME command filter_samples_from_otu_table.py takes your OTU table and makes a new version of both based on the filtering parameters you set. The command is run using these parameters: filter_samples_from_otu_table.py -i input_file_path -o output_file_path -n number_of_sequences The input file path is to the original OTU table in .biom format that you want to filter (-i). The output file path is where you want your filtered OTU table to be (-o). The minimum number of sequences a sample must have to remain in the OTU table is set with the last parameter (-n). There are other options you can use to filer your OTU table, such as: -s valid_states --sample_id_fp path_to_text_file --negate_sample_id_fp path_to_text_file -m max_sequence_count The valid states let you specify a mapping column and values in that column that a sample must be associated with to remain in the OTU table (-s). For example, if we sampled people from different locations in the Twin Cities and the collection location was under a header called 'Location' in the mapping file. If we wanted to keep only samples collected from Uptown and Downtown and not St Paul, you could use: -s Location:Uptown,Downtown. You could also use a sample ID text file, with one sample ID per line, to list which samples to keep (--sample_id_fp), or which to be remove from the OTU table (--negate_sample_id_fp) . You can also filter using the maximum number of sequences a sample can have to remain in the OTU (-m). For the full list of parameters and how to use them, you can look at the command page on the QIIME webpage: http://qiime.org/scripts/filter_samples_from_otu_table.html The filter_samples_from_otu_table.py command will create a new OTU table in biom format containing only the samples that meet the filtering criteria. To run this QIIME command, we can use QIIME interactively. ``` { r, eval=F} ssh lab cd /home/biol1961/x500 module load qiime/1.8.0 filter_samples_from_otu_table.py -i file/path/to/otu_table.biom -o file/path/to/whatever_you_want.biom -m home/biol1961/shared/map.txt -output_mapping_fp file/path/to/whatever_you_want.txt -n number_of_sequences ``` Of course, you have to modify the file to specify YOUR file paths and the outputs to what you want. For the HMP dataset in the examples, the file path for the map would remain the same. All the parameters for the actual command must be all on one line in the job file, with a space between the parameter letter and value. In the above example they are on separate lines so that you can read them easily. We should use a depth no lower than 1000 sequences. Converting Table Types ##.biom to .txt We can convert .biom files to.txt using biom convert on MSI. To do so, QIIME should be loaded for interactive use. This command will convert the .biom file to a tab-delimited .txt file. It will take in the OTU table (in .biom format, -i) and will output a new .txt file (-o). Specifying -b means we are going from .biom to .txt and --header-key specifies it has taxonomy. Like other biom commands, this command must be run when QIIME has been loaded in interactive mode. Again, this command should be run all in one line, with the parameters separated by one space. ssh lab cd /home/biol1961/x500 module load qiime/1.8.0 biom convert -i table.biom -o table_from_biom_w_taxonomy.txt -b --header-key taxonomy This text file of your OTU table will look something like this. The columns in the OTU table are the samples. The rows are the OTU IDs. The column header for the OTU IDs is always #OTU ID. In the text file, each column is separated with a 'tab'. When we open this tab-delimited text file in Excel, Excel knows to read each tab as a new column. The values in the OTU table are the counts for that OTU ID in each sample. For example, if we look at the first OTU (189503) we can see it occurs 34 times in sample A, 19 times in sample B, and so on. Notice that the last column is not a sample, it is the taxonomy. .txt to .biom We can convert out OTU tables from .txt to .biom using biom convert on MSI. This might be important to do after we filter and normalize our tables (if we want to use them with QIIME again). To do so, QIIME should be loaded for interactive use. ssh lab cd /home/biol1961/x500 module load qiime/1.8.0 biom convert -i normalized_table.txt -o normalized_table.biom --table-type "OTU table" --header-key taxonomy (This should be written all on one line!) Alpha Diveristy What is alpha diversity? Alpha diversity measures how many different things are within a particular area or ecosystem, and is usually expressed by the number of species (i.e., species richness) in that ecosystem. In our case, the ecosystem in question is the sample type we are analyzing (stool, soil, skin.). It's important to remember that alpha diversity is within a sample, which is what makes it different than beta diversity, which we will talk about later in the course. The amount of diversity in any community is extremely important in determining ecological dynamics (e.g. community productivity, stability, and resilience). For humans, there is a great deal of data demonstrating that the ancestral human gut microbiome is more diverse than the modern one, and that this lower diversity is highly correlated with a number of important diseases. Therefore, alpha diversity is an important phenotype in microbiome research. Different metrics have been developed to calculate alpha diversity. Some of these include: Richness: A measure of the number of OTUs present in a sample Evenness: How many of each OTU is present in a sample Phylogenetic relationship: Accounts for taxonomy and phylogenetic relationships Richness and evenness for one sample in a microbiome study. Example of a phylogenetic tree. What are Diveristy Metrics? Below are some common alpha diversity metrics use in microbiome research. There are numerous other metrics available in QIIME, but we don't need to cover all of them for Biology 2002. Observed Species The simplest diversity index; it is just the number of OTUs. Chao1 estimator This is commonly used, and is based upon the number of rare OTUs found in a sample The problem with this metric is that if a sample contains many singletons (OTUs that happen just once, usually by sequencing error) the Chao 1 index will estimate greater species richness than it would for a sample without rare OTUs. This problem is avoided if we first filter the rare OTUs from out OTU table. In the equation Sobs is the number of species in the sample, F1 is the number of singletons (i.e., the number of species with only a single occurrence in the sample) and F2 is the number of doubletons, which is the number of species with exactly two occurrences in the sample (Colwell, et al. 1994). Shannon index This index accounts for both abundance and evenness of the species present. It assumes all species are represented in a sample. In the Shannon index, p is the proportion (n/N) of individuals of one particular species found (n) divided by the total number of individuals found (N), ln is the natural log, ?? is the sum of the calculations, and s is the number of species (CISN. 2010). Simpson index The Simpson index is actually a similarity index, so the higher the value the lower diversity in the sample. It gives more weight to common or dominant species. In the Simpson index, p is the proportion (n/N) of individuals of one particular species found (n) divided by the total number of individuals found (N), ?? is still the sum of the calculations, and s is the number of species (CISN,. 2010). Phylogenetic Distance (PD Whole Tree) The phylogenetic distance metric used most often is PD whole tree. It is the sum of all phylogenetic branches connecting OTUs together within a community. PD is the sum of ED for each species (i) in the sample. ED is the evolutionary distinctiveness. It is calculated by the second equation where for species i in tree (T), ED is the sum of edge of length ??? in the set s(T,i,r) connecting species i to the root (r) and Se is the number of species that descend from edge e (Cadotte, et al. 2010). Below is a figure showing the components of a phylogenetic tree for reference. The components of a phylogenetic tree (Vellend, et al. 2011.) So, as you can see, each one of the diversity metrics is slightly different, each with it's advantages and disadvantages. In terms of measuring richness and evenness, each metric is summarized below. Summary of Diversity Metrics Metric Measurement Observed Species Richness Chao1 Richness & Evenness Shannon Richness & Evenness Simpson Richness & Evenness PD Whole Tree Phylogeny How do we calculate alpha diversity in QIIME? In QIIME, we can use our rarefied or filtered OTU table to calculate alpha diversity. Earlier we used alpha diversity metrics to determine a reasonable rarefaction depth or a reasonable sequencing depth as filtering cutoff. Now we will use the alpha_diversity.py command in QIIME to make a final alpha diversity calculation for each sample. The command is run using these parameters: alpha_diversity.py -i file/path/to/otu_table.biom -o file/path/to/alpha_diversity.txt -m metrics,to,use -t file/path/to/tree The input file path is to the filtered or rarefied otu table in .biom format (-i). The output file path is where you want your alpha diversity table to be (-o). The metrics are what you would like to use as an estimate of diversity, and should be a comma separated list with no spaces (m). The tree file path is to the GreenGenes 97% OTU tree (-t). For a full list of the metrics available and how to spell them, you can type: alpha_diversity.py -s The output is: Known metrics are: ace, berger_parker_d, brillouin_d, chao1, chao1_ci, dominance, doubles, enspie, equitability, esty_ci, fisher_alpha, gini_index, goods_coverage, heip_e, kempton_taylor_q, margalef, mcintosh_d, mcintosh_e, menhinick, michaelis_menten_fit, observed_otus, observed_species, osd, simpson_reciprocal, robbins, shannon, simpson, simpson_e, singles, strong, PD_whole_tree For the full list of parameters and how to use them, you can look at the command page on the QIIME webpage: http://qiime.org/scripts/alpha_diversity.html How do we run aplha_diversity.py? To run this QIIME command, we can use QIIME interactively. Note: You should be using your rarefied OTU table! ssh lab cd /home/biol1961/x500 module load qiime/1.8.0 alpha_diversity.py -i file/path/to/otu_table.biom -o file/path/to/alpha_diversity.txt -m shannon,simpson,choa1,PD_whole_tree -t /home/biol1961/shared/97_otus.tree All the parameters for the alpha_diveristy.py command must be all on one line, with a space between the parameter letter and value. In the above example they are on separate lines so that you can read them easily. What does alpha_diversity.py give us? The output of the alpha_diversity.py command is a table, where the columns are the different diversity metrics and the rows are samples. We can then use this table to make alpha diversity plots, to visual our findings. We can also test to see if the alpha diversity is significantly different between and across different sample types. References Cadotte MW, et al. 2010. Ecology Letters.13: 96-105. Colwell, R.K. and Coddington, J.A. 1994. Philosophical Transactions of the Royal Society: Biological Sciences. 345:101-118. Community Invasive Species Network (CISN). 2010. How to Calculate Biodiversity. http://www.protectingusnow.org Vellend M, et al. 2011. Biological diversity: frontiers in measurement and assessment. Oxford University Press. Beta Diversity What is Beta Diversity? In his 1972 publication in Taxon, "Evolution and Measurement of Species Diversity", R. H. Whittaker laid out the terms and concepts for how we think about and define biodiversity. His idea was that the total species diversity in a landscape (𝛾 or gamma-diversity) (e.g. ALL human gastrointestinal (GI) tracts) is determined by two different things: | --------|----------------------------- 1) Alpha diversity | the mean species diversity at the habitat level | 𝛼 | e.g. one person's GI tract 2) Beta diversity | the differentiation among habitats | 𝛽 | e.g. different people's GI tracts The total diversity, gamma, is alpha multiplied by beta: 𝛾 = 𝛼 *𝛽 We have already discussed alpha diversity and have compared the average alpha diversity of samples across body sites. We found that indeed, there are significant differences in alpha diversity between body sites. Now, we are interested in looking at the difference (the ecological distance) in the community members between samples (e.g. individuals) and groups of samples (e.g. body sites). For example, let's say you are comparing the biological communities of a 20m2 patch of the Great Barrier Reef (right) and a 20m2 of the Amazon rainforest (left). Both of these habitats have very high alpha (𝛼) diversity. However, despite similarly high alpha diversity, if you were to compare the composition these two communities at the macroscopic level, they are almost completely non-overlapping. Therefore, they would also have a very high beta diversity (𝛽). This however, is a very extreme example. Let's say that instead of comparing a single patch of coral reef and a single patch of rainforest, you compare multiple patches of 5 different coral reefs to each other and multiple patches of 5 different rainforests to each other. You might find that the average alpha diversity is about the same for coral reefs and rainforests, but beta diversity is significantly higher for rainforests than for coral reefs. That would mean that different rainforests have species that differ from each other. In another example, you sequence the GI tract microbiota of 100 healthy adults. Fifty individuals have been taking regular low doses of aspirin for the past 30 days. The other half of study subjects have been taking a placebo. You find that the alpha diversity for the treatment group is not significantly different from the control group. However, the beta diversity for the treatment group is significantly higher. What would that mean? What are Beta Diversity Metrics? If you remember there were multiple diversity metrics that we used for alpha diversity. Similarly, there are multiple beta diversity metrics. Below we will cover the most widely used distance metrics for beta diversity. UniFrac Distance This is the most widely used index. The unique fraction metric, or UniFrac, measures the phylogenetic distance between sets of taxa in a phylogenetic tree. It counts the branch lengths of the tree that lead to taxa from either one environment or the other, but not both (Lozupone 2005). In the equation, UAB is the UniFrac distance between sample A and B, where unique = total unique branch length (cumulative branch lengths that lead to OTUs observed only in sample A or sample B) and observed = total branch length (cumulative branch lengths that leads to all OTUs in samples A or B). This metric is sensitive but also has emphasis on minor differences in the tree (Fukuyama, 2012). Weighted Unifrac Distance In the above example, the relative abundances of taxa is not taken into consideration (referred to as unweighted UniFrac distance). There is a second metric known as weighted UniFrac distance, that weights each OTU based on it's relative abundance. Both metrics are criticized for giving either too much (unweighted) or too little (weighted) value to rare taxa, but both have value in showing different aspects of community diversity. Bray-Curis Dissimilarity Bray-Curtis takes the sum of the differences in OTU abundances over the sum of the total OTU abundances between samples. In the equation xi is the abundance of OTU x in sample i, and xj is the abundance of OTU x in sample j. If an OTU is absent then its abundance should be recorded as zero. The Bray-Curtis metric ranges from 0 to 1, where 0 means the two samples have the same composition and 1 means the two samples do not share any OTUs (Gardener, 2016). This metric does not take relatedness of the otus into consideration (phylogeny). So as you can see, each one of the diversity metrics is slightly different, each with it's advantages and disadvantages. In terms of measuring abundance and phylogenetic differences, each metric is summarized below. Summary of Beta Diversity Metrics Metric | Phylogeny? | Abundance ------------ | ------------ | ------------ Unweighted UniFrac | yes | no Weighted UniFrac | yes | yes Bray-Curtis | no | yes How Do We Calculate Beta Diversity in QIIME? We will use the beta_diversity.py command in QIIME to calculate beta diversity metrics between samples and groups. This command will return a matrix of the distances of all samples to all other samples. This can be visualized as a graph of points, a network, or any other creative method you can come up with. We should note that sequencing depth can have an effect on beta diversity analysis, just as it does on alpha diversity. beta_diversity_through_plots.py -i file/path/to/otu_table.biom -o file/path/to/beta_diversity -m file/path/to/mapping_file.txt -t file/path/to/tree -p file/path/to/parameters_file.txt -a run_parallel -O job_to_run -e sequences_per_sample The input file path is to the filtered or rarefied OTU table in .biom format (-i). The output file path is where you want your beta diversity output to be (-o). The tree file path is to the GreenGenes 97% OTU tree (-t). The metrics are what you would like to use as an estimate of beta diversity are supplied in the parameters file (-p). To run this in parallel (-a), we must specify the number of jobs (-O, the lab queue max is 6). If we didn't rarefy our OTU table, but want an even depth for all the samples we could also specify the depth (-e). For the full list of parameters and how to use them, you can look at the script page on the QIIME webpage: http://qiime.org/scripts/beta_diversity_through_plots.html The beta_diversity_through_plots.py command will do multiple things: 1. Create jobs within the jobs/ folder it creates, as well as output (.o##) and error files (.e##), and a pbs_nodefile.txt file (just like the otu picking script) 2. Randomly subsample otu_table.biom to even number of sequences per sample (specified with -e) 3. Run beta_diversity.py for the diversity metrics wanted (specified with the parameters file via -p) and create distance matrices in the main output directory (metric_dm.txt) 4. Perform a principal coordinates analysis on the result of Step 3 in the main output directory (metric_pc.txt) 5. Generate a 2D and 3D plots for all mapping fields in the metric_emperor_pcoa_plot/ subdirectories 6. Deletes all of intermediate generated to do the analysis 7. Creates a log file and overall rarefaction plot within the main output directory To run this QIIME command, we can use QIIME interactively on MSI: ssh lab module load qiime/1.8.0 beta_diversity_through_plots.py -i file/path/to/otu_table.biom -o file/path/to/whatever_you_want -t /home/biol1961/shared/97_otus.tree -m /home/biol1961/shared/map.txt -p /home/biol1961/shared/parameters.txt Of course, you have to modify the file to specify YOUR file paths and the outputs to what you want and it should all be on one line. All the parameters for the actual command must be all on one line in the job file, with a space between the parameter letter and value. In the above example they are on separate lines so that you can read them easily. As mentioned above, the command will run its own jobs after it has started. You will know beta_diversity_through_plots.py is done when you have the following files in your output directory (-o): metric_pc.txt (one table per metric, 3 total) metric_dm.txt (one table per metric, 3 total) metric_emperor_pcoa_plot / (one per metric, 3 total) log_##.txt (log file) To look at your plots, you must transfer the entire plot folder from MSI to your computer. The plot file needs other information supplied within the subfolders. Manipulating 3D Plots Once you have moved the entire plot folder for your metric of choice to your computer, you can click on the .html file with the directory to load the plot. There are many parameters about the plot you can change. | ------------------ | ---------------------------- Colors | Change what covariate to color by Visibility | Make some samples more transparent Scaling | Make some samples larger or smaller Labels | Add sample labels Axes | Change which principal coordinates are plotted Options | Change backgroung/axes colors and save as image References Fukuyama J, et al. 2012. Pacific Symposium on Biocomputing. 2012:213-24. Gardener M. 2016. DataAnalytics.org.uk. (http://www.dataanalytics.org.uk/Publications/Writers%20Bloc/Distance%20metrics.htm). Lozupone C and Knight R. 2005. Applied and Environmental Microbiology. 71:8228-8235. Ordination When we want to look at high-demension data, one way to easily visualize similarities and differences is ordination. The type of ordination plots we will learn about and generate are Principal Component (PCA) and Coordinate Analyses (PCoA). What is PCA and PCoA? It is a way of identifying patterns in data and expressing data in such a way as to highlight their similarities and differences. Since our data can be of high dimensions, finding the patterns can be hard and this is where PCA and PCoA are powerful tools for analyzing data. The other main advantage of PCA/PCoA is that once you have found these patterns in the data you can compress the data by reducing the number of dimensions and visualize it. This concept of dimension reduction can be very tricky to grasp. Understanding all the math behind PCA and PCoA is out of the scope of this class. We will, however, try to understand how the data is reduced, what we are actually plotting, and how to to accomplish this in R. Below are some tutorials to help us understand PCA. 1. Please read the following website: https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummieseigenvectors-eigenvalues-and-dimension-reduction 2. Please read the following website: http://setosa.io/ev/principal-component-analysis PCA vs PCoA From the websites listed above we have learned about PCA. So what’s PCoA? PCoA is similar PCA, however, PCoA can handle distances generated from any similarity or dissimilarity measure, such as Bray–Curtis and both weighted and unweighted UniFrac metrics. PCoA can also handle quantitative, semi-quantitative, qualitative, and mixed variables. Similar to PCA, PCoA produces a set of uncorrelated axes to summarize the variability in the data set. Each axis has an eigenvalue whose magnitude indicates the amount of variation captured in that axis. The proportion of a given eigenvalue to the sum of all eigenvalues reveals the relative 'importance' of each axis. A successful PCoA will generate a few (2-3) axes with relatively large eigenvalues, capturing most of the variation in the input data, with all other axes having small eigenvalues. Interpretation of a PCoA plot is straightforward: objects closer to one another are more similar than those further away. Similarity or dissimilarity is defined by the measure used in the construction of the (dis)similarity matrix used as input. PCoA can handle a wide range of data, but the original variables cannot be recovered. This is because PCoA takes a matrix derived from the original data as input and not the original variables themselves. The beta_diversity_through_plots.py command gives us PCoA plots of our data. Later in the course we will also learn how to generate these plots ourselves in R. Summarizing Taxa What are Taxa Summaries? Summarizing taxa is a way to visualize which taxa are found in our samples. When we summarize taxa we can use the various levels of taxonomy. The following levels are those denoted by GreenGenes for taxonomy. Level | Taxonomy | Example ------ | ------------------ | ------------------------- 1 | Kingdom | Bacteria 2 | Phylum | Actinobacteria 3 | Class | Actinobacteria 4 | Order | Actinomycetales 5 | Family | Streptomycetacaea 6 | Genus | Streptomyces 7 | Species | mirabilis In QIIME, we can use levels 2-6 to summarize taxa. We can't use level 1 because that would result in no summary (all of our OTUs are bacteria). We also cannot use level 7, because using 97% identity of a 16S gene cannot resolve species from one another (for the most part). We summarize taxa with the summarize_taxa_through_plots.py command in QIIME. How do we actually summarize taxa? In QIIME, this task is performed on your rarefied or filtered OTU table. It must be a COUNT table, not relative abundance. The QIIME command summarize_taxa_through_plots.py takes your OTU table and collapses the table into the various taxonomic levels. It will then plot the taxa summarize for us. The command is run using these parameters: summarize_taxa_through_plots.py -i file/path/to/otu_table.biom -o file/path/to/summary_output -m file/path/to/mapping_file.txt -c category_to_use The input file path is to the filtered or rarefied OTU table in .biom format (-i). The output file path is where you want your taxa summary directory to be (-o). The mapping file path is the location of the mapping file (-m). The category you would like to use for the summarize must be a column header in the mapping file (-c). If you leave this parameter out, QIIME will make the summaries and plots using the entire OTU table. For the full list of parameters and how to use them, you can look at the command page on the QIIME webpage: http://qiime.org/scripts/summarize_taxa.html The summarize_taxa_through_plots.py command will: 1. Create an output directory named whatever you specified for -o 2. Create OTU tables collapsed at each taxonomic level (2-6) with samples grouped according to your -c parameter 3. Create taxa summary plots in a subdirectory called taxa_summary_plots Below is an example of the contents of the output directory. The taxa summary was produced using the category 'sex' from the mapping file. If you want to look at the taxa summary plots, you must move that entire subdirectory to your personal computer to view the html plots. To run this QIIME command, we must submit a job file. Your job file should look like this: #!/bin/bash -l #PBS -l nodes=1:ppn=16,mem=2Gb,walltime=6:00:00 #PBS -m abe #PBS -M x500@umn.edu #PBS -o job_name_stout #PBS -e job_name_stderr cd /home/biol1961/x500 module load qiime/1.8.0 summarize_taxa_through_plots.py -i file/path/to/otu_table.biom -o file/path/to/whatever_you_want -m file/path/to/mapping_file.txt -c category_to_use All the parameters for the actual command must be all on one line in the job file, with a space between the parameter letter and value. In the above example they are on separate lines so that you can read them easily. Below is an example of what the taxa summary plots from QIIME look like. This example is using the category 'sex' from the mapping file. The plots produced by QIIME are not pretty or easy to read. To make easier to interpret taxa summary plots, we can use R. Example taxa summary from QIIME Plotting and Statistics "There is no statistical tool that is as powerful as a well chosen graph" -Chanmbers et al. 1983 When we begin to analyze our data it is important to be able to visualize our observations. Why is this? • Plots are more effective in creating interest and in appealing the attention of others • Visual relationships are more easily grasped and remembered • Plots save time, since trends and differences can be visualized at a glance • Plots can bring out hidden trends and relationships and aid in analytical thinking Variables and Data Types Before we try and visual our data we need to identify our variable and data types. This list is not exhaustive, but includes the main characteristics we need to think about. When we talk about our data, usually the dependent variable will be a measurement of the OTUs or taxa (for example, alpha diversity measurements) and the independent variable will be something from our mapping file. Sometimes we will call the independent variable in the mapping file a covariate, which is a variable that might be predictive of the outcome of the study. Variables Dependent Variables are what we measured in the experiment and what were affected during the experiment. The dependent variable responds to the independent variable. You cannot have a dependent variable without an independent variable. On a graph, this is the y variable. Independent Variables are the variables we have control over, what we can choose and manipulate. They are usually what we think will affect the dependent variable. In some cases, we may not be able to manipulate the independent variable. It may be something observational that is already there and is fixed (sex, disease status, color). On a graph, this is the x variable. Data Types Qualitative data is descriptive, it is observed and not measured. It is often categorical (color, smell, taste). Quantitative data is numeric and can be counted or measured (length, height, volume, weight). Discrete data can only take on a finite number of values, and is counted. All qualitative variables are discrete. Some quantitative variables are discrete, such as disease score rated as 1,2,3,4, or day sampled if people were only sampled on a specific finite number of days (day 1 and 15 only). Continuous data can take on any value in a certain range. No measured variable is truly continuous, however, discrete variables measured with enough precision can often be considered continuous for practical purpose (like age measured per day, or weight). Types of Plots There are numerous types of plots we can use. Here are a few very common types of plots, and a brief explanation as to what type of data they use, what they display, and when we should use them. Pie Chart Pie charts are used with discrete independent variables. Pie charts are best to use when you are trying to compare parts of a whole (percentage or proportional data). Pie charts should be used for displaying data with no more than 6 categories. They do not show changes over time. They are not used often in scientific research. Bar Chart Bar graphs are used with discrete independent variables. Bar graphs can be horizontal (x axis on side) or vertical (x axis is on the bottom). The height of each bar (dependent variable, y variable) are scaled according to their values and the bars can be compared to each other. Bar graphs have a space between each bar. Stacked bar charts can be used to compare overall quantities across items while illustrating the contribution of each category to the total. Histogram Histograms are used with continuous independent variables. Histograms can be horizontal (x axis on side) or vertical (x axis is on the bottom). The height of each bar (dependent variable, y variable) are scaled according to their values and the bars can be compared to each other. Histograms do not have a space between each bar. Figure 1. Examples of bar charts and a pie chart encoding the same data. (a) Values in different categories are difficult to compare in pie charts. (b) Stacked bar charts enable comparison of overall values across items. (c) Layered bar charts support comparison of values within categories. (d) Grouped histograms allow comparison of values across categories (Streit and Gehlenborg, 2014). Boxplot Box plots are used with discrete independent variables. Box plots can be horizontal or vertical. Box plots show the full range of variation (from min to max), the likely range of variation (the interquartile range, IQR), a typical value (the median) and outliers (values 3 times the IQR). They provide more information than a bar chart. Figure 2. A comparison of bar graphs and box plots. (a) Bar chart showing sample means with standard-deviation error bars. (b) Box plot with whiskers extending to ? 1.5 times the interquartile range. (c) Distributions of the different data sets. (Streit and Gehlenborg, 2014). Scatter Plot Scatter plots are used when both the independent and dependent variables are quantitative. They show how much one variable is affected by another, also called their correlation. The closer the data points come when plotted to making a straight line, the higher the correlation between the two variables, or the stronger the relationship. Scatter plots can also help us see data that cluster together in certain areas of the scatter plot. Figure 3. An example scatter plot of the percent of the microbiome that is H. Pylori and obesity (Lender et al, 2014). Line Graph Line graphs are used when both the independent and dependent variables are quantitative. Line graphs are like scatter plots except a line is created connecting each data point together. This emphasizes local changes from one point to the next. Unlike scatter plots, line graphs do not usually help us detect correlations, as the line emphasizes point-to-point changes. Figure 3. An example of a line graphs. These graphs emphasize changes in specific bacterial taxa with and without probiotic supplementation (Rutten et al, 2015). Statistical Analysis "Humans aren't too good at discerning subtle patterns that are really there, but equally so at imaging them when they are all together absent." - Everitt and Hothorn, 2010 Summary Statisitcs Mode - data value that occurs most frequently Median - Data value that occurs at the precise middle of all data points Mean - Numerical average of all the data points Variance - Measure of the spread of the data, and is the average of the squared differences from the mean Standard Deviations Square root of the variance Interquartile Range - Where the 'middle' fifty percent of the data are located Distribution - Listing or function showing all the possible values of the data and how often they occur. Includes normal, skewed, uniform, Poisson and others (See Figure 5) Figure 5. Examples of various distributions with different standard deviations. Statistical Tests Parametric Statistics Parametric analyses are the oldest and most commonly used type of analysis. We will cover the three most common: correlation, t-test, and analysis of variance. All parametric statistics have three common assumptions that must be met before proceeding. In Figure 5, the top example could be tested with a parametric test. 1. All observations are independent of other observations (product of experimental design, no test needed). 2. The data are normally distributed (easily tested by examining the distribution). 3. The variances in the different treatment groups are the same. (Requires a test, such as the FMax Test) Student's t-Test This analysis is used when you are comparing two different samples. A Student's t-test will report a t-statistic and a probability value (p-value). If the p-value is greater than or equal to our our alpha (usually 0.05) we reject our null hypothesis that there is a significant difference between the groups. Analysis of Variance (ANOVA) Analysis of variance is used to determine if differences exist between more than two treatment groups. The assumptions of ANOVA are identical to the t-test and the calculated statistic is called an F-value, with a corresponding p-value. As with the t-test, if our probability value is less than 0.05 we reject our null hypothesis (in this case that there is no difference among the treatment groups). This p-value only tells us if there are significant differences among our groups. It does not tell us where these differences are. Regression Regression is used to determine whether two variables are related. A highly used regression method is Pearson's r. The r statistic has a range of values from -1.00 (a perfect negative correlation) to 1.00 (a perfect positive correlation). A negative correlation means that as one variable increases in size, the other decreases. A positive correlation means that as one variable increases so does the other. When r=0.00 there is no relationship between the two variables. This test has the same three assumptions as other parametric analyses, but it also has the additional assumption that the relationship between the two variables is linear. A regression analysis also gives a coefficient of variation (R2). The coefficient of variation has a range of values from 0%-100% and explains how much of the variation in the dependent variable is because of the independent variable. Nonparametric statistics Most nonparametric statistics are simple to use, do not require large data sets, and have few underlying assumptions. They are not as powerful as parametric statistics (i.e. they are not very good at detecting small differences between groups), Non-parametric tests all assume independence of observations. In general, these tests should be chosen over parametric alternatives when sample sizes are small (less than 10-20 replicates). We will use three nonparametric tests in this course. Wilcoxin's Rank Sums Test and the Mann-Whitney U Test These analyses are used to test for differences between two treatment groups and are analogous to a t-test. Kruskall-Wallis Test or adonis These tests for differences between more than two different treatment groups. They're basically nonparametric ANOVAs. Spearman's Correlation This analysis is a non-parametric regression analysis. Choosing a Test Below is a flowchart we will use to help us pick which test to use (Gerwien, 2016). References Everitt BS and Hothorn T. 2010. A handbook of statistical analyses using R. CRC Press. Gerwien R. 2016. A painless guide to statistics. http://abacus.bates.edu/~ganderso/biology/resources/statistics.hmtl Lender N, et al. 2014. Alimentary Pharmacology and Therapeutics. 40:24-31. Rutten RBMM, et al. 2015. PLoS ONE. 10: e0137681. Streit M and Gehlenborg N. 2014. Nature Methods.11:117. Using R Studio This tutorial will help us learn how to use RStudio. RStudio interface The first time we open RStudio we are greeted by three panels. The left half of the screen is the console. The upper right corner is the variable inspector, and the lower right corner can show you different things depending on which tab is selected. The default for this panel is a file viewer. Let's go through each panel more in depth. This is what RStudio looks like Console The console is like our terminal. Here we can type commands and R will perform them. The types of commands we use in the R console are, for the most part, specific to the R coding language. This language is different than the how you would write commands for your terminal. Environment/Variable inspector As we load and manipulate data, we can store the data as a 'variable'. The environment window shows us which variables we have created, and we can actually look to see what they are. Notice this panel also has a history tab where you can see all of the recent commands you have performed. File System This is just like the 'finder' on a Mac or the 'windows explorer' on a PC. This lists our files and folders. Notice this panel has other tabs as well. The plots tab will show us what our plots look like as we create them, the packages tab shows us all of the packages (sets of functions) you can import when running your analysis. The viewer tab is for more advanced interactive graphics and won't be used in this course. Dropdown menus These are the options listed across the top of RStudio (just like most other programs on our computers). These menus include many options you might need. For example, File, Save will save our work. Using RStudio To begin using Rstudio all we have to do is type a command into the console. For example, let's make a variable called test_variable. And in this variable we will store some data. The data will be a set of words ("bird", "dog", "cat"). We also call words strings, because the string of characters doesn't necessarily have to be an actual word. For example, "bird" is a string and so is "abcd". test_variable <- c("bird", "dog", "cat") Notice a couple of things: 1. the arrow <- is what assigns the data to a variable. You can read the whole command like this: "test_variable gets a vector containing the strings bird, dog, cat" 2. When we want to group things into one variable we can use c(), which is a function that combines values into a vector or list. 3. When we want to store strings we have to specify it's a string using quotes "". We can use either double "" or single '' quotes ("bird" or 'bird'). If we tried to use c(bird, dog, cat) instead of c("bird", "dog", "cat") R would read and try to interpret bird, dog and cat to be variables, not strings. Once you have created this variable you will be able to see it in your environment panel on the right side of RStudio. It tells us that in test_variable contains characters (chr), there are three character values stored [1:3], and they are "bird", "dog" and "cat". Using Existing R Code In addition to typing things directly into the console, we can run R code from existing files. To do this we need to open a file that contains R code. Let's open the Intro-R.r file in R studio. To do that you can go to File, Open File, and find the 'Intro-R.r' file that you can download from Moodle. This file will open in a new panel above the console. To run the code from this file we have two options. 1. Copy and paste the code into the console and press enter 2. Run the code directly from the file. To do this we can place our cursor on the line of the file we want to run. So, place your cursor on the test_variable2 <- c("bacteria", "fungi") line. For Mac users, you can run this command by holding down your 'command' key and pressing enter. For PC users, you can run this command by holding down your 'control' key and pressing enter. Another option is to place our cursor on the line we want to run and then pressing the Run button on the upper right side of the panel. We can run many lines of code, one line at a time by highlighting all the code we want to run in the file, and using the key combinations mentioned above. Saving R Code When we do an analysis in R, we definitely want to save our code so we can use it in the future. If we are working from an existing file, we can open the file and add new code as we write it. If we are starting a new file, we can use File, New File, R script to create a new file. As we run commands, they get stored in the 'History' tab of our 'Environment/Variable Inspector' (upper right panel). If we run a command, and it successfully does what we want it to, we can click on the command in the 'History' tab, and then click 'To Source' to add it to our file. As we write and save our code, we must remember to comment it. Our comments will be used by anyone reading the code to figure out what was done. Our comments should: 1. Be close to the code we are specifically commenting (not just at the top of the file) 2. Be clear and concise 3. Capture intent Remember that the comment symbol is #, and it is line-specific. Here is an example of commented code: # This stores the sum of 2,4,6 and 8 as a variable 'sum_numbers' sum_numbers <- sum(2,4,6,8) # This stores the square root of sum_numbers as a # variable 'sqrt_numbers' sqrt_numbers <- sqrt(sum_numbers) Your comments don't have to be every line, but should be easily interpreted. If there are tricky parameters in your functions this is a good way to remind yourself why you have to specify certain things. Also, when naming variables, make sure to use a descriptive name that reflects what the variable is storing. To save our R code files, we can use File, Save or 'command' s (Mac), or 'control' s (PC). Important Concepts Objects Objects are the pieces of data stored as variables in R. There are different types of objects. We already mentioned one type, 'character', which includes letters and strings. Other types of objects we will use in this class include 'logical', which are either True or False, 'integer', which are integers, and many others. Variables Variables are what we store our data as in R. We name each variable (in our first example, it was 'test_variable'), and there are different types of variables. Some types include the following: Vectors These can be considered a group of data. There are different types of vectors, some that we will use in this course include: 'logical' (trues or falses), 'integer' (numbers), and 'character' (strings). vector_1 <- c(1,2,5.3,6,-2,4) # numeric vector vector_2 <- c("one","two","three") # character vector vector_3 <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector Lists These are kind of like vectors, but the objects stored in the list do not have to be the same type. list_1 <- c("one",2,TRUE) Functions These are variables that perform a task. For example c() is a function that combines objects into a vector. R comes with many functions, and we write them with their name followed by parentheses. length() #this is a function that will tell us the length of something NULL Variables that are NULL contain nothing, and are not of a specific type. If we create a NULL variable, it will be listed in our environment but it will have no attributes. nothing <- NULL Matrices A matrix is a table, where all the columns in the matrix must have the same mode (numeric, character, etc.) and the same length. # generates 5 x 4 numeric matrix test_matrix <- matrix(1:20, nrow=5,ncol=4) Data Frames A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.). # This will make a dataframe where the columns are the filled with # the vectors 'd', 'e' and 'f' d <- c(1,2,3,4) e <- c("red", "white", "red", NA) f <- c(TRUE,TRUE,TRUE,FALSE) test_dataframe <- data.frame(d,e,f) Factors We can tell R that a variable is nominal by making it a factor. The factor stores the nominal values as a vector of integers and an internal vector of character strings (the original values) mapped to these integers. # Let's say there is a group of people, 3 female and 2 male # Let's make a vector that stores how many female (F) and # male (M) people there are gender <- c("F", "F", "M", "M", "F") # stores gender as a factor where 1=female, 2=male gender <- factor(gender) # R now treats gender as a nominal variable Operators Data is manipulated in programs using operators and functions. R has many built-in operators, the most commonly used include: Arithmetic operators * Numerical calculations (preserving the order of operations) + Addition + + Subtraction/change sign + Multiplication * + Division / Relational operators * Comparing values + Less than < + Less than or equal to <= + Greater than > + Greater than or equal to >= + Equal to == + Not equal to != Assignment operators * Assigning values to objects + Global (you will generally use this one) <+ Local (often used within functions) = Logical operators * Conjunctions for combining/excluding terms + AND && + OR || + NOT ! Colon operator * Creating regular sequences (often of numbers) + : example: 3:7 produces the output [1] 3 4 5 6 7 Now use the rest of the code in the 'Intro-R.r' file to make and inspect different variable types and operators. Loading Tables in R If you remember we generated 3 types of tables with QIIME: * OTU table (.biom and .txt versions - rarefied and low-depth filtered) * Alpha diversity table (.txt) * Beta diversity tables (.txt) OTU Table Format The first two lines include a spacer line detailing how the file was once a .biom format, and the column headers. Note that these lines start with a '#', which usually represents a comment line (something the computer doesn't read), so we will have to pay attention to how R reads our OTU table. Rows OTU ID, which is a unique ID for each set of sequences that are 97% identical. Columns 1 through the second last Each column represents a sample. The numbers in each row correspond to the number of reads that mapped to the specified OTU ID in the first column. Last Column The assigned taxonomic identity for each OTU (e.g. For k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Prevotellaceae; g__Prevotella; s__copri). k = kingdom, the p = phylum, c = class, o = order, f = family, g = genus and s= species. See example of the first 5 lines of an OTU table that is in the required format: alt text Loading OTU Table First, use the function read.table() to read in your OTU table. These various arguments are all set specifically for the format of your OTU table in .txt format. comment = is telling R what should be interpreted as a comment versus as a line of code. The default for this is the pound sign '#' but since we want the column header information we turn off the interpretation of comments using the option comment = '' header = is telling R whether the first line of code should be assigned as row 1 or as the column names. We set this to TRUE or T. sep = defines the field separator character which in this case is a tab, so sep = '\t' skip = tells R how many rows to skip when reading in the table. The default for this is 0, but in this case, we want to ignore the first line '# Constructed from biom file' so we skip the first line. as.is = controls the interpretation of character variables as a character string vs. as a factor. To avoid having thousands of levels associated with our taxonomy column, we specify as.is=T check.names = determines whether the names of variable in the data frame are syntactically valid. Because our sample names in our data set start with numbers, which would cause problems in R, we have to set check.names=F row = will tell R if we would like to set one of the columns to be the row names. In this case we would like to set the first column, which is the OTU IDs to be the row names. (row=1) You will have to change the name of the OTU table to be the name of your table. # Now we can read in the table - This the the rarefied one otu <- read.table("otu_rare2000.txt", comment="", header=TRUE, sep="\t", skip=1, as.is=TRUE, check.names=F, row=1) # Read in the low depth removed OTU table otu_low <- read.table("otu_rare2000.txt", comment="", header=TRUE, sep="\t", skip=1, as.is=TRUE, check.names=F, row=1) Remember, you can always find out more about a function by using the help() function or the ?. So, to find out about the read.table() function, you could do the following: ?read.table() To find out information about our table we can use different functions. For example, we can find out the row names and column names using rownames() and colnames(), respectively. We can find the dimensions with dim(), and we can print the first couple of lines (default is 10) with head(). We can also click on our table in the Environment panel to view the whole table. # View first 2 lines using head() head(otu, n=2) # View dimensions dim(otu) # Print row names (which are OTU IDs) row.names(otu) #Print column names (which are samples IDs the taxonomy header) colnames(otu) Alpha Diversity File Prior to plotting in R, we need to generate an alpha diversity table in QIIME. This file will be the output of alpha_diversity.py, and will be a tab-delimited, plain text file. The format for the alpha diversity file is the following: Format Rows The rows are the sample IDs. Columns Each column represents a diversity metric (e.g. PD_whole_tree, simpson, shannon, or observed_species). The numbers in each row correspond to alpha diversity estimate for the associated sample. Loading Alpha Diversity # Read in the alpha diversity table alpha <- read.table("Alpha_Div.txt", sep='\t', header=TRUE, as.is=TRUE, check.names=FALSE, row=1) Notice: * We set the header to be the first row (alpha diversity metrics) * We set the rownames to be the first column (sample IDs) Beta Diversity File Prior to plotting in R, we need to generate an a distance matrix generated by with QIIME. This file will be the output of beta_diversity.py, and will be a tab-delimited, plain text file. The format for the beta diversity file is the following: Format Rows The rows are the sample IDs. Columns Each column is also a sample ID and the distances from one sample to another are the values. You should have one for each metric you used (Unweighted UniFrac, Weighted UniFrac, and Bray-Curtis). # Load the beta diversity matrix, notice that we use read.table(), # but then change from a dataframe to a matrix with as.matrix() beta <- as.matrix(read.table("unweighted_unifrac_dm.txt", sep = "\t", header=T, row = 1, as.is = T, check.names = F)) Notice: * We set the header to be the first row (These are sample IDs) * We set the rows names to be the first column (These are also sample IDs) Metadata File Your metadata file (also called a mapping file) is a data table containing information about the samples in your dataset. In order to assess how taxa correlate with variables of interest (e.g. country, body site, species, ecoregion, BMI, etc.), we need to have that information about our samples accessible. The metadata file for our data set is HMP_5BS_metadata.txt. Format Rows The actual mapping file starts with '#SampleID' as the first header. This contains a the sample IDs, which are unique IDs for each sample in the dataset. To work in QIIME, this must have a '#' at the start. Remember that '#' usually represents a comment line (something the computer doesn't read), so we will have to pay attention to how R reads in this file. Columns 1 - last column Each column represents a description of the sample. It can be anything including details about the patient, person, animal or location the sample was taken from. This file should contain no spaces or empty columns/rows. Loading Metadata File We load the metadata table just like the OTU table, but notice that the skip parameter is left out, because the metadata table doesn't have the additional first line that the OTU table has. metadata <- read.table('HMP_5BS_metadata.txt', header=T, sep='\t', check.names=F, comment='', row=1) Notice: * We set the header to be the first row (These are sample IDs) * We set the rows names to be the first column (These are also sample IDs) * We told R to ignore the '#' in the first line What are the dimensions of the metadata file? How would you find this out? We went over this in the 'Normalizing OTU Table tutorial' What variables do we have available for this data set? They are the column headers. You can find this out using colnames(). colnames(metadata) ## [1] "BarcodeSequence" ## [4] "BodySite" ## [7] "Description" "LinkerPrimerSequence" "Sex" "SRS_SampleID" "FASTA_FILE" "Age" The mapping file is what we use in the majority of our QIIME commands so it contains information about the sequencing files (e.g. BarcodeSequence, LinkerPrimerSequence, FASTA_FILE, and SRS_SampleID), that are not necessary for our analysis. We need the sample IDs to match our variables to the microbial abundance information contained in our OTU table. Each column in this file is a variable (also called a covariate), which can be defined as being continuous or categorical. Categorical variables are described as factors, the levels of which are the categories within it. You can view the number and identity of levels for a categorical variable by calling it, or using the str() function. # View the 'Sex' column of the mapping file dataframe metadata[,'Sex'] ## [1] female ## [11] female ## [21] ## [31] female ## [41] ## [51] female female female female male female female male male male male male male male male female male female male female male male male male male male female female female female female male female female male male male female male female female female male female female female male female female male male female female female female male male ## [61] male male ## [71] male male female ## [81] female male ## [91] male female female ## [101] female female ## Levels: female male female female male female male male female female female female male female male male male male female female female male female female female male male male male male female male male female Notice how we wrote the command to access the 'Sex' column.[,] is a way to specify rows and columns or a matrix or dataframe. Inside the square brackets, the first index specified is the row, and the second (after the comma) is the column. So what we wrote was, "display all the rows (left blank), in the 'Sex' column, or metadata[,'Sex']. We can also use the row number or column number metadata[,4]. Because our mapping file is loaded as a dataframe, we can also do this using the "$". # Notice that using "$" only works for dataframes and not matrices metadata$Sex ## [1] female female female ## [11] female female female ## [21] male male ## [31] male male female ## [41] female male ## [51] female female ## [61] male male ## [71] male male female ## [81] female male ## [91] male female female ## [101] female female ## Levels: female male female female male male male female male female male male male female female male male female male male male female female female male female female male male female male female female female female male male male female female female female male female male female female female male male female female female male female female female male male male female female female male female male female male male male female male male male male female male # The class function will also tell you whether your variable is # a factor, numeric, character, etc. class(metadata[,'Sex']) ## [1] "factor" male male Formatting Your Data In order to assess relationships between sample information in our OTU table, alpha diversity and beta diversity, we need to match the order of our data frames. For that we will use the intersect() function. Because we have potentially removed one or more samples from our OTU table during rarefaction, filtering or other manipulations, we can first define the subset of samples in all of our tables. intersect() can retain all the sample IDs that are in the OTU table and also in the metadata file. We can then subset all of our tables to keep just those samples. # First, define all the samples in the OTU table. # Remember, when we load in the OTU table, samples are columns # Remember, the last column in the OTU table is taxonomy, so ommit the last column samples1 <- colnames(otu)[1:(ncol(otu)-1)] # Now let's see what the intersect with the metadata row names are IDs_Keep <- intersect(samples1, rownames(metadata)) # Now let's filter the metadata to keep only those samples # We do this by telling R to make a new data frame that only has the rows we want metadata <- metadata[IDs_Keep,] # Now let's filter the OTU table to keep just the intersecting samples # We will store it as a new otu table (incase we need the old one) # Remember, OTU table has columns as samples! # This will also remove the taxonomy, because it's not a sample ID we want otu2 <- otu[,IDs_Keep] #for rarefied otu_low2 <- otu_low[, IDs_Keep] #for low depth removed # To add the taxonomy back, we can use the taxonomy info from # the orignal table otu2$taxonomy <- otu$taxonomy otu_low2$taxonomy <- otu_low$taxonomy # Now let's filer the alpha diversity table to keep those samples too # Alpha diversity has the samples as row names alpha <- alpha[IDs_Keep, ] # Now let's filer the beta diversity table to keep those samples too # Beta diversity has the samples as row names AND column names # We must filter both the rows and columns beta <- beta[IDs_Keep,IDs_Keep] #Let's check to make sure the samples match as.character(rownames(metadata)) == colnames(otu2)[1:(ncol(otu2)-1)] ## [1] TRUE ## [15] TRUE ## [29] TRUE ## [43] TRUE ## [57] TRUE ## [71] TRUE ## [85] TRUE ## [99] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE #Let's see how many samples are in the otu table (columns) and mapping ncol(otu) #There should be one more here because there is also a taxonomy row ## [1] 103 nrow(metadata) ## [1] 102 Plotting in R ggplot To visualize our data in R we will use the package ggplot2(). This package allows us to make detailed and specific visualization needed to best show our results. Let's start with the packages we need to load. If these are not installed you can install them first with install.packages(). library(ggplot2) ## Warning: package 'ggplot2' was built under R version 3.3.2 ggplot input ggplot likes to have all the data for the plot in one table. Specifically, it like to be able to access the information needed by using columns. Let's use alpha diversity as an example. We will use our alpha diversity measurements as the input data for our examples. First we have to combine our alpha diversity results with the metadata that tells us which body site each samples comes from. Make sure you have your metadata and alpha diversity tables loaded and that the samples are subsetted and in the correct order for both tables. # We will make a copy of our metadata to work with combined_alphadata <- metadata # Because our sample order is the same, we can make a new column in the table # This column will contain all the Shannon index measurements for the samples combined_alphadata$shannon <- alpha$shannon ggplot Format ggplot creates plots in layers. First we make the base layer with ggplot() and then add on different types of plotting types and aesthetics. Aesthetic Mapping In ggplot, aesthetic means something you can see. For example: * position (i.e., on the x and y axes) * color (“outside” color) * fill (“inside” color) * shape (of points) * linetype * size Geometic Objects (geom) Geometric objects are the actual characters we put on a plot. For example: * points (e.g. geom_point, for scatter plots) * lines (e.g. geom_line, for line graphs) * boxplot (e.g. geom_boxplot, for box plots) Plotting Before we plot our data, we need to think about what type it is. In the first example the independent variable will be "BodySite", which is discrete. Our dependent variable, alpha diversity (Shannon Index), falls in a range of values. We could use a bar chart, but a box plot will tell us more about the dataset. Box Plots If we remember, Box plots consist of several features: the box, which extends from the first quartile (Q1) to the third quartile (Q3), respectively, with the median (Q2) depicted by a vertical line within the box; whiskers, which extending vertically from the box and indicate the range of variability outside of the upper and lower quartiles; and outliers, which are individual points outside of the box and whiskers. It is important to note that box plots are good for non-parametric data. They display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution. The spacing between the interquartile range of the box indicates the degree of dispersion (spread) and skewness in the data. Plotting Example 1 - Discrete X-Variable ggplot() + geom_boxplot(data=combined_alphadata, aes(x= BodySite, y= shannon)) Notice: * We created the base layer with ggplot() * We add the next layer with+* We added a boxplot withgeom_boxplot()* We specified which table to use withdata=* We specified the aesthetics withinaes()` * The dependent variable (y) is called by its column name * The independent variable (x) is called by its column name Scatter Plots In the second example the independent variable will be "Age", which is continuous. Our dependent variable, alpha diversity (Shannon Index), falls in a range of values. Therefore we can use a scatter plot to look for trends. Plotting Example 2 - Continuous X-Variable ggplot() + geom_point(data=combined_alphadata, aes(x= Age, y= shannon)) Notice: * We created the base layer with ggplot() * We add the next layer with+* We added a scatter plot withgeom_point()* We specified which table to use withdata=* We specified the aesthetics withinaes()` * The dependent variable (y) is called by its column name * The independent variable (x) is called by its column name Other Plotting Examples Plotting Example 3 - Wrong Plot Types ggplot() + geom_boxplot(data=combined_alphadata, aes(x= Age, y= shannon)) ## Warning: Continuous x aesthetic -- did you forget aes(group=...)? Notice: * We tried to make a boxplot with continuous data. It clearly doesn't work well! Plotting Example 4 - Adding in more aesthetics # Specifying the color changes the outline color ggplot() + geom_boxplot(data=combined_alphadata, aes(x= BodySite, y= shannon, color=BodySite)) # Specifying the fill changes the interior color ggplot() + geom_boxplot(data=combined_alphadata, aes(x= BodySite, y= shannon, fill= BodySite)) # Specifying the fill to a color changes the interior color for all ggplot() + geom_boxplot(data=combined_alphadata, aes(x= BodySite, y= shannon, fill= BodySite)) # Specifying the fill outside of aes() changes it for all x values # You must pick an exact color if you'd like to do this ggplot() + geom_boxplot(data=combined_alphadata, fill = "red", aes(x= BodySite, y= shannon)) # Specifying the theme changes the background ggplot() + geom_boxplot(data=combined_alphadata, fill="red", aes(x= BodySite, y= shannon)) + theme_bw() #this is the black and white theme # We can pick any colors we want to fill by ggplot() + geom_boxplot(data=combined_alphadata, aes(x= BodySite, y= shannon, fill=BodySite)) + theme_bw() + scale_fill_manual(values= c("blue", "green", "pink", "grey", "yellow")) Plotting Example 5 - Adding Layers of Plots # Add a scatter on top of the boxplots # To do this, we have to specify the data frame for each layer and the aes() for each layer ggplot() + geom_boxplot(data=combined_alphadata, aes(x= BodySite, y= shannon, color=BodySite)) + geom_jitter(data=combined_alphadata, aes(x=BodySite, y=shannon, color=BodySite)) # We can decrease the jitter on the scatter plot with width= ggplot() + geom_boxplot(data=combined_alphadata, aes(x= BodySite, y= shannon, color=BodySite)) + geom_jitter(data=combined_alphadata, width= 0.1, aes(x=BodySite, y=shannon, color=BodySite)) # If all the data for your entire plot will use the same data frame you can specify that in the ggplot() ggplot(data=combined_alphadata) + geom_boxplot(aes(x= BodySite, y= shannon, color=BodySite)) + geom_jitter(width= 0.1, aes(x=BodySite, y=shannon, color=BodySite)) # If all the data and aes() for your entire plot will be the same, you can specify it in ggplot() ggplot(data=combined_alphadata, aes(x=BodySite, y=shannon, color=BodySite)) + geom_boxplot() + geom_jitter(width= 0.1) # You can make this specific to just one plot layer # We are using color ONLY in the geom_jitter here ggplot(data=combined_alphadata, aes(x=BodySite, y=shannon)) + geom_boxplot() + geom_jitter(width= 0.1, aes(color=BodySite)) Alpha Diversity Differences in R Input data Alpha Diversity Table and Metadata Your alpha diversity table and metadata table should be loaded. Remember to subset the tables so that the samples IDs are correct and in the same order. In the "Loading Tables in R" section we saved these tables as alpha and metadata Now we need to pick which covariate we would like to use for the plot, and which alpha diversity metric we would like to visualize. We will use "Sex" and the "shannon" diversity metric. Testing for Differences t-Tests A t-test can be used to determine if two sets of data are significantly different from each other based on the population means. It assumes the data are normally distributed. Although it is typically assumed that data of a large enough sample size are normally distributed, this is not always the case. Testing for Normality Let's see if our data are normally distributed using the hist() function. We will do this for each group in the covariate we are testing (in this case, 'sex'). # We will find the samples that are male in the metadata males.ix <- metadata$Sex == 'male' # And subset the alpha table to include only those, and store it as 'males' males <- alpha[males.ix,] # We will do the same for females females.ix <- metadata$Sex == 'female' females <- alpha[females.ix,] # Now we can plot the histograms hist(females$shannon, xlab="Alpha Diversity", main='Females') hist(males$shannon, xlab="Alpha Diversity", main='Males') Do those look normally distributed? For the most part they do, but it is sometimes hard to tell. The best way to determine if your data are normally distributed is to do a statistical test. The Shapiro-Wilk Normality Test This test can be run with the shapiro.test() function in R. It will generate an approximate pvalue, which is adequate in assessing normality. In this case, p-values less than 0.1 indicate the data are significantly different from normal distribution. We will run this test for each group in our covariate of interest. shapiro.test(females$shannon) ## ## Shapiro-Wilk normality test ## ## data: females$shannon ## W = 0.94434, p-value = 0.01548 shapiro.test(males$shannon) ## ## Shapiro-Wilk normality test ## ## data: males$shannon ## W = 0.89861, p-value = 0.0004951 Based on the results, should you run a t-test? Mann-Whitney U Test Based on these results, it is better to test for differences in our data using a statistical test that does not require normal distributions. The Mann-Whitney U test (aka Wilcoxen-rank_sum test) is smilar to the t-test. The null hypothesis of this test would be that that groups means differ only by chance. We can do a Mann-Whitney U test using the wilcox.test() function in R. wilcox.test(females$shannon, males$shannon, na.rm=TRUE) ## ## Wilcoxon rank sum test with continuity correction ## ## data: females$shannon and males$shannon ## W = 973, p-value = 0.0295 ## alternative hypothesis: true location shift is not equal to 0 # This is telling are to use a wilcox test to comapre females # and males for the diversity metric we set earlier. It is also # telling R to remove any NAs (missing data) # Note that you would run a t-test in the same way, only using # the t.test() function The p-value here, which is less than an alpha of 0.05. This means that we reject the null hypothesis that these two body sites are not significantly different for this metric of alpha diversity. We can also see what the box plot looks like. alpha2 <- alpha alpha2$Sex <- metadata$Sex ggplot(data=alpha2, aes(x=Sex, y= shannon)) + geom_boxplot(outlier.color = NA) + # removes outlier points becuase we add in the jitter anyways geom_jitter(width= 0.1, aes(color=Sex)) + theme_bw() + guides(color=F) #because the x-axis is already labeled In our example we only have two groups, 'male' and 'female'. If we used a different covariate, like 'BodySite' (which contains 5 groups) we would have 10 tests to do pairwise. Luckily, for loops can come to the rescue! This loop will perform the wilcox.test() on every unique combination of groups in the covariate. # First let's set all the groups available for the variable we care about # In this case we will use BodySite instead of what we set as cov1 # (because sex only has two values) groups <- unique(metadata$BodySite) # We create empty vectors to store the pair-wise pvalues and the # groups tested (names) pw.pvalues <- NULL pw.names <- NULL # We set two counters, 'i' starts at 1 and goes until one less than # the number of groups. 'j' will start at 2, and go until the full # number of groups. This will end up comparing: 1 vs 2, 2 vs 3, # 3 vs 4, and so on. for(i in 1:(length(groups) - 1)){ for(j in (i+1):length(groups)){ #we use this to pick the groups assigned to 'i' ix.metric.i <- metadata$BodySite == groups[i] #and this for 'j' ix.metric.j <- metadata$BodySite == groups[j] #this stores the pvalue from the test pvalue <- wilcox.test(alpha[ix.metric.i,"shannon"], alpha[ix.metric.j,"shannon"])$p.value #appends the new p-value to the list pw.pvalues <- c(pw.pvalues, pvalue) #sets the names of the groups tested test.name <- paste(groups[i], "_vs_", groups[j],sep='') #appends the names of the groups tested to the list pw.names <- c(pw.names, test.name) } } names(pw.pvalues) <- pw.names pw.pvalues ## Mid_vagina_vs_Left_Retroauricular_crease ## 6.925356e-02 ## Mid_vagina_vs_Saliva ## 7.431382e-12 ## Mid_vagina_vs_Subgingival_plaque ## 3.715691e-12 ## Mid_vagina_vs_Stool ## 2.600984e-11 ## Left_Retroauricular_crease_vs_Saliva ## 3.482133e-10 ## Left_Retroauricular_crease_vs_Subgingival_plaque ## 1.066403e-10 ## Left_Retroauricular_crease_vs_Stool ## 2.117572e-09 ## Saliva_vs_Subgingival_plaque ## 5.100514e-01 ## Saliva_vs_Stool ## 4.815795e-02 ## Subgingival_plaque_vs_Stool ## 2.298969e-01 False Discovery Rate Correction When we use the 'sex' covariate, we only have one test to perform. If we are comparing more than two groups and we are running multiple tests we have to correct of the number of comparisons we are making. We do this with the p.adjust() function. This will correct for type I errors, which are rejections of a true null hypothesis (also known as a false positive). # We will correct using 'fdr', which is the false discovery rate fdr.pvalues <- p.adjust(pw.pvalues,'fdr') fdr.pvalues ## Mid_vagina_vs_Left_Retroauricular_crease ## 8.656695e-02 ## Mid_vagina_vs_Saliva ## 3.715691e-11 ## Mid_vagina_vs_Subgingival_plaque ## 3.715691e-11 ## Mid_vagina_vs_Stool ## 8.669946e-11 ## Left_Retroauricular_crease_vs_Saliva ## 6.964267e-10 ## Left_Retroauricular_crease_vs_Subgingival_plaque ## 2.666008e-10 ## Left_Retroauricular_crease_vs_Stool ## 3.529287e-09 ## Saliva_vs_Subgingival_plaque ## 5.100514e-01 ## Saliva_vs_Stool ## 6.879708e-02 ## Subgingival_plaque_vs_Stool ## 2.554410e-01 Now we can view the relative p-values for each pairwise comparison, and we can save this table as a file. # sink() will write whatever is listed below it to a file. # You close that file by listing sink() again. sink("alpha_stats.txt") cat("\nNumber of samples in each group:\n") print(table(metadata$BodySite)) #This prints a table of the number of samples at each body site cat("\nMean Alpha Diversity:\n") print(tapply(alpha$shannon, metadata$BodySite, mean)) # This will get the mean of alpha diversity at each body site # by using tapply() to apply the mean function across the alpha # table (subsetted into body site groups) cat("\nMedian Alpha Diversity:\n") print(tapply(alpha$shannon, metadata$BodySite, median)) # This will get the median of alpha diversity at each body site cat("\nStandard Deviation:\n") print(tapply(alpha$shannon, metadata$BodySite, sd)) # This will get the standard deviations of alpha diversity at # each body site cat("\nPairwise Mann-Whitney-Wilcoxon Tests were performed.\n") cat("Pairwise p-values are:\n") print(pw.pvalues) cat("\nFDR-corrected pairwise p-values are:\n") print(p.adjust(pw.pvalues,'fdr')) sink() We can also see what the box plot looks like. alpha2 <- alpha alpha2$BodySite <- metadata$BodySite ggplot(data=alpha2, aes(x=BodySite, y= shannon)) + geom_boxplot() + geom_jitter(width= 0.1, aes(color=BodySite)) + theme_bw() We can also print this plot to a pdf with the pdf() function followed by dev.off() to close the pdf. plot_output <- ggplot(data=alpha2, aes(x=BodySite, y= shannon)) + geom_boxplot() + geom_jitter(width= 0.1, aes(color=BodySite)) + theme_bw() + scale_x_discrete(labels=c("ear fold", "vagina", "saliva", "stool", "plaque")) + guides(color=F) #because they are labeled at the x- axis pdf("Alpha_Diversity.pdf", height=4, width=6) plot(plot_output) dev.off() ## quartz_off_screen ## 2 Taxa Summary Plots in R Visualizing which taxa are in your samples can be an effective way to see patterns in the data. Here we will learn how to make taxa summary plots based on your input OTU table, a covariate of interest, and other specified parameters. Input data OTU table and Metadata Your OTU table should be loaded and we can use the rarefied version. Your metadata file should also be loaded. Remember to subset the tables so that the samples IDs are correct and in the same order. In the "Loading Tables in R" section we saved these tables as otu2 and metadata Manage Taxonomy We want to set a taxa level by number: 1 = kingdom 2 = phylum 3 = class 4 = order 5 = family 6 = genus 7 = species Let's work with phylum (level 2). The taxa are listed with a letter representing the level followed by two underscores, and a semicolon separating each level (k__kingdom; p__phylum; ...). We will do some string parsing to replace the full taxonomy label with the appropriate level. # In this example we are using 2 (or phylum). # This can be for any level you want. level = 2 # First we make an empty table (array) for our new names # The array will have the number of rows equal to the number of OTUs in the table # and one column for each taxonomy level names_split <- array(dim=c(length(otu2$taxonomy), level)) # We will store our taxonomy as a list of names otu_names <- as.character(otu2$taxonomy) # # # # Then we run through each name and split based on the level we are interested in. We make a for loop to split every name stored in otu_names. strsplit() splits the string (otu_names[i]) at ";". This retains all the levels as separate strings. head() takes the # first items (the total will be the number you specified with level) # from the string split output and stores it in the names_split # array at the specied row. for (i in 1:length(otu_names)){ names_split[i,] <- head(strsplit(otu_names[i], "; ", fixed=T)[[1]], n=level) } # Now we will collapse the strings together into one string otu_names <- apply(names_split, 1, function(x) paste(x[1:level], sep = "", collapse = ";")) # Replace the old taxonomy with the truncated version otu2$taxonomy <- otu_names Now we want to consolidate our OTU table by the taxa levels we've set, just like we learned in the loading and manipulating tutorial. We will use the aggregate() function. # Get the number of samples (the last column is taxonomy) sample_no <- ncol(otu2)-1 # Collapse the otu table and save it as a new table otu3 <- aggregate(otu2[,1:sample_no], by=list(otu2$taxonomy), FUN=sum) # Name the first column taxonomy because R stores the column # we told it to aggregate by as the first column names(otu3)[1] <- "taxonomy" # We can see that the consolidating worked by checking how many rows we # now have - that's how many phyla there are (level=2) nrow(otu3) ## [1] 17 Let's replace the rownames with the taxonomy, and get rid of the taxonomy column. # Set rownames as taxonomy rownames(otu3) <- otu3$taxonomy # Keep all columns in the otu table that do NOT (!) have the column # header "taxonomy" otu3 <- otu3[,!names(otu3) == "taxonomy"] Filtering OTUs and Samples Let's filter the OTU table to keep only OTUs that are in at least 5 people, and that have at least 100 counts. #Set the number of samples cut off nsamples <- 5 # `otu > 0` tells R to take all values and see if they are greater # than 0. If so it will store it as TRUE, if not greater than 0 they # get a false. Then we take the `rowSums()` of that value, where # TRUE=1 and FALSE=0. Then we ask if the row sums are greater than # then number of samples we set as the cut off. It will store # TRUE/FALSE values for each row. cutoff_nsamples <- rowSums(otu3 > 0) > nsamples # Keep only samples that are 'TRUE' (meet the cutoff value) otu3 <- otu3[cutoff_nsamples,] ncounts <- 99 # This cutoff is different than the previous. We care about how MANY # counts each taxon has. We only want to keep those with a minimum # of 100 counts across all samples (greater than 99) cutoff_ncounts <- rowSums(otu3) > ncounts #Keep only taxa that meet the cutoff otu3 <- otu3[cutoff_ncounts,] Calculating relative abundances We took out some taxa when filtering, so we need to convert count into the relative abundance of that sample. To do this, we will use a for loop # We want to use all the columns (since we already took out taxonomy) for(i in 1:ncol(otu3)){ otu3[,i] <- otu3[,i]/sum(otu3[,i]) } In order to make our OTU table and results more easily compatible with our metadata, we want the sample IDs as the rows and the taxa as the columns. We'll use the function t() to transpose the data frame. We can then make a column SampleID that will be useful later. # Transpose as a data frame otu3 <- data.frame(t(otu3)) # Make a column that is the Sample IDs (which are the rownames) otu3$SampleID <- rownames(otu3) # Let's save a backup of this filtered OTU table otu_backup <- otu3 If you remember, ggplot likes to have all the data in one table. Now we will use a function called melt() from the library reshape2 to convert our data frame into three columns: one that has the sample ID, one that has taxa IDs, and one that has the relative abundances of the taxa in our sample. We'll also use the package plyr for it's function ddply() to aggregate our data nicely. Before we can move forward, you must get the packages needed to run the functions we will use. You can install packages with the install.packages() function. #You'll want to install these packages if you don't already have them library(reshape2) library(plyr) otu3 <- melt(otu3, id.vars = "SampleID", variable.name = "Taxa", value.name = "RelativeAbundance") Plotting Now we have a filtered table with three columns with which we can make a basic taxa summary plot (just grouped by sample ID). We'll use ggplot. library(ggplot2) # This will make a plot with the OTU table (otu), using the column # headers specified ggplot(otu3, aes(x = SampleID, y = RelativeAbundance, fill= Taxa)) + geom_bar(stat = "identity", position="fill") + # This makes it a bar plot (geom_bar()) scale_x_discrete(labels = NULL) #This takes off the x-labels (too hard to read) This plot is kind of messy! There are so many samples you can't easily see one sample from another. Adding Metadata Let's try piloting by a covariate. The easiest way to do that is to simply add our metadata values to our table. More columns means more potential variables to plot by. Let's go back to our full, filtered OTU table, melt it and then add metadata. otu3 <- otu_backup otu3 <- melt(otu3, id.vars = "SampleID", variable.name = "Taxa", value.name = "RelativeAbundance") Now we can add in our metadata, using the function merge(). First let's make sure we have the covariates/header names we think we do, and we can rename any that aren't right, and only keep the ones we're interested in. If you're looking at real metadata, you'll have a much longer list than the tutorial files. colnames(metadata) ## [1] "BarcodeSequence" ## [4] "BodySite" ## [7] "Description" "LinkerPrimerSequence" "Sex" "SRS_SampleID" "FASTA_FILE" "Age" # We only want to keep "Sex", "BodySite", and # "Description", which is the area of the body #This will keep only the columns with the headers we want columns_keep <- c("Sex","BodySite","Description") metadata2 <- metadata[,columns_keep] # Now we merge covariates to sample ids # First we need to make a column that is the sample IDs in the metadata2$SampleID <- rownames(metadata2) # This will drop any samples in the mapping file that aren't in the OTU table otu3 <- merge(otu3, metadata2, by="SampleID") Now we can plot according to body site. ggplot(otu3, aes(x=BodySite, y=RelativeAbundance, fill=Taxa)) + # using position="fill" makes sure it sums to 1 geom_bar(stat ="identity", position="fill") # We will want to shorten the x-labels. # We can even split our data up by sex using this method, # using an option called facet_grid(): ggplot(otu3, aes(x=BodySite, y=RelativeAbundance, fill=Taxa)) + geom_bar(stat ="identity", position="fill") + facet_grid(.~Sex) + # This will separate by sex scale_x_discrete(labels=c("LRC", "MV", "Sal.","Stool","SGP")) relabels the x axis # This Plot Specific Taxa We can also plot just specific taxa. For that, we can use the aggregated relative abundance table otu from the sex in the example above, and pull out a subset of the taxa we're specifically interested in. You'll need the exact taxa labels from the table to match. Say we want to look at Firmicutes and Actinobacteria: # If we don't remember the spelling, we can print all the taxa and # copy and paste: unique(otu3$Taxa) ## [1] k__Bacteria.p__Cyanobacteria k__Bacteria.p__Tenericutes ## [3] k__Bacteria.p__Fusobacteria k__Bacteria.p__Bacteroidetes ## [5] k__Bacteria.p__Verrucomicrobia k__Bacteria.p__TM7 ## [7] k__Bacteria.p__Actinobacteria k__Bacteria.p__Spirochaetes ## [9] k__Bacteria.p__Proteobacteria k__Bacteria.p__Firmicutes ## 10 Levels: k__Bacteria.p__Actinobacteria ... k__Bacteria.p__Verrucomicrobia # Let's subset to just Bacteroidetes and Actinobacteria taxaList <- c("k__Bacteria.p__Bacteroidetes", "k__Bacteria.p__Actinobacteria") # Let's make a new subsetted table that is just those phyla filtered <- subset(otu3, is.element(otu3$Taxa, taxaList)) We plot things the same, making sure not to use the option position="fill", since our abundances now should not add up to 1. Let's make our labels a little nicer, as well. ggplot(filtered, aes(x = Sex, y = RelativeAbundance, fill=Taxa)) + geom_bar(stat="identity") + labs(y = "Relative Abundance") + scale_fill_discrete(labels = c("Actinobacteria", "Bacteroidetes")) + scale_x_discrete(labels = c("Female", "Male")) There are almost unlimited parameters that you can play with to change the actual look of your plots. Below we use theme_bw() to make the background white, and modified the colors by making a color variable cols that we use to color the different taxa with scale_fill_manual(). cols <- c("purple","yellow") #Note that we have to use scale_fill_manual() instead of scale_fill_discrete # to specify colors ggplot(filtered, aes(x = Sex, y = RelativeAbundance, fill=Taxa)) + geom_bar(stat="identity") + labs(y = "Relative Abundance") + scale_x_discrete(labels = c("Female", "Male")) + theme_bw() + scale_fill_manual(labels = c("Actinobacteria", "Bacteroidetes"), values=cols) Differentiated OTUs in R We can test for taxa or OTU that are differentially abundant across sample types. To do this, we need to first transform our data out of the simplex. This means we want to go from working with compositional data to non-compositional data. Inputs The input data you need include the metadata and your OTU table that has low depth samples removed. Don't use the rarefied OTU table. The tables should be subsetted and ordered for sample ID. First we will take the taxonomy out of the OTU table, filter low abundant OTUs and low occurring OTUs: # We can store taxonomy and which OTUs they are to use for later # drop=F makes sure it stays as a table taxonomy_table <- otu_low2[,"taxonomy",drop=F] #Keep only the samples, drop taxonomy from table otu_low3 <- otu_low2[, ! names(otu_low2) =="taxonomy"] #Filter OTUs that are in low abundance #Change those less than 1/1 millionth of read depth to 0 otu_low3[otu_low3 < sum(colSums(otu_low3))/1000000] <- 0 #Change singletons to 0 (needed for low depth OTU tables) otu_low3[otu_low3 < 2] <- 0 #Filter the OTU table to keep OTUs in at least 5% of samples otu_low3 <- otu_low3[rowSums(otu_low3 > 0) > (0.05*ncol(otu_low3)),] Now we will transform the data using a centered log-ratio transformation. This needs the robCompositions package. library(robCompositions) ## Warning: package 'robCompositions' was built under R version 3.3.2 ## Loading required package: robustbase ## Warning: package 'robustbase' was built under R version 3.3.2 ## Loading required package: data.table ## ## Attaching package: 'data.table' ## The following objects are masked from 'package:reshape2': ## ## dcast, melt ## Loading required package: e1071 ## Warning: package 'e1071' was built under R version 3.3.2 ## Loading required package: pls ## Warning: package 'pls' was built under R version 3.3.2 ## ## Attaching package: 'pls' ## The following object is masked from 'package:stats': ## ## loadings ## sROC 0.1-2 loaded ## ## Attaching package: 'robCompositions' ## The following object is masked from 'package:robustbase': ## ## alcohol #Convert any 0 to 0.65 to allow for CLR transform #Ref: Palarea-Albaladejo J, et al. 2014. JOURNAL OF CHEMOMETRICS. A bootstrap estimation scheme for chemical compositional data with nondetects. 28;7:585– 599. otu_low3[otu_low3 == 0] <- 0.65 #Centered log-ratio transform for compositions #Ref: Gloor GB, et al. 2016. ANNALS OF EPIDEMIOLOGY. It's all relative: analyzing microbiome data as compositions. 26;5:322-329. #convert to samples as rows otu_table <- t(otu_low3) #Centered log-ratio tranform the data otu_table <- cenLR(otu_table)$x.clr Test For Differences Now our otu table has samples as rows and OTUs as samples. We can now loop through the OTUs and test for differences according to our metadata. Lets test for differences by bodysite. Because we transformed our data, we can now use parametric tests to look for differentiated OTUs. We can use ANOVA or t-test depending on the number of groups to test. # Let's test the first OTU (first column) in the OTU table # What is the name of this OTU? We can look it up in our table # We pick the row we want using the otu id in the column this_taxa <- taxonomy_table[colnames(otu_table)[1],"taxonomy"] this_taxa ## [1] "k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Lactobacillaceae; g__Lactobacillus; s__" # Now lets run the test using the first column and according to bodysites in the metadata aov_test <- aov(otu_table[,1] ~ metadata$BodySite) summary(aov_test) ## ## ## ## ## Df Sum Sq Mean Sq F value Pr(>F) metadata$BodySite 4 47.77 11.942 27.95 1.91e-15 *** Residuals 97 41.44 0.427 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 The output from aov() is more complicated than kruskal.test(). aov() output is a list that includes information about the degrees of freedom (Df), the Sum of Squares (Sum Sq), the Mean Square (Mean Sq), the F statistic/ratio (F value); and the P-value (Pr(>F)). For now, we are interested in the p-value, which can be indexed from summary(aov_test) with summary(aov_test)[[1]][1,5]. summary(aov_test)[[1]][1,5] ## [1] 1.913061e-15 Let's plot this example and see what it looks like. # Because ggplot likes to have all the data in one table, let's make a new table to plot with plot_table <- data.frame(otu_table) #Note that this will store and x infront of all the numerical column names plot_table$BodySite <- metadata$BodySite #store which column (header) you want to plot this_otu <- colnames(plot_table)[1] # We can also store its name # will split the taxonomy based on the ";" # Then take the last two values (genus and species) to shorten the name name = strsplit(this_taxa, ";", fixed=T)[[1]] names_tail = tail(name, n=2) # This will plot the transformed abundnces for each body site # Note that we have to use aes_string() because we are filling in the y column header with a string ggplot(plot_table) + geom_boxplot(aes_string(x="BodySite", y=this_otu, fill="BodySite")) + scale_fill_manual(values = c("tomato", "darkorchid4", "gold","tan4","dodgerblue")) + scale_x_discrete(labels=c("ear fold", "vagina", "saliva", "stool", "plaque")) + labs(y=names_tail) Test All Taxa Using for-loops we can apply this test to all of the OTUs in our table. In the loop, the transformed abundance of each OTU for all samples will be assigned to the variable 'y', the dependent variable. The next line in the loop calls the aov() function, and the last line assigns each p-value to a pvals vector. #The first step is to make an empty vector that will store our p-values. pvals <- c() #Loop through each column except the last (because it's body site) for(i in 1:(ncol(plot_table)-1)){ aov_out <- aov(plot_table[,i] ~ plot_table$BodySite) pvals[i] <- summary(aov_out)[[1]][1,5] } Find Significant p-Values Let's see how many p-values are significant for each covariate. We will use an alpha of 0.05. sum(pvals < 0.05) ## [1] 579 False Discovery Rate Because we did so many statistical comparisons, we need to correct for type I errors (rejection of a true null hypothesis, also known as a false positive). Controlling the false discovery rate helps to control the expected proportion of false positives. To do this we use the p.adjust() function with the 'fdr parameter. pvals.fdr = p.adjust(pvals, "fdr") Let's see how many p-values are significant for each covariate after the false discovery rate correction. sum(pvals.fdr < 0.05) ## [1] 577 Plotting Significant OTUs If we wanted to plot all the significantly different taxa we could do so with a for loop. We will plot the first three significantly different taxa across the body site. # Index just the first three significantly different OTUs # which() tells us the position of the values that are true (< 0.05), and [1:3] # takes the first 3. first_three <- which(pvals.fdr < 0.05)[1:3] # This loops through the significant OTUs, stores their name # and makes a box plot of the transformed abundances of the taxa # We then store the plots in a list plot_list <- list() for(i in 1:length(first_three)){ index <- first_three[[i]] this_otu <- colnames(plot_table)[i] this_taxa <- taxonomy_table[i,"taxonomy"] name <- strsplit(this_taxa, ";", fixed=T)[[1]] taxon <- paste(name[6], name[7], sep=" ") # Note that we have to use aes_string() because we are filling in the y column header with a string plot_out <- ggplot(plot_table) + geom_boxplot(aes_string(x="BodySite", y=this_otu, fill="BodySite")) + scale_fill_manual(values = c("tomato", "darkorchid4", "gold","tan4","dodgerblue")) + scale_x_discrete(labels=c("ear fold", "vagina", "saliva", "stool", "plaque")) + labs(y=taxon) plot_list[[i]] <- plot_out } # Now lets print the three plots to a pdf # each plot will be a new page in the pdf pdf("Diff_taxa.pdf", height=4, width=6) for(i in 1:length(plot_list)){ plot(plot_list[[i]]) } dev.off() ## quartz_off_screen ## 2 PCoA in R We use QIIME to calculate our distance matrices using beta_diversity.py or beta_diversity_through_plots.py command. We then can use R to make 2D PCoA plots of this data. Let's start with the packages we need to load. If these are not installed you can install them first with install.packages(). library(ape) library(vegan) library(ggplot2) Load Data You'll need to have your beta diversity and metadata files loaded and subsetted to the correct number and order of samples. Principal Coordinates Analysis Now we can use the function pcoa() from the R package ape to actually calculate our principal coordinate vectors. To make plotting easier, we save the vectors as a data frame, set up new column titles, and add a column of sample IDs. # Run the pcoa() function on the beta diversity table, # and store the vectors generated as a dataframe PCOA <- data.frame(pcoa(beta)$vectors) # If you look at the PCOA table, you'll see the column names # are the 'axes' and the row names are sample IDs. We want them to # be labeled "PC" instead of "axis" # We will make a vector with place holders new_names <- rep("", ncol(PCOA)) # Fill in first with PC followed by the number (e.g. PC1, PC2, PC3...) for(i in 1:ncol(PCOA)){ new_names[i] <- paste("PC",i, sep="") } # Replace the column names of PCOA names(PCOA) <- new_names # Create a column that is SampleIDS for PCOA PCOA$SampleID <- rownames(PCOA) #Create a column that is SampleIDs for the metadata metadata$SampleID <- rownames(metadata) # Merge the metadata and beta diversity PCOA <- merge(PCOA, metadata, by = "SampleID") Plotting the PCoA Now you have a data frame that has all of your PCOA vectors and all the relevant metadata, matched up by sample ID. In this example we will plot the first two principal coordinates (PC1 and PC2). If you remember, the first principal coordinates should explain the majority of the variation in the data. These will be pretty simple scatter plots. # Note that geom_point() makes it a scatter plot where the points # are colored according to BodySite ggplot(PCOA) + geom_point(aes(x = PC1, y = PC2, color = BodySite)) + labs(title="PCoA Plot") # Now let's add some clusters. This makes it look great, but can # also be misleading and make us think there are groups when there # aren't. Note that we are using BodySite to color the points and body # AREA to fill the clusters ggplot(PCOA) + geom_point(aes(x = PC1, y = PC2, color = BodySite)) + labs(title="PCoA and Clusters") + stat_ellipse(alpha = 0.3, geom="polygon", linetype="blank", aes(x = PC1, y = PC2, fill = Description)) Notice that the color of the ellipses don't really match the color of the points they are clustering. The colors are determined by which order the body area is factored by. We can make this order line up with the order of the body sites. # Check order of levels of body area (Description) levels(PCOA$Description) ## [1] "Gastrointestinal_tract" "Oral" ## [3] "Skin" "Urogenital_tract" # Check order of levels in BodySite levels(PCOA$BodySite) ## [1] "Left_Retroauricular_crease" "Mid_vagina" ## [3] "Saliva" "Stool" ## [5] "Subgingival_plaque" # Reset levels of Bodysite to match levels of body area PCOA$BodySite <- factor(PCOA$BodySite, levels = c("Stool", "Saliva", "Subgingival_plaque", "Left_Retroauricular_crease", "Mid_vagina")) #Replot ggplot(PCOA) + geom_point(aes(x = PC1, y = PC2, color = BodySite)) + labs(title="PCoa and Clusters") + stat_ellipse(alpha = 0.3, geom="polygon", linetype="blank", aes(x = PC1, y = PC2, fill = Description)) Changing Plotting Parameters The following long command throws a whole pile of customization bells and whistles at ggplot the fill colors are tweaked, the points are a bit bigger, the font sizes are bigger. This is just to give you a taste of all the different aesthetic options you can play around with. You should try modifying each parameter and see what it does to the plot. ggplot(PCOA) + stat_ellipse(alpha = 0.3, geom="polygon", linetype="blank", aes(x = PC1, y = PC2, fill = Description)) + geom_point(size = 3, aes(x = PC1, y = PC2, color = BodySite)) + labs(title="Human Microbiome Betadiversity") + scale_color_discrete(name = "Body Site", labels = c("Ear Fold","Vagina", "Saliva", "Stool","Plaque")) + scale_fill_hue(h.start = 20, name = "Body Area", labels = c("GI Tract", "Oral","Skin", "Urogenital Tract")) + theme(panel.background = element_rect(color = "grey97"), plot.title = element_text(size = 16), axis.title = element_text(size = 14), axis.text = element_text(size = 12), legend.title = element_text(size = 14), legend.text = element_text(size = 12)) + theme_bw() + guides(color = guide_legend(override.aes = list(fill = "grey97", size = 4)), fill = guide_legend(override.aes=list(shape = NA)) ) Testing for Signifcant Differences Just like we did for alpha diversity, we can test for significant differences in beta diversity. For example, let's say we want to test for significant differences between body sites with the Unweighted UniFrac data. We already loaded that data when ran it through the pcoa() function above. But since we are learning let's load it again so we can get comfortable with the code. adonis adonis is a non-parametric statistical test, which means it uses permutations of the data to determine the p-value, or statistical significance. It requires: * a distance matrix file, such as a UniFrac distance matrix * a mapping file, and a category in the mapping file to determine sample grouping from It computes an R2 value (effect size) which shows the percentage of variation explained by the supplied mapping file category, as well as a p-value to determine the statistical significance. More information of the adonis test can be found here: http://qiime.org/tutorials/category_comparison.html, http://cc.oulu.fi/~jarioksa/softhelp/vegan/html/adonis.html # Turn the beta table into resemblance matrix using as.dist() beta_dist = as.dist(beta) # Test for a significant difference across all groups. # This will run an ADONIS test. ad = adonis(beta_dist ~ metadata[,"BodySite"], data=metadata, permutations=999) ad ## ## Call: ## adonis(formula = beta_dist ~ metadata[, "BodySite"], permutations = 999) ## ## Permutation: free ## Number of permutations: 999 ## ## Terms added sequentially (first to last) ## ## Df SumsOfSqs MeanSqs F.Model ## metadata[, "BodySite"] 4 12.725 3.1811 15.789 ## Residuals 97 19.544 0.2015 ## Total 101 32.269 ## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' data = metadata, R2 Pr(>F) 0.39433 0.001 *** 0.60567 1.00000 0.1 ' ' 1 Note: Pr indicates that at an alpha of 0.05, the grouping of samples by 'BodySite' is statistically significant. The R2 value indicates that approximately 39% of the variation in distances is explained by this grouping. It's important because a p-value can indicate significance but we must also notice how much of the variation the input variables contribute. Now let's write our output to a file. # This takes just the analysis of variance table (aoc.tab) # from the output a.table <- ad$aov.tab # This writes it to a file write.table(a.table, file="analysis.txt", quote=FALSE, sep="\t", col.names = NA)
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf Linearized : No Page Count : 134 PDF Version : 1.4 Title : Microsoft Word - Full_Comp_Micro_lab_manual.docx Producer : Mac OS X 10.11.6 Quartz PDFContext Creator : Word Create Date : 2017:10:04 22:21:04Z Modify Date : 2017:10:04 22:21:04ZEXIF Metadata provided by EXIF.tools