Main Manual
User Manual: Pdf
Open the PDF directly: View PDF .
Page Count: 35
Download | |
Open PDF In Browser | View PDF |
BioPype: A crash course in bioinformatics and custom pipelines Ethan Gniot May 2018 Todo list Still need to add information about the inflammatory bowel disease dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Make URLs footnotes instead of appendix entries . . . . . . . . . . . . 8 Fix bibliography entries in this paragraph and in general. They’re not correctly referencing even though bib file has entries. . . . . . . . 8 Add further software that you end up using (e.g., USEARCH). ALSO, make sure to update the section that mentions installing packages if you do so. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 add 3rd column to table with link to documentation/download source 13 Write a blurb explaining the benefits of using a virtual environment. . 14 Link to resource for further reading on virtual environments . . . . . . 14 Write blurb about what packages are. . . . . . . . . . . . . . . . . . . 15 Link to resource for further reading about packages. . . . . . . . . . . 15 Solve the issue with failing to center figures when they’re on their own page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Talk about setting up the sra-tools workspace (https://trace.ncbi. nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=std) 16 PYPIPER IS ONLY COMPATIBLE WITH MAC AND LINUX. Start by coding with just subprocess module commands. If the scripts work on Windows computers, then forget about using pypiper. But if the subprocess scripts don’t work on windows, then we’ll be developing exclusively for Mac anyways, so you could use pypiper without any worries. In that case, talk about installing pypiper here. OTHERWISE, delete this section. . . . . . . . . . 16 show image of what the prompt looks like when it’s done initializing. . 16 Reference the figure pypiper-already-installed . . . . . . . . . . . . . . 18 Figure: pypiper-already-installed . . . . . . . . . . . . . . . . . . . . . 18 reference figure pypiper-wrong-location . . . . . . . . . . . . . . . . . . 18 Figure: pypiper-wrong-location . . . . . . . . . . . . . . . . . . . . . . 18 Talk about choosing between PyCharm and other options . . . . . . . 19 (Talk about putting all the tools in the same path/directory) . . . . . 19 *****Show the user what the result of the commands are. Right now we show the commands, but not the results. e.g., when we run a command to get a list of accession numbers, follow it up by typing the command that just prints the list of numbers in the terminal. Then the user can see what their results should look like if they did everything correctly. . . . . . . . . . . . . . . . . 21 Figure: highlight where to click to download the runinfo table . . . . . 22 Either shorten this margin-label, or put it in a footnote. . . . . . . . . 22 Give a specific name to the RunInfo Table file. People might get confused if they needs to fill in the name of the file themselves. . 22 Find resource for explaining the .sra file format that we download from the SRA Database . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2 Contents 1 Foreword 5 1.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 How to use this book . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Pre-requisites . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Source Code 7 3 Sample Information 8 4 Setting the scene 9 5 Microbiome Analysis 10 5.1 The Gut Microbiome . . . . . . . . . . . . . . . . . . . . . . . 10 5.2 Relative Abundance Analysis . . . . . . . . . . . . . . . . . . 10 5.3 Metagenomics . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5.4 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 6 How to Find Tools 11 6.1 Finding Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 6.2 Finding Software . . . . . . . . . . . . . . . . . . . . . . . . . 11 7 The Plan 12 8 Software and Set-up 13 8.1 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 8.2 Set-up and Install Dependencies 8.3 . . . . . . . . . . . . . . . . 13 8.2.1 Install Anaconda . . . . . . . . . . . . . . . . . . . . . 13 8.2.2 Create a New Virtual Environment . . . . . . . . . . . 14 8.2.3 Install packages . . . . . . . . . . . . . . . . . . . . . . 15 8.2.4 Integrated Development Environment (IDE) . . . . . . 19 8.2.5 PATH . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Analysis Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 19 9 The Dataset 9.1 Download Dataset . . . . . . . . . . . . . . . . . . . . . . . . 21 9.1.1 9.2 20 Creating the code . . . . . . . . . . . . . . . . . . . . 24 Perform Quality Control on Dataset . . . . . . . . . . . . . . 24 10 Relative Abundance Analysis 25 11 Predict ORFs 26 12 Create Non-redundant Gene Sets 27 13 Align Genes 28 14 Get GenBank Accession Numbers 29 15 Find COG Functional Classes 30 16 New Analysis 31 A Web Links 32 B Referenced Studies 34 4 1 1.1 Foreword Goal This tutorial aims to improve your general understanding of bioinformatics through several methods: • Define technical terms commonly used in bioinformatics methods and found in the literature. • Provide a collection of various useful resources, including... 1. Resources for finding tools, data, and background information that can help answer your research questions. 2. Resources that explain details about bioinformatics concepts and techniques in beginner-friendly language. • Demonstrate how Python can be used to answer your research questions by combining existing bioinformatics tools and automating repetitive or time-consuming tasks. 1.2 How to use this book Most reference materials are consolidated in the appendices at the end of the book. The main text of this book is written in the right-hand margin. The left-hand margin contains special markers and important notes to the reader. The book can be used as a self-paced tutorial with the help of the markers described below. ——————————- This is a margin label. I will write things here to further explain the main text, define jargon, etc. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas eu felis sodales, interdum purus nec, interdum ex. Integer at nunc ultricies, tempus nibh eget, egestas risus. Suspendisse aliquam, lorem at dictum accumsan, dolor elit euismod velit, a gravida risus libero non felis. Donec a tortor tempor, scelerisque sapien posuere, volutpat erat. Morbi in imperdiet velit. Vivamus sagittis, massa sit amet venenatis euismod, elit eros aliquam diam, molestie faucibus lectus nisi euismod erat. Nulla id dictum mi. ! → The arrow that points to this line is an "Attention" marker. It will indicate key pieces of information that you should pay special attention to. Curabitur egestas aliquam nisl, pharetra finibus mauris placerat nec. Nam in diam risus. Nulla mollis purus quis feugiat tristique. Vestibulum et sollicitudin ante, at sagittis ipsum. Nunc hendrerit ante sed massa semper eleifend. → This kind of annotation will reference appendix entries that you can consult for moredetailed information about the main text. Ut ultrices eros velit, at faucibus ante rutrum eget. Pellentesque a molestie diam. Curabitur mattis dui a risus lacinia fringilla. Phasellus porttitor elit nec neque euismod, id ultrices elit lobortis. Aliquam molestie sem. Curabitur sit amet urna faucibus, vulputate arcu sit amet, fermentum ipsum. Nulla facilisi. Nam ullamcorper eget leo id malesuada. 1.3 Pre-requisites Things to know Computer requirements • At least 4GB RAM bare minimum, 16GB RAM if possible (basically, the more RAM, the better) • Must be able to run macOS 10.12 Sierra, preferably macOS 10.13 High Sierra • At least 500GB storage (ideally several TB) 6 2 Source Code (Source code can be found at github.com/EthanGniot/LU-microbiome) 3 Sample Information Still need to add information about the inflammatory bowel disease dataset While we are building the LU-microbiome pipeline, we will need a dataset that we can use to test our pipeline throughout the process and make sure it is working as intended. In order to do this, we must use a dataset that’s already been analyzed so that we know what the results should look like. When we test our pipeline, if our results match the results of the original analysis, then we will know that our tool is working correctly. There are several test datasets available to us for testing the feature that analyzes the relative abundance of bacteria in the microbiota. Make URLs footnotes instead of appendix entries → Caporaso et al, 2011 [?] → THIS CITATION NEEDS TO BE FIXED [?] Fix bibliography entries in this paragraph and in general. They’re not correctly referencing even though bib file has entries. The first is the dataset used in both the QIIME “Illumina Overview Tutorial" (A.1) and the QIIME 2 “Moving Pictures" tutorial (A.2) derived from the Moving Pictures of the Human Microbiome study, where two human subjects collected daily samples from four body sites: the tongue, the palm of the left hand, the palm of the right hand, and the gut (via fecal samples obtained by swapping used toilet paper). These data were sequenced using the barcoded amplicon sequencing protocol described in Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. A more recent version of this protocol that can be used with the Illumina HiSeq 2000 and MiSeq can be found here. (Here is information about the untested inflammatory bowel disease dataset that we will analyze using the completed pipeline) 4 Setting the scene ("Here" is a hypothetical situation/research question that a student may have. This is the research question that will be answered by the pipeline we are creating.) 5 Microbiome Analysis (This chapter will give some general background information about the topics listed below. More-detailed information will be provided in later chapters when we are actually creating the pipeline) 5.1 The Gut Microbiome 5.2 Relative Abundance Analysis 5.3 Metagenomics 5.4 Python 6 6.1 How to Find Tools Finding Data (Here is where we talk about various databases that users can use to find general information, data files, study results, public datasets, etc.) 6.2 Finding Software (Here is where we talk about ways/places that people can look for software programs that can help answer their research question.) 7 The Plan (Break down the sub-tasks required to accomplish the two main tasks: Relative abundance analysis and metagenomic analysis) 1. What is my research question? 2. Is there an existing tool that I can use to directly answer my research question? If not, proceed to Step 3. 3. What is the step-by-step process required to answer my research question? 4. What existing tools are available that can help me accomplish each of these steps? 5. How do I write code that uses these tools to accomplish the steps? 8 8.1 Software and Set-up Software Add further software that you end up using (e.g., USEARCH). ALSO, make sure to update the section that mentions installing packages if you do so. (Table of software name, name in PATH, version number, function of the software for each one we’re gonna use) add 3rd column to table with link to documentation/download source Software name Anaconda Biopython BLAT matplotlib pandas sra-tools trim-galore Version number 5.1 1.70 35 2.2.2 0.22.0 2.8.2 0.4.5 Table 8.1: Software used to create the tutorial pipeline. 8.2 Set-up and Install Dependencies Before we write any code, there are several steps that must be completed to prep your machine for the tasks we will be performing in this tutorial. Without these prerequisites, the code you write during this tutorial will not work correctly: 1. Install and open Anaconda 2. Create a new virtual environment 3. Install packages 8.2.1 Install Anaconda → RESOURCE FOR LEARNING ABOUT PYTHON PACKAGES The Anaconda program will play a key role in this tutorial. Anaconda is essentially Python and a lot of scientific computing tools bundled together, along with many popular add-ons to Python called packages. Downloading all of these tools individually can be difficult, as the quirks of one package may conflict with another when they’re installed manually; using Anaconda to install packages greatly simplifies the process because Anaconda can smoothly handle all of the minute details that cause manual installations to fail. Install and Open: 1. Go to the download page for the Anaconda distribution at https://www.anaconda.com/download. 2. Select your preferred operating system from the Windows, macOS, or Linux tabs, then select the Download option for the Python 3.6 version (Figure 8.1) and follow the installation instructions. Figure 8.1: The Anaconda download options provided on the Anaconda distribution website at https://www.anaconda.com/download . 3. After installation is complete, open the application named "AnacondaNavigator" (the icon looks like ). After a brief start-up period, you should see the following window (Figure 8.2): Figure 8.2: The window displayed to the user upon opening AnacondaNavigator. 8.2.2 Create a New Virtual Environment Write a blurb explaining the benefits of using a virtual environment. Link to resource for further reading on virtual environments 14 Figure 8.3: The Environments window of the Anaconda-Navigator. Make sure the computer has an internet connection while completing this section, otherwise Anaconda will not let you create a virtual environment. 1. On the left side of the Anaconda-Navigator window, click on the tab labeled Environments. (Figure 8.3) 2. Click the Create button on the bottom of the center panel. A new window titled "Create new environment" will appear. (Figure 8.4) 3. Enter a Name for the environment. You may choose any name you want, but for the sake of this tutorial we will name the new environment "BioPype". 4. Select the box labeled Python next to the Packages heading. 5. Choose the latest version of Python from the adjacent drop-down menu (Python 3.6 is the most current version at the time of this writing, so we choose 3.6). 6. Click the Create button within the "Create new environment window". 8.2.3 Install packages Write blurb about what packages are. Link to resource for further reading about packages. 1. Change Anaconda’s current environment from the root environment by selecting the BioPype tab in the middle panel of the Environments window. 2. Click on the drop-down menu in the right-hand panel that says "Installed" and change it to "All". 15 Figure 8.4: The "Create new environment" window. 3. In the "Search Packages" box, enter "biopython". The search should return a package named "biopython". Select the checkbox to the left of the name. (Figure 8.5) • A pair of green and red boxes (reading "Apply" and "Clear", respectively) will appear in the bottom-right of the window once the package is selected. Do not click these just yet. Solve the issue with failing to center figures when they’re on their own page 4. Use the search bar to find and select the other packages listed in Table 8.1. Once all packages have been selected, click the green "Apply" button in the bottom right corner of the window, then select "Apply" again within the "Install Packages" window that appears. (Figure 8.6) Anaconda will now install the selected packages. 5. 6. Talk about setting up the sra-tools workspace (https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view= toolkit_doc&f=std) PYPIPER IS ONLY COMPATIBLE WITH MAC AND LINUX. Start by coding with just subprocess module commands. If the scripts work on Windows computers, then forget about using pypiper. But if the subprocess scripts don’t work on windows, then we’ll be developing exclusively for Mac anyways, so you could use pypiper without any worries. In that case, talk about installing pypiper here. OTHERWISE, delete this section. (a) Open a terminal window in the BioPype environment by clicking the "play" button on the BioPype environment tab and then selecting "Open Terminal". (b) Wait for the terminal window to finish opening. You’ll know it’s finished when you see show image of what the prompt looks like when it’s done initializing. (c) Install pypiper by typing the following at the command prompt, followed by pressing return/enter: p i p i n s t a l l −−u s e r p y p i p e r 16 Figure 8.5: Searching for a package. When a package is selected, the checkbox next to the package’s name will be green. Figure 8.6: The window displaying the packages and dependencies that will be installed. 17 (d) Check that the package was installed correctly by executing the following in the command prompt: conda l i s t This will generate a list of all the packages installed in the current environment. If you see the pypiper package listed, the installation was successful and you may skip the rest of this section. If not, proceed with the following steps. (e) Execute the install command from step (c) again. This time, the Terminal should return a message similar to the one displayed in Reference the figure pypiper-already-installed . The line that reads "Requirement already satisfied: pypiper in ...." tells us the location where the package was (incorrectly) installed. Missing figure pypiper-already-installed (f) Open a Terminal window, and navigate to the location indicated by the message from the previous step. For my example, I need to start at my home directory and walk through the following folders: .local | lib | python3.6 | site-packages. • The folders along the path to the pypiper installation may be hidden. On a Mac, these hidden folders are preceded by a "." If the path to the pypiper installation includes hidden locations, reveal them by pressing "Cmd + Shift + ." in the Finder window. • Once you find the site-packages folder containing two pypiper folders reference figure pypiper-wrong-location , copy those folders and their contents and paste them into the /anaconda/envs/BioPype/lib/python3.6/site-packages directory. The package should now be installed correctly. 18 Missing figure 8.2.4 pypiper-wrong-location Integrated Development Environment (IDE) Talk about choosing between PyCharm and other options 8.2.5 PATH (Talk about putting all the tools in the same path/directory) 8.3 Analysis Pipeline (Use figures to illustrate the stages of the pipeline) 19 9 The Dataset Recap In the previous chapter, we set up our machine so that it has all of the software BioPype needs in order to function. In this chapter, we will use BioPype to download experimental data, and then prepare them for analysis via a process called "quality control". We will also discuss how the BioPype functions used in this workflow were created. As described in Chapter 4, we want to investigate if there are any differences in the gut microbiomes of younger patients with IBD compared to older patients with IBD. Using the methods described in Chapter 6, we found a study in the SRA database with sequencing data that are useful to us. The webpage for the SRA Study is reproduced in Figure 9.1 and can be accessed at https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP115494 . Figure 9.1: The "SRA Study" webpage. The dataset is reproduced in Figure 9.2 and can be found by clicking on "Runs" under "Related SRA data" on the right side of the SRA Study webpage, or by going to https://trace.ncbi.nlm.nih.gov/Traces/study/ ?acc=SRP115494# . Figure 9.2: The sequencing data collected by the Longitudinal Multi’omics of the Human Microbiome in Inflammatory Bowel Disease study. (Note: the table’s columns extend beyond the right side of the page because there are 36 metadata categories.) To learn more about how to use these pages to find information about the study and its methods, please refer to Chapter 6. 9.1 Download Dataset *****Show the user what the result of the commands are. Right now we show the commands, but not the results. e.g., when we run a command to get a list of accession numbers, follow it up by typing the command that just prints the list of numbers in the terminal. Then the user can see what their results should look like if they did everything correctly. 1) Choose sample criteria Note: The dataset has a metadata category called "BioSample." This is not to be confused with the generic term "samples", which in this case are equivalent to one run (i.e., one row) of the data table. 1. First, we must decide what criteria we will use to choose our experimental samples. Remember: our research question investigates the effects of age on the gut microbiomes of inflammatory bowel disease patients. Looking at the study abstract from Figure 9.1, we can see that the researchers investigated two types of IBD: Crohn’s disease and ulcerative colitis. So one of the metadata categories from the dataset (i.e., the columns from the dataset in Figure 9.2) that we will use to select our samples 21 is the "host_disease" category. We will also select our samples based on the "host_age" metadata. 2. Now we want to select our samples based on these categories. Download the RunInfo Table file found on the study’s sample page. The RunInfo table is the browser’s sample table in a .txt file format. Missing figure highlight where to click to download the runinfo table 3. Open the command line interface for your machine’s operating system. On a mac, this is the Terminal application. 4. Navigate to your project’s folder and then confirm that you are in the right location by using the the following commands at the command prompt: $ cd$ pwd 5. Activate the Python interpreter by typing the following into the command prompt: $ python3 6. Import BioPype to the current Python environment: >>> import BioPype ! → • If not done already, stop and follow the instructions in Chapter 8 to configure the BioPype working directory and the SRA Toolkit Workspace Location. 7. Import BioPype.cmds.runtable to gain access to the RunTable class: Either shorten this margin-label, or put it in a footnote. Using "as" in an import statement lets you reference the full name of the imported package by some shorter name. It is generally considered better practice to not import packages "as" some other name, because it makes the code less explicit (which goes against the purpose of Python), but we do it in this manual for the sake of space. >>> import BioPype . cmds . r u n t a b l e a s r u n t a b l e 8. Create a RunTable object called "my_table" out of the RunInfo Table .txt file: Give a specific name to the RunInfo Table file. People might get confused if they needs to fill in the name of the file themselves. >>> my_table = r u n t a b l e . RunTable ( ' r u n i n f o t a b l e _ f i l e n a m e . t x t ' ) 22 • With the my_table object created, you can remind yourself what the metadata categories are by using the following command: >>> my_table . column_names [ ' Assay_Type ' , ' AvgSpotLen ' , ' BioSample ' , ' BioSampleModel ' , ' Experiment ' , ' I n s t r u m e n t ' , ' L i b r a r y S e l e c t i o n ' , ' L i b r a r y S o u r c e ' , ' Library_Name ' , ' LoadDate ' , ' MBases ' , ' MBytes ' , ' Organism ' , ' R e l e a s e D a t e ' , ' Run ' , ' SRA_Sample ' , ' Sample_Name ' , ' c o l l e c t i o n _ d a t e ' , ' env_biome ' , ' e n v _ f e a t u r e ' , ' e n v _ m a t e r i a l ' , ' host_age ' , ' h o s t _ d i s e a s e ' , ' host_sex ' , ' h o s t _ s u b j e c t _ i d ' , ' l a t _ l o n ' , ' B i o P r o j e c t ' , ' Consent ' , ' DATASTORE_filetype ' , ' DATASTORE_provider ' , ' I n s e r t S i z e ' , ' L i b r a r y L a y o u t ' , ' P l a t f o r m ' , ' SRA_Study ' , ' geo_loc_name ' , ' h o s t ' ] 9. We want to first group the samples based on the patients’ disease... >>>uc_samples=my_table . f i l t e r _ d a t a ( " h o s t _ d i s e a s e " , >>>cd_samples=my_table . f i l t e r _ d a t a ( " h o s t _ d i s e a s e " , ! → '== ' , ' u l c e r a t i v e c o l i t i s ' ) '== ' , " Crohn ' ' s d i s e a s e " ) • Note that the "Crohn’s disease" string within the parentheses in the second line of code has two single-quotes/apostrophes between the "n" and the "s" of "Crohn’s". This is because the value we input must exactly match the value that the original study’s researchers input for the host_disease category, and (for whatever reason) they entered the value for Crohn’s disease as: "Crohn’’s disease" This is an example of why it’s important to check the metadata carefully before you begin the analysis. If the argument does not exactly match the value of the metadata category, it won’t select the sample. 10. ... then we want to group them based on age: >>>uc_young >>>uc_old = >>>cd_young >>>cd_old = ! → = uc_samples . f i l t e r _ d a t a ( " host_age " , '<= ' , 2 1 ) uc_samples . f i l t e r _ d a t a ( " host_age " , '>= ' , 6 0 ) = cd_samples . f i l t e r _ d a t a ( " host_age " , '<= ' , 2 1 ) cd_samples . f i l t e r _ d a t a ( " host_age " , '>= ' , 6 0 ) • Note that the third argument passed to the filter_data() method is not a string like the previous two arguments. It is simply an integer (no quotation marks around it). 11. Now that we have groups of samples selected, we want to get a list of SRA accession numbers for each group. The accession numbers are what we will use to specify the files we want to download from the SRA database. >>>uc_young_nums >>>uc_old_nums = >>>cd_young_nums >>>cd_old_nums = → To learn more about .sra file format, see Appendix A.7. = uc_young . get_accession_numbers ( ) uc_old . get_accession_numbers ( ) = cd_young . get_accession_numbers ( ) cd_old . get_accession_numbers ( ) 12. We can now use the lists of accession numbers to download .sra files from the SRA database. However, these are large files that can easily take up hundreds of gigabytes of memory. They also take time to download (times will vary based on network speed). To account for hardware limitations that would severely bottleneck the analysis at 23 → The "random_sample_ subset()" method is preceded by "RunTable." in this case because it is a "@staticmethod". To learn more, see Appendix A.3. file-handling steps like these, we will randomly select a subset of only 3 accession numbers from each list we created in the previous step. The analysis will go much faster if we analyze fewer samples, so these subsets are what we will analyze going forward. >>>rs_uc_young >>>rs_uc_old = >>>rs_cd_young >>>rs_cd_old = = RunTable . random_sample_subset ( uc_young_nums , n=3) RunTable . random_sample_subset ( uc_old_nums , n=3) = RunTable . random_sample_subset ( cd_young_nums , n=3) RunTable . random_sample_subset ( cd_old_nums , n=3) 13. Now we need to download the .sra files linked to these accession numbers and get them into .fastq format. First, we’ll need to import the necessary functions: >>>from BioPype . cmds . downloadsra import download_sra , >>>from BioPype . cmds . downloadsra import c o n v e r t _ s r a _ t o _ f a s t q 14. Use the download_sra() function to download the .sra files: >>>download_sra ( rs_uc_young ) >>>download_sra ( rs_uc_old ) >>>download_sra ( rs_cd_young ) >>>download_sra ( rs_cd_old ) ! → • The .sra files will not be downloaded to the correct location if the SRA Toolkit Workspace Location has not been properly configured (see Chapter 8 for details). → To learn more about .fastq format, see Appendix A.4. 15. Use the convert_sra_to_fastq() function to convert the .sra files to .fastq format. → To learn more about what threads are, and how many you should use, see Appendix A.5. ! → >>>c o n v e r t _ s r a _ t o _ f a s t q ( t h r e a d s =4) • BioPype will not know where to look for the downloaded .sra files if the SRA Toolkit Workspace Location has not been properly configured (see Chapter 8 for details). • The more threads your computer can divide this process between, the better (it’s more complicated than that, but for now let’s go with it). The question is: how many threads is your computer capable of using? For that, you’ll have to do some Googling. You can find the resources I used to figure out how many threads my MacBook Pro can run (4) in Appendix A.6. To simplify: my computer has 1 CPU, the CPU has 2 cores, and each core can run 2 threads. 1 x 2 x 2 = 4 threads. 9.1.1 9.2 Creating the code Perform Quality Control on Dataset 24 10 Relative Abundance Analysis (How to compare relative abundance of bacterial taxa between experimental conditions using QWRAP.) 11 Predict ORFs (how to predict the ORFs of the sequencing reads) 12 Create Non-redundant Gene Sets (How to create non-redundant gene sets using the predicted ORFs and what kind of information they provide) (Align reads using BLAT) 13 Align Genes 14 Get GenBank Accession Numbers 15 Find COG Functional Classes 16 New Analysis (Walk user through analysis of new, untested dataset looking for age-related differences in patients with Inflammatory Bowel Disease.) Appendix A A.1 Web Links QIIME Illumina Overview Tutorial http://nbviewer.jupyter.org/github/biocore/qiime/blob/1.9.1/examples/ipynb/illumina_overview_ tutorial.ipynb A.2 QIIME 2 Moving Pictures Tutorial https://docs.qiime2.org/2018.2/tutorials/moving-pictures/ A.3 Explanation of @staticmethod decorator vs @classmethod decorator in Python The Basics: Static methods make code easier to read and let you use a class’ methods without needing to have an object of that class first. This is useful when it makes logical sense to place a function within a class (because the function is related to the other tasks that the class handles), but the function doesn’t need to operate on an object/the data of that class. Normal methods are called by typing: my_object.method. Static methods are called by typing: MyClass().method or MyClass.method. Basic explanation with slight background on class methods: https://julien.danjou.info/guide-python-static-class-abstract-methods/ More technical explanations: https://stackoverflow.com/questions/12179271/meaning-of-classmethod-and-staticmethod-for-beginner This Stack Overflow question has some basic, easy to understand explanations mixed in with more in-depth explanations. The top-voted answer isn’t the only one that is useful; each of the answers presents their explanation with a different degree of simplicity. Make sure to check several answers if the top ones don’t seem helpful. A.4 FASTQ Format https://galaxyproject.org/tutorials/ngs/ This link contains: 1. An explanation of the FASTQ file format 2. An explanation of PHRED quality scores with an accompanying figure (Fig 4) 3. The following quote: "Fastq format is not strictly defined and its variations will always cause headache for you. See https://www.ncbi.nlm.nih.gov/books/NBK242622/ for more information." • From the NCBI link: "Text formats, such as FASTQ, are supported, but are not the preferred submission medium. Poorly defined specifications and high variability within these formats tend to lead to a higher frequency of failed or problematic submissions." A.5 How many threads to use? https://www.jstorimer.com/blogs/workingwithcode/7970125-how-many-threads-is-too-many A.6 How many threads can my computer run? https://superuser.com/questions/1101311/how-many-cores-does-my-mac-have A.7 SRA/.sra file format Find resource for explaining the .sra file format that we download from the SRA Database 33 Appendix B Referenced Studies
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 35 Page Mode : UseOutlines Author : Title : Subject : Creator : LaTeX with hyperref package Producer : pdfTeX-1.40.18 Create Date : 2018:05:06 16:48:48-05:00 Modify Date : 2018:05:06 16:48:48-05:00 Trapped : False PTEX Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.18 (TeX Live 2017) kpathsea version 6.2.3EXIF Metadata provided by EXIF.tools