UserGuide.dvi Py MS User Guide
User Manual:
Open the PDF directly: View PDF .
Page Count: 89
Download | ![]() |
Open PDF In Browser | View PDF |
PyMS User Guide Copyright c Andrew Isaac, Sean O’Callaghan and Vladimir Likić PyMS version 1.0 A Python toolkit for processing of chromatography–mass spectrometry data Contents 1 Introduction 1 1.1 The PyMS project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 PyMS installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Downloading PyMS source code . . . . . . . . . . . . . . . . . 3 External Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 Package ’NumPy’ (required for all aspects of PyMS) . . . . . . 4 1.3.2 Package ’pycdf’ (required for reading ANDI-MS files) . . . . . 4 1.3.3 Package ’Pycluster’ (required for peak alignment by dynamic programming) . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.4 Package ’scipy.ndimage’ (required for TopHat baseline corrector) 4 1.3.5 Package ’matplotlib’ (required for plotting) . . . . . . . . . . . 5 1.3.6 Package ’mpi4py’ (required for parallel processing) . . . . . . . 5 1.4 Notes on PyMS installation on Linux . . . . . . . . . . . . . . . . . . . 5 1.5 PyMS installation tutorial (current) . . . . . . . . . . . . . . . . . . . 5 1.5.1 9 1.3 Extra Dependencies . . . . . . . . . . . . . . . . . . . . . . . . 1.6 PyMS installation tutorial (old dependencies) . . . . . . . . . . . . . . 10 1.7 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.7.1 Pycdf import error . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.8 PyMS tutorial and examples 1.9 Using PyMS on Windows . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.9.1 . . . . . . . . . . . . . . . . . . . . . . . 13 PyMS installation on Windows . . . . . . . . . . . . . . . . . . 15 i ii CONTENTS 1.9.2 Example processing GC-MS data on Windows . . . . . . . . . 15 2 GC-MS Raw Data Model 17 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Creating a ”GCMS data” object . . . . . . . . . . . . . . . . . . . . . 17 2.3 2.2.1 Reading JCAMP GC-MS data . . . . . . . . . . . . . . . . . . 17 2.2.2 Reading ANDI GC-MS data A GCMS data object . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 A Scan data object . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.3 Exporting data and obtaining information about a data set . . 20 2.3.4 Comparing two GC-MS data sets . . . . . . . . . . . . . . . . . 21 3 GC-MS data derived objects 3.1 . . . . . . . . . . . . . . . . . . . 18 23 IntensityMatrix Object . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.1 Discussion of Binning Boundaries . . . . . . . . . . . . . . . . . 24 3.1.2 Build intensity matrix . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.3 Build intensity matrix parameters . . . . . . . . . . . . . . . . 26 3.1.4 Build integer mass intensity matrix . . . . . . . . . . . . . . . . 26 3.2 MassSpectrum object 3.3 IonChromatogram object . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Writing IonChromatogram object to a file . . . . . . . . . . . . 28 3.4 Saving data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.5 Importing ASCII data . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4 Data filtering 31 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Time strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Intensity Matrix resizing . . . . . . . . . . . . . . . . . . . . . . . . . . 31 iii CONTENTS 4.4 4.5 4.3.1 Retention time range . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3.2 Mass spectrum range and entries . . . . . . . . . . . . . . . . . 32 Noise smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4.1 Window averaging . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.4.2 Window Averaging on Intensity Matrix . . . . . . . . . . . . . 33 4.4.3 Savitzky–Golay noise filter 4.4.4 Savitzky-Golay Noise filtering of Intensity Matrix Object . . . 34 Baseline correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.5.1 4.6 . . . . . . . . . . . . . . . . . . . . 34 Tophat Baseline correction on an Intensity Matrix object . . . 35 Pre-processing the IntensityMatrix . . . . . . . . . . . . . . . . . . . . 36 5 Peak detection and representation 5.1 5.2 37 Peak Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.1.1 Creating a Peak Object . . . . . . . . . . . . . . . . . . . . . . 37 5.1.2 Peak Object properties . . . . . . . . . . . . . . . . . . . . . . 38 5.1.3 Modifying a Peak Object . . . . . . . . . . . . . . . . . . . . . 38 Peak detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2.1 Sample processing and Peak detection . . . . . . . . . . . . . . 39 5.3 Filtering Peak Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.4 Noise analysis for peak filtering . . . . . . . . . . . . . . . . . . . . . . 41 5.5 Peak area estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.5.1 Individual Ion Areas . . . . . . . . . . . . . . . . . . . . . . . . 42 5.5.2 Reading the area of a single ion in a peak . . . . . . . . . . . . 43 6 Peak alignment by dynamic programming 6.1 45 Preparation of multiple experiments for peak alignment by dynamic programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.1.1 Creating an Experiment . . . . . . . . . . . . . . . . . . . . . . 45 6.1.2 Multiple Experiments . . . . . . . . . . . . . . . . . . . . . . . 47 iv CONTENTS 6.2 Dynamic programming alignment of peak lists from multiple experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.3 Between-state alignment of peak lists from multiple experiments . . . 50 6.4 Common Ion Area Quantitation . . . . . . . . . . . . . . . . . . . . . . 51 6.4.1 Using the Common Ion Algorithm . . . . . . . . . . . . . . . . 52 7 The Display module 53 7.1 Displaying the TIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 7.2 Displaying multiple ion chromatogram objects . . . . . . . . . . . . . . 54 7.3 Displaying the mass spectrum . . . . . . . . . . . . . . . . . . . . . . . 55 7.4 Displaying detected peaks on the TIC plot . . . . . . . . . . . . . . . . 55 7.4.1 User interaction with the plot window . . . . . . . . . . . . . . 57 8 Parallel processing with PyMS 8.1 59 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 8.1.1 Installation of ’mpich2’ . . . . . . . . . . . . . . . . . . . . . . 59 8.1.2 Installation of ’mpi4py’ . . . . . . . . . . . . . . . . . . . . . . 60 8.2 Background to using PyMS in parallel . . . . . . . . . . . . . . . . . . 61 8.3 Using PyMS in parallel . . . . . . . . . . . . . . . . . . . . . . . . . . 62 9 GCMS Simulator 9.1 69 Setting up input data for the simulator . . . . . . . . . . . . . . . . . . 69 9.1.1 Providing input to the simulator . . . . . . . . . . . . . . . . . 69 9.2 Running the simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 9.3 Adding noise to the simulated data . . . . . . . . . . . . . . . . . . . . 71 9.4 9.3.1 Constant scale noise . . . . . . . . . . . . . . . . . . . . . . . . 71 9.3.2 Variable scale noise . . . . . . . . . . . . . . . . . . . . . . . . . 71 Adding noise to the whole simulated dataset . . . . . . . . . . . . . . . 72 9.4.1 Constant scale gaussian noise . . . . . . . . . . . . . . . . . . . 72 9.4.2 Variable scale gaussian noise . . . . . . . . . . . . . . . . . . . 73 v CONTENTS 9.5 Detecting Peaks in the Simulated data . . . . . . . . . . . . . . . . . . 74 Appendix 74 A Processing GC-MS data using PyMS 77 A.1 Processing GC-Quad data . . . . . . . . . . . . . . . . . . . . . . . . . 77 A.1.1 Instrument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 A.1.2 Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 A.1.3 Processing using PyMS . . . . . . . . . . . . . . . . . . . . . . 78 A.2 Processing GC-TOF data . . . . . . . . . . . . . . . . . . . . . . . . . 78 A.2.1 Instrument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 A.2.2 Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 A.2.3 Processing using PyMS . . . . . . . . . . . . . . . . . . . . . . 79 vi CONTENTS Chapter 1 Introduction PyMS is software for processing of chromatography–mass spectrometry data. PyMS is written in Python programming language [1], and is released as open source, under the GNU Public License version 2. The driving idea behind PyMS is to provide a framework and a set of components for rapid development and testing of methods for processing of chromatography–mass spectrometry data. PyMS provides interactive processing capabilities through any of the various interactive Python front ends (”shells”). PyMS is essentially a Python library of chromatography–mass spectrometry processing functions, and individual commands can be collected into scripts which then can be run non-interactively when it is preferable to run data processing in the batch mode. PyMS functionality consists of modules which are loaded when needed. For example, one such module provides display capabilities, and can be used to display time dependent data (e.g. total ion chromatogram), mass spectra, and signal peaks. This module is loaded only when visualisation is needed. If one is interested only in noise smoothing, only noise filtering functions are loaded into the Python environment and used to smooth the data, while the visualisation module (and other module) need not be loaded at all. This modularity is supported on all levels of PyMS. For example, if one is interested in noise filtering with the Savitsky-Golay filter, only sub-module for Savitsky-Golay filter need to be loaded from the noise smoothing module, disregarding other modules, as well as other noise smoothing sub-modules. This organisation consisting of hierarchical modules ensures that the processing pipeline is put together from well defined modules each responsible for specific functions; and furthermore different functionalities are completely decoupled from one another, providing that implementing a new functionality (such as test or prototype of a new algorithm) can be implemented efficiently and ensuring that this will not break any existing functionality. 1 2 Chapter 1. Introduction 1.1 The PyMS project The PyMS project consists of three parts, each of which exists as a project in the Google Code repository that can be downloaded separately. The three parts are: • pyms – The PyMS code (http://code.google.com/p/pyms/) • pyms-docs – The PyMS documentation (http://code.google.com/p/pyms-docs/) • pyms-test – Examples of how to use PyMS (http://code.google.com/p/pymstest/) The project ’pyms’ contains the source code of the Python package PyMS. The project ’pyms-docs’ contains PyMS style guide (relevant for those who contribute to the PyMS code) and User Guide (this document). The project ’pyms-test’ contains tests and examples showing how to use various PyMS features. These examples are explained in detail in subsequent chapters of this User Guide. In addition, the current PyMS API documentation (relevant for those who are interested in PyMS internals) is available from here: http://bioinformatics.bio21.unimelb.edu.au/pyms.api/index.html 1.2 PyMS installation PyMS is written in Python, and extensible and modular object-oriented scripting language [1]. Python is highly portable, cross-platform programming language which works well on all major modern operating systems (Linux, MacOS X, Windows). PyMS is written in pure Python, and therefore works on all platforms on which Python has been ported. The PyMS is essentially a python library (a ’package’ in python parlance, which consists of several ’sub-packages’), and some of its functionality depends on several Python libraries, such as ’numpy’ (the Python library for numerical computing), or ’matplotlib’ (the Python library for plotting). These also need to work on the operating system of your choice for the functionality they provide to PyMS to be available. In general, the libraries that PyMS uses are available for all operating systems. The exception is ’pycdf’ - a python interface to Unidata netCDF library written by Andre Gosselin of the Institut Maurice-Lamontagne, Canada (http://pysclint.sourceforge.net/pycdf/). This library works only under Linux/Unix and therefore PyMS functionality which depends on it works only under the Unix operating system. There are several ways to install PyMS depending your computer configuration and 1.2. PyMS installation personal preferences. PyMS has been developed on Linux, and a detailed installation instructions for Linux are given below. PyMS should work on all major Linux distributions, and has been tested extensively on Red Hat Linux. 1.2.1 Downloading PyMS source code PyMS source code resides on publicly accessible Google Code servers, and can be accessed from the following URL: http://code.google.com/p/pyms/. Under the section ”Source” one can find the instructions for downloading the source code. The same page provides the link under ”This project’s Subversion repository can be viewed in your web browser” which allows one to browse the source code on the server without actually downloading it. Regardless of the target operating system, the first step towards PyMS installation is to download the PyMS source code. Google Code server maintains the source code with the program ‘subversion’ (an open-source version control system). To download the source code one needs to use the subversions client. Several subversion clients are available, some are open source, some freeware, and some are commercial (for more information see http://subversion.tigris.org/). The svn client programs are available for all operating systems. For example, on Linux we use the svn client program which ships with most Linux systems called simply ’svn’. The ’svn’ exists for all mainstream operating systems1 . A well known svn client for Windows is TortoiseSVN (http://tortoisesvn.tigris.org/). TortoiseSVN provides graphical user interface, and is tightly integrated with Windows. TortoiseSVN is open source and can be downloaded from the project web page (http://tortoisesvn.tigris.org/). There are also several commercial svn clients for Windows. Subversion has extensive functionality, however only the very basic functionality is needed to download PyMS source code. For more information about subversion please consult the book freely available at http://svnbook.red-bean.com/. If the computer is connected to the internet, and the subversion client ’svn’ is installed. On Linux, the following command will download the latest PyMS source code: $ svn checkout http://pyms.googlecode.com/svn/trunk/ pyms A pyms/Peak A pyms/Peak/__init__.py A pyms/Peak/List A pyms/Peak/List/__init__.py ... [ further output deleted ] ... Checked out revision 71. 1 For example, on Linux CentOS 4 it ships as a part of the RPM package ‘subversion-1.3.21.rhel4.i386.rpm’ 3 4 1.3 Chapter 1. Introduction External Libraries In addition to the source code, for the full PyMS functionality several external libraries are required. 1.3.1 Package ’NumPy’ (required for all aspects of PyMS) The package NumPy is provides numerical capabilities to Python. This package is used throughout PyMS (and also required for some external packages used in PyMS), to its installation is mandatory. The NumPy web site http://numpy.scipy.org/ provides the installation instructions and the link to the source code. 1.3.2 Package ’pycdf’ (required for reading ANDI-MS files) The pycdf (a python interface to Unidata netCDF library) source and installation instructions can be downloaded from http://pysclint.sourceforge.net/pycdf/. Follow the installation instructions to install pycdf. 1.3.3 Package ’Pycluster’ (required for peak alignment by dynamic programming) The peak alignment by dynamic programming is located in the subpackage pyms.Peak.List.DPA. This subpackage uses the Python package ‘Pycluster’ as the clustering engine. Pycluster with its installation instructions can be found here: http://bonsai.ims.u-tokyo.ac.jp/ mdehoon/software/cluster/index.html. 1.3.4 Package ’scipy.ndimage’ (required for TopHat baseline corrector) If the full SciPy package is installed the ’ndimage’ will be available. However the SciPy contains extensive functionality, and its installation is somewhat involved. Sometimes it is preferable to install only the subpackage ‘ndimage’. This subpackage is provided as the PyMS-dependencies gzipped file available for download from the PyMS webpage (see below). 1.4. Notes on PyMS installation on Linux 1.3.5 Package ’matplotlib’ (required for plotting) The displaying of information such as Ion Chromatograms and detected peaks requires the package matplotlib. The matplotlib package can be downloaded from: http://matplotlib.sourceforge.net/ 1.3.6 Package ’mpi4py’ (required for parallel processing) This package is required for parallel processing with PyMS. It can be downloaded from: http://code.google.com/p/mpi4py/ 1.4 Notes on PyMS installation on Linux It is recommended to compile your own Python installation before installing PyMS. PyMS installation involves placing the PyMS code directory (pyms/) into a location visible to the Python interpreter. This can be in the standard place for 3rd party software (the directory site-packages/), or alternatively if PyMS code is placed in a non-standard location the Python interpreter needs to be made aware of it before before it is possible to import PyMS modules. For more on this please consult the Python command sys.path.append(). In order for existing test scripts provided with the ’pyms-test’ project to run without any modification, PyMS should be installed in the /x/PyMS/ directory. To create this directory enter on a command line: $ mkdir -p /x/PyMS/ 1.5 PyMS installation tutorial (current) Currently PyMS is being developed on Linux with the following set of packages: Python-2.7.2 numpy-1.6.1 netcdf-4.1.3 pycdf-0.6-3b scipy-ndimage.tar.gz scipy-misc.tar.gz Pycluster-1.5.0 5 6 Chapter 1. Introduction We provide these packages bundled together into the archive ’PyMS-Linux-deps2.0.tar.gz’ which can be downloaded from the Bio21 Institute web server at the University of Melbourne: http://bioinformatics.bio21.unimelb.edu.au/pyms.html In addition to the dependencies bundle, one also needs to dowload the PyMS source code as explained in the section 1.2.1). Below is a tutorial showing how to install all the packages on Red Hat Linux 6. 1. ’Python’ installation: $ $ $ $ $ tar xvf Python-2.7.2.tgz cd Python-2.7.2 ./configure make make install This installs python in /usr/local/lib/python2.7. It is recommended to make sure that python called from the command line is the one just compiled and installed. Starting python from the command line should report version 2.7: $ python Python 2.7.2 (default, Feb 15 2012, 14:00:58) [GCC 4.4.4 20100726 (Red Hat 4.4.4-13)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> 2. Package ’NumPy’ (required for all aspects of PyMS) The package NumPy is provides numerical capabilities to Python. This package is used throughout PyMS (and also required for some external packages used in PyMS), so its installation is mandatory. $ tar xvf numpy-1.6.1.tar.gz $ cd numpy-1.6.1 $ python setup.py install 3. Package ’pycdf’ In order to install pycdf, netcdf must first be installed. We suggest installing netcdf without the HCDF dependency, as this is not needed for PyMS. $ $ $ $ # tar xvf netcdf-4.1.3.tar.gz cd netcdf-4.1.3 ./configure --disable-netcdf-4 make make install 1.5. PyMS installation tutorial (current) Note that by default, netcdf libraries are installed in ’/usr/local/lib’. Now the pycdf library can be installed: $ tar xvf pycdf-0.6-3b.tar.gz $ cd pycdf-0.6-3b $ python setup.py install It is useful to run the ’pycdf’ package example to verify that ’netcdf’ and ’pycdf’ libraries are installed correctly. In the folder ’pycdf-0.6-3b/examples’ run: ]$ python compr.py Running with pycdf configured for: numpy step 1, obs_table [[ 1 3 -4 5 10] [ 0 10 8 7 4] [-1 4 5 9 13]] should be [[ 1 3 -4 5 10] [ 0 10 8 7 4] [-1 4 5 9 13]] ...more output omitted... If the pycdf example produces an error similar to this: $ python compr.py Traceback (most recent call last): File "copr.py", line 7, in..... ImportError: libnetcdf.so.7: cannot open shared object file: No such file or directory This indicates tha pycdf cannot find the netcdf shared libraries. For example, the netcdf library is installed in ’/usr/local/lib’, but may not be cashed and ready for use. This is normally achieved with the command ’ldconfig’. If the ’netcdf’ library is properly installed and ready to use, the output of ’ldconfig’ command should be similar to the following: ]# /sbin/ldconfig -v | grep netcdf libnetcdf.so.7 -> libnetcdf.so.7.1.1 libnetcdff.so.5 -> libnetcdff.so.5.1.0 libnetcdf_c++.so.4 -> libnetcdf_c++.so.4.1.0 7 8 Chapter 1. Introduction If this command produces no output, consult ’ldconfig’ configuration options. In particular, the directory where ’netcdf’ was installed (’/usr/local/lib’ by default) must be included in the library path. This can be achieved either by modifying ’/etc/ld.so.conf.d’ or by using the LD LIBRARY PATH environment variable. 4. Package ’PyCluster’ PyCluster is used by PyMS for peak alignment by dynamic programming. $ tar xvf Pycluster-1.50.tar.gz $ cd Pycluster-1.50 $ python setup.py install 5. Package ’SciPY’ The module ’ndimage’, used for baseline correction in PyMS, is part of a much larger library called SciPy. To save installation of such a large library we install only the parts we require. These are the ndimage library, and a dependency of ndimage; scipy’s ’misc’ library. Modules which require ’ndimage’ expect it to be part of this scipy library so some extra steps are necessary to allow ndimage to be accessed properly by Python. First: $ tar xvf scipy-ndimage.tar.gz $ cd ndimage $ python setup.py install $ tar xvf scipy-misc.tar.gz $ cd misc $ python setup.py install The ndimage and misc libraries should be now installed in the location: /usr/local/lib/python2.7/site-package (If your version of Python is 2.7). Observe the screen output from the python setup.py install command to see precisely where they have been installed. Python programs will expect ndimage to be at /usr/local/lib/python2.7/site-packages/scipy/ndimage so we will manually put it there, and add an init .py file which tells Python that a library is present at that location. We must also move module ’misc’ so that ’ndimage’ finds it: $ $ $ $ $ cd /usr/local/lib/python2.7/site-packages mkdir scipy touch scipy/__init__.py mv ndimage scipy mv misc scipy 1.5. PyMS installation tutorial (current) 1.5.1 Extra Dependencies In addition to the dependencies listed above, PyMS also makes use of some other python packages for specific needs. The first, matplotlib, is used to display graphical output (such as ion chromatograms, peak locations). mpich and mpi4py, together provide the ability to run PyMS scripts in parallel. 1. Package ’matplotlib’ $ $ $ $ tar xvf matplotlib-0.99.1.2.tar.gz cd matplotlib-0.99.1.1 python setup.py build python setup.py install 2. Installation of the package ’mpi4py’ Before ’mpi4py’ can be installed, MPI libraries for parallel computation (mpich) must be installed. From the mpich2 project web site download the current distribution of mpich2 (in our case the file ’mpich2-1.2.1p1.tar.gz’), or use the version included with PyMS-deps versions 2.0 and above. Prepare the directory for mpich2 installation. In this example we have chosen to use /usr/local/mpich2/. Our version of mpitch2 is 1.2.1, and to allow for the installation of different version later, we create a subdirectory ”1.2.1”, $ mkdir -vp /usr/local/mpich2/1.2.1 The above command will make the directory /usr/local/mpich2/ and also /usr/local/mpich2/1.2.1. Note that /usr/local is usually owned by root, and the above commands may require root privileges. Unpack the mpitch source archive and change to the source code directory: $ tar xvf mpich2-1.2.1p1.tar.gz $ cd mpich2-1.2.1p1 Configure, compile, and install mpich2: $ ./configure --prefix=/usr/local/mpich2/1.2.1 --enable-sharedlibs=gcc $ make $ make install If /usr/local/mpich2/1.2.1 is owned by root, the last command may require root privileges. 9 10 Chapter 1. Introduction With this completed, one can install the package ’mpi4py’. From the mpi4py project web site download the current distribution of mpi4py (in our case the file ’mpi4py-1.2.1.tar.gz’), or use the distribution included in PyMS-Deps 2.0 and above. Unpack the source archive and change to the source code directory: $ tar xvf mpi4py-1.2.1.tar.gz $ cd mpi4py-1.2.1 Edit the file ’mpi.cfg’ which can be found in this directory, to reflect the location of mpich2. In our case this file after editing contained the following: # MPICH2 [mpi] mpi_dir mpicc mpicxx = /usr/local/mpich2/1.2.1 = %(mpi_dir)s/bin/mpicc = %(mpi_dir)s/bin/mpicxx Install the package ’mpi4py’: $ python setup.py install Check that mpi4py works: ]$ python Python 2.7.2 (default, Feb 15 2012, 14:00:58) [GCC 4.4.4 20100726 (Red Hat 4.4.4-13)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import mpi4py >>> If the above command import produced no output, mpi4py is installed properly and ready to use. 1.6 PyMS installation tutorial (old dependencies) Prior to 2011, PyMS was being developed on Linux with the following set of packages: Python-2.5.2 numpy-1.1.1 netcdf-4.0 pycdf-0.6-3b ndimage.zip Pycluster-1.41 1.6. PyMS installation tutorial (old dependencies) matplotlib-0.99.1.2 mpi4py-1.2.1.tar.gz mpich2-1.2.1p1.tar.gz We provide these packages bundled together into the archive ’PyMS-Linux-deps1.1.tar.gz’ which can be downloaded from the Bio21 Institute web server at the University of Melbourne: http://bioinformatics.bio21.unimelb.edu.au/pyms.html In addition to the dependencies bundle, one also needs to dowload the PyMS source code as explained in the section 1.2.1). Below is a tutorial showing how to install all the packages on Red Hat Linux 5. 1. ’Python’ installation: $ $ $ $ $ tar xvfz Python-2.5.2.tgz cd Python-2.5.2 ./configure make make install This installs python in /usr/local/lib/python2.5. It is recommended to make sure that python called from the command line is the one just compiled and installed. 2. ’NumPy’ installation: $ tar xvfz numpy-1.1.1.tar.gz $ cd numpy-1.1.1 $ python setup.py install 3. ’pycdf’ installation Pycdf has two dependencies: the Unidata netcdf library and NumPy. The NumPy installation is described above. To install pycdf, the netcdf library must be downloaded (http://www.unidata.ucar.edu/software/netcdf/index.html), compiled and installed first: $ $ $ $ $ tar xvfz netcdf.tar.gz cd netcdf-4.0 ./configure make make install 11 12 Chapter 1. Introduction The last step will create several binary ‘libnetcdf*’ files in /usr/local/lib. Then ’pycdf’ should be installed as follows: $ tar xvfz pycdf-0.6-3b $ cd pycdf-0.6-3b $ python setup.py install 4. ’Pycluster’ installation $ tar xvfz Pycluster-1.42.tar.gz $ cd Pycluster-1.42 $ python setup.py install 5. ’ndimage’ installation: $ unzip ndimage.zip $ cd ndimage $ python setup.py install --prefix=/usr/local Since ’ndimage’ was installed outside the scipy package, this requires some manual tweaking: $ $ $ $ cd /usr/local/lib/python2.5/site-packages mkdir scipy touch scipy/__init__.py mv ndimage scipy 6. ’matplotlib’ installation: $ $ $ $ tar xvfz matplotlib-0.99.1.2 cd matplotlib-0.99.1.1 python setup.py build python setup.py install The ’pyms.Display’ module uses the TKAgg backend for matplotlib. If this is not your default backend, the matplotlibrc file may be edited. To locate the file ’matplotlibrc’, in a python interactive session: >>> import matplotlib >>> matplotlib.matplotlib_fname() Open the matplotlibrc file in a text editor and adjust the ’backend’ parameter to ’TKAgg’. 7. ’mpi4py’ installation: This package is required for running PyMS processing on multiple processors (CPUs). Instructions how to install this package and run PyMS processing in parallel are given in Section 8.1. 1.7. Troubleshooting 1.7 Troubleshooting The most likely problem with PyMS installation is a problem with installing one of the PyMS dependencies. 1.7.1 Pycdf import error On Red Hat Linux 5 the SELinux is enabled by default, and this causes the following error while trying to import properly installed pycdf: $ python Python 2.5.2 (r252:60911, Nov 5 2008, 16:25:39) [GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import pycdf Traceback (most recent call last): File " ", line 1, in File "/usr/local/lib/python2.5/site-packages/pycdf/__init__.py", line 22, in from pycdf import * File "/usr/local/lib/python2.5/site-packages/pycdf/pycdf.py", line 1096, in import pycdfext as _C File "/usr/local/lib/python2.5/site-packages/pycdf/pycdfext.py", line 5, in import _pycdfext ImportError: /usr/local/lib/python2.5/site-packages/pycdf/_pycdfext.so: cannot restore segment prot after reloc: Permission denied This problem is remedied by disabling SELinux (login as ‘root’, open the menu Administration → Security Level and Firewall, tab SELinux, change settings from ‘Enforcing’ to ‘Disabled’). This problem is likely to occur on Red Hat Linux derivative distributions such as CentOS. 1.8 PyMS tutorial and examples This document provides extensive tutorial on the use of PyMS, and the corresponding examples can be downloaded from the publicly accessible Google code project ’pyms-test’ (http://code.google.com/p/pyms-test/). The data used in PyMS documentation and examples is available from the Bio21 Institute server at the University of Melbourne. Please follow the link from the PyMS web page: http://bioinformatics.bio21.unimelb.edu.au/pyms 13 14 Chapter 1. Introduction or go directly to http://bioinformatics.bio21.unimelb.edu.au/pyms-data/ A tutorial illustrating various PyMS features in detail is provided in subsequent chapters of this User Guide. The commands executed interactively are grouped together by example, and provided as Python scripts in the project ’pyms-test’ (this is a Google code project, similar to the project ’pyms’ which contains the PyMS source code). The setup used in the examples below is as follows. The projects ’pyms’, ’pymstest’, ’pyms-docs’, and ’data’ are all in the same directory, ’/x/PyMS’. In the project ’pyms-test’ there is a directory corresponding to each example, coded with the chapter number (ie. pyms-test/21a/ corresponds to the Example 21a, from Chapter 2). In each example directory, there is a script named ’proc.py’ which contains the commands given in the example. Provided that the paths to ’pyms’ and ’pyms-data’ are set properly, these scripts could be run with the following command: $ python proc.py Before running each example the Python interpreter was made aware of the PyMS location with the following commands: import sys sys.path.append("/x/PyMS") For brevity these commands will not be shown in the examples below, but they are included in ’pyms-test’ example scripts. The above path may need to be adjusted to match your own directory structure. 1.9 Using PyMS on Windows Python is highly cross-platform compatible, and PyMS works seamlessly on Windows. The only exception is reading of data in ANDI-MS (NetCDF) format, widely used standard format for storing raw chromatography–mass spectrometry data (see 2.1) This capability in PyMS depends on the ’pycdf’ library, which is not supported on Windows (see the Subsection 1.3.2). Therefore at present the ability to read ANDI-MS files is limited to Linux. All other PyMS functionality is available under Windows. We use Linux for the development and deployment PyMS, and this User Guide largely 1.9. Using PyMS on Windows assumes that PyMS is used under Linux. In this section we give some pointers on how to use PyMS under Windows. 1.9.1 PyMS installation on Windows 1. Install Python, NumPy, SciPy, and matplotlib for Windows. The bundle of these software packages tested under Windows XP and Windows 7 can be downloaded from the PyMS project page at the Bio21 Institute, University of Melbourne http://bioinformatics.bio21.unimelb.edu.au/pyms.html 2. Download the latest PyMS code from the PyMS Google Code project page http://code.google.com/p/pyms/ 3. Unpack the PyMS code and place it in a standard place for Python libraries, or adjust the PYTHONPATH variable to include the path to PyMS. If Python is installed in C: Python25, the standard place for Python libraries is C: Python25 Libs site-packages 4. Start IDLE, or other Python shell. If PyMS is installed properly, the following command will not return any warning or error messages: >>> import pyms 1.9.2 Example processing GC-MS data on Windows The example data can be downloaded from http://bioinformatics.bio21.unimelb.edu.au/pyms/data/. We will use the data file in the JCAMP-DX format, named gc01 0812 066.jdx. Once downloaded this data file will be placed in the folder C:Data for the example below. The Python environment can be accessed from several Python shells. The default shell that comes with the Python 2.5 installation is IDLE. In this example we first load the raw data, >>> >>> >>> -> from pyms.GCMS.IO.JCAMP.Function import JCAMP_reader jcamp_file = "C:\Data\gc01_0812_066.jdx" data = JCAMP_reader(jcamp_file) Reading JCAMP file ’C:\Data\gc01_0812_066.jdx’ The intensity matrix object is built by binning: 15 16 Chapter 1. Introduction >>> from pyms.GCMS.Function import build_intensity_matrix_i >>> im = build_intensity_matrix_i(data) In this example, we show how to btain the dimensions of the newly created intensity matrix, then loop over all ion chromatograms, and for each ion chromatogram apply Savitzky-Golay noise filter and tophat baseline correction: >>> >>> >>> >>> n_scan, n_mz = im.get_size() from pyms.Noise.SavitzkyGolay import savitzky_golay from pyms.Baseline.TopHat import tophat for ii in range(n_mz): print "working on IC", ii ic = im.get_ic_at_index(ii) ic1 = savitzky_golay(ic) ic_smooth = savitzky_golay(ic1) ic_base = tophat(ic_smooth, struct="1.5m") im.set_ic_at_index(ii, ic_base) The resulting noise and baseline corrected ion chromatogram is saved back into the intensity matrix. As this example illustrates, PyMS depends on Python and can be used in the exactly the same way under any operating system that supports Python. For more advanced use refer to subsequent chapters. Chapter 2 GC-MS Raw Data Model 2.1 Introduction PyMS can read gas chromatography-mass spectrometry (GC-MS) data stored in Analytical Data Interchange for Mass Spectrometry (ANDI-MS),1 and Joint Committee on Atomic and Molecular Physical Data (JCAMP-DX)2 formats. These formats are essentially recommendations, and it is up to individual vendors of mass spectrometry processing software to implement properly ”export to ANDI-MS” or ”export to JCAMP-DX” features in their software. It is also possible to get third party converters. The information contained in the exported data files can vary significantly, depending on the instrument, vendor’s software, or conversion utility. PyMS makes the following minimum set of assumptions about the information contained in the data file: • The data contain the m/z and intensity value pairs across a scan. • Each scan has a retention time. Internally, PyMS stores the raw data from ANDI files or JCAMP files as a GCMS data object. 2.2 2.2.1 Creating a ”GCMS data” object Reading JCAMP GC-MS data [ This example is in pyms-test/20a ] 1 ANDI-MS was developed by the Analytical Instrument Association. is maintained by the International Union of Pure and Applied Chemistry. 2 JCAMP-DX 17 18 Chapter 2. GC-MS Raw Data Model The PyMS package pyms.GCMS.IO.JCAMP provides capabilities to read the raw GC-MS data stored in the JCAMP-DX format. The file ‘gc01 0812 066.jdx’ (located in ‘data’) is a GC-MS experiment converted from Agilent ChemStation format to JCAMP format using File Translator Pro.3 This file can be loaded in Python as follows: >>> >>> >>> -> >>> from pyms.GCMS.IO.JCAMP.Function import JCAMP_reader jcamp_file = "/x/PyMS/data/gc01_0812_066.jdx" data = JCAMP_reader(jcamp_file) Reading JCAMP file ’/x/PyMS/pyms-data/gc01_0812_066.jdx’ The above command creates the object ‘data’ which is an instance of the class GCMS data. 2.2.2 Reading ANDI GC-MS data [ This example is in pyms-test/20b ] The PyMS package pyms.GCMS.IO.ANDI provides capabilities to read the raw GCMS data stored in the ANDI-MS format. The file ”gc01 0812 066.cdf” is a GC-MS experiment converted to ANDI-MS format from Agilent ChemStation (the same data as in the example 20a above). This file can be loaded as follows: >>> >>> >>> -> >>> from pyms.GCMS.IO.ANDI.Function import ANDI_reader ANDI_file = "/x/PyMS/data/gc01_0812_066.cdf" data = ANDI_reader(ANDI_file) Reading netCDF file ’/x/PyMS/pyms-data/gc01_0812_066.cdf’ The above command creates the object ‘data’ which is an instance of the class GCMS data. 2.3 2.3.1 A GCMS data object Methods [ Examples below are in pyms-test/20a and pyms-test/20b ] 3 ChemSW, Inc. 2.3. A GCMS data object The object ‘data’ (from the two previous examples) stores the raw data as a GCMS data object. Within the GCMS data object, raw data are stored as a list of Scan objects and a list of retention times. There are several methods available to access data and attributes of the GCMS data and Scan objects. The GCMS data object’s methods relate to the raw data. The main properties relate to the masses, retention times and scans. For example, the minimum and maximum mass from all of the raw data can be returned by the following: >>> data.get_min_mass() >>> data.get_max_mass() A list of all retention times can be returned by: >>> time = data.get_time_list() The index of a specific retention time (in seconds) can be returned by: >>> data.get_index_at_time(400.0) Note that this returns the index of the retention time in the data closest to the given retention time of 400.0 seconds. The method get tic() returns a total ion chromatogram (TIC) of the data as an IonChromatogram object: >>> tic = data.get_tic() The IonChromatogram object is explained in a later chapter. 2.3.2 A Scan data object A Scan object contains a list of masses and a corresponding list of intensity values from a single mass-spectrum scan in the raw data. Typically only non-zero (or nonthreshold) intensities and corresponding masses are stored in the raw data. [ The following examples are the same in pyms-test/20a and pyms-test/20b ] A list of all the raw Scan objects can be returned by: >>> scans = data.get_scan_list() A list of all masses in a scan (e.g. the 1st scan) is returned by: 19 20 Chapter 2. GC-MS Raw Data Model >>> scans[0].get_mass_list() A list of all corresponding intensities in a scan is returned by: >>> scans[0].get_intensity_list() The minimum and maximum mass in an individual scan (e.g. the 1st scan) are returned by: >>> scans[0].get_min_mass() >>> scans[0].get_max_mass() 2.3.3 Exporting data and obtaining information about a data set [ This example is in pyms-test/20c ] Often it is of interest to find out some basic information about the data set, e.g. the number of scans, the retention time range, and m/z range and so on. The GCMS data class provides a method info() that can be used for this purpose. >>> from pyms.GCMS.IO.ANDI.Function import ANDI_reader >>> andi_file = "/x/PyMS/data/gc01_0812_066.cdf" >>> data = ANDI_reader(andi_file) -> Reading netCDF file ’/x/PyMS/data/gc01_0812_066.cdf’ >>> data.info() Data retention time range: 5.093 min -- 66.795 min Time step: 0.375 s (std=0.000 s) Number of scans: 9865 Minimum m/z measured: 50.000 Maximum m/z measured: 599.900 Mean number of m/z values per scan: 56 Median number of m/z values per scan: 40 >>> The entire raw data can be exported to a file with the method write(): >>> data.write("output/data") -> Writing intensities to ’output/data.I.csv’ -> Writing m/z values to ’output/data.mz.csv’ 2.3. A GCMS data object This method takes the string (“output/data”, in this example) and writes two CSV files. One has extention “.I.csv” and contains the intensities (“output/data.I.csv” in this example), and the other has the extension “.mz” and contains the corresponding table of m/z value (“output/data.mz.csv” in this example). In general, these are not two-dimensional matrices, because different scans may have different number of m/z values recorded. 2.3.4 Comparing two GC-MS data sets [ This example is in pyms-test/20d ] Occasionally it is useful to compare two data sets. For example, one may want to check the consistency between the data set exported in netCDF format from the manufacturer’s software, and the JCAMP format exported from a third party software. For example: >>> >>> >>> >>> >>> -> >>> -> from pyms.GCMS.IO.JCAMP.Function import JCAMP_reader from pyms.GCMS.IO.ANDI.Function import ANDI_reader andi_file = "/x/PyMS/data/gc01_0812_066.cdf" jcamp_file = "/x/PyMS/data/gc01_0812_066.jdx" data1 = ANDI_reader(andi_file) Reading netCDF file ’/x/PyMS/data/gc01_0812_066.cdf’ data2 = JCAMP_reader(jcamp_file) Reading JCAMP file ’/x/PyMS/data/gc01_0812_066.jdx’ To compare the two data sets: >>> from pyms.GCMS.Function import diff >>> diff(data1,data2) Data sets have the same number of time points. Time RMSD: 1.80e-13 Checking for consistency in scan lengths ... OK Calculating maximum RMSD for m/z values and intensities ... Max m/z RMSD: 1.03e-05 Max intensity RMSD: 0.00e+00 If the data is not possible to compare, for example because of different number of scans, or inconsistent number of m/z values in between two scans, diff() will report the difference. For example: >>> data2.trim(begin=1000,end=2000) 21 22 Chapter 2. GC-MS Raw Data Model Trimming data to between 1000 and 2000 scans >>> diff(data1,data2) -> The number of retention time points different. First data set: 9865 time points Second data set: 1001 time points Data sets are different. Chapter 3 GC-MS data derived objects In the raw GC-MS data, consecutive scans do not necessarily contain the same mass per charge (mass) values. For data processing, it is often necessary to convert the data to a matrix with a set number of masses and scans. In PyMS, the resulting object is called intensity matrix. In this chapter the methods for converting the raw GC-MS data to an intensity matrix object are illustrated. 3.1 IntensityMatrix Object The general scheme for converting raw mass values is to bin intensity values based on the interval the corresponding mass belongs to. The general procedure is as follows: • Set the interval between bins, lower and upper bin boundaries • Calculate the number of bins to cover the range of all masses. • Centre the first bin at the minimum mass found for all the raw data. • Sum intensities whose masses are in a given bin. A mass, m, is considered to belong to a bin when c − l ≤ m < c + u, where c is the centre of the bin, l is the lower boundary and u is the upper boundary of the bin. The default bin interval is one with a lower and upper boundary of ±0.5. A function to bin masses to the nearest integer is also available. The default bin interval is one with a lower boundary of -0.3 and upper boundary of +0.7 (as per the NIST library). 23 24 3.1.1 Chapter 3. GC-MS data derived objects Discussion of Binning Boundaries For any chemical element X, let w(x) be the atomic weight of X, and δ(X) = w(X) − {w(X)} , w(X) (3.1) where {a} is the integer value of a (rounded to the nearest integer). 12 For example, for hydrogen δ(1 H) = 1.007825032−1 1.007825032 = 0.0076. Similarly δ( C) = 0, δ(14 N) = 0.00022, δ(16 O) = −0.00032, etc. Let also ∆(X) = w(X) − {w(x)}. Then −0.023 < ∆(31 P), ∆(28 Si) < 0. Let a compound undergo GC-MS and let Y be one of it’s fragments. If Y consists of k1 atoms of type X1 , k2 atoms of type X2 ,....., kr atoms of type Xr , then ∆(Y ) = k1 ∗ ∆(X1 ) + k2 ∗ ∆(X2 ) + .... + kr ∗ ∆(Xr ). The fragment will usually not contain more than 2 or 3 P or Si atoms and if it’s molecular weight is less than 550 it may not contain more than 35 O atoms, so ∆(Y ) ≥ −0.023 ∗ 5 − 0.00051 ∗ 35 = −0.133. On the other hand, of Y contains k H atoms and m N atoms, then ∆(Y ) ≤ k ∗ 0.00783 + m ∗ 0.00051. Since for each two hydrogen atoms at least one carbon (or heavier) atom is needed, giving the limit of no more than 80 hydrogen atoms. Therefore in this case (i.e. H and C atoms only)∆(Y ) ≤ 80 ∗ 0.00783 = 0.63. If carbon is replaced by any heavier atom, at least 2 hydrogen atoms will be eliminated and ∆(Y ) will become even smaller. If the molecular weight of Y does not exceed 550 (typically the largest mass scanned for in a GC-MS setup) then −0.133 ≤ ∆(Y) ≤ 0.63. This means that if we set our binning boundaries to (−0.3, 0.7) or (−0.2, 0.8) the opportunity for having a fragment whose molecular weight is very close to the boundary is minimised. Since the resolution of MS is at least 0.1 dalton, we may assume that it’s error does not exceed 0.05, and MS accuracy will not cause additional problems. 3.1.2 Build intensity matrix [ This example is in pyms-test/30a ] An intensity matrix on the raw GC-MS data can be built using the following function. First the raw data is imported as before. >>> from pyms.GCMS.IO.JCAMP.Function import JCAMP_reader >>> jcamp_file = "/x/PyMS/data/gc01_0812_066.jdx" 3.1. IntensityMatrix Object 25 >>> data = JCAMP_reader(jcamp_file) -> Reading JCAMP file ’/x/PyMS/pyms-data/gc01_0812_066.jdx’ >>> Then the data can be converted to an intensity matrix using the functions build intensity matrix() and build intensity matrix i(), available in “pyms.GCMS.Function”. The default operation of build intensity matrix() is to use a bin interval of one and treat the masses as floating point numbers. The default intensity matrix can be built as follows: >>> from pyms.GCMS.Function import build_intensity_matrix >>> im = build_intensity_matrix(data) The size as the number of scans and the number of bins is returned by: >>> im.get_size() (9865, 551) There are 9865 scans and 551 bins in this example. The raw masses have been binned into new mass units based on the minimum mass in the raw data and the bin size. A list of the new masses can be obtained as follows: >>> masses = im.get_mass_list() >>> print masses[:10] [50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0] The last command prints the first ten masses. The methods im.get min mass() and get max mass() return the minimum and maximum mass: >>> print im.get_min_mass() 50.0 >>> print im.get_max_mass() 600.0 It is also possible to search for a particular mass, by finding the index of the binned mass closest to the desired mass. For example, the index of the closest binned mass to a mass of 73.3 m/z can be found by using the methods get index of mass(): >>> index = im.get_index_of_mass(73.3) >>> print index 23 26 Chapter 3. GC-MS data derived objects The value of the closest mass can be returned by the method get mass at index(): >>> print im.get_mass_at_index(index) 73.0 A mass of 73.0 is returned in this example. 3.1.3 Build intensity matrix parameters [ This example is in pyms-test/30b ] The bin interval can be set to values other than one, and binning boundaries can also be adjusted. In the example below, to fit the 0.5 bin interval, the upper and lower boundaries are set to ±0.25. im = build_intensity_matrix(data, 0.5, 0.25, 0.25) The size of the intensity matrix will reflect the change in the number of bins: >>> im.get_size() In this example there are 9865 scans (as before), but 1101 bins. The index and binned mass of the mass closest to 73.3 should also reflect the different binning. >>> index = im.get_index_of_mass(73.3) >>> print im.get_mass_at_index(index) A mass of 73.5 is returned in this example. 3.1.4 Build integer mass intensity matrix [ This example is in pyms-test/30c ] It is also possible to build an intensity matrix with integer masses and a bin interval of one. The default range for the binning is -0.3 and +0.7 mass units. The function is imported from “pyms.GCMS.Function”: >>> from pyms.GCMS.Function import build_intensity_matrix_i >>> im = build_intensity_matrix_i(data) 3.2. MassSpectrum object The masses are now integers. >>> index = im.get_index_of_mass(73.3) >>> print im.get_mass_at_index(index) A mass of 73 is returned in this example. The lower and upper bounds can be adjusted by build intensity matrix i(data, lower, upper). 3.2 MassSpectrum object [ This example is in pyms-test/31 ] A MassSpectrum object contains two attributes, mass list and mass spec, a list of mass values and corresponding intensities, respectively. MassSpectrum is returned by the IntensityMatrix method get ms at index(index). For example, the properties of the first MassSpectrum object of an IntensityMatrix, im, can be investigated by; >>> >>> >>> >>> ms = im.get_ms_at_index(0) print len(ms) print len(ms.mass_list) print len(ms.mass_spec) The length of all attributes should be the same. 3.3 IonChromatogram object [ This example is in pyms-test/31 ] An IonChromatogram object is a one dimensional vector containing mass intensities as a function of retention time. This can can be either m/z channel intensities (for example, the ion chromatogram at m/z = 73), or cumulative intensities over all measured m/z (TIC). An IonChromatogram for the TIC and a given mass or index can be obtained as follows: >>> tic = data.get_tic() >>> ic = im.get_ic_at_index(0) >>> ic = im.get_ic_at_mass(73) 27 28 Chapter 3. GC-MS data derived objects This will return, respectively: the TIC; the ion chromatogram of the first mass; and the ion chromatogram of the mass closest to 73. An ion chromatogram object has a method is tic() which returns ”True” if the ion chromatogram is a TIC, ”False” otherwise: >>> print "’tic’ is a TIC:", tic.is_tic() ’tic’ is a TIC: True >>> print "’ic’ is a TIC:",ic.is_tic() ’ic’ is a TIC: False 3.3.1 Writing IonChromatogram object to a file [ This example is in pyms-test/31 ] The method write() of IonChromatogram object allows one to save the ion chromatogram object to a file: >>> tic.write("output/tic.dat", minutes=True) >>> ic.write("output/ic.dat", minutes=True) The flag minutes=True indicates that retention time will be saved in minutes. The ion chromatogram object saved with with the write method is a plain ASCII file which contains a pair of (retention time, intensity) per line. $ head tic.dat 5.0930 2.222021e+07 5.0993 2.212489e+07 5.1056 2.208650e+07 5.1118 2.208815e+07 5.1181 2.200635e+07 5.1243 2.200326e+07 5.1306 2.202363e+07 5.1368 2.198357e+07 5.1431 2.197408e+07 5.1493 2.193351e+07 3.4 Saving data [ This example is in pyms-test/32 ] 3.5. Importing ASCII data 29 A matrix of intensity values can be saved to a file with the function save data() from pyms.Utils.IO. A matrix of intensity values can be returned from an IntensityMatrix with the method get matrix list(). For example, >>> from pyms.Utils.IO import save_data >>> mat = im.get_matrix_list() >>> save_data("output/im.dat", mat) It is also possible to save the list of masses (from im.get mass list()) and the list of retention times (from im.get time list()) using the save data() function. For convenience, the intensity values, mass list and time list, can be saved with the method export ascii(). For example, >>> im.export_ascii("output/data") will create “data.im.dat”, “data.mz.dat”, and “data.rt.dat” where these are the intensity matrix, retention time vector, and m/z vector. By default the data is saved as space separated data with a “.dat” extension. It is also possible to save the data as comma separated data with a “.csv” extension by the command “im.export ascii("output/data", "csv")”. Additionally, the entire IntensityMatrix can be exported to LECO CSV format. This facility is useful for import into other analytical software packages. The format has a header line specifying the column heading information as: “scan, retention time, mass1, mass2, . . . ”, and then each row as the intensity data. >>> im.export_leco_csv("output/data_leco.csv") 3.5 Importing ASCII data [ This example is in pyms-test/32 ] The LECO CSV data format can be used to import ASCII data directly into an IntensityMatrix object. The data must follow the format outlined above. For example, the file saved above can be read and compared to the original: >>> from pyms.GCMS.Class import IntensityMatrix >>> >>> iim = IntensityMatrix([0],[0],[[0]]) >>> >>> iim.import_leco_csv("output/data_leco.csv") >>> 30 Chapter 3. GC-MS data derived objects >>> print im.get_size() >>> print iim.get_size() The line “IntensityMatrix([0],[0],[[0]])” is required to create an empty IntensityMatrix object. Chapter 4 Data filtering 4.1 Introduction In this chapter filtering techniques that allow pre-processing of GC-MS data for analysis and comparison to other pre-processed GC-MS data are covered. 4.2 Time strings Before considering the filtering techniques, the mechanism for representing retention times is outlined here. A time string is the specification of a time interval, that takes the format ’NUMBERs’ or ’NUMBERm’ for time interval in seconds or minutes. For example, these are valid time strings: ’10s’ (10 seconds) and ’0.2m’ (0.2 minutes). 4.3 Intensity Matrix resizing Once an IntensityMatrix has been constructed from the raw GC-MS data, the entries of the matrix can be modified. These modifications can operate on the entire matrix, or individual masses or scans. 4.3.1 Retention time range [ This example is in pyms-test/40a ] A basic operation on the GC-MS data is to select a specific time range for processing. 31 32 Chapter 4. Data filtering In PyMS, any data outside the chosen time range is discarded. The trim() method operates on the raw data, so any subsequent processing only refers to the trimmed data. Given a previously loaded raw GC-MS data file, data, the data can be trimmed to specific scans; >>> data.trim(1000, 2000) >>> print data.info() or specific retention times (in “seconds” or “minutes”); >>> data.trim("6.5m", "21m") >>> print data.info() 4.3.2 Mass spectrum range and entries [ This example is in pyms-test/40b ] An IntensityMatrix object has a set mass range and interval that is derived from the data at the time of building the intensity matrix. The range of mass values can be cropped. This is done, primarily, to ensure that the range of masses used are consistent when comparing samples. Given a previously loaded raw GC-MS data file that has been converted into an IntensityMatrix, im, the mass range can be “cropped” to a new (smaller) range; >>> im.crop_mass(60, 400) >>> print im.get_min_mass(), im.get_max_mass() It is also possible to set all intensities for a given mass to zero. This is useful for ignoring masses associated with sample preparation. The mass can be “nulled” via; >>> data.null_mass(73) >>> print sum(im.get_ic_at_mass(73).get_intensity_array()) 4.4 Noise smoothing The purpose of noise smoothing is to remove high-frequency noise from data, and thereby increase the contribution of the signal relative to the contribution of the noise. 4.4. Noise smoothing 4.4.1 Window averaging [ This example is in pyms-test/41a ] A simple approach to noise smoothing is moving average window smoothing. In this approach the window of a fixed size (2N + 1 points) is moved across the ion chromatogram, and the intensity value at each point is replaced with the mean intensity calculated over the window size. The example below illustrates smoothing of TIC by window averaging. Load the data and get the TIC: >>> >>> -> >>> andi_file = "/x/PyMS/data/gc01_0812_066.cdf" data = ANDI_reader(andi_file) Reading netCDF file ’/x/PyMS/data/gc01_0812_066.cdf’ tic = data.get_tic() Apply the mean window smoothing with the 5-point window: from pyms.Noise.Window import window_smooth tic1 = window_smooth(tic, window=5) -> Window smoothing (mean): the wing is 2 point(s) Apply the median window smoothing with the 5-point window: >>> tic2 = window_smooth(tic, window=5, median=True) -> Window smoothing (median): the wing is 2 point(s) Apply the mean windows smoothing, but specify the window as a time string (in this example, 7 seconds): >>> tic3 = window_smooth(tic, window=’7s’) -> Window smoothing (mean): the wing is 9 point(s) Time strings are explained in the Section 4.2. 4.4.2 Window Averaging on Intensity Matrix [This example is in pyms-test/41b] In the previous section, window averaging was applied to an Ion Chromatogram object (in that case a TIC). Where filtering is to be performed on all Ion Chromatograms, the window smooth im() function may be used instead. 33 34 Chapter 4. Data filtering The use of this function is identical to the Ion Chromatogram window smooth() function, except that an Intensity Matrix is passed to it. For example, to perform window smoothing on an Intensity Matrix object with a 5 point window and mean window smoothing: >>> from pyms.Noise.Window import window_smooth_im() ... im is a PyMS IntensityMatrix object >>> im_smooth = window_smooth_im(im, window = 5, median = False) 4.4.3 Savitzky–Golay noise filter [ This example is in pyms-test/41c ] A more sophisticated noise filter is the Savitzky-Golay filter. Given the data loaded as above, this filter can be applied as follows: >>> from pyms.Noise.SavitzkyGolay import savitzky_golay >>> tic1 = savitzky_golay(tic) -> Applying Savitzky-Golay filter Window width (points): 7 Polynomial degree: 2 In this example the default parameters were used. 4.4.4 Savitzky-Golay Noise filtering of Intensity Matrix Object [ This example is in pyms-test/41d ] The savitzky golay() function described in the previous section acts on a single Ion Chromatogram. Where it is desired to perform Savitzky Golay filtering on the whole Intensity matrix the function savitzky golay im() may be used as follows: >>> from pyms.Noise.SavitzkyGolay import savitzky_golay_im ... im is a PyMS IntensityMatrix object >>> im_smooth = savitzky_golay(im) 4.5 Baseline correction [ This example is in pyms-test/42a ] 4.5. Baseline correction Baseline distortion originating from instrument imperfections and experimental setup is often observed in mass spectrometry data, and off-line baseline correction is often an important step in data pre-processing. There are many approaches for baseline correction. One advanced approach is based top-hat transform developed in mathematical morphology [2], and used extensively in digital image processing for tasks such as image enhancement. Top-hat baseline correction was previously applied in proteomics based mass spectrometry [3]. PyMS currently implements only top-hat baseline corrector, using the SciPy package ’ndimage’. For this feature to be available either SciPy (Scientific Tools for Python [4]) must be installed, or the local versions of scipy’s ndimage must be installed. For the SciPy/ndimage installation instructions please see the section 1.3.4. Application of the top-hat baseline corrector requires the size of the structural element to be specified. The structural element needs to be larger than the features one wants to retain in the spectrum after the top-hat transform. In the example below, the top-hat baseline corrector is applied to the TIC of the data set ’gc01 0812 066.cdf’, with the structural element of 1.5 minutes: >>> >>> >>> -> >>> >>> >>> -> >>> >>> -> >>> >>> >>> from pyms.GCMS.IO.ANDI.Function import ANDI_reader andi_file = "/x/PyMS/data/gc01_0812_066.cdf" data = ANDI_reader(andi_file) Reading netCDF file ’/x/PyMS/data/gc01_0812_066.cdf’ tic = data.get_tic() from pyms.Noise.SavitzkyGolay import savitzky_golay tic1 = savitzky_golay(tic) Applying Savitzky-Golay filter Window width (points): 7 Polynomial degree: 2 from pyms.Baseline.TopHat import tophat tic2 = tophat(tic1, struct="1.5m") Top-hat: structural element is 239 point(s) tic.write("output/tic.dat",minutes=True) tic1.write("output/tic_smooth.dat",minutes=True) tic2.write("output/tic_smooth_bc.dat",minutes=True) In the interactive session shown above, the data set if first loaded, Savitzky-Golay smoothing was applied, followed by baseline correction. Finally the original, smoothed, and smoothed and baseline corrected TIC were saved in the directory ’output/’. 4.5.1 Tophat Baseline correction on an Intensity Matrix object [ This example is in pyms-test/42b ] 35 36 Chapter 4. Data filtering The function outlined in the instructions above tophat(), acts on a single Ion Chromatogram. To perform baseline correction on an Intenstity Matrix object (i.e. on all Ion Chromatograms) the tophat im() function may be used. Using the same definition for ”struct” as above, use of the tophat im() function is as follows: >>> from pyms.Baseline.TopHat import tophat_im() ... im is an Intensity Matrix object >>> im_base_corr = tophat(im, struct="1.5m") 4.6 Pre-processing the IntensityMatrix [ This example is in pyms-test/43 ] The entire noise smoothing and baseline correction can be applied to each ion chromatogram in the intensity matrix; >>> >>> >>> >>> >>> ... ... ... ... ... ... jcamp_file = "/x/PyMS/data/gc01_0812_066.jdx" data = JCAMP_reader(jcamp_file) im = build_intensity_matrix(data) n_scan, n_mz = im.get_size() for ii in range(n_mz): print "Working on IC#", ii+1 ic = im.get_ic_at_index(ii) ic_smooth = savitzky_golay(ic) ic_bc = tophat(ic_smooth, struct="1.5m") im.set_ic_at_index(ii, ic_bc) Alternatively, the filtering may be performed on the Intensity Matrix without using a for loop, as outlined in the sections above. However filtering by Ion Chromatogram in a for loop as described here is much faster. The resulting IntensityMatrix object can be “dumped” to a file for later retrieval. There are general perpose object file handling methods in pyms.Utils.IO. For example; >>> from pyms.Utils.IO import dump_object >>> dump_object(im, "output/im-proc.dump") Chapter 5 Peak detection and representation 5.1 Peak Object Fundamental to GC-MS analysis is the identification of individual components of the sample mix. The basic component unit is represented as a signal peak. In PyMS a signal peak is represented as ‘Peak’ object (the class defined in pyms.Peak.Class). PyMS provides functions to detect peaks and create (discussed at the end of the chapter). A peak object stores a minimal set of information about a signal peak, namely, the retention time at which the peak apex occurs and the mass spectra at the apex. Additional information, such as, peak width, TIC and individual ion areas can be filtered from the GC-MS data and added to the Peak object information. 5.1.1 Creating a Peak Object [ This example is in pyms-test/50 ] A peak object can be created for a scan at a given retention time by providing the retention time (in minutes or seconds) and the MassSpectrum object of the scan. In the example below, first a file is loaded and an IntensityMatrix, im, built, then a MassSpectrum, ms, can be selected at a given time (31.17 minutes in this example). >>> from pyms.GCMS.Function import build_intensity_matrix_i >>> from pyms.GCMS.IO.ANDI.Function import ANDI_reader >>> andi_file = "/x/PyMS/data/gc01_0812_066.cdf" 37 38 >>> >>> >>> >>> Chapter 5. Peak detection and representation data = ANDI_reader(andi_file) im = build_intensity_matrix_i(data) index = im.get_index_at_time(31.17*60.0) ms = im.get_ms_at_index(index) Now a Peak object can be created for the given retention time and MassSpectrum. >>> from pyms.Peak.Class import Peak >>> peak = Peak(31.17, ms, minutes=True) By default the retention time is assumed to be in seconds. The parameter minutes can be set to True if the retention time is given in minutes. As a matter of convention, PyMS internally stores retention times in seconds, so the minutes parameter ensures the input and output of the retention time are in the same units. 5.1.2 Peak Object properties [ This example is in pyms-test/50 ] The retention time of the peak can be returned with get rt(). The retention time is returned in seconds with this method. The mass spectrum can be returned with get mass spectrum(). The Peak object constructs a unique identification (UID) based on the spectrum and retention time. This helps in managing lists of peaks (covered in the next chapter). The UID can be returned with get UID(). The format of the UID is the masses of the two most abundant ions in the spectrum, the ratio of the abundances of the two ions, and the retention time (in the same units as given when the Peak object was created). The format is Mass1-Mass2-Ratio-RT. For example, >>> print peak.get_rt() 1870.2 >>> print peak.get_UID() 319-73-74-31.17 5.1.3 Modifying a Peak Object [ This example is in pyms-test/51 ] The Peak object has methods for modifying the mass spectrum. The mass range can be cropped to a smaller range with crop mass(), and the intensity values for a single ion can be set to zero with null mass(). For example, the mass range can be set from 60 to 450 m/z, and the ions related to sample preparation can be ignored by setting their intensities to zero; 5.2. Peak detection >>> peak.crop_mass(60, 450) >>> peak.null_mass(73) >>> peak.null_mass(147) The UID is automatically updated to reflect the changes; >>> print peak.get_UID() 319-205-54-31.17 It is also possible to change the peak mass spectrum by calling the method set mass spectrum(). 5.2 Peak detection The general use of a Peak object is to extract them from the GC-MS data and build a list of peaks. In PyMS, the function for peak detection is based on the method of Biller and Biemann (1974)[5]. The basic process is to find all maximising ions in a pre-set window of scans, for a given scan. The ions that maximise at a given scan are taken to belong to the same peak. The function is BillerBiemann() in pyms.Deconvolution.BillerBiemann.Function. The function has parameters for the window width for detecting the local maxima (points), and the number of scans across which neighbouring, apexing, ions are combined and considered as belonging to the same peak. The number of neighbouring scans to combine is related to the likelyhood of detecting a peak apex at a single scan or several neighbouring scans. This is more likely when there are many scans across the peak. It is also possible, however, when there are very few scans across the peak. The scans are combined by taking all apexing ions to have occurred at the scan that had to greatest TIC prior to combining scans. 5.2.1 Sample processing and Peak detection [ This example is in pyms-test/52 ] The process for detecting peaks is to pre-process the data by performing noise smoothing and baseline correction on each ion (as in pyms-test/52). The first steps then are; >>> >>> >>> >>> >>> from from from from pyms.GCMS.IO.ANDI.Function import ANDI_reader pyms.GCMS.Function import build_intensity_matrix pyms.Noise.SavitzkyGolay import savitzky_golay pyms.Baseline.TopHat import tophat 39 40 >>> >>> >>> >>> >>> >>> >>> ... ... ... ... ... Chapter 5. Peak detection and representation andi_file = "/x/PyMS/data/gc01_0812_066.cdf" data = ANDI_reader(andi_file) im = build_intensity_matrix(data) n_scan, n_mz = im.get_size() for ii in range(n_mz): ic = im.get_ic_at_index(ii) ic_smooth = savitzky_golay(ic) ic_bc = tophat(ic_smooth, struct="1.5m") im.set_ic_at_index(ii, ic_bc) Now the Biller and Biemann based technique can be applied to detect peaks. >>> from pyms.Deconvolution.BillerBiemann.Function import BillerBiemann >>> peak_list = BillerBiemann(im) >>> print len(peak_list) 9845 Note that this is nearly as many peaks as there are scans in the data (9865 scans). This is due to noise and the simplicity of the technique. The number of detected peaks can be constrained by the selection of better parameters. Parameters can be determined by counting the number of points across a peak, and examining where peaks are found. For example, the peak list can be found with the parameters of a window of 9 points and by combining 2 neighbouring scans if they apex next to each other; >>> peak_list = BillerBiemann(im, points=9, scans=2) >>> print len(peak_list) 3698 The number of detected peaks has been reduced, but there are still many more than would be expected from the sample. Functions to filter the peak list are covered in the next section. 5.3 Filtering Peak Lists [ This example is in pyms-test/53 ] There are two functions to filter the list of Peak objects. The first, rel threshold(), modifies the mass spectrum stored in each peak so any intensity that is less than 5.4. Noise analysis for peak filtering a given percentage of the maximum intensity for the peak is removed. The second, num ions threshold() removes any peak that has less than a given number of ions above a given threshold. Once the peak list has been constructed, the filters can be applied by; >>> ... >>> >>> >>> 146 from pyms.Deconvolution.BillerBiemann.Function import \ rel_threshold, num_ions_threshold pl = rel_threshold(peak_list, percent=2) new_peak_list = num_ions_threshold(pl, n=3, cutoff=10000) print len(new_peak_list) The number of detected peaks is now more realistic of what would be expected in the test sample. 5.4 Noise analysis for peak filtering [ This example is in pyms-test/54 ] In the previous section the cutoff parameter for peak filtering was set by the user. This can work well for individual data files, but can cause problems when applied to large experiments with many individual data files. Where experimental conditions have changed slightly between experimental runs, the ion intensity over the GC-MS run may also change. This means that an inflexible cutoff value can work for some data files, while excluding too many, or including too many peaks in other files. An alternative to manually setting the value for cutoff, is to use the pyms function window analyzer(). This function examines a Total Ion Chromatogram (TIC) and computes a value for the median absolute deviation in troughs between peaks. This gives an approximate threshold value above which false peaks from noise should be filtered out. To compute this noise value: >>> ... >>> >>> from pyms.Noise.Analysis import window_analyzer data is a GCMS data object tic = data.get_tic() noise_level = window_analyzer(tic) Now the usual peak deconvolution steps are performed, and the peak list is filtered using this noise value as the cutoff: ... pl is a peak list, n is number of ions above threshold >>> peak_list = num_ions_threshold(pl, n, noise_level) 41 42 5.5 Chapter 5. Peak detection and representation Peak area estimation [ This example is in pyms-test/55 ] The Peak object does not contain any information about the width or area of the peak when it is created. This information can be added after the instantiation of a Peak object. The area of the peak can be set by the set area(), or set ion areas() method of the peak object. The total peak area can by obtained by the peak sum area() function in pyms.Peak.Function. The function determines the total area as the sum of the ion intensities for all masses that apex at the given peak. To calculate the peak area of a single mass, the intensities are added from the apex of the mass peak outwards. Edge values are added until the following conditions are met: the added intensity adds less than 0.5% to the accumulated area; or the added intensity starts increasing (i.e. when the ion is common to co-eluting compounds). To avoid noise effects, the edge value is taken at the midpoint of three consecutive edge values. Given a list of peaks, areas can be determined and added as follows: >>> from pyms.Peak.Function import peak_sum_area >>> for peak in peak_list: ... area = peak_sum_area(intensity_matrix, peak) ... peak.set_area(area) ... 5.5.1 Individual Ion Areas [ This example is in pyms-test/56 ] While the previous approach uses the sum of all areas in the peak to estimate the peak area, the user may also choose to record the area of each individual ion in each peak. This can be useful when the intention is to later perform quantitation based on the area of a single characteristic ion for a particular compound. It is also essential if using the Common Ion Algorithm for quantitation, outlined in section 6.4. To set the area of each ion for each peak, the following code is used: >>> from pyms.Peak.Function import peak_top_ion_areas >>> for peak in peak_list: >>> area_dict = peak_top_ions_areas(intensity_matrix, peak) >>> peak.set_ion_areas(area_dict) ... 5.5. Peak area estimation This will set the areas of the 5 most abundant ions in each peak. If it is desired to record more than the top five ions, the argument num ions=x should be supplied, where x is the number of most abundant ions to be recorded. e.g. >>> area_dict = peak_top_ions_areas(intensity_matrix, peak, num_ions=10) will record the 10 most abundant ions for each peak. The individual ion areas can be set instead of, or in addition to the total area for each peak. 5.5.2 Reading the area of a single ion in a peak If the individual ion areas have been set for a peak, it is possible to read the area of an individual ion for the peak. e.g. >>> peak.get_ion_area(101) will return the area of the m/z value 101 for the peak. If the area of that ion has not been set (i.e. it was not one of the most abundant ions), the function will return None. 43 44 Chapter 5. Peak detection and representation Chapter 6 Peak alignment by dynamic programming PyMS provides functions to align GC-MS peaks by dynamic programming [6]. The peak alignment by dynamic programming uses both peak apex retention time and mass spectra. This information is determined from the raw GC-MS data by applying a series of processing steps to produce data that can then be aligned and used for statistical analysis. The details are described in this chapter. 6.1 6.1.1 Preparation of multiple experiments for peak alignment by dynamic programming Creating an Experiment [ This example is in pyms-test/60 ] Before aligning peaks from multiple experiments, the peak objects need to be created and encapsulated into PyMS experiment objects. During this process it is often useful to pre-process the peaks in some way, for example to null certain m/z channels and/or to select a certain retention time range. To capture the data and related information prior to peak alignment, an Experiment object is used. The Experiment object is defined in pyms.Experiment.Class. The procedure is to proceed as described in the previous chapter. Namely: read a file; bin the data into fixed mass values; smooth the data; remove the baseline; deconvolute peaks; filter the peaks; set the mass range; remove uninformative ions; and estimate peak areas. The process is given in the following program listing. 45 46 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 Chapter 6. Peak alignment by dynamic programming import sys, os sys.path.append("/x/PyMS") from from from from from from pyms.GCMS.IO.ANDI.Function import ANDI_reader pyms.GCMS.Function import build_intensity_matrix_i pyms.Noise.SavitzkyGolay import savitzky_golay pyms.Baseline.TopHat import tophat pyms.Peak.Class import Peak pyms.Peak.Function import peak_sum_area from pyms.Deconvolution.BillerBiemann.Function import BillerBiemann, \ rel_threshold, num_ions_threshold # deconvolution and peak list filtering parameters points = 9; scans = 2; n = 3; t = 3000; r = 2; andi_file = "/x/PyMS/data/a0806_077.cdf" data = ANDI_reader(andi_file) # integer mass im = build_intensity_matrix_i(data) # get the size of the intensity matrix n_scan, n_mz = im.get_size() # smooth data for ii in range(n_mz): ic = im.get_ic_at_index(ii) ic1 = savitzky_golay(ic) ic_smooth = savitzky_golay(ic1) ic_base = tophat(ic_smooth, struct="1.5m") im.set_ic_at_index(ii, ic_base) # do peak detection on pre-trimmed data # get the list of Peak objects pl = BillerBiemann(im, points, scans) # trim by relative intensity apl = rel_threshold(pl, r) # trim by threshold peak_list = num_ions_threshold(apl, n, t) 6.1. Preparation of multiple experiments for peak alignment by dynamic programming47 45 46 47 48 49 50 51 52 53 54 55 56 print "Number of Peaks found:", len(peak_list) # ignore TMS ions and set mass range for peak in peak_list: peak.crop_mass(50,540) peak.null_mass(73) peak.null_mass(147) # find area area = peak_sum_area(im, peak) peak.set_area(area) The resulting list of peaks can now be stored as an Experiment object. from pyms.Experiment.Class import Experiment from pyms.Experiment.IO import store_expr # create an experiment expr = Experiment("a0806_077", peak_list) # set time range for all experiments expr.sele_rt_range(["6.5m", "21m"]) store_expr("output/a0806_077.expr", expr) Once an experiment has been defined, it is possible to limit the peak list to a desired range using sele rt range(). The resulting experiment object can then be stored for later alignment. 6.1.2 Multiple Experiments [ This example is in pyms-test/61a ] This example considers the preparation of three GC-MS experiments for peak alignment. The experiments are named ‘a0806 077’, ‘a0806 078’, ‘a0806 079’, and represent separate GC-MS sample runs from the same biological sample. The procedure is the same as above, and repeated for each experiment. For example: ... # define path to data files base_path = "/x/PyMS/data/" 48 Chapter 6. Peak alignment by dynamic programming # define experiments to process expr_codes = [ "a0806_077", "a0806_078", "a0806_079" ] # loop over all experiments for expr_code in expr_codes: print "Processing", expr_code # define the names of the peak file and the corresponding ANDI-MS file andi_file = os.path.join(base_path, expr_code + ".cdf") ... ... # create an experiment expr = Experiment(expr_code, peak_list) # use same time range for all experiments expr.sele_rt_range(["6.5m", "21m"]) store_expr("output/"+expr_code+".expr", expr) [ This example is in pyms-test/61b ] The previous set of data all belong to the same experimental condition. That is, they represent one group and any comparison between the data is a within group comparison. For the original experiment, another set of GC-MS data was collected for a different experimental condition. This group must also be stored as a set of experiments, and can be used for between group comparison. The experiments are named ‘a0806 140’, ‘a0806 141’, ‘a0806 142’, and are processed and stored as above (see pyms-test/61b). 6.2 Dynamic programming alignment of peak lists from multiple experiments • This example is in pyms-test/62 • This example uses the subpackage pyms.Peak.List.DPA, which in turn uses the Python package ’Pycluster’. For ’Pycluster’ installation instructions see the Section 1.3.3. 6.2. Dynamic programming alignment of peak lists from multiple experiments In this example the experiments ‘a0806 077’, ‘a0806 078’, and ‘a0806 079’ prepared in pyms-test/61a will be aligned, and therefore the script pyms-test/61a/proc.py must be run first, to create the files ‘a0806 077.expr’, ‘a0806 078.expr’, ‘a0806 079.expr’ in the directory pyms-test/61a/output/. These files contain the post-processed peak lists from the three experiments. A script for running the dynamic programming alignment on these experiments is given below. """proc.py """ import sys, os sys.path.append("/x/PyMS/") from pyms.Experiment.IO import load_expr from pyms.Peak.List.DPA.Class import PairwiseAlignment from pyms.Peak.List.DPA.Function import align_with_tree, exprl2alignment # define the input experiments list exprA_codes = [ "a0806_077", "a0806_078", "a0806_079" ] # within replicates alignment parameters Dw = 2.5 # rt modulation [s] Gw = 0.30 # gap penalty # do the alignment print ’Aligning expt A’ expr_list = [] expr_dir = "../61a/output/" for expr_code in exprA_codes: file_name = os.path.join(expr_dir, expr_code + ".expr") expr = load_expr(file_name) expr_list.append(expr) F1 = exprl2alignment(expr_list) T1 = PairwiseAlignment(F1, Dw, Gw) A1 = align_with_tree(T1, min_peaks=2) A1.write_csv(’output/rt.csv’, ’output/area.csv’) The script reads the experiment files from the directory where they were stored (61a/output), and creates a list of the loaded Experiment objects. Each experiment object is converted into an Alignment object with exprl2alignment(). In this example, there is only one experimental condition so the alignment object is only for 49 50 Chapter 6. Peak alignment by dynamic programming within group alignment (this special case is called 1-alignment). The variable F1 is a Python list containing three alignment objects. The pairwise alignment is then performed. The parameters for the alignment by dynamic programming are: Dw, the retention time modulation in seconds; and Gw, the gap penalty. These parameters are explained in detail in [6]. PairwiseAlignment(), defined in pyms.Peak.List.DPA.Class is a class that calculates the similarity between a ll peaks in one sample with those of another sample. This is done for all possible pairwise alignments (2-alignments). The output of PairwiseAlignment() (T1) is an object which contains the dendrogram tree that maps the similarity relationship between the input 1-alignments, and also 1-alignments themselves. The function align with tree() takes the object T1 and aligns the individual alignment objects according to the guide tree. In this example, the individual alignments are three 1-alignments, and the function align with tree() first creates a 2-alignment from the two most similar 1-alignments and then adds the third 1alignment to this to create a 3-alignment. The parameter ‘min peaks=2’ specifies that any peak column of the data matrix that has less than two peaks in the final alignment will be dropped. This is useful to clean up the data matrix of accidental peaks that are not truly observed over the set of replicates. Finally, the resulting 3-alignment is saved by writing alignment tables containing peak retention times (‘rt1.csv’) and the corresponding peak areas (‘area1.csv’). These two files are plain ASCII files is CSV format, and are saved in the directory 62/output/. The file ‘area1.csv’ contains the data matrix where the corresponding peaks are aligned in the columns and each row corresponds to an experiment. The file ‘rt1.csv’ is useful for manually inspecting the alignment in some GUI driven program. 6.3 Between-state alignment of peak lists from multiple experiments [ This example is in pyms-test/63 ] In the previous example the list of peaks were aligned within a single experiment with multiple replicates (“within-state alignment”). In practice, it is of more interest to compare the two experimental states. In a typical experimental setup there can be multiple replicate experiments on each experimental state or condition. To analyze the results of such an experiment statistically, the list of peaks need to be aligned within each experimental state and also between the states. The result of such an alignment would be the data matrix of integrated peak areas. The data matrix contains a row for each sample and the number of columns is determined by the number of unique peaks (metabolites) detected in all the experiments. 6.4. Common Ion Area Quantitation In principle, all experiments could be aligned across conditions and replicates in the one process. However, a more robust approach is to first align experiments within each set of replicates (within-state alignment), and then to align the resulting alignments (between-state alignment) [6]. This example demonstrates how the peak lists from two cell states are aligned. The cell state, A, consisting of three experiments aligned in pyms-test/61a (‘a0806 077’, ‘a0806 078’, ‘a0806 079’) and cell state, B, consisting of three experiments aligned in pyms-test/61b (‘a0806 140’, ‘a0806 141’, and ‘a0806 142’). The between group alignment can be performed by the following alignment commands. # between replicates alignment parameters Db = 10.0 # rt modulation Gb = 0.30 # gap penalty print ’Aligning input {1,2}’ T9 = PairwiseAlignment([A1,A2], Db, Gb) A9 = align_with_tree(T9) A9.write_csv(’output/rt.csv’, ’output/area.csv’) where A1 and A2 are the results of the within group alignments (as above) for group A and B, respectively. In this example the retention time tolerance for between-state alignment is greater compared to the retention time tolerance for the within-state alignment as we expect less fidelity in retention times between them. The same functions are used for the within-state and between-state alignment. The result of the alignment is saved to a file as the area and retention time matrices (described above). 6.4 Common Ion Area Quantitation [ This example is in pyms-test/64 ] The area.csv file produced in the preceding section lists the total area of each peak in the alignment. The total area is the sum of the areas of each of the individual ions in the peak. While this approach produces broadly accurate results, it can result in errors where neighbouring peaks or unfiltered noise add to the peak in some way. One alternative to this approach is to pick a single ion which is common to a particular peak (compound), and to report only the area of this ion for each occurance of that peak in the alignment. Using the method common ion() of the PyMS class 51 52 Chapter 6. Peak alignment by dynamic programming Alignment, PyMS can select an ion for each aligned peak which is both abundant and occurs most often for that peak. We call this the ’Common Ion Algorithm’ (CIA). To implement this in PyMS, it is essential that the individual ion areas have been set (see section 5.5.1). 6.4.1 Using the Common Ion Algorithm When using the CIA for area quantitation, a different method of the class Alignment is used to write the area matrix; write common ion csv(). This requires a list of the common ions for each peak in the alignment. This list is generated using the Alignment class method common ion(). Continuing from the previous example, the following invokes common ion filtering on previously created alignment object ’A9’: >>> common_ion_list = A9.common_ion() The variable ’common ion list’ is a list of the common ion for each peak in the alignment. This list is the same length as the alignment. To write peak areas using common ion quantitation: >>> A9.write_common_ion_csv(’output/area_common_ion.csv’,common_ion_list) Chapter 7 The Display module PyMS has graphical capabilities to display information such as ion chromatogram objects (ICs), total ion chromatogram (TIC), and detected lists of peaks. This functionality is provided by the package matplotlib, which must be installed in order to use the Display module and to run the scripts referred to in this chapter. The instructions for this are provided in 1.3.5. 7.1 Displaying the TIC [ This example is in pyms-test/70a ] TIC must first be extracted from the data (see section 3.3). Once the TIC object is created, a simple call to the function plot ics() displays the TIC in a graphical window. In addition to the TIC, two strings may be passed to plot ics() which label the data (in this case ’TIC’), and the overall plot (’TIC for gc01 0812 066’). >>> plot_ics(tic, ’TIC’,’TIC for gc01_0812_066’) The window in figure 7.1 should now be displayed: button, hold down the left mouse To zoom in on a portion of the plot, select the button while dragging a rectangle over the area of interest. To return to the original view, click on the The button. button allows panning across the zoomed plot. 53 54 Chapter 7. The Display module Figure 7.1: Graphics window displayed by the script 70a/proc.py 7.2 Displaying multiple ion chromatogram objects [ This example is in pyms-test/70b ] The Display module can plot multiple ICs and the TIC on the same figure, as shown in the following example. First, a list of ion chromatogram objects is created: >>> >>> >>> >>> tic = data.get_tic() ic = im.get_ic_at_mass(73) ic1 = im.get_ic_at_mass(147) ics = [tic, ic, ic1] This list is passed to the function plot ics(), along with a list of strings to label each ion chromatogram: >>> plot_ics(ics, [’TIC’, ’73’,’147’], ’TIC and ICs for m/z = 73 & 147’) This results in figure 7.2 being displayed: 7.3. Displaying the mass spectrum Figure 7.2: Graphics window displayed by the script 70b/proc.py 7.3 Displaying the mass spectrum [ This example is in pyms-test/70c ] The pyms Display module can also be used to display individual mass spectra. The following selects the mass spectrum of interest: ... im is an intensity matrix >>> ms = im.get_ms_at_index(1024) The pyms function plot ms() can then be called to display the mass spectrum to the screen. >>> plot_ms(ms, ’Mass Spectrum at index 1024’) The resulting window is shown below in figure 7.3 7.4 Displaying detected peaks on the TIC plot [ This example is in pyms-test/71 ] 55 56 Chapter 7. The Display module Figure 7.3: Graphics window displayed by the script 70c/proc.py The Display class is a more powerful implementation of plotting which allows plotting of detected peaks and user interaction with the plot figure. In order to plot a list of peaks, the peaks must be created first. The example pymstest/71 contains the script proc save peaks.py which produces such a peak list. For more information on detecting peaks see section 5.2. The function store peaks() in proc save peaks.py stores the peaks, while load peaks() in proc.py loads them for the Display class to use. When using the Display Class, a series of plots is added to the figure one at a time before the final plot is displayed. The first step is always creating an instance of the Display class: >>>display = Display() Next some ics and the TIC are plotted. Unlike in simple plotting, the TIC is plotted using a separate function. This function ensures that the TIC is always plotted in blue for easy reference. The legend for each IC is supplied to the functions, but the overall figure label is not supplied at this time. >>> display.plot_ics(ics, [’73’,’147’]) >>> displat.plot_tic(tic, ’TIC’) 7.4. Displaying detected peaks on the TIC plot Figure 7.4: Graphics window displayed by the script 71/proc.py The function plot peaks() adds the PyMS detected peaks to the figure. >>>display.plot_peaks(peak_list, ’Peaks’) \end{verbatgoogle.com.au/im} Finally, the function {\tt do\_plotting()} is called to draw the figure with labels and display the plot. \begin{verbatim} >>> display.do_plotting(’TIC, ICs, and PyMS Detected Peaks’) This should result in figure 7.4 being displayed: 7.4.1 User interaction with the plot window When using the Display class, the resulting figure window can be used to access data about the displayed peaks. Clicking on a peak causes a list of the 5 highest intensity eluting ions at that peak to be written to the screen in order. Clicking a mouse button over one of the peaks should result in output similar to the following: 57 58 Chapter 7. The Display module Figure 7.5: The mass spectrum displayed by PyMS when a peak in the graphics window of Figure 7.4 is clicked on >>> >>> >>> >>> >>> >>> mass 273 73 147 363 347 intensity 1003678.47619 625396.428571 526953.333333 255903.714286 241031.333333 In addition, clicking a mouse button other than the left button on a peak displays the mass spectrum at the peak in a new window (figure 7.5): Clicking on other peaks will display further mass spectrums for those peaks in new windows. Chapter 8 Parallel processing with PyMS 8.1 Requirements Using PyMS parallel capabilities requires installation of the package ’mpi4py’, which provides bindings of the Message Passing Interface (MPI) for the Python programming language. This package can be downloaded from http://code.google.com/p/mpi4py/. Since ’mpi4py’ provides only Python bindings, it requires an MPI implementation. We recommend using mpich2: http://www.mcs.anl.gov/research/projects/mpich2/ We show the installation of ’mpich2’ and ’mpi2py’ on Linux system from software distributions downloaded from the projects’ web site. 8.1.1 Installation of ’mpich2’ 1. From the mpich2 project web site download the current distribution of mpich2 (in our case the file ’mpich2-1.2.1p1.tar.gz’). 2. Prepare the directory for mpich2 installation. In this example we have chosen to use /usr/local/mpich2/. Our version of mpitch2 is 1.2.1, and to allow for the installation of different version later, we create a subdirectory ”1.2.1”, $ mkdir -vp /usr/local/mpich2/1.2.1 The above command will make the directory /usr/local/mpich2/ and also /usr/local/mpich2/1.2.1. Note that /usr/local is usually owned by root, and the above commands may require root privileges. 59 60 Chapter 8. Parallel processing with PyMS 3. Unpack this file and change to the source code directory: $ tar xvfz mpich2-1.2.1p1.tar.gz $ cd mpich2-1.2.1p1 4. Configure, compile, and install mpich2: $ ./configure --prefix=/usr/local/mpich2/1.2.1 --enable-sharedlibs=gcc $ make $ make install If /usr/local/mpich2/1.2.1 is owned by rood, the above command may require root privileges. 8.1.2 Installation of ’mpi4py’ 1. From the mpi4py project web site download the current distribution of mpi4py (in our case the file ’mpi4py-1.2.1.tar.gz’). 2. Unpack this file and change to the source code directory: $ tar xvfz mpi4py-1.2.1.tar.gz $ cd mpi4py-1.2.1 3. Edit the file ’mpi.cfg’ to reflect the location of mpich2. In our case this file after editing contained the following: # MPICH2 [mpi] mpi_dir mpicc mpicxx = /usr/local/mpich2/1.2.1 = %(mpi_dir)s/bin/mpicc = %(mpi_dir)s/bin/mpicxx 4. Install mpi4py: $ python setup.py install 5. Check that mpi4py works: $ python Python 2.5.2 (r252:60911, Sep 10 2008, 14:39:22) [GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import mpi4py >>> If the above command import produced no output, mpi4py is installed properly and ready to use. 8.2. Background to using PyMS in parallel 8.2 Background to using PyMS in parallel Any processing that loops through ion chromatograms or mass spectra can be performed in parallel, by distributing the processing of individual ion chromatograms or mass spectra to different CPUs by using the efficient MPI mechanism. Before the parallel processing can be deployed, data needs to be binned to produce an IntensityMatrix object, as described in the Section 3.1. This is essentially a two dimensional matrix, with ion chromatograms along one dimension and mass spectra along the other dimension. Consider the processing which applies a noise smoothing function to each ion chromatogram. We first read the raw data: andi_file = "/x/PyMS/data/gc01_0812_066.cdf" data = ANDI_reader(andi_file) Then build the intensity matrix, and get its dimensions: im = build_intensity_matrix_i(data) n_scan, n_mz = im.get_size() The last command sets the variables n scan and n mz to the number scans and number of m/z values present in data, respectively. Processing of ion chromatograms with the noise smoothing function requires fetching of each ion chromatogram from the data, and application of the noise smoothing function. This can be achieved with a simple loop: for ii in n_mz: print ii+1, ic = im.get_ic_at_index(ii) ic_smooth = window_smooth(ic, window=7) This example epitomizes the typical processing required on the GC-MS data. Another, equally important processing, is that of individual mass spectra. In this case the same logic can be applied, except that one would loop over the other dimension of the IntensityMatrix object ’im’. That is, one would loop over all the scan indices, and use the method get ms at index() to fetch individual mass spectra: for ii in n_scan: print ii+1, ms = im.get_ms_at_index(ii) # here do something the the mass spectra ’ms’ 61 62 Chapter 8. Parallel processing with PyMS Processing of data in this fashion is computationally intensive. A typical data set may consist of 3,000-10,000 scans and 500 m/z values. If complex processing algorithms are applied to each ion chromatogram (or mass spectra), the processing will quickly become computationally prohibitive. The type of calculation illustrated above is an ideal candidate for parallelization because each ion chromatogram (or mass spectrum) are processed independently. PyMS takes advantage of this and allows one to harvest the power of multiple CPUs to speed-up the processing. To achieve this PyMS can distributes the loop from the above (either type, ie. over ion chromatograms or mass spectra) over the available CPUs, achieving a linear speed-up with the number of CPUs. 8.3 Using PyMS in parallel Using PyMS in parallel requires a minimal intervention, only that special method of the IntensityMatrix object is invoked in the for loop described above. For looping over all ion chromatograms in parallel, for ii in im.iter_ic_indices(): print ii+1, ic = im.get_ic_at_index(ii) ic_smooth = window_smooth(ic, window=7) The only change is that for ii in n_mz: is replaced with for ii in im.iter_ic_indices() The corresponding method for looping over all mass spectra would involve replacing: for ii in n_scan: with for ii in im.iter_ms_indices() The special constructs for ii in im.iter ic indices(): and for ii in im.iter ms indices() will distribute the calculation in parallel if MPI capability is available (ie. mpi4py is 8.3. Using PyMS in parallel installed on the system, and multiple CPUs are available). If MPI capability is not available, the processing will be performed in a serial mode. Running in parallel also requires some prior preparations, as explained below. Consider how the following script that performs noise smoothing example described above (named ’proc.py’). This script is can be run in serial or parallel mode. """proc.py """ import sys sys.path.append("/x/PyMS") from pyms.GCMS.IO.ANDI.Function import ANDI_reader from pyms.GCMS.Function import build_intensity_matrix_i from pyms.Noise.Window import window_smooth # read the raw data as a GCMS_data object andi_file = "/x/PyMS/data/gc01_0812_066.cdf" data = ANDI_reader(andi_file) # build the intensity matrix im = build_intensity_matrix_i(data) # get the size of the intensity matrix n_scan, n_mz = im.get_size() print "Size of the intensity matrix is (n_scans, n_mz):", n_scan, n_mz # loop over all m/z values, fetch the corresponding IC, and perform # noise smoothing for ii in im.iter_ic_indices(): print ii+1, ic = im.get_ic_at_index(ii) ic_smooth = window_smooth(ic, window=7) A simple running of this script will produce a serial run without any warning messages: $ python proc.py -> Reading netCDF file ’/x/PyMS/data/gc01_0812_066.cdf’ Size of the intensity matrix is (n_scans, n_mz): 9865 551 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 ... [ further output deleted ] ... 63 64 Chapter 8. Parallel processing with PyMS Figure 8.1: The xterm output of the program ’top’ with PyMS running in serial mode on the computer with multiple CPUs Inspection of the CPU usage during the execution of the program shows that only one CPU is utilised 100% (although multiple CPUs are available) as shown in Figure 8.2. To run the above script in parallel, one needs first to start the mpitch2 process launcher, called ’mpd’ (this is a program, in this example located in /usr/local/mpich2/1.2.1/bin/mpd, see the section ). This can be achieved as follows: $ /usr/local/mpich2/1.2.1/bin/mpd --daemon The above command start ’mpd’ as a daemon (the program runs in the background, without a controlling terminal). A common problem that causes the above command is fail is the absence of the .mpd.conf file which ’mpd’ requires to be present in the home directory of the user who is starting the process. Here is an excerpt from the ’mpd’ help page: A file named .mpd.conf file must be present in the user’s home directory with read and write access only for the user, and must contain at least a line with MPD_SECRETWORD= To run mpd as root, install it while root and instead of a .mpd.conf file use mpd.conf (no leading dot) in the /etc directory.’ Fixing this problem is simple, and requires creating the file /.mpd.conf, which in our case contains only one line: 8.3. Using PyMS in parallel Figure 8.2: The xterm output of the program ’top’ with PyMS running in serial mode on the computer with multiple CPUs MPD_SECRETWORD=euSe0veo After this the ’mpd’ can be launched. Running the PyMS script in the parallel mode requires the use of ’mpirun’ command, $ /usr/local/mpich2/1.2.1/bin/mpirun -np 2 python proc.py The above command prepare ’python proc.py’ to run in parallel, in this case by using two CPUS (-np 2). The execution produces the following output: $ /usr/local/mpich2/1.2.1/bin/mpirun -np 2 python proc.py -> Reading netCDF file ’/x/PyMS/data/gc01_0812_066.cdf’ -> Reading netCDF file ’/x/PyMS/data/gc01_0812_066.cdf’ Size of the intensity matrix is (n_scans, n_mz):Size of the intensity matrix is (n_scans, n_mz): 9865 551 276 9865 551 1 277 2 278 3 279 4 280 5 281 6 282 7 283 8 284 9 285 10 286 11 287 12 288 13 289 14 290 15 291 16 292 17 293 18 294 19 295 20 296 21 297 22 298 23 299 24 300 25 301 26 302 27 303 28 304 2 ... [ further output deleted ] ... The above shows that two processes are active (each reading its own version of data). While the distribution of processing between the two processes has been achieved automatically by PyMS. Since both processes were started from the same terminal 65 66 Chapter 8. Parallel processing with PyMS Figure 8.3: The xterm output of the program ’top’ with PyMS running in parallel mode with two CPUs their output is intermingled. This time the processing is using two CPUs, and this can be seen from the inspection of CPU utilisation, as shown in Figure 8.3. Also the execution of the script ’proc.py’ is now two times faster. This simple example shows how to speed-up PyMS processing on a common workstation with two CPUs. The MPI also allows PyMS processing to run on specialised computers with many CPUs, such as Linux clusters. The MPI implementation allows PyMS to be easily used in such distributed computing environments, much like in the example above. We have tested PyMS on Linux clusters, and the resulting speed-up is nearly linear with the number of CPUs employed (Figure 8.4). 8.3. Using PyMS in parallel 67 Speedup vs. Number of Processors 16 14 12 speedup 10 8 6 4 2 00 2 4 6 8 10 # processors 12 14 16 Figure 8.4: The speedup in PyMS processing when run in parallel on a Linux cluster as a function of a number of CPUs deployed 68 Chapter 8. Parallel processing with PyMS Chapter 9 GCMS Simulator The PyMS GCMS data simulator is a useful tool for creating datasets with known peaks. 9.1 Setting up input data for the simulator The PyMS GCMS data simulator package provides an Intensity Matrix object which can be processed by PyMS in the same way as any Intensity Matrix object derived from real data. As input, the simulator requires a list of peaks, a list of times and a list of m/z channels. In this example we will use the time list, mass list and PyMS detected peak list from an existing real GCMS dataset. For the purpose of demonstrating the simulator, only a portion of this dataset is examined. The following code selects the area of interest in the dataset and limits subsequent processing to this interval: ... data is a GCMS data object >>> data.trim(4101, 4350) 9.1.1 Providing input to the simulator [ This example is in pyms-test/90 ] The peaks are detected using the PyMS function BillerBiemann(), and subsequently filtered to remove those which do not meet certain requirements. This procedure is fully detailed in section 5.2. 69 70 Chapter 9. GCMS Simulator Next the time list and mass list of the original data are copied to be used in the simulator >>> mass_list = real_im.get_mass_list() >>> time_list = real_im.get_time_list() , where real im is the Intensity Matrix object derived from the real data. (For more information about Intensity Matrix and data conversion refer to section ( 3.1). 9.2 Running the simulator The function gcms sim() takes three inputs: time list, mass list and peak list. It returns an Intensity Matrix object which can be manipulated in PyMS. >>> sim_im = gcms_sim(time_list, mass_list, peak_list) Using the PyMS display functionality described in chapter 7, the results of the simulator can be viewed. Figure 9.1 shows the IonChromatograms of the simulated Intensity Matrix. Figure 9.1: Graphics window displayed by the script 90/proc.py 9.3. Adding noise to the simulated data 9.3 Adding noise to the simulated data Noise can be added to either the full Intenstity Matrix, or to individual Ion Chromatograms. Currently two noise models have been implemented in PyMS. These are: 1. Gaussian noise drawn from a normal distribution with constant scale across all scans. 2. Gaussian noise drawn from a normal distribution with scale varying with the intensity value at that point. 9.3.1 Constant scale noise [ This example is in pyms-test/91 ] The Python package NumPy contains routines for pseudo random number generation. These ”random” values are drawn from a normal distribution with a user defined scale over the same number of samples as the number of scans in a GC-MS experiment. This resulting list of numbers is a good approximation to gaussian noise on an m/z channel in the GC-MS experiment. To add gaussian noise to an IC, the following code is used: >>> ic = sim_im.get_ic_at_mass(73) >>> scale = 1000 >>> add_gaussc_noise_ic(ic, scale) The normal distribution from which the noise values are drawn in this example has a top value of 1000. This noisy IC can be displayed using the PyMS package Display, the resulting figure is shown in figure 9.2. 9.3.2 Variable scale noise [ This example is in pyms-test/92 ] In reality, noise from a GC-MS experiment tends to have higher values in areas where peaks occur than in the valleys between peaks. To attempt to model this, a variable scale noise function has been implemented in PyMS. For a given time-point in an Ion Chromatogram, if the intensity at that time-point is greater than a user defined threshold value, the scale of the normal distribution from which the noise value is drawn will be proportional to the intensity at that point. The actual scale of the distribution at that point will be scale * (intensity*proportion), if the intensity is above the cutoff value, and scale if below. 71 72 Chapter 9. GCMS Simulator Figure 9.2: Graphics window displayed by the script 91/proc.py To add variable scale gaussian noise to an IC, the following code is used: >>> >>> >>> >>> >>> ic = sim_im.get_ic_at_mass(73) scale = 1000 cutoff = 10000 prop = 0.0003 add_gaussv_noise_ic(ic, scale, cutoff, prop) The resulting noisy IC is shown in figure 9.3 9.4 Adding noise to the whole simulated dataset Often, the user may desire to add noise to all ICs in the dataset. This can be accomplished easily using the functions add gaussc noise() and add gaussv noise() The noise models in question are exactly the same as for the single IC noise functions above. 9.4.1 Constant scale gaussian noise [ This example is in pyms-test/93 ] For constant scale gaussian noise: ... sim_im is a simulated intensity matrix object 9.4. Adding noise to the whole simulated dataset Figure 9.3: Graphics window displayed by the script 92/proc.py >>> scale = 1000 >>> add_gaussc_noise(sim_im, scale) This code adds noise to all Ion Chromatograms in the Intensity Matrix. These can be viewed in the PyMS package Display, with the resulting figure shown below (figure 9.4). 9.4.2 Variable scale gaussian noise [ This example is in pyms-test/94 ] To apply variable scale gaussian noise to the whole simulated dataset: >>> >>> >>> >>> scale = 1000 cutoff = 10000 prop = 0.0003 add_gaussv_noise(sim_im, scale, cutoff, prop) The arguments supplied to add gaussv noise() relate to how the gaussian distribution supplying the noise values changes with the intensity. These parameters are further discussed in sub-section 9.3.2 above. 73 74 Chapter 9. GCMS Simulator Figure 9.4: Graphics window displayed by the script 103/proc.py 9.5 Detecting Peaks in the Simulated data [ This example is in pyms-test/95 ] The simulated intensity matrix may be processed in PyMS in the same way as an intensity matrix derived from real data. Applying the same techniques as outlined in section 5.2, the peaks of the simulated IM with added noise from 9.4.2 above can be found and displayed. The found peaks are shown below in fig 9.6. 9.5. Detecting Peaks in the Simulated data Figure 9.5: Graphics window displayed by the script 94/proc.py Figure 9.6: Graphics window displayed by the script 95/proc.py 75 76 Chapter 9. GCMS Simulator Appendix A Processing GC-MS data using PyMS This Chapter presents different test examples where PyMS is used to process GC-MS data from different instruments. A.1 Processing GC-Quad data [ This example is in pyms-test/A1 ] A.1.1 Instrument • Maker: Agilent • Software: Chemstation A.1.2 Raw Data • File: gc01 0812 066.cdf • Location: PyMS/data/ • Nature: Metabolite mix • Processing: This is the raw cdf file generated by the instrument and no further processing was done using Chemstation. 77 78 Appendix A. Processing GC-MS data using PyMS A.1.3 Processing using PyMS As detailed in the previous chapters, PyMS processes this data by binning the data, smoothing the data, removing the baseline, deconvolute peaks, filtering peaks, setting the mass range, removing uninformative ions and estimating peak areas. GC-Quad data is made up of approximately 3 scans per second. As a result of this, the values of the following parameters in PyMS are unique for this dataset. • points in Biller Biemann algorithm The parameter points in the Biller Biemann algorithm defines the width of the window that scans across the ion chromatogram in order to define a peak. The GC-Quad is made up of only a few scans per second and hence the peaks form over only a few number of scans resulting in a lower value defining the window width. In this example, this value is set to 3. • scans in Biller Biemann algorithm The parameter scans in the Biller Biemann algorithm defines the number of consecutive scans across which a peak can be defined. The near optimal value of this parameter for GC-Quad data that is used in this example is 2. • threshold in peak filtering In peak filtering, the parameter threshold (t) defines the intensity above which a certain number of ions needs to be represented to be accounted in and not to be filtered out. The near optimal value of this parameter for GC-Quad data that is used in this example is 10000. A.2 Processing GC-TOF data [ This example is in pyms-test/A2 ] A.2.1 Instrument • Maker: Leco Pegasus • Software: ChromaTOF A.2.2 Raw Data • File: MM-10.0 1 no processing.cdf A.2. Processing GC-TOF data • Location: PyMS/data/ • Nature: Metabolite mix • Processing: This is the raw cdf file generated by the instrument and no further processing was done using ChromaTOF. A.2.3 Processing using PyMS PyMS processes this data through the following steps; binning the data, smoothing the data, removing the baseline, deconvolute peaks, filtering peaks, setting the mass range, removing uninformative ions and estimating peak areas. GC-TOF data is made up of nearly 10 scans per second. As a result of this, the values of the following parameters in PyMS are unique for this dataset. • points in Biller Biemann algorithm The GC-TOF is made up of more scans per second and as a result the data is more dense when compared to the GC-Quad data. Therefore the peaks form over more number of scans resulting in a higher value defining the window width. In this example, this value is set to 15. • scans in Biller Biemann algorithm Due to the dense nature of the data, a peak can have its maximum value over more than 1 scan. In this example, this value is set to 3. • threshold in peak filtering The near optimal value of this parameter for GC-TOF data that is used in this example is 4000. 79 80 Appendix A. Processing GC-MS data using PyMS Bibliography [1] Python. http://www.python.org. [2] Serra J. Image Analysis and Mathematical Morphology. Academic Press, Inc, Orlando, 1983. ISBN 0126372403. [3] Sauve AC and Speed TP. Normalization, baseline correction and alignment of high-throughput mass spectrometry data. Procedings Gensips, 2004. [4] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python, 2001–. http://www.scipy.org/. [5] Biller JE and Biemann K. Reconstructed mass spectra, a novel approach for the utilization of gas chromatograph–mass spectrometer data. Anal. Lett., 7:515–528, 1974. [6] Robinson MD, De Souza DP, Keen WW, Saunders EC, McConville MJ, Speed TP, and Likic VA. A dynamic programming approach for the alignment of signal peaks in multiple gas chromatography-mass spectrometry experiments. BMC Bioinformatics, 8:419, 2007. 81 82 BIBLIOGRAPHY
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.4 Linearized : No Page Count : 89 XMP Toolkit : XMP toolkit 2.9.1-13, framework 1.6 About : 1f464723-8fb2-11ec-0000-7b649d290d6e Producer : GPL Ghostscript 8.70 Modify Date : 2012:02:15 16:26:12+11:00 Create Date : 2012:02:15 16:26:12+11:00 Creator Tool : dvips\(k\) 5.96.1 Copyright 2007 Radical Eye Software Document ID : 1f464723-8fb2-11ec-0000-7b649d290d6e Format : application/pdf Title : UserGuide.dvi Creator : dvips(k) 5.96.1 Copyright 2007 Radical Eye SoftwareEXIF Metadata provided by EXIF.tools