Guide
User Manual: Pdf
Open the PDF directly: View PDF .
Page Count: 10
Toy Model of pH Subtraction for FSCV Data
Vincent Toups
Nov 1, 2010
Abstract
This is a quick document describing the use of some tools for chemometrics
in matlab.
0.1 Front Matter
This is a quick document describing the use of some tools for chemometrics
in matlab. It should come with a set of library files in this directory under
matlab. These library files should be loaded before running the examples. If
you start Matlab in this directory, the file startup.m should execute automati-
cally on startup. If it doesn’t, you can execute addpath(genpath(path-to-library-
directory)) to load them. This document should have come in a zip file, which,
when expanded, should create a directory. If you start Matlab in this directory,
everything should “just work.”
0.2 Generating CVs From Raw Data
The first step in analyzing data from a flow cell experiment is to generate a
single CV for each trial from the flow cell. This process involves separating
the scans which contain the sample from the background, estimating the back-
ground, removing it from the sample trials, and finally averaging some subset
of the sample scans to produce the background subtracted CV. Because each of
these steps involve some discretion, and because differences in each step produce
different CVs at the end of the process, it makes sense to automate the process
entirely.
Before the raw data can be analyzed by this library, it has to be exported
from Tar Heel CV as a color data file, with care taken to make sure there is no
background subtraction or averaging performed. This Matlab library will take
care of all that itself. In order to export a raw data file, you should load the data
into THCV, click Analyze Data, and set “#scans to average for background,”
“# of scans to average for CV,” and “# of points to average for I vs T” to zero.
Then click “Write out Color Plot.” See Figure 1.
In the simplest case, you can now load Matlab and execute a single command
to generate a CV.
f i l e n a m e = ’ path−to−raw−f i l e ’ ;
cv = autoCV ( f i l e n a m e ) ;
This just calls the function “autoCV” on the file. You can also load the file
into Matlab, and then call the function on the data itself:
f i l e n a m e = ’ path−to−raw−f i l e ’ ;
data = load (filename );
cv = autoCV ( d at a ) ;
However, this is apt to provide less informative error messages during batch
processing, because “autoCV” will not know which file has caused a possible
error. To process a batch of files located in a single directory, you can write
something like:
1

Figure 1:
f i l e n a m e s = ddirnames ( ’ /some−d i r e c t o r y / ’ ) ; % c a l l a Vincent l i b r a r y
% f u n c t i o n t o g e n er a t e a l i s t o f a l l f i l e n a m e s .
for f i =1: length(filenames )
f i l e n a m e=f i l e n a m e s {f i };
cv = autoCV ( d at a ) ; % g e n e r a t e t h e CV.
newName = strrep( fil e n a me , ’ . tx t ’ , ’−cv . t x t ’ ) ; % g e n e r a t e a new name
% to save t h e cv to .
save (newName , ’−a s c i i ’ , ’ cv ’ ) ; % sav e i t .
end
0.3 How it Works
“AutoCV” is a function which does a few things. The first task, in broad
terms, is to locate the scans in the data set corresponding to the presence of
the sample in the flow cell. This process is automated based on the rather
modest assumption that these scans will be different than the background scans
(this assumption fails when a blank is injected, but then there is no way to tell
the difference between sample and background). More formally speaking, each
scan from the data set is treated as a vector, and then the Euclidian Distances
between these vectors are used to find when the sample appears in the flow cell,
and when it leaves. It does this by trying to find the locations for the entrance
2
and exit time which maximize the distances between putative “sample” scans
and “background” scans.
Once we’ve found the division between background and sample trials, we
can estimate the background and remove it. We’d like to be conservative about
background estimation, so we pad the boundaries of the sample before finding
background trials to estimate.
“AutoCV” can perform background subtraction in a variety of ways. It
takes optional arguments which allow the user to customize how it performs
background subtraction. If you open the function definition file, you’ll see:
function [ cv , aux ] = autoCV ( data , v a r a r g i n )
% [ cv , aux ] = AUTOCV(DATA,VARARGIN) fi nd a CV in th e c o l o r p l o t in DATA
%
% This f u nc t io n at t e mp t s to c a l c u l a t e a CV f o r a f lo w c e l l da ta s e t .
% I t i s o l a t e s t he sample from t he background using d i f f e r e n c e s i n the
% CVs i n each column o f DATA and then c a l c u l a t e s t h e back ground
% trend and s u b t r a c t s i t .
% I t s up po r t s t he f o l l o w i n g o p t i o n a l arguments which modify t he d e f a u l t
% b e h a v i o r .
%
% OPTIONAL ARG |D e f a u l t Value
% marginPct = 2;
% A ft er t he sample CVs are d e l i n e a t e d from t he
% b u f f e r CVs , pad th e r e gi o n by t h i s p er c en t ag e b e f o r e
% c a l c u l a t i n g t h e background . Two he re c orresp on ds t o
% a 200% pad .
% f i l e n a m e = ’UNSPECIFIED ’ ;
% Use t h i s as t h e f i l e n a m e i f raw d at a was pa ss e d i n and
% t h e r e i s an e r r o r .
% f l a t t e n A r g s = {} ;
% Arguments t o be pa s sed to
% flattenBackg ro un dP ol y , i f i t i s used to s u b t r a c t th e
% background .
% trialSelectionMethod = ’peak ’;
% S p e c i f i e s t h e method used t o s e l e c t t h e t r i a l s use f o r t h e CV.
% ’ peak ’ means t h a t t h e maximum power CV i n t h e sa mp le r e gi on ,
% a f t e r ba ckgr ound s u b t r a c t , forms t he c e n t e r o f t h e CV r e gi o n .
% aroundPeak = 10;
% Number o f CVs around th e t r i a l S e l e c t e d f o r th e sample CV to
% av e rage to produce th e f i n a l sample CV.
% deFli c k e r D a t a = True ;
% S p e c i f i e s whether t o d e f l i c k e r th e data b e f or e a n a l y s i s .
% This removes ” pops and s naps ” i n t h e da ta b e f o r e any p r o c e s s i n g .
% Sh ould be mo st ly h arm le ss f o r d ata t h a t doesn ’ t
% have t h i s k i n d o f n o i s e .
3
% deFlickerArgs = {} ;
% Arguments t o pa ss t o th e d e f l i c k e r fu n c t i o n , i n a d di t io n to
% t h e d e f a u l t ones .
% partitionArgs = {};
% Arguments t o p as s t o t h e f u n c t i o n which c a l c u l a t e s t h e sample
% and background t r i a l s .
% biasTowardsPre = 0;
% E nable s b i a s i n g t owar ds e a r l y CVs in the dat a s e t as
% b ackg rou nd CVs .
% useSimpleMean = 0;
% D i sa b le po ly no mial backgr ound s u b t r a c t i o n and sim pl y u se
% a mean o f t he background t r i a l s as th e back ground to s u b t r a c t .
% saveBackground = 0;
% Turn t h i s on t o sa ve t h e background CVs , which you might
% want t o examine t o d e t e c t l o n g term t r e n d s i n t h e d ata .
% b ac kg r ou nd D ir e ct or y = ’ . / au toBack groun ds / ’
% I f you e n a b l e b ack gro und CV s av in g , t h i s i s w here t h e y go .
% au t o SignF i x = True ;
% Attempt to a u t o m a t i c a l l y f l i p d ata t h a t has been i n v e r t e d
% as a c on se que nc e o f u s i n g UEI mode v s non UEI mode .
This shows the degree to which AutoCV can be customized. I’ve tried to set
up defaults which work for most CVs. Optional arguments are passed in like
standard Matlab functions, so if you want to enable simple mean background
subtraction but disable deflickering you would call autoCV like this:
cv = autoCV ( ’ a−c o l o r −plot−f i l e . tx t ’ , ’ useSimpleMean ’ , True , . . .
’ d e F l i ckerD a t a ’ , F a l s e ) ;
AutoCV is pretty smart - I’d estimate it gets the CV right more than 80
percent of the time, but you should always double check the output before
proceeding with further analysis steps.
0.4 In Vivo Data
Automatically extracting CVs from flow cell data is substantially easier than
similar analysis of in-vivo data. The critical step in extracting a sample CV is
finding the CVs in a recording which correspond to the sample and the ones
which correspond to the background. In the flow cell, we know that the trials
start with background CVs, at some point the sample enters the flow cell, after
which we have sample CVs until the sample leaves the cell, at which point
buffer should resume flowing. The job of selecting the sample CVs simplifies
into selecting just two points: onset and offset. This is a problem domain which
can be exhaustively searched (although that is not how autoCV works, in most
cases). In Vivo, particularly for transients, there are no guarantees of single
chemical events or rapid onsets or offsets. As a consequence, I’ve developed
4
alternative tools to deal with this case. Rather than trying to find a single CV
in an in vivo set, it makes sense to find multiple candidates, and return a few
of them for subsequent analysis.
At the moment, in vivo automatic extraction is more of an aid in finding CVs
rather than a bulletproof method. You can extract CVs from in vivo color plots
(exported without background subtraction) with the function “autoCVInVivo”.
As most of the lab is doing in vitro experiments, and this is alpha-level tech,
that is probably all the useful documentation for this code at this time.
0.5 Performing PCR
Above you learned about extracting CVs from in vitro (or in vivo) data auto-
matically. Usually the next step in our analysis pipeline is to calibrate a linear
model which is able to take a CV and predict a concentration. This can be
approached in several ways.
For single analyte runs, I’ve found its almost always possible to simply pick
the peak associated with the analyte that has the highest variability with respect
to concentration and perform a regression between concentration and just those
points. Flow cell data tends to be of sufficiently low variability that there is
more than enough information in a single voltage point to get very high r2
values.
0.5.1 A note on file naming conventions
People in the lab may have noticed I tend to use crazy long file names. This
is for a great reason: it makes writing scripts to manipulate and analyze data
much, much easier. Typically, my filenames look something like this:
bu f fer Ph=740=samplePh=000748=sampleHPO=000160=sampleDA=001000= co u r s e=001= t r i a l =005=newRank=4=newRank=4. txt
bu f fer Ph=740=samplePh=000748=sampleHPO=000160=sampleDA=001000= co u r s e=001= t r i a l =004=newRank=1=newRank=1. txt
bu f fer Ph=740=samplePh=000748=sampleHPO=000160=sampleDA=001000= co u r s e=001= t r i a l =003=newRank=2=newRank=2. txt
bu f fer Ph=740=samplePh=000748=sampleHPO=000160=sampleDA=001000= co u r s e=001= t r i a l =002=newRank=3=newRank=3. txt
bu f fer Ph=740=samplePh=000748=sampleHPO=000160=sampleDA=001000= co u r s e=001= t r i a l =001=newRank=5=newRank=5. txt
bu f fer Ph=740=samplePh=000748=sampleHPO=000160=sampleDA=001000= co u r s e=001= t r i a l =000=newRank=6=newRank=6. txt
bu f fer Ph=740=samplePh=000748=sampleHPO=000160=sampleDA=000500= co u r s e=001= t r i a l =005=newRank=3=newRank=3. txt
bu f fer Ph=740=samplePh=000748=sampleHPO=000160=sampleDA=000500= co u r s e=001= t r i a l =004=newRank=4=newRank=4. txt
bu f fer Ph=740=samplePh=000748=sampleHPO=000160=sampleDA=000500= co u r s e=001= t r i a l =003=newRank=2=newRank=2. txt
bu f fer Ph=740=samplePh=000748=sampleHPO=000160=sampleDA=000500= co u r s e=001= t r i a l =002=newRank=1=newRank=1. txt
The critical thing to notice here is that these file names tell you everything
you want to know about each file in a simple to parse format. This library in-
cludes code to transform a file name in the format above into a matlab structure
which can be used programmatically.
>> d s f ( ’ cvs / t r i a l =000005=sampleHpo=000100=sampleDa=000125=samplePh=000748=newRank=4. tx t ’ )
5
d s f ( ’ c vs / t r i a l =000005=sampleHpo=000100=sampleDa=000125=samplePh=000748=newRank=4. tx t ’ )
ans =
trial : 5
sampleHpo : 100
sampleDa : 125
samplePh : 748
newRank : 4
“dsf” stands for “destructure file”. It is a very simple function which simply
splits a file name (after discarding the directory) into a pieces at the “=” signs,
and then converts those pieces into name/value pairs, finally loading those into
a structure, which it returns.
Suppose you wanted to get all the files in a directory which corresponding
to trials with only changes in pH. You could write a script like this in Matlab,
assuming you files had the right names.
f i l e n a m e s = ddirnames ( ’ . / cv s / ’ ) ; % g e t a l l t he f i l e s i n a d i r e c t o r y
% c a l l e d ’ cvs ’
ph Onl yF ile s = { } ;% i n i t i a l i z e an empty l i s t f o r h o l d i n g t h e pH f i l e n a m e s
for f i =1: length(filenames ) % l o o p ov er a l l t h e f i l e s
f i l e n a m e=f i l e n a m e s {f i };
f i l e I n f o = d s f ( f i l e n a m e ) ;
i f f i l e I n f o . sampleDa == 0 && f i l e I n f o . sampleHpo == 0
ph Onl yF ile s {end+1}= filename ;
end
end
It is extremely common to take subsets of your data like this, and so the
library has a lot of code to make these kinds of operations easy to write. For
instance, I would code the above like this:
qu ery Fu nct ion s ; % l oa d a s e t o f f u n c t i o n s t o h e l p o r g a n i z e f i l e n a m e s .
phOnlyTest = fAnd ( mkMatcher ( ’ sampleDa ’ , 0 ) , . . .
mkMatcher ( ’ sampleHpo ’ , 0 ) ) ;
% mkMatcher means ”make matcher ” . I t r e t u r n s a f u n c t i o n which a c c e p t s
% a f i l e name and which r e t ur n s t r u e o nly when t h e f i e l d s p e c i f i e d
% i s t h e v al ue s p e c i f i e d
% fAnd t a k e s two f u n c t i o n s and r e t u r n s a new f u n c t i o n which i s t r u e
% o n l y when b o t h o f t h e i n p u t f u n c t i o n s a re t r u e on t h e i n p u t s t o t h e
% new f u n c t i o n .
% So t h e above code read s as :
% c r e a t e a f u n c t i o n c a l l e d pHOnlyTest w hich a c c e pt s a f i le n a me as
% i n p u t and o n ly r e t u r n s t r u e when t h a t f i l e has z e ro f o r sampleDa
6
% and z er o f o r sampleHpo .
ph Onl yF ile s = f i l t ( phOnlyTest , ddirnames ( ’ . / cvs / ’ ) ) ;
% t h i s l i n e o f code means”
% f i l t e r t h e l i s t o f f i l e s r e t u r n e d by ddirnames ( ’ . / cv s / ’ )
% r e t u r n o n ly t h o s e f i l e s w hich p a ss ” phOnlyT es t ”
Without comments, the above is substantially shorter than the other version.
Less code means fewer opportunities for mistakes!
Anyway, you can use this library with any files you want. Chemometrics
and CV extraction don’t depend on specific filename conventions. But if you
want to do data analysis in Matlab more easily, choosing a good convention for
filenames is pretty helpful.
0.6 Performing Standard PCR
Performing PCR is simple once you’ve examined the above examples. Here is a
complete listing for a PCR in one of my directories:
a l l F i l e s = ddirnames ( [ ’ . / cvs / ’ ] ) ; % g e t t he cv names
n o t O u t l i e r = fAnd (@( x ) d s f ( x , [ ] , ’ newRank ’ , 0 ) <4 , . . .
@( x ) d s f ( x , [ ] , ’ b la c k ’ , 0) == 0); % remove o u t l i e r s
% and b l a c k l i s t e d data
t r a i n i n g F i l e s = unique ( f i l t ( not O u t l i e r , [ f i l t ( daTraining , a l l F i l e s ) ;
f i l t ( hpoTraining , a l l F i l e s ) ;
f i l t ( phTraining , a l l F i l e s ) ] ) ) ; % select
% t r a i n i n g da ta
t e s t i n g F i l e s = unique ( f i l t ( n o t O u t l i e r , f i l t ( t es t in g , a l l F i l e s ) ) ) ;
% s e l e c t t e s t i n g da ta
clear p c r S t r u c t ; % make sure t he s t r u c t u r e c on t ai n in g pcr da ta i s c l e a r
for t i =1: length(trainingFiles)
t r a i n i n g F i l e=t r a i n i n g F i l e s {t i };
props = ds f ( t r a i n i n g F i l e ) ;
p c r S t r u c t ( t i ) . da ta = load ( t r a i n i n g F i l e ) ;
p cr St ru c t ( t i ) . da = ds f ( t r a i n i n g F i l e , [ ] , ’ sampleDA ’ ) ;
p cr St ru c t ( t i ) . hpo = ds f ( t r a i n i n g F i l e , [ ] , ’ sampleHPO ’ ) ;
p cr St ru c t ( t i ) . ph = d s f ( t r a i n i n g F i l e , [ ] , ’ samplePh ’ ) ;
end
% t he above f i l l s t he pcr s t r u c t .
t e s t i n g D a t a = l o a d F i l e s ( t e s t i n g F i l e s ) ;
% l o a d t h e data f or t e s t i n g .
[ models , aux ] = doPCR( pcrStr u c t , te s t i ngDa t a , ccm ) ;
% perform PCR
7
The function doPCR returns a struct of models for each of the analytes
indicated in the pcrStruct (in our example, da, hpo and ph). Models.da, for in-
stance, contains the predicted and actual Dopamine concentrations in an array.
The order corresponds to the training and then the testing data passed in, in
their orders.
You can plot the results for the training data like so.
das = map( dag , a l l F i l e s ) ; % dag i s a f u n c t i o n d e f i n e d i n
% qu er yFun ct ions which g e t s t he r ec or ded
% dopamine c o n c e n t r a t i o n from a f i l e
% name . Map a p p l i e s a f u n c t i o n t o a l i s t
% o f t h i n g s and r e t u r n s a l i s t o f t h e
% r e s u l t s . So t h e above l i n e r e t ur ns a
% l i s t o f a l l t h e p re pa red dopamine c o n c e n t r a t i o n s .
nTr = 1 : length( t r a i n i n g F i l e s ) ;
plotWithRedundantXs ( das ( nTr ) , models . da . p r e d i c t e d ( nTr ) ) ;
% plotWithRedundantXs makes a s c a t t e r p l o t o f t h e d at a w here Xs may b e r e p e a t e d .
Further documentation is forthcoming, but an intrepid analyst could start
from these examples and go!
8