Libagf Guide
User Manual:
Open the PDF directly: View PDF .
Page Count: 27
Download | |
Open PDF In Browser | View PDF |
User Guide Peter Mills peteymills@hotmail.com October 5, 2018 1 CONTENTS 1 INSTALLATION 3 2 FILE FORMATS 4 3 COMMAND PARAMETERS 8 4 COMMAND LINE EXECUTABLES 4.1 Direct routines . . . . . . . . . . . . 4.2 Binary classification . . . . . . . . . . 4.3 Multi-class classification . . . . . . . 4.4 Testing/validation . . . . . . . . . . . 4.5 Clustering . . . . . . . . . . . . . . . 4.6 Experimental . . . . . . . . . . . . . 4.7 Pre-/post-processing . . . . . . . . . 4.8 File conversion . . . . . . . . . . . . 4.9 Commands used by other commands 4.10 Deprecated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 ERROR CODES AND DIAGNOSTICS 12 12 12 13 13 13 14 14 15 15 15 16 6 MULTI-BORDERS CLASSIFICATION 18 6.1 Multi-borders classification the easy way . . . . . . . . . . . . 18 6.2 Multi-borders classification the hard way . . . . . . . . . . . . 19 6.3 Multi-borders with external binary classifiers . . . . . . . . . . 21 7 EXAMPLES 7.1 examples/class borders . . . . . . . . . . . . . . . . . . . . . . 7.2 examples/humidity data . . . . . . . . . . . . . . . . . . . . . 7.3 examples/Landsat . . . . . . . . . . . . . . . . . . . . . . . . . 26 26 26 27 BIBLIOGRAPHY 27 2 1 INSTALLATION First, make sure you have the required dependencies. You will need the GNU Scientific Library (GSL). If it is not already installed on your system, download and install it. Take note of the location of the compiled libraries and include files. You will also need to install libpetey, another, much smaller numerical library–this now comes bundled with libAGF. Download the source from Github (https://github/peteysoft/libmsci) or your favourite mirror. Unzip and unpack the tarball to the desired directory. Modify the makefile with the appropriate directories, compiler names and compiler flags. Type make from inside the libagf directory to perform the build and make install to copy the executables, libraries and include files to the appropriate directories. A configure script is not supplied: please refer to the Peteysoft coding standards coding_standards.txt in the base of the installation directory (https://github.com/peteysoft/libmsci/blob/master/doc/coding_standards. txt). Make sure the following macros have been set properly in the make file: LIB_PATH = location of your compiled libraries BIN_PATH = location of binary executables INCLUDE_PATH = location of your include files GSL_LIB = GSL_INCLUDE = location of the GSL libraries (if different from above) location of the GSL include files (if different from above) CC = name of your C++ compiler There are two ways to change the optimisation of the binaries. If you would like multiple versions with different levels of optimisation, be sure to change the OPT VER macro. This will append the optimisation flag to all the object files, libraries and executables. If you only want one version, then modify the CFLAGS macro in the usual manner. To test the resulting executables, type, make test. To see the resulting comparison summary, it may be necessary to type it twice. Overall accuracies of all the classification algorithms should be around 90 % and the uncertainty coefficients should be over 0.5, while the confidence ratings should roughly match the accuracies. When comparing one algorithm to another, accuracies should be around 99% and uncertainty coefficients around 0.92, while the confidence ratings should be 1. or close to it, regardless of accuracy. Prob3 ability density comparisons should be around r = {0.98, 0.99, 0.97, 0.99} for KNN and AGF for the first class and KNN and AGF for the second class, respectively. 2 FILE FORMATS The programs accept three types of files with very simple, binary formats. Although portability is sometimes an issue, the binary formats are easy to generate and allow for rapid access and compact storage. For vector data, use the extension, .vec. Vector files have a four byte header (by default, see next paragraph) that indicates the dimensionality of the data. After that, each vector is stored one after the other as arrays of four byte floats. To write such a file in C++, you can use the following commands: #include#include int main() { FILE *fs; float ** data; //array of vectors int32_t n; //number of vectors int32_t D; //number of dimensions n=100; //(e.g.) D=2; data=new float * [n]; for (int32_t i=0; i #include int main() { FILE *fs; int32_t *data; int32_t n; n=100; //(e.g) data=new long[n]; //... commands to fill array ... fs=fopen("data.cls", "w"); fwrite(data, sizeof(int32_t), n, fs); fclose(fs); delete [] data; } In IDL: data=lonarr(100) ;... openw, 1, "data.cls" writue, 1, data close, 1 In Fortran: parameter(n=100) integer data(n) ... open(10, file="data.cls", form="binary") write(10) (data(i), i=1,n) 6 close(10) Note that the class indices must go from 0 to Nc − 1, where Nc is the number of classes. The classification routines will output two binary files containing scalar data: one with the extension, .cls containing class estimates, and one with the extension .con containing “confidence ratings” which are simply conditional probabilities re-scaled in the following manner: c= Nc P (c|~x) − 1 Nc − 1 (1) where P (c|~x) is the conditional probability of the “winning” class. A confidence rating of one indicates theoretically perfect knowledge of the true class, while a value of zero indicates that the result is little better than chance. If conditional or joint probabilities are needed for all the classes, they are written to standard out. To read the output files, examine the routines read_clsfile for reading class data, read_vecfile for reading vector data and read_datfile for reading scalar floating point data, contained in the source file, agf_util.cc, or you can use them directly. Note that vector data is allocated in one contiguous block, meaning that it can also be read and written in one contiguous block. It can also be deleted very simply using two commands: float ** data; nel_ta n; dim_ta D; data=read_vecfile(data, n, D); delete [] data[0]; delete [] data; In addition, the sparse satellite library contains a utility called sparse_calc that can operate on both types of binary files containing floating point data (but not class data). To read in and print a file of vector data called, foo.vec type the following commands: $>sparse_calc %s%c>print full(foo.vec) 7 To read in and print a file of scalar data called, bar.dat, type the following: %s%c>print vector(bar.dat) Some users may prefer to use ASCII file formats to store their data. If this is the case, there are a number of file conversion utilities included for comparison with other algorithms. See Section 4, COMMAND LINE EXECUTABLES. Sometimes it’s desireable to normalize the data before performing classifications. Normalization in libagf has now been generalized to linear transformation and includes singular value decomposition as well as feature selection. The command, agf_precondition, is used to perform linear transformations on test and training (features) data and store the resulting transformation matrix in the same binary format for vector data as described above. The default name is the same as before with the extension .std appended to the output base file name. Features data can still be transformed directly within the machine learning routines using the command options -n, -S and -a: see Section 3, COMMAND PARAMETERS. By default, pre-trained model data is stored in the normalized coordinates; to store it in the non-normalized coordinates, use the command switch, -u. 3 COMMAND PARAMETERS Most of the commands have the following syntax: command [options] model [test_data] output where model is the base name of the files containing either a set of training data or a pre-trained model, test_data is the file containing test data and output is the base name of the files in which to store the final estimates. To get a summary of the syntax and parameters, simply type the command name with no arguments, e.g.: > classify_b Syntax: classify_b [-n] [-u] [-a normfile] border test output where: 8 border test output files containing class borders: .brd for vectors .bgd for gradients .std for variable normalisations (unless -a specified) file containing vector data to be classified files containing the results of the classification: .cls for classes, .con for confidence ratings options: -n option to normalise the data -u normalize borders data (stored in un-normalized coords) -a normfile file containing normalization data Here is a list of all the options, although not all commands support all options: 9 Option – -+ -ˆ -0 -1 -a -A -b -B -c -C -d -D -e -E -f -F -g -h Function External command to train a binary classifier. Extra options to pass to binary training command. External command to convert data formats for above. Read from stdin. Write to stdout. Name of file containing normalization data. Use ASCII files instead of binary. Short output (cls_comp_stats) Sort by class/ordinate. Algorithm selection: nfold cross-validation: 0=AGF borders (default) classification 1=AGF classification 2=KNN classification 3=AGF interpolation 4=KNN interpolation 5=AGF borders multi-class PDF validation: 6=AGF PDF estimation 7=KNN PDF estimation No class data. Number of divisions in cross-validation scheme. Generate ROC curve by varying the discrimination border. Return error estimates. Value for missing data. For validation schemes: fraction of training data to use for testing. Select features. Use lograthmic progression. Relative difference for calculating numerical derivatives. 10 -H -i -I -j -k -K -l -L -m -M -n -N -o -O -p -P -q -Q Omit header data. Class borders: maximum number of iterations in supernewton. Weights calculation: maximum number of iterations in supernewton. Print joint instead of conditional probabilities to stdout. Number of k-nearest-neighbours to use in the estimate Keep temporary files (do not delete on exit). Tolerance of W. Floating point ordinates. Type of metric to use. Only works for knn classification. Get min. and max./LIBSVM format. Takes averages and standard deviations of the training data and normalizes both the training and test data. Maximum number of iterations in supernewton. Name of log file. External command for binary classification prediction. Threshold density in clustering algorithm. Calculate correlation/covariance matrix. Number of trials/divisions. Algorithm selection: optimal AGF: how to calculate the filter variances 0=halving max filter variance (default); 1=filter variance min and max; 2=total weight min and max non-hierarchical multi-class classification: 0=constrained inverse (not always optimal) 1=linear least squares, no constraints or re-normalization 2=voting from pdf, no re-normalization 3=voting from class label, no re-normalization 4=voting from pdf overrides least squares, conditional probabilities are adjusted and re-normalized 5=voting from class overrides least squares, conditional probabilities are adjusted and re-normalized 6=experimental 7=constrained inverse (may be extremely inefficient) 11 -r -R -s -S -t -T -u -U -v -V -w -W -x -X -z -Z 4 Class borders: value of conditional prob. at discrimination border. For random sampling. Number of times to sample the class border. Singular value decomposition: number of singular values to keep. Desired tolerance when searching for class borders. Class threshold for class-borders calculation. Store borders data in non-normalized coordinates. Re-label classes to go from [0, nc). First filter variance bracket. Second filter variance bracket/initial filter variance. Lower bound for W in AGF optimal/constraint weight Parameter Wc–equivalent to k in a k-nearest-neighbours scheme. See paper describing the theory. Run in background. Ratio between sizes of sample classes (P (2)/P (1)). Randomize data. In-house LIBSVM codes. COMMAND LINE EXECUTABLES This is a list of commands and their function: 4.1 Direct routines Direct, non-parametric classification, interpolation/regression, pdf estimation using variable-bandwidth kernel estimation: agf Uses adaptive Gaussian filtering knn Uses k-nearest neighbours All the direct kernel estimation operations have been collected into two executables: one for AGF called, agf, and one for k-nearest-neighbours called, knn. Both of them take an extra argument which specifies which operation to perform: classify, interpol or pdf. 4.2 Binary classification Two-class classification by training a model: class_borders Searches for the class borders using AGF. classify_b Performs classifications using a set of border samples. 12 4.3 Multi-class classification Multi-class classification using a series of class-borders: multi_borders Trains a multi-class model based on a control file. classify_m: Performs classifications using model output from multi_borders. svm_accelerate: Trains a multi-borders model from a native LIBSVM model. Multi-borders classification represents a significant step forward with this library. The operation of these programs, called from the command line using multi_borders and classify_m is quite involved and is described in detail in Section, 6, MULTI-BORDERS CLASSIFICATION. 4.4 Testing/validation Calculates the accuracy of classification results. Performs n-fold cross-validation for the three different classification algorithms or for the interpolation routines. agf_preprocess: Splits a dataset for validation schemes. pdf_sim Generates a synthetic dataset with the same approximate PDF as the training data. validate_pdf.sh Validates probability density function (PDF) estimates. agf_correlate Correlates two binary files containing scalar floating point data. roc_curve Computes the receiver operatoring characteristic (ROC) using three different methods. cls_comp_stats nfold 4.5 Clustering Uses k-nearest neighbours and a threshold density to perform a clustering analysis. See description, below. browse_cluster_tree Create a dendrogram and manually browse through it and assign classes. The first clustering algorithm works by calculating the density of each training samples. If the density is larger than a certain threshold, a class number is assigned and the program recursively calculates the densities of all the samples in the vicinity, based on its k neighbours, and assigns the same class number to each if they also exceed the threshold. Samples lower than the threshold are assigned the class label of 0, while clusters are assigned consecutive values starting at 1. This is far simpler than a dendrogram, but cluster_knn 13 somewhat less general although the final result should be similar. But, we’ve added a dendrogram anyway. 4.6 Experimental Continuum extension : Programs that extend the classification algorithms to work with continuous data: c_borders.sh Trains a continuum classification model. classify_c Returns coninuum extension estimates. Discrete Bayesian modelling : programs that model the probability distributions through binning. sort_discrete_vectors Sorts a set of discrete vectors. search_discrete_vectors Searches within a set of discrete vectors and estimates the probability density. 4.7 Pre-/post-processing agf_procondition Performs linear transformations on coordinate (feature) data and outputs the transformation matrix. Normalization, singular-value-decomposition (SVD) and feature selection are supported. agf_preprocess Processes both coordinate data and class data. Supports class selection, re-labelling and partitioning, file splitting for validation and cross-correlation calculation. 14 4.8 File conversion Converts the LVQPAK (ASCII) file format to the binary format accepted by libagf. Many users may find these ASCII formats easier to work with. svm2agf LIBSVM format to libAGF. svmout2agf Converts (ASCII) output from LIBSVM to binary libAGF format. agf2ascii Converts binary libAGF format to ASCII (LVQPAK or LIBSVM) format. C2R Converts a two-class classification to a difference in conditional probabilites. The equation is: R = (2c − 1)C where c is the class and C is the confidence rating. mbh2mbm: Collates a multi-file multi-borders model into a single ASCII file. float_to_class: Converts floating point data into class data by binning it into discrete ranges. class_to_float: Converts class data into floating point. The LVQPAK file format is probably the easiest to work with. There is one header with the number of dimensions, followed by a listing of the vectors, one vector per line, each column is a dimension except for the last one which is the class. The sample_class program prints its results to standard out in an LVQPAK-compatible format–see Section 7, EXAMPLES. lvq2agf 4.9 Commands used by other commands Note that some commands call other commands, therefore these latter commands must be in your path. agf_precondition Called by all the machine learning programs if one or more of the -a, -n or -S switches are used pdf_agf, pdf_knn still used by validate_pdf.sh 4.10 Deprecated classify_a, classify_knn, int_agf, int_knn, test_classify, test_classify_b, test_classify_knn To compile and install older routines, type, make old. 15 5 ERROR CODES AND DIAGNOSTICS All commands will return “0” upon successful completion or one of the following error codes: Code Meaning 1 Wrong or unsufficient number of arguments. 101 Unable to open file for reading. 111 File read error. 256 File read warning. 303 Allocation failure. 201 Unable to open file for writing. 211 File write error. 512 File write warning. 401 Dimension mismatch. 411 Sample count mismatch. 501 Numerical error. 511 Syntax error. 503 Maximum number of iterations exceeded. 768 Parameter out of range. 1024 Command option parse error. 21 Fatal command option parse error. 901 Internal error. 911 Other error. 1280 Other warning. Note that warnings are always divisible by 256 so that when passed back to the command line return a 0 exit status so as not to interupt running scripts. Error codes are also listed in the include file, error_codes.h in the libpetey distribution. Note that non-fatal errors reduce to ’0’ if passed to the command line. AGF routines also return a set of diagnostics, for example: diagnostic parameter min iterations in supernewton:** tolerance of samples: max 4 0 ** total number of iterations: number of bisection steps: 1178 276 16 52 3.53e-05 average 11.8 2.59e-06 number of convergence failures: 0 diagnostic parameter min iterations in agf_calc_w: value of f: value of W: 2 2.6e-13 20.00 max 6 0.284 20.00 average 3.73 0.00146 20.00 total number of calls: 1430 The first parameter indicates the number of iterations required to reach the correct value for the total of the weights, W . For efficiency reasons, this should be as small as possible, however the super-linear convergence of the root-finding algorithm (“supernewton”) means that the brackets can be quite far away without much effect on efficiency. To change the values of the filter variance used to bracket the root, use -v for the lower bracket and -V for the upper bracket. Normally the defaults should work just fine but to decrease the number of iterations, they should be narrowed. If they fail to bracket the root, the offending bracket is pushed outward. These changes are “sticky”, so the brackets can, in fact, be set arbritrarily narrow and even quite far from the root. To change the tolerance of W , use the -l switch. To change the maximum number of iterations, use the -i or -I switch. The second parameter is the number of iterations required to converge to the class border, excluding convergence failures. This is only applicable when searching for the class borders. To change the maximum number of iterations (default is 100) in the root-finding algorithm, use the -i or -N parameter. The third parameter is the value of the minimum filter weight divided by the maximum and is only applicable when the -k option (selecting a number of nearest neighbours) has been set. Ideally, it should be as small as possible, although values as large as 0.2 often produce reasonable results. Be sure to experiment with your own particular problem. To decrease it, increase the number of nearest neighbours used in the calculations which has the undesirable side effect of increasing computation time. In general, k should be a fair bit larger than W . 17 The tolerance of the samples only applies when searching for class borders. It shows how close the value of R = P (2|~x)−P (1|~x) (difference in conditional probabilities) is to zero at each border sample. This is set using the -t option and the diagnostic should be close to the set value. It is sometimes larger because the convergence test of the root-finding routine looks at tolerance along the independent variable as well as the dependent variable–whichever is better. To best understand how to interpret the diagnostics and use these to set the operational parameters, be sure to read the paper entitled, “Adaptive Guassian Filters: a powerful new method for supervised learning”, in the doc/ sub-directory of this installation or Mills (2011) which is a more fleshedout version of it. 6 MULTI-BORDERS CLASSIFICATION Multi-borders is an algorithm that generalizes the AGF-borders binary classification algorithm to multiple classes. It is a means of specifying the configuration of the class borders using a recursive control language. The multi_borders command is used for training a multi-borders model and has the following syntax: multi_borders [options] control-in train model control-out where: • control-in is the input or training control file; • train is the base-name of the binary files containing the training data • model is the base-name for the files that will contain each of the binary classification models and • control-out is the output or classification control file which is passed to the classification program, classify_m. 6.1 Multi-borders classification the easy way To train a four-class one-versus-one multi-class class borders classification model, use the print_control command to generate one of several standard multi-class models and pass it directly to the multi_borders command: 18 $>print_control -Q 6 4 | multi_borders foo bar foobar.txt The training data is contained in foo; the trained binary classification models are output in a series of files (six in this case) starting with bar. The output control file, foobar.txt, tells the prediction program, classify_m how the binary classifiers are configured: $>classify_m foobar.txt test.vec results Where test.vec contains the test points while the results are stored in result.cls and result.con. The -Q options determines the multi-class model used: type print_control with no arguments to get a list of supported models. 6.2 Multi-borders classification the hard way Consider the following control file (test_multi5.txt): "-s 100" { "." {0 1} "" 0 1 / 2; "-s 75" 0 / 1 2; { "-s 50" {2 3} 4 5 } } When passed to multi_borders as follows: $>multi_borders -n -s 125 test_multi5.txt foo bar foobar.txt runs the following set of statements: agf_precondition -a bar.std -n foo.vec bar.139397543805655.vec cp foo.cls bar.139397543805655.cls class_borders -s 100 bar.139397543805655 bar.00 0 / 1 class_borders -s 50 bar.139397543805655 bar.01.00 2 / 3 class_borders -s 125 bar.139397543805655 bar.01-00 2 3 4 / 5 class_borders -s 75 bar.139397543805655 bar.01-01 2 3 / 4 5 class_borders -s 100 bar.139397543805655 bar 0 1 / 2 3 4 5 rm bar.139397543805655.vec rm bar.139397543805655.cls 19 Initial versions printed these commands to standard out with running them the job of the user. Now they are executed directly with the option of running them in the background using the -x switch. It is still possible to print out the commands only using the -0 or -K switch. The contents of foobar.txt are as follows: bar { bar.00 { 0 1 } bar.01-00 0 1 / 2; bar.01-01 0 / 1 2; { bar.01.00 { 2 3 } 4 5 } } Note that each of the names in this file correspond to pairs of files generated by the commands in the previous script. To classify the data in a file named test.vec use the following command: $>classify_m -n foobar.txt test.vec result There are two types of multi-borders classification: hierarchical and nonhierarchical. In non-hieararchical multi-borders classification, all the classes are partitioned in multiple ways using a binary classifier and the equations relating the conditional probabilities of each class to those of the binary classifiers are solved returning all of the conditional probabilities. In the hierarchical method (also called a decision tree), the classes are partitioned using either a binary classifier or a non-hierarchical multi-classifier, then each of those partions are partitioned again and so on until there is only one class left in each of the partions. Hierarchical classification returns only the conditional probability of the winning class. The classify_m method 20 automatical detects whether a control file uses the hierarchical method or only the non-hierarchical method(that is it has only one level) and prints out conditional probabilities as appropriate. The parameters for training each binary classifier are contained in double quotes in the training control file. To take parameters from the command line, use the null string ("") while a period (".") tells the program to use the last set of parameters at the same level or below. A series of statements for training each of the binary classifiers are generated and executed using the “system” command. Once the commands are run, these binary classifiers will be stored in a pair of uniquely named files, the base-names of which replace, in the final control file, the quoted parameter lists from the training control file. In the control language, going up a level in the hierarchical scheme is denoted by a left curly brace ({) while going down is denoted by a right curly brace (}). In a non-hierarchical model, we specify the parameters (file name) of each of the binary partitions followed by the partitions themselves: two lists of classes separated by a forward slash (/). Class labels for nonhierarchical partitions are relative, that is they go from 0 to the number of classes in the non-hierarchical model less one. Class labels in the top level partitions are absolute, therefore must be unique. These should also go from 0 to one less than the total number of classes in the over-all model, but need not. In this example, there are six classes. They are first partioned into a group of two and a group of four. The group of four is partitioned into three parts, the first of which is partitioned into two classes and the second and third of which are single classes. The print_control command can be used to generate common control files. A good way to understand the “multi-borders” paradigm is to look at the example cases in the examples/humidity data directory. There is also a draft paper contained in the docs/ sub-directory. 6.3 Multi-borders with external binary classifiers The multi-borders routines can now interface with external binary classification software, specifically, either LIBSVM, or a pair of programs that have the same calling conventions. For training, use the -- switch to pass the command name to multi_borders. Consider the above control file as an example, but without any of the control switches: 21 "" { "" "" "" { {0 1} 0 1 / 2; 0 / 1 2; "" {2 3} 4 5 } } We pass the LIBSVM command, svm-train, to multi_borders, as follows: $>multi_borders -M -- "svm-train -b 1" -+ "-h 0 -c 25" \ test_multi.txt foo.svm bar foobar.txt which will execute the following statements: agf_preprocess -A -M foo.svm bar.00.2367545368.tmp 0 / 1 svm-train -b 1 -h 0 -c 25 bar.00.2367545368.tmp bar.00 rm -f bar.00.2367545368.tmp agf_preprocess -A -M foo.svm bar.01.00.2367545368.tmp 2 / 3 svm-train -b 1 -h 0 -c 25 bar.01.00.2367545368.tmp bar.01.00 rm -f bar.01.00.2367545368.tmp agf_preprocess -A -M foo.svm bar.01-00.2367545368.tmp 2 3 4 / 5 svm-train -b 1 -h 0 -c 25 bar.01-00.2367545368.tmp bar.01-00 rm -f bar.01-00.2367545368.tmp agf_preprocess -A -M foo.svm bar.01-01.2367545368.tmp 2 3 / 4 5 svm-train -b 1 -h 0 -c 25 bar.01-01.2367545368.tmp bar.01-01 rm -f bar.01-01.2367545368.tmp agf_preprocess -A -M foo.svm bar.2367545368.tmp 0 1 / 2 3 4 5 svm-train -b 1 -h 0 -c 25 bar.2367545368.tmp bar rm -f bar.2367545368.tmp Several things should be noted: • the -b switch is required and tells svm-train to generate probability estimates • the -+ option passes any other switches to use as “defaults” 22 • when used in this way, training data must be in ASCII format: -M switch for the same format as LIBSVM, otherwise it uses the same format as LVQPAC (see File Conversion, above) • the output control file is the same as before The training command, passed through --, should have the following syntax: train [options] data model/ where: • train is the training command • options are a set of options passed through the control file, through the -+ option, or directly from the end of the command line or through the command name • data is the training data in LVQ or SVM ASCII format • model is the output model reconizable to the prediction command, see below. Once the training has completed, classifications are performed by passing the prediction command, svm-predict, from LIBSVM to classify_m using the -O option: $>classify_m -M -O "svm-predict -b 1" foobar.txt test.svm output.svmout Once again, when paired with an external command, classify_m operates on ASCII files as opposed to the native AGF binary format. Output file format is the same as LIBSVM: a header consisting of the word, “labels”, followed by a list of labels, then one class label per line, followed by the conditional probabilities in the same order as the class labels in the header. Since “hierarchical” classification generates only one probability per estimate, in this case only one is written per line. While this is not LIBSVM conformant, the file conversion utility, svmout2agf, nonetheless recognizes it. The prediction command, passed by -O, should have the following syntax: predict test model output 23 where: • predict is the command name–if there are options they must be passed directly as part of this name; • test is the test data in LVQ or SVM ASCII format; • model is the binary classification model; • output are the predicted classes plus both conditional probabilities in the format described above. Note that the order of the first two arguments are reversed as compared to the libAGF convention in classify_b and classify_m. LIBSVM tends to be slow, especially if you have a lot of training data. There are currently three ways of converting LIBSVM models into borders models: the first two use the multi_borders command and work on LIBSVM/multi-borders hybrid models. First, you can used the -O switch to pass the prediction command (svm-predict -b 1 for LIBSVM models) to multi_borders so that it can train a faster multi-borders model. This solution has the advantage that it can work with any external binary classifier, not just LIBSVM. It has the disadvantage, however, that gradient vectors are calculated numerically, so is not very accurate. A better solution is the -Z switch which tells multi_borders to use the “in-house” SVM codes. To accelerate the previous model, pass the output classification control file from the previous pass: $>multi_borders -Z foobar.txt foo bar2 foobar2.txt which executes the following statements: class_borders class_borders class_borders class_borders class_borders -Z -Z -Z -Z -Z bar.00 foo bar2.00 bar.01.00 foo bar2.01.00 bar.01-00 foo bar2.01-00 bar.01-01 foo bar2.01-01 bar foo bar2 Note how we’re using class_borders to perform the training: it uses SVM as a source of conditional probabilities and finds a series of roots (zeroes) to sample the border between the two classes, just as it would with AGF estimates. To facilitate this, an extra parameter is included in the 24 class_borders command: in addition to the output file name in the last parameter, the first parameter is the name of the model used to predict the probabilities, while the second parameter is a file containing training data which is used to sample the space while searching for roots. Output is in the normal, AGF binary format and classify_m can be used as normal for classification: $>classify_m foobar2.txt test.vec output Another way of accelerating LIBSVM models, which only works with “native” LIBSVM models, is with the svm_accelerate command. Suppose we’ve trained a LIBSVM model, model.svmmod, as follows: $>svm-train foo.svm model.svmmod We can convert it to a multi-borders model as follows: $>svm2agf foo.svm foo $>svm_accelerate model.svmmod foo model.mod where foo.svm contains the training data in LIBSVM format and foo is the training data in libAGF format. Note that the multi-borders model in this case is stored in ASCII format all in one file but is still accepted by classify_m: $>classify_m model.mod test.vec output If a multi-file, multi-borders model is in one of the following configurations: one-vs-one, one-vs-the-rest, or partitioning of adjacent classes, then you can convert it to a single ASCII file using the mbh2mbm command. The file format fairly simple. There is a four line header with the following information: type of configuration (“1v1”, “1vR”, ”ADJ”), number of classes, list of class labels, and “polarities” of each of the binary classifiers. Next, all the binary classifiers are listed as pairs of matrices: first the border vectors, then the gradient vectors. Each matrix has a one line header with the number of vectors plus the size of each vector. Running the commands, print_control, multi_borders, classify_m, class_borders, svm_accelerate, or mbh2mbm without arguments prints a generous help screen including a description of most of the features explained in this section. There is also an example contained in the second half of the makefile under the examples/humidity_data directory (see, Section 7, EXAMPLES, above). 25 7 EXAMPLES The examples sub-directory collects together a number of test suites for comparison, validation and application of AGF. The makefiles in these test cases can give you a good idea of how to use the various components of the library. 7.1 examples/class borders The directory examples/class_borders includes a number of routines for testing the algorithms and comparing them with other, popular classification algorithms. The validation exercise is performed on a pair of twodimensional, synthetic test classes and is described in Mills (2011). To test the classification algorithms, type, make test. To test the pdf estimation routines, type, make test_pdf. To test both from the bottom level directory, type make test. All tests are done on a pair of synthetic test classes, described in the paper. To generate samples and analytical estimates for these classes, use the following commands sample_class Generates random samples of the synthetic test classes. classify_sc Returns class estimates using analytic/semi-analytic estimates of the class pdfs; classifications should therefore be close to the best possible for any supervised algorithm. pdf_sc1 Generates analytic pdf estimates for the first class. pdf_sc2 Generates semi-anaylitic (using quadrature) pdf estimates for the second class. sc_borders Find the border between the sample classes using the same algorithm as that employed for the class_borders command. To compare the algorithms with LIBSVM (Chang and Lin, 2011) and Kohonen’s LVQ algorithm (Kohonen, 2000), type, make compare. You must have both modules installed and the executables in your path. 7.2 examples/humidity data The test suite in examples/humidity_data tests the multi-borders paradigm on a discretized sub-set of the satellite humidity data described in Mills (2009). Use this directory for examples on how to use the multi-borders multi-class classification method. 26 7.3 examples/Landsat In the Landsat directory there are a number of scripts for performing surface classifications using Landsat data. In particular, there is a simple app that allows the user to classify pixels by hand. It opens a window with a Landsat scene in it; clicking one of the three mouse buttons allows you to classify pixels in the scene. There are three files that already contain hand-classified forest clearcut data. BIBLIOGRAPHY Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27:1–27:27. Kohonen, T. (2000). Self-Organizing Maps. Springer-Verlag, 3rd edition. Michie, D., Spiegelhalter, D. J., and Tayler, C. C., editors (1994). Machine Learning, Neural and Statistical Classification. Ellis Horwood Series in Artificial Intelligence. Prentice Hall, Upper Saddle River, NJ. Available online at: http://www.amsta.leeds.ac.uk/~charles/statlog/. Mills, P. (2009). Isoline retrieval: An optimal method for validation of advected contours. Computers & Geosciences, 35(11):2020–2031. Mills, P. (2011). Efficient statistical classification of satellite measurements. International Journal of Remote Sensing, 32(21):6109–6132. Mills, P. (2014). arxiv:1404.4095. Multi-borders classification. Technical Report Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., and Schölkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2):181–201. Terrell, D. G. and Scott, D. W. (1992). Variable kernel density estimation. Annals of Statistics, 20:1236–1265. 27
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 27 XMP Toolkit : XMP toolkit 2.9.1-13, framework 1.6 About : uuid:cb398371-0122-11f4-0000-a981a33a0bd1 Producer : GPL Ghostscript 9.22 Modify Date : 2018:10:05 21:22:09-04:00 Create Date : 2018:10:05 21:22:09-04:00 Creator Tool : UnknownApplication Document ID : uuid:cb398371-0122-11f4-0000-a981a33a0bd1 Format : application/pdf Title : UntitledEXIF Metadata provided by EXIF.tools