Miss Ranger Manual
User Manual:
Open the PDF directly: View PDF .
Page Count: 6
Package ‘missRanger’
June 14, 2019
Title Fast Imputation of Missing Values
Version 2.0.1
Description Alternative implementation of the beautiful 'MissForest' algorithm used to impute mixed-
type data sets by chaining random forests, intro-
duced by Stekhoven, D.J. and Buehlmann, P. (2012) <doi:10.1093/bioinformatics/btr597>. Un-
der the hood, it uses the lightning fast random jungle package 'ranger'. Between the itera-
tive model fitting, we offer the option of using predictive mean matching. This firstly avoids im-
putation with values not already present in the original data (like a value 0.3334 in 0-
1 coded variable). Secondly, predictive mean matching tries to raise the variance in the result-
ing conditional distributions to a realistic level. This would allow e.g. to do multiple imputa-
tion when repeating the call to missRanger().
Depends R (>= 3.5.0)
License GPL(>= 2)
Encoding UTF-8
LazyData true
Type Package
Date 2019-06-14
Imports stats, FNN (>= 1.1), ranger (>= 0.10)
Author Michael Mayer [aut, cre, cph]
Maintainer Michael Mayer <mayermichael79@gmail.com>
RoxygenNote 6.1.1
NeedsCompilation no
Rtopics documented:
allVarsTwoSided ...................................... 2
generateNA ......................................... 2
imputeUnivariate ...................................... 3
missRanger ......................................... 3
pmm............................................. 5
Index 6
1
2generateNA
allVarsTwoSided Extraction of Variable Names from Two-Sided Formula.
Description
Takes a formula and a data frame and returns all variable names in both the lhs and the rhs. lhs and
rhs are evaluated separately. This is relevant if both sides contain a "." (= all variables).
Usage
allVarsTwoSided(formula, data)
Arguments
formula A two-sided formula object.
data Adata.frame. Primarily used to deal with "." in the formula.
Value
Alist with two character vectors of variable names.
Examples
allVarsTwoSided(Species + Sepal.Width ~ Petal.Width, iris)
allVarsTwoSided(. ~ ., iris)
allVarsTwoSided(.-Species ~ Sepal.Width, iris)
allVarsTwoSided(. ~ Sepal.Width, iris)
generateNA Adds Missing Values to a Data Set
Description
Takes a data frame and replaces randomly part of the values by missing values.
Usage
generateNA(data, p = 0.1, seed = NULL)
Arguments
data Adata.frame.
pProportion of missing values to approximately add to each column of data.
seed An integer seed.
Value
data with missing values.
Examples
head(generateNA(iris))
imputeUnivariate 3
imputeUnivariate Univariate Imputation
Description
Fills missing values of a vector of any type by sampling with replacement from the non-missing
values. Requires at least one non-missing value to run.
Usage
imputeUnivariate(x, seed = NULL)
Arguments
xA vector of any type possibly containing missing values.
seed An integer seed.
Value
A vector of the same length and type as xbut without missing values.
Examples
imputeUnivariate(c(NA, 0, 1, 0, 1))
imputeUnivariate(c("A", "A", NA))
imputeUnivariate(as.factor(c("A", "A", NA)))
# Impute a whole data set univariately
ir <- generateNA(iris)
head(imputed <- do.call(data.frame, lapply(ir, imputeUnivariate)))
missRanger Fast Imputation of Missing Values by Chained Random Forests
Description
Uses the "ranger" package [1] to do fast missing value imputation by chained random forests, see
[2] and [3]. Between the iterative model fitting, it offers the option of predictive mean matching.
This firstly avoids imputation with values not present in the original data (like a value 0.3334 in a
0-1 coded variable). Secondly, predictive mean matching tries to raise the variance in the resulting
conditional distributions to a realistic level and, as such, allows to do multiple imputation when
repeating the call to missRanger(). The iterative chaining stops as soon as maxiter is reached or if
the average out-of-bag estimate of performance stops improving. In the latter case, except for the
first iteration, the second last (i.e. best) imputed data is returned.
Usage
missRanger(data, formula = . ~ ., pmm.k = 0L, maxiter = 10L,
seed = NULL, verbose = 1, returnOOB = FALSE, case.weights = NULL,
...)
4missRanger
Arguments
data Adata.frame or tibble with missing values to impute.
formula A two-sided formula specifying variables to be imputed (left hand side) and
variables used to impute (right hand side). Defaults to . ~ ., i.e. use all variables
to impute all variables. If e.g. all variables (with missings) should be imputed
by all variables except variable "ID", use . ~ . - ID. Note that a "." is evaluated
separately for both sides of the formula. Note that variables with missings must
appear in the left hand side if they should be used on the right hand side.
pmm.k Number of candidate non-missing values to sample from in the predictive mean
matching step. 0 to avoid this step.
maxiter Maximum number of chaining iterations.
seed Integer seed to initialize the random generator.
verbose Controls how much info is printed to screen. 0 to print nothing. 1 (default) to
print a "." per iteration and variable, 2 to print the OOB prediction error per
iteration and variable (1 minus R-squared for regression).
returnOOB Logical flag. If TRUE, the final average out-of-bag prediction error is added to
the output as attribute "oob".
case.weights Vector with weight per observation in the data set used in fitting the random
forests.
... Arguments passed to ranger. If the data set is large, better use less trees (e.g.
num.trees = 100) and/or a low value of sample.fraction. The following ar-
guments are incompatible: data,write.forest,probability,split.select.weights,
dependent.variable.name, and classification.
Value
An imputed data.frame.
References
[1] Wright, M. N. & Ziegler, A. (2016). ranger: A Fast Implementation of Random Forests for High
Dimensional Data in C++ and R. Journal of Statistical Software, in press. http://arxiv.org/abs/1508.04409.
[2] Stekhoven, D.J. and Buehlmann, P. (2012). ’MissForest - nonparametric missing value imputa-
tion for mixed-type data’, Bioinformatics, 28(1) 2012, 112-118. https://doi.org/10.1093/bioinformatics/btr597.
[3] Van Buuren, S., Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained
Equations in R. Journal of Statistical Software, 45(3), 1-67. http://www.jstatsoft.org/v45/i03/
Examples
irisWithNA <- generateNA(iris)
irisImputed <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100)
head(irisImputed)
head(irisWithNA)
# With extra trees algorithm
irisImputed_et <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100, splitrule = "extratrees")
head(irisImputed_et)
# Do not impute Species. Note: Since this variable contains missings, it cannot be used
# to impute the other variables as well.
pmm 5
irisImputed <- missRanger(irisWithNA, . - Species ~ ., pmm.k = 3, num.trees = 100)
# Impute univariately only.
irisImputed <- missRanger(irisWithNA, . ~ 1)
# Use Species and Petal.Length to impute Species and Petal.Length.
irisImputed <- missRanger(irisWithNA, Species + Petal.Length ~ Species + Petal.Length,
pmm.k = 3, num.trees = 100)
pmm Predictive Mean Matching
Description
This function is used internally only but might help others to implement an efficient way of doing
predictive mean matching on top of any prediction based missing value imputation. It works as
follows: For each predicted value of a vector xtest, the closest kpredicted values of another vector
xtrain are identified by k-nearest neighbour. Then, one of those neighbours is randomly picked
and its corresponding observed value in ytrain is returned.
Usage
pmm(xtrain, xtest, ytrain, k = 1L, seed = NULL)
Arguments
xtrain Vector with predicted values in the training data set.
xtest Vector with predicted values in the test data set.
ytrain Vector with observed response in the training data set.
kNumber of nearest neighbours to choose from. Set k=0if no predictive mean
matching is to be done.
seed Integer random seed.
Value
Vector with predicted values in the test data set based on predictive mean matching.
Examples
pmm(xtrain = c(0.2, 0.2, 0.8), xtest = 0.3, ytrain = c(0, 0, 1), k = 1) # 0
pmm(xtrain = c(0.2, 0.2, 0.8), xtest = 0.3, ytrain = c(0, 0, 1), k = 3) # 0 or 1
pmm(xtrain = c("A", "A", "B"), xtest = "B", ytrain = c("B", "A", "B"), k = 1) # B
pmm(xtrain = c("A", "A", "B"), xtest = "B", ytrain = c("B", "A", "B"), k = 2) # A or B