Miss Ranger Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 6

allVarsTwoSided
generateNA
imputeUnivariate
missRanger
pmm
Index

Package ‘missRanger’

June 14, 2019

Title Fast Imputation of Missing Values

Version 2.0.1

Description Alternative implementation of the beautiful 'MissForest' algorithm used to impute mixed-

type data sets by chaining random forests, intro-

duced by Stekhoven, D.J. and Buehlmann, P. (2012) <doi:10.1093/bioinformatics/btr597>. Un-

der the hood, it uses the lightning fast random jungle package 'ranger'. Between the itera-

tive model ﬁtting, we offer the option of using predictive mean matching. This ﬁrstly avoids im-

putation with values not already present in the original data (like a value 0.3334 in 0-

1 coded variable). Secondly, predictive mean matching tries to raise the variance in the result-

ing conditional distributions to a realistic level. This would allow e.g. to do multiple imputa-

tion when repeating the call to missRanger().

Depends R (>= 3.5.0)

License GPL(>= 2)

Encoding UTF-8

LazyData true

Type Package

Date 2019-06-14

Imports stats, FNN (>= 1.1), ranger (>= 0.10)

Author Michael Mayer [aut, cre, cph]

Maintainer Michael Mayer <mayermichael79@gmail.com>

RoxygenNote 6.1.1

NeedsCompilation no

Rtopics documented:

allVarsTwoSided ...................................... 2

generateNA ......................................... 2

imputeUnivariate ...................................... 3

missRanger ......................................... 3

pmm............................................. 5

Index 6

2generateNA

allVarsTwoSided Extraction of Variable Names from Two-Sided Formula.

Description

Takes a formula and a data frame and returns all variable names in both the lhs and the rhs. lhs and

rhs are evaluated separately. This is relevant if both sides contain a "." (= all variables).

Usage

allVarsTwoSided(formula, data)

Arguments

formula A two-sided formula object.

data Adata.frame. Primarily used to deal with "." in the formula.

Value

Alist with two character vectors of variable names.

Examples

allVarsTwoSided(Species + Sepal.Width ~ Petal.Width, iris)

allVarsTwoSided(. ~ ., iris)

allVarsTwoSided(.-Species ~ Sepal.Width, iris)

allVarsTwoSided(. ~ Sepal.Width, iris)

generateNA Adds Missing Values to a Data Set

Description

Takes a data frame and replaces randomly part of the values by missing values.

Usage

generateNA(data, p = 0.1, seed = NULL)

Arguments

data Adata.frame.

pProportion of missing values to approximately add to each column of data.

seed An integer seed.

Value

data with missing values.

Examples

head(generateNA(iris))

imputeUnivariate 3

imputeUnivariate Univariate Imputation

Description

Fills missing values of a vector of any type by sampling with replacement from the non-missing

values. Requires at least one non-missing value to run.

Usage

imputeUnivariate(x, seed = NULL)

Arguments

xA vector of any type possibly containing missing values.

seed An integer seed.

Value

A vector of the same length and type as xbut without missing values.

Examples

imputeUnivariate(c(NA, 0, 1, 0, 1))

imputeUnivariate(c("A", "A", NA))

imputeUnivariate(as.factor(c("A", "A", NA)))

# Impute a whole data set univariately

ir <- generateNA(iris)

head(imputed <- do.call(data.frame, lapply(ir, imputeUnivariate)))

missRanger Fast Imputation of Missing Values by Chained Random Forests

Description

Uses the "ranger" package [1] to do fast missing value imputation by chained random forests, see

[2] and [3]. Between the iterative model ﬁtting, it offers the option of predictive mean matching.

This ﬁrstly avoids imputation with values not present in the original data (like a value 0.3334 in a

0-1 coded variable). Secondly, predictive mean matching tries to raise the variance in the resulting

conditional distributions to a realistic level and, as such, allows to do multiple imputation when

repeating the call to missRanger(). The iterative chaining stops as soon as maxiter is reached or if

the average out-of-bag estimate of performance stops improving. In the latter case, except for the

ﬁrst iteration, the second last (i.e. best) imputed data is returned.

Usage

missRanger(data, formula = . ~ ., pmm.k = 0L, maxiter = 10L,

seed = NULL, verbose = 1, returnOOB = FALSE, case.weights = NULL,

...)

4missRanger

Arguments

data Adata.frame or tibble with missing values to impute.

formula A two-sided formula specifying variables to be imputed (left hand side) and

variables used to impute (right hand side). Defaults to . ~ ., i.e. use all variables

to impute all variables. If e.g. all variables (with missings) should be imputed

by all variables except variable "ID", use . ~ . - ID. Note that a "." is evaluated

separately for both sides of the formula. Note that variables with missings must

appear in the left hand side if they should be used on the right hand side.

pmm.k Number of candidate non-missing values to sample from in the predictive mean

matching step. 0 to avoid this step.

maxiter Maximum number of chaining iterations.

seed Integer seed to initialize the random generator.

verbose Controls how much info is printed to screen. 0 to print nothing. 1 (default) to

print a "." per iteration and variable, 2 to print the OOB prediction error per

iteration and variable (1 minus R-squared for regression).

returnOOB Logical ﬂag. If TRUE, the ﬁnal average out-of-bag prediction error is added to

the output as attribute "oob".

case.weights Vector with weight per observation in the data set used in ﬁtting the random

forests.

... Arguments passed to ranger. If the data set is large, better use less trees (e.g.

num.trees = 100) and/or a low value of sample.fraction. The following ar-

guments are incompatible: data,write.forest,probability,split.select.weights,

dependent.variable.name, and classification.

Value

An imputed data.frame.

References

[1] Wright, M. N. & Ziegler, A. (2016). ranger: A Fast Implementation of Random Forests for High

Dimensional Data in C++ and R. Journal of Statistical Software, in press. http://arxiv.org/abs/1508.04409.

[2] Stekhoven, D.J. and Buehlmann, P. (2012). ’MissForest - nonparametric missing value imputa-

tion for mixed-type data’, Bioinformatics, 28(1) 2012, 112-118. https://doi.org/10.1093/bioinformatics/btr597.

[3] Van Buuren, S., Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained

Equations in R. Journal of Statistical Software, 45(3), 1-67. http://www.jstatsoft.org/v45/i03/

Examples

irisWithNA <- generateNA(iris)

irisImputed <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100)

head(irisImputed)

head(irisWithNA)

# With extra trees algorithm

irisImputed_et <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100, splitrule = "extratrees")

head(irisImputed_et)

# Do not impute Species. Note: Since this variable contains missings, it cannot be used

# to impute the other variables as well.

pmm 5

irisImputed <- missRanger(irisWithNA, . - Species ~ ., pmm.k = 3, num.trees = 100)

# Impute univariately only.

irisImputed <- missRanger(irisWithNA, . ~ 1)

# Use Species and Petal.Length to impute Species and Petal.Length.

irisImputed <- missRanger(irisWithNA, Species + Petal.Length ~ Species + Petal.Length,

pmm.k = 3, num.trees = 100)

pmm Predictive Mean Matching

Description

This function is used internally only but might help others to implement an efﬁcient way of doing

predictive mean matching on top of any prediction based missing value imputation. It works as

follows: For each predicted value of a vector xtest, the closest kpredicted values of another vector

xtrain are identiﬁed by k-nearest neighbour. Then, one of those neighbours is randomly picked

and its corresponding observed value in ytrain is returned.

Usage

pmm(xtrain, xtest, ytrain, k = 1L, seed = NULL)

Arguments

xtrain Vector with predicted values in the training data set.

xtest Vector with predicted values in the test data set.

ytrain Vector with observed response in the training data set.

kNumber of nearest neighbours to choose from. Set k=0if no predictive mean

matching is to be done.

seed Integer random seed.

Value

Vector with predicted values in the test data set based on predictive mean matching.

Examples

pmm(xtrain = c(0.2, 0.2, 0.8), xtest = 0.3, ytrain = c(0, 0, 1), k = 1) # 0

pmm(xtrain = c(0.2, 0.2, 0.8), xtest = 0.3, ytrain = c(0, 0, 1), k = 3) # 0 or 1

pmm(xtrain = c("A", "A", "B"), xtest = "B", ytrain = c("B", "A", "B"), k = 1) # B

pmm(xtrain = c("A", "A", "B"), xtest = "B", ytrain = c("B", "A", "B"), k = 2) # A or B

Index

allVarsTwoSided,2

generateNA,2

imputeUnivariate,3

missRanger,3

pmm,5

Miss Ranger Manual

Navigation menu

Versions of this User Manual:

Views

Navigation