The tree has fallen

This blog promote and explain the use of forestFloor to interpret random forest model fits.

forestFloor on cran – forestFloor development on GitHub

Two examples from the forum cross-valided: [1]pima_indians, [2]simulated_data

Here’s a teaser, on how to visualize the structure of a RF model fit on the dataset pimaindiansdiabetic, see R code below:

Sorted by variable importance (Reading direction), y-axis is cross validated feature contributions (change of predicted probability due to variable value), x-axis is variable value. Color gradient color all samples by glucose. Color gradient show how other variables interact with glucose. Fitted line (LOO-kNN-gaussian) describes how well the each variable effect is described by the variable itself. R2 quantifies the goodness-of-fit when visualizing the variable effect as on main effect.

Sorted by variable importance (Reading direction), y-axis is cross validated feature contributions (change of predicted probability due to variable value), x-axis is variable value. Color gradient color all samples by glucose. Color gradient show how other variables interact with glucose. Fitted line (LOO-kNN-gaussian) describes how well the each variable effect is described by the variable itself. R2 quantifies the goodness-of-fit when visualizing the variable effect as on main effect.

 

useR2015 presentation

Title: forestFloor: a package to visualize and comprehend the full curvature of random forests,  Søren Havelund Welling, DTU, Compute; NovoNordisk, Insulin Pharmacology Research:

forestFloor is an add-on to the randomForest[1] package. It enables users to explore the curvature of a random forest model-fit. In general, for any problem where a random forest have a superior prediction performance, it is of great interest to learn its model mapping. Even within statistical fields where random forest is far from standard practice, such insight from a data driven analysis can give inspiration to how a given model driven analysis could be improved. forestFloor is machine learning to learn from the machine! A mapping-function of a random forest model is most often high dimensional and therefore difficult to visualize and interpret. However, with a new concept, feature contributions[2-3], it is possible to split the random forest mapping-function into additive components and understand the full curvatures. Hereby the forestFloor package provides a great extended functionality, compared to the original partial dependence plot provided in the randomForest package. To explore the curvature of random forests through series of 2D/3D plots with use of color gradients is fun and quite intuitive. forestFloor relies amongst others on Rcpp, rgl and kknn packages to produce visualizations fast and smoothly.

 

 

code for pima_idians example:

rm(list=ls())
set.seed(1)
library(mlbench)
library(randomForest)
library(forestFloor)
library(AUC)
data(PimaIndiansDiabetes)
y = PimaIndiansDiabetes$diabetes
X = PimaIndiansDiabetes
X = X[,!names(X)=="diabetes"]

#train default model and the most regularized model with same predictive performance
rf.default = randomForest(X,y,ntree=5000)
rf.robust = randomForest(X,y,sampsize=25,ntree=5000,mtry=4,
                         keep.inbag = T,keep.forest = T)
#verify similar performance
plot(roc(rf.default$votes[,2],y),main="ROC: default black, robust is red")
plot(roc(rf.robust$votes[,2],y),col=2,add = T)
auc(roc(rf.default$votes[,2],y))
auc(roc(rf.robust$votes[,2],y))

#compute feature contributions
ff = forestFloor(rf.robust,X,binary_reg = T,calc_np=T)
Col = fcol(ff,cols=1,outlier.lim = 2.5)

#the plot in this blog
plot(ff,col=Col,plot_GOF = T)
#some 3d plots
show3d(ff,c(1,5),5,col=Col,plot_GOF = T)
library(rgl); rgl.snapshot("3dPressure.png")