This blog promote and explain the use of forestFloor to interpret random forest model fits.

forestFloor on cran – forestFloor development on GitHub

Two examples from the forum *cross-valided*: [1]pima_indians, [2]simulated_data

Here’s a teaser, on how to visualize the structure of a RF model fit on the dataset pimaindiansdiabetic, see R code below:

Title: forestFloor: a package to visualize and comprehend the full curvature of random forests, Søren Havelund Welling, DTU, Compute; NovoNordisk, Insulin Pharmacology Research:

forestFloor is an add-on to the randomForest[1] package. It enables users to explore the curvature of a random forest model-fit. In general, for any problem where a random forest have a superior prediction performance, it is of great interest to learn its model mapping. Even within statistical fields where random forest is far from standard practice, such insight from a data driven analysis can give inspiration to how a given model driven analysis could be improved. forestFloor is machine learning to learn from the machine! A mapping-function of a random forest model is most often high dimensional and therefore difficult to visualize and interpret. However, with a new concept, feature contributions[2-3], it is possible to split the random forest mapping-function into additive components and understand the full curvatures. Hereby the forestFloor package provides a great extended functionality, compared to the original partial dependence plot provided in the randomForest package. To explore the curvature of random forests through series of 2D/3D plots with use of color gradients is fun and quite intuitive. forestFloor relies amongst others on Rcpp, rgl and kknn packages to produce visualizations fast and smoothly.

code for pima_idians example:

rm(list=ls()) set.seed(1) library(mlbench) library(randomForest) library(forestFloor) library(AUC)

data(PimaIndiansDiabetes) y = PimaIndiansDiabetes$diabetes X = PimaIndiansDiabetes X = X[,!names(X)=="diabetes"] #train default model and the most regularized model with same predictive performance rf.default = randomForest(X,y,ntree=5000) rf.robust = randomForest(X,y,sampsize=25,ntree=5000,mtry=4, keep.inbag = T,keep.forest = T) #verify similar performance plot(roc(rf.default$votes[,2],y),main="ROC: default black, robust is red") plot(roc(rf.robust$votes[,2],y),col=2,add = T) auc(roc(rf.default$votes[,2],y)) auc(roc(rf.robust$votes[,2],y)) #compute feature contributions ff = forestFloor(rf.robust,X,binary_reg = T,calc_np=T) Col = fcol(ff,cols=1,outlier.lim = 2.5) #the plot in this blog plot(ff,col=Col,plot_GOF = T)

#some 3d plots show3d(ff,c(1,5),5,col=Col,plot_GOF = T) library(rgl); rgl.snapshot("3dPressure.png")