This figure from paper uses a simulated data set(1) [y = -X1^2 – cos(-X2) + noise], which is modelled with random forest. (2) is the out-of-bag predictions and (3) is any test set prediction. Feature contribution method is used to decompose (2) and (3) into (2a/2b) and (3a/3b). Projections of feature contributions (grey surfaces) split model into non-linaer main effects and interactions, which is a useful approach to investigate a learned random forest model structure. Goodness-of-visualization: It is possible to test how well, a visualization describes the random forest model structure, by testing how well the high dimensional model structure can be reconstructed from low dimensional visualizations.

]]>

forestFloor on cran – forestFloor development on GitHub

Two examples from the forum *cross-valided*: [1]pima_indians, [2]simulated_data

Here’s a teaser, on how to visualize the structure of a RF model fit on the dataset pimaindiansdiabetic, see R code below:

Title: forestFloor: a package to visualize and comprehend the full curvature of random forests, Søren Havelund Welling, DTU, Compute; NovoNordisk, Insulin Pharmacology Research:

forestFloor is an add-on to the randomForest[1] package. It enables users to explore the curvature of a random forest model-fit. In general, for any problem where a random forest have a superior prediction performance, it is of great interest to learn its model mapping. Even within statistical fields where random forest is far from standard practice, such insight from a data driven analysis can give inspiration to how a given model driven analysis could be improved. forestFloor is machine learning to learn from the machine! A mapping-function of a random forest model is most often high dimensional and therefore difficult to visualize and interpret. However, with a new concept, feature contributions[2-3], it is possible to split the random forest mapping-function into additive components and understand the full curvatures. Hereby the forestFloor package provides a great extended functionality, compared to the original partial dependence plot provided in the randomForest package. To explore the curvature of random forests through series of 2D/3D plots with use of color gradients is fun and quite intuitive. forestFloor relies amongst others on Rcpp, rgl and kknn packages to produce visualizations fast and smoothly.

code for pima_idians example:

rm(list=ls()) set.seed(1) library(mlbench) library(randomForest) library(forestFloor) library(AUC)

data(PimaIndiansDiabetes) y = PimaIndiansDiabetes$diabetes X = PimaIndiansDiabetes X = X[,!names(X)=="diabetes"] #train default model and the most regularized model with same predictive performance rf.default = randomForest(X,y,ntree=5000) rf.robust = randomForest(X,y,sampsize=25,ntree=5000,mtry=4, keep.inbag = T,keep.forest = T) #verify similar performance plot(roc(rf.default$votes[,2],y),main="ROC: default black, robust is red") plot(roc(rf.robust$votes[,2],y),col=2,add = T) auc(roc(rf.default$votes[,2],y)) auc(roc(rf.robust$votes[,2],y)) #compute feature contributions ff = forestFloor(rf.robust,X,binary_reg = T,calc_np=T) Col = fcol(ff,cols=1,outlier.lim = 2.5) #the plot in this blog plot(ff,col=Col,plot_GOF = T)

#some 3d plots show3d(ff,c(1,5),5,col=Col,plot_GOF = T) library(rgl); rgl.snapshot("3dPressure.png")]]>