The aim of this guide is to give a brief introduction and explanation to the functions contained inside the `vivid`

package. `vivid`

(variable importance and variable interaction displays) is used for investigating relationships within a machine learning model fit. All of the visualisations in this vignette are highly customizable (see the long-form `vivid`

vignette to see examples) and return `ggplot`

objects which can be customized using the normal `ggplot`

options.

Note: For the purposes of speed, the grid size (i.e., `gridSize`

- the size of the gid on which the evaluations are made) and the number of rows subsetted (`nmax`

) are small. This achieve more accurate results, incerease both the grid size and the number of rows used.

Some of the plots used by `vivid`

are built upon the `zenplots`

package which requires the `graph`

package from BioConductor. To install the `graph`

and `zenplots`

packages use:

`if (!requireNamespace("graph", quietly = TRUE)){`

`install.packages("BiocManager")`

`BiocManager::install("graph")`

`}`

`install.packages("zenplots")`

Now we can install and load `vivid`

by using:

`install.packages("vivid")`

We then load the other required packages.

```
library(vivid) # for visualisations
library(randomForest) # for model fit
library(ranger) # for model fit
library(ggplot2)
```

The data used in the following examples is simulated from the Friedman benchmark problem 1^{1}. This benchmark problem is commonly used for testing purposes. The output is created according to the equation:

Create the data:

```
set.seed(101)
<- function(noFeatures = 10,
genFriedman noSamples = 100,
sigma = 1) {
# Set Values
<- noSamples # no of rows
n <- noFeatures # no of variables
p <- rnorm(n, sd = sigma)
e
# Create matrix of values
<- matrix(runif(n * p, 0, 1), nrow = n) # Create matrix
xValues colnames(xValues) <- paste0("x", 1:p) # Name columns
<- data.frame(xValues) # Create dataframe
df
# Equation:
# y = 10sin(πx1x2) + 20(x3−0.5)^2 + 10x4 + 5x5 + ε
<- (10 * sin(pi * df$x1 * df$x2) + 20 * (df$x3 - 0.5)^2 + 10 * df$x4 + 5 * df$x5 + e)
y # Adding y to df
$y <- y
df
df
}
<- genFriedman(noFeatures = 9, noSamples = 350, sigma = 1) myData
```

In the following examples, we use a `ranger`

random forest model fit on the data, with the importance set to `permutation`

.

```
set.seed(101)
<- randomForest(y ~ ., data = myData) fit
```

Next, we create the ‘vivi-matrix’, which will contain variable importance on the diagonal and variable interactions in the upper and lower triangle. This matrix can then be supplied to the `vivid`

plotting functions.

```
set.seed(101)
<- vivi(
viFit fit = fit,
data = myData,
response = "y",
gridSize = 10,
importanceType = NULL,
nmax = 100,
reorder = TRUE,
class = 1,
predictFun = NULL
)
```

#Section 2: Visualizing the results

To create a heatmap of the vivi-matrix, we use:

`viviHeatmap(mat = viFit) + ggtitle("random forest fit heatmap")`

To create a network graph of the vivi-matrix, we use:

`viviNetwork(mat = viFit)`

In this plot we use a generalized pairs plot matrix style layout (which we call GPDP) to display partial dependence plots (PDPs) in the upper triangle, individual conditional exception curves (along with the aggregated 1-way partial dependence) on the diagonal and a scatterplot in the lower triangle.

To create the plot we supply the model fit to the plotting function.

```
set.seed(1701)
pdpPairs(data = myData, fit = fit, response = "y", nmax = 50, gridSize = 10)
#> Generating ice/pdp fits... waiting...
#> Finished ice/pdp
```

For this plot, we calculate the bivariate partial dependence and display them in a *zenplots* layout, which we call (ZPDP). The ZPDP is based on graph Eulerians and focuses on key subsets. ‘Zenplots’ create a zigzag expanded navigation plot (‘zenplot’) of the partial dependence values. This results in an alternating sequence of two-dimensional plots laid out in a zigzag structure, as shown in Fig 4.0 below and can be used as a useful space-saving plot that displays the most influential variables.

```
set.seed(1701)
pdpZen(data = myData, fit = fit, response = "y", nmax = 50, gridSize = 10)
#> Generating ice/pdp fits... waiting...
#> Finished ice/pdp
```

In Fig 4.0, we can see PDPs laid out in a zigzag structure, with the most influential variable pairs displayed at the top. As we move down the plot, we also move down in influence of the variable pairs.

Using the `zpath`

argument, we can filter out any interactions below a set value. `zpath`

takes the vivi matrix as a function argument and then, using `cutoff`

, we can filter out any interactions below the chosen value. For example:

```
set.seed(1701)
<- zPath(viv = viFit, cutoff = 0.1)
zpath pdpZen(data = myData, fit = fit, response = "y", nmax = 50, gridSize = 10, zpath = zpath)
#> Generating ice/pdp fits... waiting...
#> Finished ice/pdp
```

In this section, we briefly describe how to apply the above visualisations to a classification example using the `iris`

data set.

To begin we fit a `ranger`

random forest model with “Species” as the response and create the vivi matrix setting the category for classification to be “setosa” using `class`

.

```
set.seed(1701)
<- ranger(Species ~ .,
rfClassif data = iris, probability = T,
importance = "impurity"
)
set.seed(101)
<- vivi(
viviClassif fit = rfClassif,
data = iris,
response = "Species",
gridSize = 10,
importanceType = NULL,
nmax = 50,
reorder = TRUE,
class = "setosa",
predictFun = NULL
)#> Embedded impurity variable importance method used.
#> Calculating interactions...
```

Next we plot the heatmap and network plot of the iris data.

```
set.seed(1701)
viviHeatmap(mat = viviClassif)
```

```
set.seed(1701)
viviNetwork(mat = viviClassif)
```

As PDPs are evaluated on a grid, they can extrapolate where there is no data. To solve this issue we calculate a convex hull around the data and remove any points that fall outside the convex hull. This can be seen in the GPDP in Fig 3.2 below.

```
set.seed(1701)
pdpPairs(data = iris, fit = rfClassif, response = "Species", class = "setosa", convexHull = T, gridSize = 10, nmax = 50)
#> Generating ice/pdp fits... waiting...
#> Finished ice/pdp
```

Finally, a ZPDP for the random forest fit on the iris data with extrapolated data removed:

```
set.seed(1701)
pdpZen(data = iris, fit = rfClassif, response = "Species", class = "setosa", convexHull = T, gridSize = 10, nmax = 50)
#> Generating ice/pdp fits... waiting...
#> Finished ice/pdp
```

Friedman, Jerome H. (1991) Multivariate adaptive regression splines. The Annals of Statistics 19 (1), pages 1-67.↩︎