Topological data analysis is a relatively new area of data science
which can compare and contrast data sets via non-linear global
structure. The main tool of topological data analysis, *persistent
homology* (Edelsbrunner, Letscher, and
Zomorodian 2000; Zomorodian and Carlsson 2005), builds on
techniques from the field of algebraic topology to describe shape
features present in a data set (stored in a “persistence diagram”).
Persistent homology has been used in a number of applications,
including

- predicting stock market crashes from the topology of stock price correlations over time (Yen and Cheong 2021),
- finding non-trivial and complex topological structure in local patches of naturalistic images (G. E. Carlsson et al. 2007),
- translating sentences in one language into another language (from a set of candidate sentences) using the persistence diagrams of their word embeddings (Haim Meirom and Bobrowski 2022), and
- improving model predictions of various chemical properties of molecules by including topological features (Krishnapriyan 2021),
- distinguishing between the topology of human brain function of healthy control subjects vs. subjects with a neurological disorder (Gracia-Tabuenca et al. 2020; Hyekyoung et al. 2014; Lee et al. 2011), etc.

For a broad introduction to the mathematical background and main tools of topological data analysis with applications, see (Chazal and Michel 2017; G. Carlsson and Vejdemo-Johansson 2021).

Traditional data science pipelines in academia and industry are focused on machine learning and statistical inference of structured (tabular) data, being able to answer questions like:

- How well can a label variable be predicted from feature variables?
- What are the latent subgroups/dimensions of a dataset?
- Are the subgroups of a dataset, defined by factor features, distinguishable?

While persistence diagrams have been found to be a useful summary of datasets in many domains, they are not structured data and therefore require special analysis methods. Some papers (for example (Robinson and Turner 2017; Le and Yamada 2018)) have described post-processing pipelines for analyzing persistence diagrams built on distance (Kerber, Morozov, and Nigmetov 2017) and kernel (Le and Yamada 2018) calculations, however these papers are lacking publicly available implementations in R (and Python), and many more data science methods are possible using such calculations (Murphy 2012; Scholkopf, Smola, and Muller 1998; Gretton et al. 2007; Cox and Cox 2008; Dhillon, Guan, and Kulis 2004).

**TDApplied** is the first R package which provides
applied analysis implementations of published methods for analyzing
persistence diagrams using machine learning and statistical inference.
Its functions contain highly optimized and scalable code (see the
package vignette “Benchmarking and Speedups”) and have been tested and
validated (see the package vignette “Comparing Distance Calculations”).
**TDApplied** can interface with other data science
packages to perform powerful and flexible analyses (see the package
vignette “Personalized Analyses with **TDApplied**”), and
an example usage of **TDApplied** on real data has been
demonstrated (see the package vignette “Human Connectome Project
Analysis”).

This vignette documents the background of **TDApplied**
functions and the usage of those functions on simulated data, by
considering a typical data analysis workflow for topological data
analysis:

- Computing and comparing persistence diagrams.
- Visualizing and interpreting persistence diagrams.
- Analyzing statistical properties of groups of persistence diagrams.
- Finding latent structure in groups of persistence diagrams.
- Predicting labels from persistence diagram structure.

To start we must load the **TDApplied** package:

`library("TDApplied")`

Let’s get started!

`PyH`

FunctionThe main tool of topological data analysis is called *persistent
homology* (Edelsbrunner, Letscher, and
Zomorodian 2000; Zomorodian and Carlsson 2005). Persistent
homology starts with either data points and a distance function, or a
distance matrix storing distance values between data points. It assumes
that these points arise from a dataset with some kind of “shape”. This
“shape” has certain features that exist at various scales, but sampling
induces noise in these features. Persistent homology aims to describe
certain mathematical features of this underlying shape, by forming
approximations to the shape at various distance scales. The mathematical
features which are tracked include clusters (connected components),
loops (ellipses) and voids (spheres), which are examples of
*cycles* (i.e. different types of holes). The *homological
dimension* of these features are 0, 1 and 2, respectively. What is
interesting about these particular mathematical features is that they
can tell us where our data is not, which is extremely important
information which other data analysis methods cannot provide.

The persistent homology algorithm proceeds in the following manner:
first, if the input is a dataset and distance metric, then the distance
matrix, storing the distance metric value of each pair of points in the
dataset, is computed. Next, a parameter \(\epsilon \geq 0\) is grown starting at 0,
and at each \(\epsilon\) value we
compute a shape approximation of the dataset \(C_{\epsilon}\), called a *simplicial
complex* or in this case a *Rips complex*. We construct \(C_{\epsilon}\) by connecting all pairs of
points whose distance is at most \(\epsilon\). To encode higher-dimensional
structure in these approximations, we also add a triangle between any
triple of points which are all connected (note that no triangles are
formally shaded on the above diagram, even though there are certainly
triples of connected points), a tetrahedron between any quadruple of
points which are all connected, etc. Note that this process of forming a
sequence of skeletal approximations is called a *Rips-Vietoris
filtration*, and other methods exist for forming the
approximations.

At any given \(\epsilon\) value,
some topological features will exist in \(C_{\epsilon}\). As \(\epsilon\) grows, the \(C_{\epsilon}\)’s will contain each other,
i.e. if \(\epsilon_{1} <
\epsilon_{2}\), then every edge (triangle, tetrahedron etc.) in
\(C_{\epsilon_1}\) will also be present
in \(C_{\epsilon_2}\). Each topological
feature of interest will be “born” at some \(\epsilon_{birth}\) value, and “die” at some
some \(\epsilon_{death}\) value –
certainly each feature will die once the whole dataset is connected and
has trivial shape structure. Consider the example of a loop – a loop
will be “born” when the last connection around the circumference of the
loop is connected (at the \(\epsilon\)
value which is the largest distance between consecutive points around
the loop), and the loop will “die” when enough connections across the
loop fill in its hole. Since the topological features are tracked across
multiple scales, we can estimate their (static) location in the data,
i.e. finding the points on these structures, by calculating what are
called *representative cycles*.

The output of persistent homology, a *persistence diagram*,
has one 2D point for each topological feature found in the filtration
process in each desired homological dimension, where the \(x\)-value of the point is the birth \(\epsilon\)-value and the \(y\)-value is the death \(\epsilon\)-value. Hence every point in a
persistence diagram lies above the diagonal – features die after they
are born! The difference of a points \(y\) and \(x\) value, \(y-x\), is called the “persistence” of the
corresponding topological feature. Points which have high (large)
persistence likely represent real topological features of the dataset,
whereas points with low persistence likely represent topological
noise.

For a more practical and scalable computation of persistence
diagrams, a method has been introduced called *persistence
cohomology* (Silva and Vejdemo-Johansson
2011b, 2011a) which calculates the exact same output as
persistent homology (with analogous “representative co-cycles” to
persistent homology’s representative cycles) just much faster.
Persistent cohomology is implemented in the c++ library
**ripser** (Bauer 2015),
which is wrapped in R in the **TDAstats** package (Wadhwa et al. 2019) and in Python in the
**ripser** module. However, it was observed in simulations
that the Python implementation of **ripser** seemed faster,
even when called in R via the reticulate package (Ushey, Allaire, and Tang 2022) (see the package
vignette “Benchmarking and Speedups”). Therefore, the
**TDApplied** `PyH`

function has been
implemented as a wrapper of the Python **ripser** module
for fast calculations of persistence diagrams.

There are three prerequisites that must be satisfied in order to use
the `PyH`

function:

- The reticulate package must be installed.
- Python must be installed and configured to work with reticulate.
- The
**ripser**Python module must be installed, via`reticulate::py_install("ripser")`

. Some windows machines have had issues with recent versions of the**ripser**module, but version 0.6.1 has been tested and does work on Windows. So, Windows users may try`reticulate::py_install("ripser==0.6.1")`

.

For a sample use of `PyH`

we will use the following
pre-loaded dataset called “circ” (which is stored as a data frame in
this vignette):

We would then calculate the persistence diagram as follows:

```
# import the ripser module
ripser <- import_ripser()
# calculate the persistence diagram
diag <- PyH(X = circ,maxdim = 1,thresh = 2,ripser = ripser)
# view last five rows of the diagram
diag[47:51,]
```

```
#> dimension birth death
#> 47 0 0.0000000 0.2545522
#> 48 0 0.0000000 0.2813237
#> 49 0 0.0000000 0.2887432
#> 50 0 0.0000000 2.0000000
#> 51 1 0.5579783 1.7385925
```

In the package vignette “Benchmarking and Speedups”, simulations are
used to demonstrate the practical advantages of using `PyH`

to calculate persistence diagrams compared to other alternatives.

Note that the installation status of Python for `PyH`

is
checked using the function `reticulate::py_available()`

,
which according to several online forums does not always behave as
expected. If error messages occur using **TDApplied**
functions regarding Python not being installed then we recommend
consulting online resources to ensure that the `py_available`

function returns `TRUE`

on your system. Due to the
complicated dependencies required to use the `PyH`

function,
it is only an optional function in the **TDApplied**
package and therefore the reticulate package is only suggested in the
**TDApplied** namespace.

`diagram_to_df`

FunctionThe most typical data structure used in R for data science is a data
frame. The output of the `PyH`

function is a data frame
(unless representatives are calculated, in which case the output is a
list containing a data frame and other information), but the persistence
diagrams calculated from the popular R packages **TDA**
(B. T. Fasy et al. 2021) and
**TDAstats** (Wadhwa et al.
2019) are not stored in data frames, making subsequent machine
learning and inference analyses of these diagrams challenging. Since In
order to solve this problem the **TDApplied** function
`diagram_to_df`

can convert
**TDA**/**TDAstats** persistence diagrams into
data frames:

```
# convert TDA diagram into data frame
diag1 <- TDA::ripsDiag(circ,maxdimension = 1,maxscale = 2,library = "dionysus")
diag1_df <- diagram_to_df(diag1)
class(diag1_df)
```

`#> [1] "data.frame"`

```
# convert TDAstats diagram into data frame
diag2 <- TDAstats::calculate_homology(circ,dim = 1,threshold = 2)
diag2_df <- diagram_to_df(diag1)
class(diag2_df)
```

`#> [1] "data.frame"`

When a persistence diagram is calculated with either
`PyH`

, `ripsDiag`

or `alphaComplexDiag`

and contains representatives, `diagram_to_df`

only returns
the persistence diagram data frame (i.e. the representatives are
ignored).

`diagram_distance`

and `diagram_kernel`

FunctionsEarlier we mentioned that persistence diagrams do not form structured data, and now we will give an intuitive argument for why this is the case. A persistence diagram \(\{(x_1,y_1),\dots,(x_n,y_n)\}\) containing \(n\) topological features can be represented in a vector of length \(2n\), \((x_1,y_1,x_2,y_2,\dots,x_n,y_n)\). However, we cannot easily combine vectors calculated in this way into a table with a fixed number of feature columns because

- persistence diagrams can contain different numbers of features, meaning their vectors would be of different lengths, and
- the ordering of the features is arbitrary, calling into question what the right way to compare the vectors would be.

Fortunately, we can still apply many machine learning and inference techniques to persistence diagrams provided we can quantify how near (similar) or far (distant) they are from each other, and these calculations are possible with distance and kernel functions.

There are several ways to compute distances between persistence
diagrams in the same homological dimension. The most common two are
called the *2-wasserstein* and *bottleneck* distances
(Kerber, Morozov, and Nigmetov 2017; Edelsbrunner
and Harer 2010). These techniques find an optimal matching of the
2D points in their input two diagrams, and compute a cost of that
optimal matching. A point from one diagram is allowed either to be
paired (matched) with a point in the other diagram or its diagonal
projection, i.e. the nearest point on the diagonal line \(y=x\) (matching a point to its diagonal
projection is essentially saying that feature is likely topological
noise because it died very soon after it was born).

Allowing points to be paired with their diagonal projections both allows for matchings of persistence diagrams with different numbers of points (which is almost always the case in practice) and also formalizes the idea that some points in a persistence diagram represent noise. The “cost” value associated with a matching is given by either (i) the maximum of infinity-norm distances between paired points, or (ii) the square-root of the sum of squared infinity-norm between matched points. The cost of the optimal matching under loss (i) is the bottleneck distance of persistence diagrams, and the cost of the optimal matching of cost (ii) is called the 2-wasserstein metric of persistence diagrams. Both distance metrics have been used in a number of applications, but the 2-wasserstein metric is able to find more fine-scale differences in persistence diagrams compared to the bottleneck distance. The problem of finding an optimal matching can be solved with the Hungarian algorithm, which is implemented in the R package clue (Hornik 2005).

We will introduce three new simple persistence diagrams, D1, D2 and D3, for examples in this section (and future ones):

Here is a plot of the optimal matchings between D1 and D2, and between D1 and D3:

In the picture we can see that there is a “better” matching between D1 and D2 compared to D1 and D3, so the (wasserstein/bottleneck) distance value between D1 and D2 would be smaller than that of D1 and D3.

Another distance metric between persistence diagrams, which will be
useful for kernel calculations, is called the *Fisher information
metric*, \(d_{FIM}(D_1,D_2,\sigma)\) (Le and Yamada 2018). The idea is to represent
the two persistence diagrams as probability density functions, with a
2D-Gaussian point mass centered at each point in both diagrams
(including the diagonal projections of the points in the opposite
diagram), all of variance \(\sigma^2 >
0\), and calculate how much those distributions disagree on their
pdf value at each point in the plane (called their *Fisher
information metric*).

Points in the rightmost plot which are close to white in color have the most similar pdf values in the two distributions, and would not contribute to a large distance value; however, having more points with a red color would contribute to a larger distance value.

The wasserstein and bottleneck distances have been implemented in the
**TDApplied** function `diagram_distance`

. We
can confirm that the distance between D1 and D2 is smaller than D1 and
D3 for both distances:

```
# calculate 2-wasserstein distance between D1 and D2
diagram_distance(D1,D2,dim = 0,p = 2,distance = "wasserstein")
#> [1] 0.3905125
# calculate 2-wasserstein distance between D1 and D3
diagram_distance(D1,D3,dim = 0,p = 2,distance = "wasserstein")
#> [1] 0.559017
# calculate bottleneck distance between D1 and D2
diagram_distance(D1,D2,dim = 0,p = Inf,distance = "wasserstein")
#> [1] 0.3
# calculate bottleneck distance between D1 and D3
diagram_distance(D1,D3,dim = 0,p = Inf,distance = "wasserstein")
#> [1] 0.5
```

There is a generalization of the 2-wasserstein distance for any \(p \geq 1\), the *p-wasserstein*
distance, which can also be computed using the
`diagram_distance`

function by varying the parameter
`p`

, although \(p = 2\)
seems to be the most popular parameter choice.

The `diagram_distance`

function can also calculate the
Fisher information metric between persistence diagrams:

```
# Fisher information metric calculation between D1 and D2 for sigma = 1
diagram_distance(D1,D2,dim = 0,distance = "fisher",sigma = 1)
#> [1] 0.02354779
# Fisher information metric calculation between D1 and D3 for sigma = 1
diagram_distance(D1,D3,dim = 0,distance = "fisher",sigma = 1)
#> [1] 0.08821907
```

Again, D1 and D2 are less different than D1 and D3 using the Fisher information metric.

A fast approximation to the Fisher information metric was described
in (Le and Yamada 2018), and C++ code in
the GitHub repository (https://github.com/vmorariu/figtree) was used to
calculate this approximation in Matlab. Using the Rcpp package (Eddelbuettel and Francois 2011) this code is
included in **TDApplied** and the approximation can be
calculated by providing the positive `rho`

parameter:

```
# Fisher information metric calculation between D1 and D2 for sigma = 1
diagram_distance(D1,D2,dim = 0,distance = "fisher",sigma = 1)
```

`#> [1] 0.02354779`

```
# fast approximate Fisher information metric calculation between D1 and D3 for sigma = 1
diagram_distance(D1,D2,dim = 0,distance = "fisher",sigma = 1,rho = 0.001)
```

`#> [1] 0.02354779`

Now we will explore calculating similarity of persistence diagrams
using kernel functions. A kernel function is a special (positive
semi-definite) symmetric similarity measure between objects in some
complicated space which can be used to project data into a space
suitable for machine learning (Murphy
2012). Some examples of machine learning techniques which can be
“kernelized” when dealing with complicated data are *k-means*
(kernel k-means), *principal components analysis* (kernel PCA),
and *support vector machines* (SVM) which are inherently based on
kernel calculations.

There have been, to date, four main kernels proposed for persistence
diagrams. In **TDApplied** the persistence Fisher kernel
(Le and Yamada 2018) has been implemented
because of its practical advantages over the other kernels – smaller
cross-validation SVM error on a number of test data sets and a faster
method for cross validation. For information on the other three kernels
see (Kusano, Fukumizu, and Hiraoka 2018;
Carriere, Cuturi, and Oudot 2017; Reininghaus et al. 2014).

The persistence Fisher kernel is computed directly from the Fisher information metric between two persistence diagrams: let \(\sigma > 0\) be the parameter for \(d_{FIM}\), and let \(t > 0\). Then the persistence Fisher kernel is defined as \(k_{PF}(D_1,D_2) = \mbox{exp}(-t*d_{FIM}(D_1,D_2,\sigma))\).

Computing the persistence Fisher kernel can be achieved with the
`diagram_kernel`

function in **TDApplied**:

```
# calculate the kernel value between D1 and D2 with sigma = 2, t = 2
diagram_kernel(D1,D2,dim = 0,sigma = 2,t = 2)
#> [1] 0.9872455
# calculate the kernel value between D1 and D3 with sigma = 2, t = 2
diagram_kernel(D1,D3,dim = 0,sigma = 2,t = 2)
#> [1] 0.9707209
```

As before, D1 and D2 are more similar than D1 and D3, and if desired a fast approximation to the kernel value can be computed.

`plot_diagram`

Persistence diagrams can contain any number of points representing
different types of topological features from different homological
dimensions. However we can easily view this information in a
two-dimensional plot of the birth and death values of the topological
features, where each homological dimension has a unique point shape and
color, using **TDApplied’s** `plot_diagram`

function:

`plot_diagram(diag,title = "Circle diagram")`