`OTclust` is an R package for computing a mean partition of an ensemble of clustering results by optimal transport alignment (OTA) and for assessing uncertainty at the levels of both partition and individual clusters. To measure uncertainty, set relationships between clusters in multiple clustering results are revealed. Functions are provided to compute the Covering Point Set (CPS), Cluster Alignment and Points based (CAP) separability, and Wasserstein distance between partitions.

Here, we illustrate the usage of `OTclust` for an ensemble clustering based on a simulated toy example,

```
# the number of clusters.
C = 4
# generate an ensemble of perturbed partitions.
# if perturb_method is 1 then perturbed by bootstrap resampling, it it is 0, then perturbed by adding Gaussian noise.
ens.data = ensemble(sim1$X, nbs=100, clust_param=C, clustering="kmeans", perturb_method=1)
```

To find a consensus partition, the function

```
# calculate baseline method for comparison.
kcl = kmeans(sim1$X,C)
# align clustering results for convenience of comparison.
compar = align(cbind(sim1$z,kcl$cluster,ota$meanpart))
lab.match = lapply(compar$weight,function(x) apply(x,2,which.max))
kcl.algnd = match(kcl$cluster,lab.match[[1]])
ota.algnd = match(ota$meanpart,lab.match[[2]])
```

Here, as cluster-wise uncertainty measures, we briefly introduce the usage of topological relationship statistics of mean partitions, cluster alignment and points based (CAP) separability, and covering point sets (CPS). The detailed definition of the above statistics can be found in [1]. Moveover, if you want to carry out CPS Analysis, please next two sections.

```
# distance between ground truth and each partition
wassDist(sim1$z,kmeans(sim1$X,C)$cluster) # baseline method
#> [1] 0.254152
wassDist(sim1$z,ota$meanpart) # mean partition by OTclust
#> [1] 0.2498597
# Topological relationships between mean partition and ensemble clusters
t(ota$match)
#> C1 C2 C3 C4
#> match 82 98 82 82
#> split 0 1 0 0
#> merge 0 0 1 1
#> l.c. 18 1 17 17
# Cluster Alignment and Points based (CAP) separability
ota$cap
#> C1 C2 C3 C4
#> C1 0.0000000 0.9121878 0.9993012 1.0000000
#> C2 0.9121878 0.0000000 1.0000000 0.9967047
#> C3 0.9993012 1.0000000 0.0000000 0.9392917
#> C4 1.0000000 0.9967047 0.9392917 0.0000000
```

```
# Covering Point Set(CPS)
otplot(sim1$X,ota$cps[lab.match[[2]][1],],legend.labels=c('','CPS'),add.text=F,title='CPS for C1')
#> Warning: Removed 2 rows containing missing values (geom_text).
otplot(sim1$X,ota$cps[lab.match[[2]][2],],legend.labels=c('','CPS'),add.text=F,title='CPS for C2')
#> Warning: Removed 2 rows containing missing values (geom_text).
otplot(sim1$X,ota$cps[lab.match[[2]][3],],legend.labels=c('','CPS'),add.text=F,title='CPS for C3')
#> Warning: Removed 2 rows containing missing values (geom_text).
otplot(sim1$X,ota$cps[lab.match[[2]][4],],legend.labels=c('','CPS'),add.text=F,title='CPS for C4')
#> Warning: Removed 2 rows containing missing values (geom_text).
```

The red area of the above plots indicates covering point set (CPS) for each cluster. The detail of the CPS analysis is addressed in the next section.

The functions that are going to be used in this section are

```
# CPS analysis on selection of visualization methods
data(vis_pollen)
c=visCPS(vis_pollen$vis, vis_pollen$ref)
```

After the computation, we have the return list c, which would be the input of function

Furthermore, if you want to see the statitics, you can simply view the return of

```
# overall tightness
c$tight_all
#> [1] 0.5166624
# cluster-wise tightness
c$tight
#> 1 2 3 4 5
#> Tightness of each cluster 0.2134804 0.7115383 1 0.6092218 0.9272868
#> 6 7 8 9 10
#> Tightness of each cluster 0.4363253 0.435473 0.2177813 0.1285714 0.4454768
#> 11
#> Tightness of each cluster 0.5581313
```

In this section, the relevant functions are

```
# CPS Analysis on validation of clustering result
data(YAN)
y=clustCPS(YAN, k=7, l=FALSE, pre=FALSE, noi="after", cmethod="kmeans", dimr="PCA", vis="tsne")
#> Warning in min(ref): no non-missing arguments to min; returning Inf
#> sigma summary: Min. : 0.323162264525782 |1st Qu. : 0.686532727791371 |Median : 0.840637685950217 |Mean : 0.832540338898672 |3rd Qu. : 0.996223616580691 |Max. : 1.26695806934483 |
#> Epoch: Iteration #100 error is: 14.568977374827
#> Epoch: Iteration #200 error is: 0.485179542650453
#> Epoch: Iteration #300 error is: 0.47141108056016
#> Epoch: Iteration #400 error is: 0.422772036027473
#> Epoch: Iteration #500 error is: 0.422283242087265
#> Epoch: Iteration #600 error is: 0.42178674345771
#> Epoch: Iteration #700 error is: 0.421785891226059
#> Epoch: Iteration #800 error is: 0.421785890574734
#> Epoch: Iteration #900 error is: 0.421785890574302
#> Epoch: Iteration #1000 error is: 0.421785890574302
# visualization of the results
mplot(y,4)
cplot(y,4)
# point-wise stability assessment
p=pplot(y)
p$v
```

If you want to try other clustering method rather than

[1] Jia Li, Beomseok Seo, and Lin Lin. “Optimal transport, mean partition, and uncertainty assessment in cluster analysis.” Statistical Analysis and Data Mining: The ASA Data Science Journal 12.5 (2019): 359-377.

[2] Lixiang Zhang, Lin Lin, and Jia Li. “CPS analysis: self-contained validation of biomedical data clustering.” Bioinformatics 36.11 (2020): 3516-3521.