# How to use weighted hamming on sequence data for cell lineage reconstruction?

## What is Cell Lineage Reconstruction?

The research question is that "Can we reconstruct the lineage of how cells differentiate?". McKenna et al Science (2016) shows the possibility that lineage can be reconstructed through the edited barcode of CRISPR/Cas9 target sites. Scientists can reconstruct the cell lineange based on the information from barcodes of each cell.

The next question is how accurately can we recover the cell lineage? The Allen Institute Cell Lineage Reconstruction DREAM Challenge is hosted to answer the question and search for useful approaches.

## Who We are?

We are Il-Youp Kwak and Wuming Going who participated this competition as a team (Kwak_Gong). We would like to share our methods and experience. Hope our findings helpful to everyone! :)

## The Simulation data

We tried to generate the sequence data, the way Salvador-Martinez and Grillo et al. (2019) described in 'General description of the simulation' from Result section. Basically, it simulate binary structure of cell lineage. For a simple example, given the tree structure of 7 cells below,

    1
2   3
4 5 6 7

The barcode of cells generated as

1: '0000000000'
2: 'E00A000000'
3: '0000C00000'
4: 'EA0A000E00'
5: 'E00A000000'
6: 'E0A0CD0000'
7: '0000C---00'

Here, '0' stands for the initial state, '-' stands for interval dropout, and character letter stands for mutational outcomes.

Here is a code to generate simulation data.

m = 30  # number of targets
acell = as.integer(rep(1,m)) ## mother cell with an unmutated state
mu_d = 0.1  ## mutation rate (ex 0.01 to 0.3)
d = 3  ## number of cell division
n_s = 5 ##  the number of possible mutational outcomes (1:n_s)
p_d = 0.05 ## dropout probability

nmstrings = c( '0', '-', LETTERS[1:n_s] )
sim_tree = list()
sim_tree[[1]] = acell

k = 2
while(k < 2^d) {
## codes for point mutation
mother_cell = sim_tree[[k%/%2]]
mu_loc = runif(m) < mu_d
mutation_cites = (mother_cell == 1) &  mu_loc
n_mut = sum(mutation_cites)
if (n_mut > 0) {
mother_cell[mutation_cites] = as.integer(sample(n_s, n_mut, replace = T)+2)
}

## codes for dropout
dropout_cites = runif(m) < p_d
if (sum(dropout_cites) > 2 ) {
dropout_between = sample(which(dropout_cites), 2 )
mother_cell[dropout_between[1]:dropout_between[2]] = as.integer(2)
}

sim_tree[[k]] = mother_cell
k = k+1
}

Simulated cells are shown below:

1:7 %>% map(~paste(nmstrings[sim_tree[[.]]], collapse=""))
## [[1]]
## [1] "000000000000000000000000000000"
##
## [[2]]
## [1] "0000000000000000000000000CB000"
##
## [[3]]
## [1] "0000E000000000000000000C000C00"
##
## [[4]]
## [1] "00E000000A000000000E00000CB000"
##
## [[5]]
## [1] "00D000AB00000000000000000CB000"
##
## [[6]]
## [1] "00C0EB0E000A0000A000000C0D0C00"
##
## [[7]]
## [1] "A000E000B00000000000000C000CA0"

By extending the code, we made our code sim_seqdata for simulating data.

set.seed(1)
mu_d1 = c( 30, 20, 10, 5, 5, 1, 0.01, 0.001)
mu_d1 = mu_d1/sum(mu_d1)
simn = 100 # number of cell samples
m = 200  ## number of targets
mu_d = 0.03 # mutation rate
d = 12 # number of cell division
p_d = 0.005 # dropout probability

To generate 100 number of cells with 200 number of targets, 0.03 mutation rate, 12 number of cell divisions, 0.005 dropout probability, and 8 number of outcome states with outcome probability of 0.4224698, 0.2816465, 0.1408233, 0.0704116, 0.0704116, 0.0140823, 1.408232510^{-4}, 1.408232510^{-5}, we can run the code below.

sD = sim_seqdata(sim_n = simn, m = m, mu_d = mu_d, d = d, n_s = length(mu_d1), outcome_prob = mu_d1, p_d = p_d )

The result of sim_seqdata function, sD, is a list of two object, 'seqs' and 'tree'.

'seqs' is a sequence data of 'phyDat' object,

sD$seqs ## 100 sequences with 200 character and 200 different site patterns. ## The states are 0 - A B C D E F G H and 'tree' is ground truth tree structure of 'phylo' object. class(sD$tree)
## [1] "phylo"

## Distance estimation using hamming

One easy way to construct a tree is to calculate distance among cells and construct a tree from the distance matrix. One of widely used method for the sequence distance is hamming distance.

distH = dist.hamming(sD$seqs) With Neighbor Joining or fastme by Desper, R. and Gascuel, O. (2002), we can construct tree from the distance matrix. TreeNJ = NJ(distH) TreeFM = fastme.ols(distH) We calculate Robinson-Foulds distance, one of evaluation metric used in the competition, to evaluate performance. print( RF.dist(TreeNJ, sD$tree, normalize = TRUE) )
## [1] 0.5670103
print( RF.dist(TreeFM, sD$tree, normalize = TRUE) ) ## [1] 0.5773196 ## Ideas ### Weighted Hamming Distance I Hamming distance simply count the number of different characters given two sequences. Say we have two sequences '00AB0' and '0-CB0'. Hamming distance is simply 2. The second position, we have '0' and '-', and the third position, we have 'A' and 'C'. Would it be reasonable to put equal weights for these two differences? If not, what difference would be farther? To account for this consideration, we proposed weights for '0' (initial state), '-' (interval dropout), ''(point dropout) and each outcome state (A-Z). Say $$w_1$$ is weight for '0', $$w_2$$ is for '-', $$w_3$$ is for '', and $$w_a, \cdots, w_z$$ are for outcome states A to Z. So, for the given example, '00AB0' and '0-CB0', weighted hamming distance is $$w_1 w_2 + w_a w_c$$ . ### Initcial choice of weights. #### Information entropy weights for $$w_a, \cdots, w_z$$ We can estimate mutation probability for each outcome state from the data by their frequencies. Less frequent outcomes are less likely to observe. Less likely outcomes would corresponds with farther distances. Thus one idea for outcome states is with their information entropy. $$w_a = -log P(\text{base is A}), \cdots, w_z = -log P(\text{base is Z})$$. #### Constant weights for $$w_a, \cdots, w_z$$ Or, we may assign simple constant weight for $$w_a, \cdots, w_z$$. #### Weights for $$w_1, w_2,$$ and $$w_3$$ We set weight for initial state, $$w_1$$, as 1. We can search for weights $$w_2$$ and $$w_3$$ that minimize averaged RF score on train set with simple grid search. ### Weighted Hamming Distance II that considering interval dropout The proposed weighted hamming didn't account for interval dropout. It maybe better to consider interval dropout event in distance calculation. Previously, we assigned weights to each target site. Our extended idea is to assign weight on dropout interval. Say we have two sequences '00AB0' and '00- - -'. The Hamming distance score would be 3, and the weighted hamming score would be $$w_a w_2 + w_b w_2 + w_1 w_2$$. However, this interval dropout maybe one event of interval missing. So, with the new algorithm, weighted hamming distance II is just $$w_2^*$$. ### Example code for Weighted Hamming I Here's how we specify weights. InfoW = -log(mu_d1) ## entropy as their weights InfoW[1:2] = 1 ## weight for initial state is 1 , weight for '-' is set to 1. InfoW[3:7] = 4.5 ## Constant weight for outcome states dist_wh1 = WH(sD$seqs, InfoW)
#dist_wh1 = dist_weighted_hamming(sD$seqs, InfoW, dropout = FALSE) We can construct tree using NJ or fastme. TreeNJ_wh1 = NJ(dist_wh1) TreeFM_wh1 = fastme.ols(dist_wh1) RF performance is as below. print( RF.dist(TreeNJ_wh1, sD$tree, normalize = TRUE) )
## [1] 0.371134
print( RF.dist(TreeFM_wh1, sD$tree, normalize = TRUE) ) ## [1] 0.3195876 ### Example code for Weighted Hamming II Here's how we specify weights. InfoW = -log(mu_d1) ## entropy as their weights InfoW[1] = 1 # weight for initial state is 1 InfoW[2] = 12 # weight for interval dropout, '----', is set to 12. InfoW[3:7] = 3 # Constant weight for outcome states dist_wh2 = WH(sD$seqs, InfoW, dropout=TRUE)

We can construct tree using NJ or fastme.

TreeNJ_wh2 = NJ(dist_wh2)
TreeFM_wh2 = fastme.ols(dist_wh2)

RF performance is as below.

print( RF.dist(TreeNJ_wh2, sD$tree, normalize = TRUE) ) ## [1] 0.3298969 print( RF.dist(TreeFM_wh2, sD$tree, normalize = TRUE) )
## [1] 0.3814433