Example code for subchallenge 2

In [1]:
library(tidyverse)
library(stringr) ## for str_pad() function
library(purrr) ## map
library(phangorn)
library(phylosim)
library(parallel)
library(rBayesianOptimization)
library(DCLEAR)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

 ggplot2 3.3.2      purrr   0.3.4
 tibble  3.0.2      dplyr   1.0.0
 tidyr   1.1.0      stringr 1.4.0
 readr   1.3.1      forcats 0.5.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()

Loading required package: ape

Loading required package: R.oo

Loading required package: R.methodsS3

R.methodsS3 v1.8.0 (2020-02-14 07:10:20 UTC) successfully loaded. See ?R.methodsS3 for help.

R.oo v1.23.0 successfully loaded. See ?R.oo for help.


Attaching package: ‘R.oo’


The following object is masked from ‘package:R.methodsS3’:

    throw


The following objects are masked from ‘package:methods’:

    getClasses, getMethods


The following objects are masked from ‘package:base’:

    attach, detach, load, save


Loading required package: compoisson

Loading required package: MASS


Attaching package: ‘MASS’


The following object is masked from ‘package:dplyr’:

    select



Attaching package: ‘DCLEAR’


The following object is masked from ‘package:stats’:

    simulate


Data Loading

Assume that we have data from subchallenge 2 data in ../data/subC2/ folder. Let's load the first file.

In [2]:
states <- c('0', '-', LETTERS)
sequence_length <- 200L
prefix <- '../data/subC2/SubC2_train_0001'
txt_file <- sprintf('%s.txt', prefix)
nw_file <- sprintf('%s_REF.nw', prefix)

print(txt_file)
[1] "../data/subC2/SubC2_train_0001.txt"

Print first 5 lines to see how it looks

In [3]:
x <- read.table(txt_file, header = FALSE, sep = '\t')
tip_names <- x[, 1]
print(x[1:5,])
      V1
1 c_0011
2 c_0170
3 c_0021
4 c_0020
5 c_0541
                                                                                                                                                                                                        V2
1 00BCAA0A00A0000A00A000A0A0000B0000A00A00000A0AA00A0A0A0000AA0AA0AD000AB0B0BAB0000A00000I000000000A00000A0A0000G000BA0A0BA0A000HA----------------CAA0C0B0D0A000A00D000000A00AC000A0000000A-------------A0
2 00A000D0A0A00A-AAAA-------------AA000BAAA0000BA0D0A0A00AA00C000C0A000A0000CE00B0AA00A0A0BAAA-BA0A00000AEABAA000ACA--------------A00C000000B00AB0AA0AA0000AAAD00B0A0000000FA0A00AD0A0000A0FA00A0A0BA000A0
3 00B0A0000D00A00A0A00AC00AA00A0000A00EB000000A0AAA00000AB0B0A00IA00A0ABAA0AI000A0A000C0AI00A0000G00000A000000B0G000000A0A000AAED000A00000E0G----------------B0AA0AE00AD000000AA000AAA0BAAA000A0000BA0AAAA
4 00B0A0A0A0AAAAAA00B000AC0B0B0A00000B00E0FB0000A0AB0AA0000BA0B00A00AA000AAE0B0A0F00A0000I0A00AAA000A0B00000A000GAA000B000000AAAA000AE00000BAB0AA0000A0A00BAA0AAA00A------------AA0AC00A0ABA00A00A000AB0A0
5 000000AD0000A000B000000I00000AA00A00000C0A00HA000AC00A0AFA0A00CAA00AA00D0A00A00000AB0BAAA0A00A000AB000B0B0AA0A00AAA00B0A0000000000A00J0A0F0A00000A0000A------------AA00AA0B0C0CAAA00AB0A0AAA0AI0AA0G000A

Process data and save it as phyDat object form

In [4]:
states2num = 1:length(states)
names(states2num) = states

x <- do.call('rbind', strsplit(as.character(x[, 2]), ''))
rownames(x) <- tip_names

x <- x %>% phyDat(type = 'USER', levels = states)
y <- read.tree(nw_file)

x is a phyDat object.

In [5]:
x
100 sequences with 200 character and 200 different site patterns.
The states are 0 - A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 

And, y is phylo object for tree.

In [6]:
plot(y)

Our algorithm is divided in 2 parts; distance calculation, and tree reconstruction.

For distance calculation, we will compare 1) hamming distance, 2) weighted hamming with entrophy weights, and 3) weighted hamming with bayesian optimization search.

For tree reconstruction, we will simply use Neighbor Joining (NJ) as tree reconstruction method.

Distance claculation

The simple one is hamming distance

In [7]:
D_h = x %>% dist.hamming()

D_h is a dist object that calculated pairwise distance among 100 cells

In [8]:
length(D_h)
4950

Proposed Methods

We will explain how to get dist object using weghted hamming method. There are two methods. The first one is using Inverse information as weights. and the second one is estimating weights using bayesian optimization.

Prepare training data first

In [9]:
set.seed(99)
idx1 = sample(100)
train_idx = idx1[1:70]
test_idx = idx1[71:100]

Let 70% of data as training set.

In [10]:
SDs_train = NULL

for(i in 1:length(train_idx)) {
    idx = train_idx[i]

    states <- c('0', '-', LETTERS)
    sequence_length <- 200L
    prefix <- paste('../data/subC2/SubC2_train_',str_pad(idx,4, pad = "0"), sep="")
    txt_file <- sprintf('%s.txt', prefix)
    nw_file <- sprintf('%s_REF.nw', prefix)

    x <- read.table(txt_file, header = FALSE, sep = '\t')
    tip_names <- x[, 1]

    states2num = 1:length(states)
    names(states2num) = states

    x <- do.call('rbind', strsplit(as.character(x[, 2]), ''))
    rownames(x) <- tip_names

    x <- x %>% phyDat(type = 'USER', levels = states)
    y <- read.tree(nw_file)    
    
    item1 = list(seq=x, tree=y)    
    SDs_train[[i]] = item1
}

Let 30% of the data as testing set

In [11]:
SDs_test = NULL

for(i in 1:length(test_idx)) {
    idx = test_idx[i]

    states <- c('0', '-', LETTERS)
    sequence_length <- 200L
    prefix <- paste('../data/subC2/SubC2_train_',str_pad(idx,4, pad = "0"), sep="")
    txt_file <- sprintf('%s.txt', prefix)
    nw_file <- sprintf('%s_REF.nw', prefix)

    x <- read.table(txt_file, header = FALSE, sep = '\t')
    tip_names <- x[, 1]

    states2num = 1:length(states)
    names(states2num) = states

    x <- do.call('rbind', strsplit(as.character(x[, 2]), ''))
    rownames(x) <- tip_names

    x <- x %>% phyDat(type = 'USER', levels = states)
    y <- read.tree(nw_file)    
    
    item1 = list(seq=x, tree=y)    
    SDs_test[[i]] = item1
}

eval_test is a function for evaluating our method in test set

In [14]:
eval_test <- function(InfoW, SDs_test, dir = 'res', fnm= "v1") {
    Dwh = NULL
    Dh = NULL
    
    dir.create(dir, showWarnings = FALSE)
    
    for (i in 1:length(SDs_test)) {
        x = SDs_test[[i]][1]$seq
        y = SDs_test[[i]]$tree
        tree1 = x %>% dist_weighted_hamming(InfoW, dropout=FALSE) %>% fastme.bal()
        dist_wh = tree1 %>% RF.dist(y, normalize = TRUE)
        write.tree(tree1, paste(dir, '/Constructed_', i,'_',fnm,'.newick', sep="") )
        write.tree(y, paste(dir, '/true_', i,'_',fnm,'.newick', sep="") )
#        dist_h = x %>% dist.hamming() %>% fastme.bal() %>% RF.dist(y, normalize = TRUE)
        Dwh = c(Dwh, dist_wh)
#        Dh = c(Dh, dist_h)
        }
    return(mean(Dwh))
}
1) Evaluate Hamming distance
In [15]:
InfoW = rep(1,23)
#InfoW[1] = 1
#InfoW[2:23] = 1
eval_test(InfoW, SDs_test, fnm = 'hamming')
0.742268041237113
2) Inverse information as weights

A function to calculate Inverse information from data:

In [16]:
getInfoW <- function(SDs_train, outcome_pos=3:25, upper_bound = 7, W0 = 1, W_ = 1 ) {
    n_t = length(SDs_train)
    seqs <- rep(0,30)
    for(i in 1:n_t) {
        tb = table(unlist(SDs_train[[i]]$seq))
        for(j in names(tb)) {
            seqs[as.numeric(j)] = seqs[as.numeric(j)] + tb[j]
        }
    }
    seqs = seqs[outcome_pos]
    ws = seqs / sum(seqs)
    InfoW = -log(ws)
    InfoW[InfoW > upper_bound] = upper_bound
    InfoW = c(W0,W_,InfoW)
    
    return(InfoW)
}

Using inverse information method, we can get evaluation score as below

In [17]:
InfoW = getInfoW(SDs_train, outcome_pos = 3:25, upper_bound = 7, W0 = 1, W_ = 1)
eval_test(InfoW, SDs_test, fnm = 'v1_1')
0.672508591065292
3) estimating weights using bayesian optimization

Assume two different weights, weights for interval dropout, Wdropout, and weight for other outcome states Wothers.

In [20]:
WHfit <- function( Wdropout, Wothers ) {

    InfoW = rep(1,7)
    InfoW[1] = 1
    InfoW[2] = Wdropout
    InfoW[3:23] = Wothers

    ds = NULL
    for(i in 1:length(SDs_train)){
        D_wh = dist_weighted_hamming(SDs_train[[i]][1]$seq, InfoW, dropout = FALSE)
        tree_wh = fastme.bal(D_wh)
        ds = c(ds, -RF.dist(tree_wh, SDs_train[[i]]$tree, normalize = TRUE) )
    }    
    
    result = list( Score = mean(ds), Pred = 0 )
    return(result)

}


search_bound <- list(Wdropout = c(0.1,1),
                     Wothers = c(1, 15)  )


set.seed(123)
search_grid <- data.frame(Wdropout = runif(30, 0.1, 1.5),
                          Wothers = runif(30, 1.5, 15))

head(search_grid)
A data.frame: 6 × 2
WdropoutWothers
<dbl><dbl>
10.502608514.500827
21.203627213.681037
30.672567710.824521
41.336224412.238810
51.4166542 1.832285
60.1637791 7.950246
In [21]:
set.seed(123)
bayes_WH <- BayesianOptimization(FUN = WHfit, bounds = search_bound,
                     init_points = 0, init_grid_dt = search_grid,
                     n_iter = 30, acq = "ucb")
elapsed = 2.52	Round = 1	Wdropout = 0.5026	Wothers = 14.5008	Value = -0.8021 
elapsed = 2.35	Round = 2	Wdropout = 1.2036	Wothers = 13.6810	Value = -0.7948 
elapsed = 2.42	Round = 3	Wdropout = 0.6726	Wothers = 10.8245	Value = -0.7885 
elapsed = 2.30	Round = 4	Wdropout = 1.3362	Wothers = 12.2388	Value = -0.7901 
elapsed = 1.86	Round = 5	Wdropout = 1.4167	Wothers = 1.8323	Value = -0.7065 
elapsed = 2.38	Round = 6	Wdropout = 0.1638	Wothers = 7.9502	Value = -0.7772 
elapsed = 2.48	Round = 7	Wdropout = 0.8393	Wothers = 11.7392	Value = -0.7920 
elapsed = 2.22	Round = 8	Wdropout = 1.3494	Wothers = 4.4215	Value = -0.7078 
elapsed = 2.31	Round = 9	Wdropout = 0.8720	Wothers = 5.7954	Value = -0.7411 
elapsed = 2.25	Round = 10	Wdropout = 0.7393	Wothers = 4.6269	Value = -0.7214 
elapsed = 1.99	Round = 11	Wdropout = 1.4396	Wothers = 3.4278	Value = -0.6854 
elapsed = 2.34	Round = 12	Wdropout = 0.7347	Wothers = 7.0964	Value = -0.7611 
elapsed = 2.36	Round = 13	Wdropout = 1.0486	Wothers = 7.0853	Value = -0.7555 
elapsed = 2.20	Round = 14	Wdropout = 0.9017	Wothers = 6.4794	Value = -0.7549 
elapsed = 2.15	Round = 15	Wdropout = 0.2441	Wothers = 3.5580	Value = -0.7065 
elapsed = 1.99	Round = 16	Wdropout = 1.3598	Wothers = 3.3739	Value = -0.6865 
elapsed = 2.25	Round = 17	Wdropout = 0.4445	Wothers = 4.6460	Value = -0.7270 
elapsed = 2.29	Round = 18	Wdropout = 0.1589	Wothers = 7.7905	Value = -0.7758 
elapsed = 2.18	Round = 19	Wdropout = 0.5591	Wothers = 5.0906	Value = -0.7321 
elapsed = 2.39	Round = 20	Wdropout = 1.4363	Wothers = 13.0807	Value = -0.7882 
elapsed = 1.82	Round = 21	Wdropout = 1.3454	Wothers = 2.1187	Value = -0.6866 
elapsed = 2.39	Round = 22	Wdropout = 1.0699	Wothers = 7.4697	Value = -0.7585 
elapsed = 2.35	Round = 23	Wdropout = 0.9967	Wothers = 12.2855	Value = -0.7920 
elapsed = 2.00	Round = 24	Wdropout = 1.4920	Wothers = 3.1456	Value = -0.6820 
elapsed = 2.38	Round = 25	Wdropout = 1.0180	Wothers = 9.0728	Value = -0.7736 
elapsed = 2.19	Round = 26	Wdropout = 1.0919	Wothers = 4.2882	Value = -0.7066 
elapsed = 2.05	Round = 27	Wdropout = 0.8617	Wothers = 3.2217	Value = -0.6791 
elapsed = 2.44	Round = 28	Wdropout = 0.9318	Wothers = 11.6697	Value = -0.7922 
elapsed = 2.38	Round = 29	Wdropout = 0.5048	Wothers = 13.5831	Value = -0.8019 
elapsed = 2.32	Round = 30	Wdropout = 0.3060	Wothers = 6.5552	Value = -0.7610 
elapsed = 1.99	Round = 31	Wdropout = 0.9995	Wothers = 2.7991	Value = -0.6594 
elapsed = 2.02	Round = 32	Wdropout = 0.3440	Wothers = 2.6643	Value = -0.6739 
elapsed = 1.91	Round = 33	Wdropout = 0.7458	Wothers = 2.6994	Value = -0.6604 
elapsed = 2.17	Round = 34	Wdropout = 0.1000	Wothers = 3.0080	Value = -0.6982 
elapsed = 1.99	Round = 35	Wdropout = 0.1000	Wothers = 2.0411	Value = -0.6751 
elapsed = 2.08	Round = 36	Wdropout = 0.1000	Wothers = 1.5006	Value = -0.6884 
elapsed = 2.13	Round = 37	Wdropout = 0.4597	Wothers = 1.0000	Value = -0.7028 
elapsed = 1.93	Round = 38	Wdropout = 1.0000	Wothers = 2.5898	Value = -0.6521 
elapsed = 2.00	Round = 39	Wdropout = 0.5456	Wothers = 2.3489	Value = -0.6523 
elapsed = 2.09	Round = 40	Wdropout = 0.1000	Wothers = 2.3258	Value = -0.6806 
elapsed = 2.03	Round = 41	Wdropout = 0.4625	Wothers = 1.8618	Value = -0.6465 
elapsed = 1.87	Round = 42	Wdropout = 0.5502	Wothers = 1.5095	Value = -0.6630 
elapsed = 1.96	Round = 43	Wdropout = 0.4609	Wothers = 2.0500	Value = -0.6487 
elapsed = 2.07	Round = 44	Wdropout = 0.3861	Wothers = 1.8135	Value = -0.6517 
elapsed = 1.91	Round = 45	Wdropout = 0.6293	Wothers = 2.0589	Value = -0.6452 
elapsed = 1.88	Round = 46	Wdropout = 0.5835	Wothers = 1.8797	Value = -0.6430 
elapsed = 2.22	Round = 47	Wdropout = 0.6787	Wothers = 3.7693	Value = -0.6990 
elapsed = 1.98	Round = 48	Wdropout = 0.5838	Wothers = 1.8443	Value = -0.6454 
elapsed = 2.06	Round = 49	Wdropout = 0.5421	Wothers = 3.1763	Value = -0.6835 
elapsed = 2.01	Round = 50	Wdropout = 0.3021	Wothers = 1.4084	Value = -0.6717 
elapsed = 2.04	Round = 51	Wdropout = 0.2370	Wothers = 2.1040	Value = -0.6700 
elapsed = 1.98	Round = 52	Wdropout = 0.8909	Wothers = 2.6161	Value = -0.6566 
elapsed = 1.97	Round = 53	Wdropout = 0.7567	Wothers = 1.0000	Value = -0.7256 
elapsed = 1.88	Round = 54	Wdropout = 0.5812	Wothers = 1.9596	Value = -0.6436 
elapsed = 1.94	Round = 55	Wdropout = 0.5275	Wothers = 2.0774	Value = -0.6493 
elapsed = 1.87	Round = 56	Wdropout = 0.7054	Wothers = 2.3754	Value = -0.6515 
elapsed = 2.04	Round = 57	Wdropout = 0.4600	Wothers = 2.0294	Value = -0.6507 
elapsed = 1.97	Round = 58	Wdropout = 0.5236	Wothers = 2.1701	Value = -0.6473 
elapsed = 1.93	Round = 59	Wdropout = 0.5310	Wothers = 2.0294	Value = -0.6457 
elapsed = 2.01	Round = 60	Wdropout = 0.5580	Wothers = 2.1510	Value = -0.6476 

 Best Parameters Found: 
Round = 46	Wdropout = 0.5835	Wothers = 1.8797	Value = -0.6430 

With Bayesian hyperparameter optimization, we get estimates for Wdropout and Wothers. Put these estimates as wegiht and get the evaluation performance.

In [23]:
InfoW = rep(1,7)
InfoW[1] = 1
InfoW[2] = 0.5835
InfoW[3:23] = 1.8797

eval_test(InfoW, SDs_test, fnm = 'v1_2')
0.637457044673539
In [ ]:

In [ ]: