Performing MTaSPUsSet (aSPUs) for multiple genes

We mapped 34 rs-ids from Mapping Snp to Gene tutorial. myrsidout.genes.annot

I prepared 6 multi-trait summary statistics data for 34 rs-ids, exdat.txt.

bash-3.2$ head exdat.txt
rs3121561       1       990380  0.38    0.14    0.012   0.070   0.64    0.81
rs2799064       1       957898  0.58    0.18    0.61    0.79    0.83    0.73
rs3128126       1       962210  0.076   0.12    0.081   0.17    0.15    0.81
rs3766186       1       1162435 0.29    0.57    0.32    0.73    0.52    0.63
rs11721 1       1152631 0.15    0.34    0.13    0.68    0.17    0.25
rs2887286       1       1156131 0.29    0.81    0.19    0.30    0.22    0.78
rs3813200       1       1151300 0.14    0.41    0.46    0.43    0.26    0.77
rs7515488       1       1163804 0.13    0.87    0.11    0.96    0.42    0.11
rs3813199       1       1158277 0.20    0.64    0.27    0.65    0.40    0.65
rs6603781       1       1158631 0.33    0.68    0.90    0.37    0.35    0.79

I made a Perl program, processGene.pl, to process magma outcome, myrsidout.genes.annot. The Usage is : ./processGene.pl “Magma Result” “Folder Name”

bash-3.2$ ./processGene.pl myrsidout.genes.annot exGenes
SSU72  25

It also shows a Gene name with maximum number of SNPs are mapped. The biggest gene in our example is SSU72 which have 25 SNPs mapped. The program make exGenes directory and save results there.

bash-3.2$ cd exGenes
bash-3.2$ ls
AGRN.txt  SDF4.txt  SSU72.txt  TMEM240.txt
bash-3.2$ head AGRN.txt
rs3121561
rs2799064
rs3128126

As you can check from the code above. AGRN.txt file have rs-ids mapped.

After running this perl code, I write a R function as follows. Take gene AGRN for example to explain. You can change gene name in the code for other genes.

library(aSPU)
g <- "AGRN"

Assume we already calculated correlation between phenotypes from the whole population. exvarPhe.txt

varPhe <- as.matrix(read.table("exvarPhe.txt"))
varPhe

##                V1    V2    V3    V4   V5    V6
##        [1,]  1.00 -0.01  0.19  0.30 0.66  0.03
##        [2,] -0.01  1.00 -0.01 -0.01 0.04 -0.01
##        [3,]  0.19 -0.01  1.00  0.29 0.21 -0.02
##        [4,]  0.30 -0.01  0.29  1.00 0.29  0.32
##        [5,]  0.66  0.04  0.21  0.29 1.00  0.02
##        [6,]  0.03 -0.01 -0.02  0.32 0.02  1.00

Using awk command I took subdata for AGRN on subdatAGRN.txt from exdat.txt.

    tof <- paste("subdat",g,".txt",sep ="")
    cmd2run <- paste("awk 'NR==FNR {h[$1] = $1; next} h[$1] !=\"\" {print $0}' exGenes/",g,".txt exdat.txt >", tof, sep="")
    system(cmd2run)

    sdat <- read.table(file=tof, colClasses=c("character", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "character", "character")  )
        system(paste("rm",tof) )

Again, using awk command, I took subdata for AGRN, tempdataAGRN.txt, from reference panel, SubsetCh1.txt.

    snpids <- sdat[,1]
    tempf = paste("tempids",g,".txt" , sep="")
    write.table(snpids,  file = tempf, row.names=FALSE, col.names=FALSE, quote=FALSE )
    tof <- paste("tempdata",g,".txt", sep="")

    system( paste("awk 'NR==FNR {h[$1] = $0; next} h[$1] !=\"\" {print h[$1]}' SubsetCh1.txt ", tempf," > ", tof, sep="") )

    locdat <- read.table(file=tof)
    datn <- t(matrix(as.numeric(as.matrix(locdat[,3:381])), dim(locdat)[1],379))

    colnames(datn) <- locdat[,1]

(Note: The reason I used awk is because of the speed. It is small example so reading data in R and processing it in R is fast. However, the 1000 human genome project data have 81.2 million polymorphic markers. It takes really long time to read data in R. I tried it once and terminated R. On the other hand, awk do the same job in a few seconds. )

Save used allele in reference panel.

    usedallele <- as.character(locdat[,2])
    system(paste("rm",tof))

If any column have all 0’s, 1’s or 2’s, erase it.

    snpids1 <- snpids[(snpids %in% colnames(datn))]

    a = max(apply( datn == 0, 2, sum )) == 379
    b = max(apply( datn == 1, 2, sum )) == 379
    c = max(apply( datn == 2, 2, sum )) == 379

    if( a | b | c ) {
        eraseit = 0
        eraseit <- c(eraseit, which( apply( datn == 0, 2, sum ) == 112 ),
                                which(apply( datn == 1, 2, sum ) == 112),
                                which(apply( datn == 2, 2, sum ) == 112) )

        datn <- datn[, -eraseit]
        snpids1 <- snpids1[-eraseit]
        usedallele <- usedallele[-eraseit]
    }

Compare allele used for 0,1,2 coding for original data and reference panel data. Alter 0 and 2 if different alleles are used.

    rownames(sdat) <- sdat[,1]
    allele <- toupper(sdat[snpids1, 11])
    difal <- ifelse( allele == usedallele, 1, -1)
    nsdat <- sdat[snpids1,]

    subdat2 <- as.matrix(datn)
    temp <- subdat2[, difal == -1 ]
    temp[ temp == 0] = 3
    temp[ temp == 2] = 0
    temp[ temp == 3] = 2
    subdat2[, difal == -1 ]  <- temp

Calculate correlation matrix

    datn <- as.matrix(subdat2)
    datn <- as.matrix(datn)
    corSNP <- cor(as.matrix(datn), use="pairwise.complete.obs")

    Ps <- as.matrix(nsdat[,c(4:9)])

Erase rs-ids with high correlation, \(>\) 0.95 and \(<\) -0.95.

    while( sum( corSNP < -0.95 ) > 0 ) {
       toerase <- which( apply( corSNP < -.95, 1, sum) == max( apply( corSNP < -.95, 1, sum) ) )
       corSNP <- corSNP[-toerase[1], -toerase[1]]
       Ps <- Ps[-toerase[1],]
       if( is.matrix(corSNP) == FALSE) break
    }

    corSNP = as.matrix(corSNP)
    while( sum( corSNP > 0.95 ) > ncol(corSNP) ) {
       toerase <- which( apply( corSNP > 0.95, 1, sum) == max( apply( corSNP > 0.95, 1, sum) ) )
       corSNP <- corSNP[-toerase[1], -toerase[1]]
       Ps <- Ps[-toerase[1],]
       if( is.matrix(corSNP) == FALSE) break
    }

    Ps = as.matrix(Ps)

Perform MTaSPUsSet test.

#    out <- MTaSPUsSet(Ps, corSNP=corSNP, corPhe = varPhe, pow=c(1,2,4,8),  pow2 = c(1,2,4,8), n.perm=10^3, Ps=TRUE)
    out <- MTaSPUsSetC(Ps, corSNP=corSNP, corPhe = varPhe, pow=c(1,2,4,8),  pow2 = c(1,2,4,8), n.perm=10^3, Ps=TRUE)
out

## MTSPUs1,1 MTSPUs2,1 MTSPUs4,1 MTSPUs8,1 MTSPUs1,2 MTSPUs2,2 MTSPUs4,2 MTSPUs8,2
##     0.145     0.174     0.247     0.313     0.140     0.201     0.253     0.272
## MTSPUs1,4 MTSPUs2,4 MTSPUs4,4 MTSPUs8,4 MTSPUs1,8 MTSPUs2,8 MTSPUs4,8 MTSPUs8,8
##     0.161     0.203     0.202     0.193     0.177     0.168     0.165     0.164
##   MTaSPUs
##     0.236

Perform aSPUs test.

Out <-NULL
for(i in 1:6) {
   Out[[i]] <- aSPUs(Ps[,i], corrSNP = corSNP, pow=c(1,2,4,8), n.perm = 100, Ps=TRUE)$pvs
}
Out

## [[1]]
##     SPUs1     SPUs2     SPUs4     SPUs8     aSPUs
## 0.2400000 0.2200000 0.2100000 0.1800000 0.2376238
## [[2]]
##      SPUs1      SPUs2      SPUs4      SPUs8      aSPUs
## 0.05000000 0.09000000 0.11000000 0.19000000 0.06930693
## [[3]]
##      SPUs1      SPUs2      SPUs4      SPUs8      aSPUs
## 0.07000000 0.03000000 0.01000000 0.01000000 0.02970297
## [[4]]
##     SPUs1     SPUs2     SPUs4     SPUs8     aSPUs
## 0.1700000 0.1400000 0.1500000 0.1500000 0.1683168
## [[5]]
##     SPUs1     SPUs2     SPUs4     SPUs8     aSPUs
## 0.5900000 0.5000000 0.4300000 0.3900000 0.4554455
## [[6]]
##     SPUs1     SPUs2     SPUs4     SPUs8     aSPUs
## 0.9200000 0.9300000 0.9600000 0.9700000 0.9306931

You can change g to calculate MTaSPUsSet/aSPUs P-values for other genes.

Performing MTaSPUsSet (aSPUs) for multiple genes

Il-Youp Kwak

2016-5-3