We mapped 34 rs-ids from Mapping Snp to Gene tutorial. myrsidout.genes.annot
I prepared 6 multi-trait summary statistics data for 34 rs-ids, exdat.txt.
bash-3.2$ head exdat.txt
rs3121561 1 990380 0.38 0.14 0.012 0.070 0.64 0.81
rs2799064 1 957898 0.58 0.18 0.61 0.79 0.83 0.73
rs3128126 1 962210 0.076 0.12 0.081 0.17 0.15 0.81
rs3766186 1 1162435 0.29 0.57 0.32 0.73 0.52 0.63
rs11721 1 1152631 0.15 0.34 0.13 0.68 0.17 0.25
rs2887286 1 1156131 0.29 0.81 0.19 0.30 0.22 0.78
rs3813200 1 1151300 0.14 0.41 0.46 0.43 0.26 0.77
rs7515488 1 1163804 0.13 0.87 0.11 0.96 0.42 0.11
rs3813199 1 1158277 0.20 0.64 0.27 0.65 0.40 0.65
rs6603781 1 1158631 0.33 0.68 0.90 0.37 0.35 0.79
I made a Perl program, processGene.pl, to process magma outcome, myrsidout.genes.annot. The Usage is : ./processGene.pl “Magma Result” “Folder Name”
bash-3.2$ ./processGene.pl myrsidout.genes.annot exGenes
SSU72 25
It also shows a Gene name with maximum number of SNPs are mapped. The biggest gene in our example is SSU72 which have 25 SNPs mapped. The program make exGenes
directory and save results there.
bash-3.2$ cd exGenes
bash-3.2$ ls
AGRN.txt SDF4.txt SSU72.txt TMEM240.txt
bash-3.2$ head AGRN.txt
rs3121561
rs2799064
rs3128126
As you can check from the code above. AGRN.txt file have rs-ids mapped.
After running this perl code, I write a R function as follows. Take gene AGRN for example to explain. You can change gene name in the code for other genes.
library(aSPU)
g <- "AGRN"
Assume we already calculated correlation between phenotypes from the whole population. exvarPhe.txt
varPhe <- as.matrix(read.table("exvarPhe.txt"))
varPhe
## V1 V2 V3 V4 V5 V6
## [1,] 1.00 -0.01 0.19 0.30 0.66 0.03
## [2,] -0.01 1.00 -0.01 -0.01 0.04 -0.01
## [3,] 0.19 -0.01 1.00 0.29 0.21 -0.02
## [4,] 0.30 -0.01 0.29 1.00 0.29 0.32
## [5,] 0.66 0.04 0.21 0.29 1.00 0.02
## [6,] 0.03 -0.01 -0.02 0.32 0.02 1.00
Using awk command I took subdata for AGRN on subdatAGRN.txt from exdat.txt.
tof <- paste("subdat",g,".txt",sep ="")
cmd2run <- paste("awk 'NR==FNR {h[$1] = $1; next} h[$1] !=\"\" {print $0}' exGenes/",g,".txt exdat.txt >", tof, sep="")
system(cmd2run)
sdat <- read.table(file=tof, colClasses=c("character", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "character", "character") )
system(paste("rm",tof) )
Again, using awk command, I took subdata for AGRN, tempdataAGRN.txt, from reference panel, SubsetCh1.txt.
snpids <- sdat[,1]
tempf = paste("tempids",g,".txt" , sep="")
write.table(snpids, file = tempf, row.names=FALSE, col.names=FALSE, quote=FALSE )
tof <- paste("tempdata",g,".txt", sep="")
system( paste("awk 'NR==FNR {h[$1] = $0; next} h[$1] !=\"\" {print h[$1]}' SubsetCh1.txt ", tempf," > ", tof, sep="") )
locdat <- read.table(file=tof)
datn <- t(matrix(as.numeric(as.matrix(locdat[,3:381])), dim(locdat)[1],379))
colnames(datn) <- locdat[,1]
(Note: The reason I used awk is because of the speed. It is small example so reading data in R and processing it in R is fast. However, the 1000 human genome project data have 81.2 million polymorphic markers. It takes really long time to read data in R. I tried it once and terminated R. On the other hand, awk do the same job in a few seconds. )
Save used allele in reference panel.
usedallele <- as.character(locdat[,2])
system(paste("rm",tof))
If any column have all 0’s, 1’s or 2’s, erase it.
snpids1 <- snpids[(snpids %in% colnames(datn))]
a = max(apply( datn == 0, 2, sum )) == 379
b = max(apply( datn == 1, 2, sum )) == 379
c = max(apply( datn == 2, 2, sum )) == 379
if( a | b | c ) {
eraseit = 0
eraseit <- c(eraseit, which( apply( datn == 0, 2, sum ) == 112 ),
which(apply( datn == 1, 2, sum ) == 112),
which(apply( datn == 2, 2, sum ) == 112) )
datn <- datn[, -eraseit]
snpids1 <- snpids1[-eraseit]
usedallele <- usedallele[-eraseit]
}
Compare allele used for 0,1,2 coding for original data and reference panel data. Alter 0 and 2 if different alleles are used.
rownames(sdat) <- sdat[,1]
allele <- toupper(sdat[snpids1, 11])
difal <- ifelse( allele == usedallele, 1, -1)
nsdat <- sdat[snpids1,]
subdat2 <- as.matrix(datn)
temp <- subdat2[, difal == -1 ]
temp[ temp == 0] = 3
temp[ temp == 2] = 0
temp[ temp == 3] = 2
subdat2[, difal == -1 ] <- temp
Calculate correlation matrix
datn <- as.matrix(subdat2)
datn <- as.matrix(datn)
corSNP <- cor(as.matrix(datn), use="pairwise.complete.obs")
Ps <- as.matrix(nsdat[,c(4:9)])
Erase rs-ids with high correlation, \(>\) 0.95 and \(<\) -0.95.
while( sum( corSNP < -0.95 ) > 0 ) {
toerase <- which( apply( corSNP < -.95, 1, sum) == max( apply( corSNP < -.95, 1, sum) ) )
corSNP <- corSNP[-toerase[1], -toerase[1]]
Ps <- Ps[-toerase[1],]
if( is.matrix(corSNP) == FALSE) break
}
corSNP = as.matrix(corSNP)
while( sum( corSNP > 0.95 ) > ncol(corSNP) ) {
toerase <- which( apply( corSNP > 0.95, 1, sum) == max( apply( corSNP > 0.95, 1, sum) ) )
corSNP <- corSNP[-toerase[1], -toerase[1]]
Ps <- Ps[-toerase[1],]
if( is.matrix(corSNP) == FALSE) break
}
Ps = as.matrix(Ps)
Perform MTaSPUsSet test.
# out <- MTaSPUsSet(Ps, corSNP=corSNP, corPhe = varPhe, pow=c(1,2,4,8), pow2 = c(1,2,4,8), n.perm=10^3, Ps=TRUE)
out <- MTaSPUsSetC(Ps, corSNP=corSNP, corPhe = varPhe, pow=c(1,2,4,8), pow2 = c(1,2,4,8), n.perm=10^3, Ps=TRUE)
out
## MTSPUs1,1 MTSPUs2,1 MTSPUs4,1 MTSPUs8,1 MTSPUs1,2 MTSPUs2,2 MTSPUs4,2 MTSPUs8,2
## 0.145 0.174 0.247 0.313 0.140 0.201 0.253 0.272
## MTSPUs1,4 MTSPUs2,4 MTSPUs4,4 MTSPUs8,4 MTSPUs1,8 MTSPUs2,8 MTSPUs4,8 MTSPUs8,8
## 0.161 0.203 0.202 0.193 0.177 0.168 0.165 0.164
## MTaSPUs
## 0.236
Perform aSPUs test.
Out <-NULL
for(i in 1:6) {
Out[[i]] <- aSPUs(Ps[,i], corrSNP = corSNP, pow=c(1,2,4,8), n.perm = 100, Ps=TRUE)$pvs
}
Out
## [[1]]
## SPUs1 SPUs2 SPUs4 SPUs8 aSPUs
## 0.2400000 0.2200000 0.2100000 0.1800000 0.2376238
## [[2]]
## SPUs1 SPUs2 SPUs4 SPUs8 aSPUs
## 0.05000000 0.09000000 0.11000000 0.19000000 0.06930693
## [[3]]
## SPUs1 SPUs2 SPUs4 SPUs8 aSPUs
## 0.07000000 0.03000000 0.01000000 0.01000000 0.02970297
## [[4]]
## SPUs1 SPUs2 SPUs4 SPUs8 aSPUs
## 0.1700000 0.1400000 0.1500000 0.1500000 0.1683168
## [[5]]
## SPUs1 SPUs2 SPUs4 SPUs8 aSPUs
## 0.5900000 0.5000000 0.4300000 0.3900000 0.4554455
## [[6]]
## SPUs1 SPUs2 SPUs4 SPUs8 aSPUs
## 0.9200000 0.9300000 0.9600000 0.9700000 0.9306931
You can change g
to calculate MTaSPUsSet/aSPUs P-values for other genes.