This vignette illustrates the use of MTaSPUsSet test, an adaptive gene-multitrait association testing with GWAS summary statistics.
We first downloaded GWAS summary statistics of Genetic Investigation of ANthropometric Traits (GIANT) consortium data. We get the genomic coordinates of SNPs using SnpTracker software (hg19 used), and then we mapped SNPs to Genes using MAGMA software (build 37 used). We will consider testing a gene named SAMD11 for example.
Let’s load SAMD11 data first.
data(SAMD11)
attach(SAMD11)
ZsM
and PsM
are Z-scores and P-values for GIANT Man data mapped on gene SAMD11. (We used MAGMA software to map rs ids to genes. )
round(ZsM,3)
## BMIm HeightM HIPm WCm WeightM WHRm
## rs4951864 -1.150 0.789 -0.539 -0.706 -0.628 -0.468
## rs12132517 1.372 -0.755 0.690 0.824 0.842 0.279
## rs11240777 1.036 0.878 0.755 1.670 1.254 1.175
PsM
## BMIm HeightM HIPm WCm WeightM WHRm
## rs4951864 0.25 0.43 0.59 0.480 0.53 0.64
## rs12132517 0.17 0.45 0.49 0.410 0.40 0.78
## rs11240777 0.30 0.38 0.45 0.095 0.21 0.24
Both ZsM
and PsM
are 3 by 6 matrix. Row represent SNPs and column represent phenotypes. 3 SNPs are mapped on gene SAMD11 and 6 number of phenotypes are considered in testing.
corSNPM
and corPheM
are correlation among SNPs and Phenotypes respectively for GIANT Man data.
round(corSNPM,2)
## rs4951864 rs12132517 rs11240777
## rs4951864 1.00 -0.89 -0.49
## rs12132517 -0.89 1.00 0.44
## rs11240777 -0.49 0.44 1.00
round(corPheM,2)
## V4 V6 V8 V12 V16 V18
## V4 1.00 -0.02 0.12 0.27 0.56 0.07
## V6 -0.02 1.00 0.00 -0.01 0.08 0.00
## V8 0.12 0.00 1.00 0.27 0.16 -0.01
## V12 0.27 -0.01 0.27 1.00 0.25 0.34
## V16 0.56 0.08 0.16 0.25 1.00 0.03
## V18 0.07 0.00 -0.01 0.34 0.03 1.00
ZsF
and PsF
are Z-scores and P-values for GIANT Woman data mapped on gene SAMD11.
round(ZsF,3)
## BMIw HeightW HIPw WCw WeightW WHRw
## rs4951864 0.628 0.345 0.674 1.405 0.643 1.150
## rs12132517 -0.789 -0.412 -0.935 -1.555 -0.860 -1.227
## rs11240777 -0.385 0.215 -0.482 -0.496 -0.025 -0.305
PsF
## BMIw HeightW HIPw WCw WeightW WHRw
## rs4951864 0.53 0.73 0.50 0.16 0.52 0.25
## rs12132517 0.43 0.68 0.35 0.12 0.39 0.22
## rs11240777 0.70 0.83 0.63 0.62 0.98 0.76
corSNPF
and corPheF
are correlation among SNPs and Phenotypes respectively for GIANT Woman data.
round(corSNPF,2)
## rs4951864 rs12132517 rs11240777
## rs4951864 1.00 -0.89 -0.49
## rs12132517 -0.89 1.00 0.44
## rs11240777 -0.49 0.44 1.00
round(corPheF,2)
## V5 V7 V9 V13 V17 V19
## V5 1.00 -0.01 0.19 0.30 0.66 0.03
## V7 -0.01 1.00 -0.01 -0.01 0.04 -0.01
## V9 0.19 -0.01 1.00 0.29 0.21 -0.02
## V13 0.30 -0.01 0.29 1.00 0.29 0.32
## V17 0.66 0.04 0.21 0.29 1.00 0.02
## V19 0.03 -0.01 -0.02 0.32 0.02 1.00
You can see that corSNPM
and corSNPF
are same. Yes, they are calculated from the reference population (here, we used 1000 genome CEU panel). They are same under H0 no matter what phenotype is.
Now, let’s perform MTaSPUsSet
and MTaSPUsSetC
functions. MTaSPUsSetC
is C coded version it impemented using Rcpp and RcppArmadillo package. The C coded version is faster when n.perm < 10^4. However it used (n.perm \(\times\) npow1 \(\times\) npow2 space) in calculation while R version just used n.perm space in calculation (complexity of C version is higher). So R version is faster when n.perm is big like n.perm = \(10^6\). Thus, use C version for the 1st screening using n.perm = 1000 or \(10^5\) for all the genes, and increase the number of permutation for the genens with small p-value.
(outFZC <- MTaSPUsSetC(ZsF, corSNP=corSNPF, corPhe = corPheF,
pow=c(1,2,4,8), pow2 = c(1,2,4,8), n.perm=100, Ps=FALSE))
## MTSPUs1,1 MTSPUs2,1 MTSPUs4,1 MTSPUs8,1 MTSPUs1,2 MTSPUs2,2 MTSPUs4,2
## 0.50 0.94 0.94 0.95 0.77 0.76 0.77
## MTSPUs8,2 MTSPUs1,4 MTSPUs2,4 MTSPUs4,4 MTSPUs8,4 MTSPUs1,8 MTSPUs2,8
## 0.76 0.79 0.78 0.76 0.76 0.79 0.76
## MTSPUs4,8 MTSPUs8,8 MTaSPUs
## 0.76 0.76 0.81
(outFZ <- MTaSPUsSet(ZsF, corSNP=corSNPF, corPhe = corPheF,
pow=c(1,2,4,8), pow2 = c(1,2,4,8), n.perm=100, Ps=FALSE))
## MTSPUsSet1,1 MTSPUsSet1,2 MTSPUsSet1,4 MTSPUsSet1,8 MTSPUsSet2,1
## 0.5100000 0.9400000 0.9500000 0.9500000 0.7100000
## MTSPUsSet2,2 MTSPUsSet2,4 MTSPUsSet2,8 MTSPUsSet4,1 MTSPUsSet4,2
## 0.7600000 0.7800000 0.7600000 0.7900000 0.8000000
## MTSPUsSet4,4 MTSPUsSet4,8 MTSPUsSet8,1 MTSPUsSet8,2 MTSPUsSet8,4
## 0.7800000 0.7800000 0.7900000 0.7800000 0.7800000
## MTSPUsSet8,8 MTaSPUsSet
## 0.7800000 0.8415842
You can check that MTaSPUsSetC
is much faster here. The speed was similar when n.perm=10^4 for this example and R coded version is faster when n.perm > 10^4. If the speed matters, you can try python version which have same complexity as R version. it is a bit faster then R version. (like 1.5 times)
To use Python version, first save files in .txt
format.
#write.table(ZsF, quote=FALSE, row.names=FALSE, col.names=FALSE, file="ZsF.txt")
#write.table(corPheF, quote=FALSE, row.names=FALSE, col.names=FALSE, file="corPheF.txt")
#write.table(corSNPF, quote=FALSE, row.names=FALSE, col.names=FALSE, file="corSNPF.txt")
MTaSPUsSet.py
function in py/aSPU_py do the same job with ./MTaSPUsSet.py ZsF.txt corPheF.txt corSNPF.txt 100 outF.txt
.
Next, We can use the P-values for MTaSPUsSet test using MTaSPUsSetC
and MTaSPUsSet
functions. Put P-value matrix PsF
and set Ps=TRUE
in the option.
(outFPC <- MTaSPUsSetC(PsF, corSNP=corSNPF, corPhe = corPheF,
pow=c(1,2,4,8), pow2 = c(1,2,4,8), n.perm=100, Ps=TRUE))
## MTSPUs1,1 MTSPUs2,1 MTSPUs4,1 MTSPUs8,1 MTSPUs1,2 MTSPUs2,2 MTSPUs4,2
## 0.72 0.74 0.77 0.78 0.79 0.79 0.78
## MTSPUs8,2 MTSPUs1,4 MTSPUs2,4 MTSPUs4,4 MTSPUs8,4 MTSPUs1,8 MTSPUs2,8
## 0.76 0.83 0.82 0.80 0.75 0.83 0.84
## MTSPUs4,8 MTSPUs8,8 MTaSPUs
## 0.83 0.82 0.86
(outFP <- MTaSPUsSet(PsF, corSNP=corSNPF, corPhe = corPheF,
pow=c(1,2,4,8), pow2 = c(1,2,4,8), n.perm=100, Ps=TRUE))
## MTSPUsSet1,1 MTSPUsSet1,2 MTSPUsSet1,4 MTSPUsSet1,8 MTSPUsSet2,1
## 0.7100000 0.7600000 0.7900000 0.8000000 0.7800000
## MTSPUsSet2,2 MTSPUsSet2,4 MTSPUsSet2,8 MTSPUsSet4,1 MTSPUsSet4,2
## 0.8200000 0.8300000 0.8100000 0.8500000 0.8400000
## MTSPUsSet4,4 MTSPUsSet4,8 MTSPUsSet8,1 MTSPUsSet8,2 MTSPUsSet8,4
## 0.8200000 0.8200000 0.8500000 0.8500000 0.8400000
## MTSPUsSet8,8 MTaSPUsSet
## 0.8400000 0.8316832
Results are not much different.
Next, let’s try MTaSPUsSet test using GIANT Man data with input of P-values and Z-scores.
(outMPC <- MTaSPUsSetC(PsM, corSNP=corSNPM, corPhe = corPheM,
pow=c(1,2,4,8), pow2 = c(1,2,4,8), n.perm=100, Ps=TRUE))
## MTSPUs1,1 MTSPUs2,1 MTSPUs4,1 MTSPUs8,1 MTSPUs1,2 MTSPUs2,2 MTSPUs4,2
## 0.38 0.47 0.59 0.67 0.55 0.72 0.75
## MTSPUs8,2 MTSPUs1,4 MTSPUs2,4 MTSPUs4,4 MTSPUs8,4 MTSPUs1,8 MTSPUs2,8
## 0.75 0.76 0.79 0.79 0.79 0.78 0.78
## MTSPUs4,8 MTSPUs8,8 MTaSPUs
## 0.78 0.78 0.55
(outMZC <- MTaSPUsSetC(ZsM, corSNP=corSNPM, corPhe = corPheM,
pow=c(1,2,4,8), pow2 = c(1,2,4,8), n.perm=100, Ps=FALSE))
## MTSPUs1,1 MTSPUs2,1 MTSPUs4,1 MTSPUs8,1 MTSPUs1,2 MTSPUs2,2 MTSPUs4,2
## 0.01 0.17 0.29 0.38 0.50 0.60 0.69
## MTSPUs8,2 MTSPUs1,4 MTSPUs2,4 MTSPUs4,4 MTSPUs8,4 MTSPUs1,8 MTSPUs2,8
## 0.73 0.64 0.73 0.78 0.79 0.72 0.74
## MTSPUs4,8 MTSPUs8,8 MTaSPUs
## 0.74 0.74 0.04
This time we got smaller P-value using ZsM
input than PsM
input. Why is it so? This can be answered from corSNPM
, corPheM
and ZsM
.
round(ZsM,3)
## BMIm HeightM HIPm WCm WeightM WHRm
## rs4951864 -1.150 0.789 -0.539 -0.706 -0.628 -0.468
## rs12132517 1.372 -0.755 0.690 0.824 0.842 0.279
## rs11240777 1.036 0.878 0.755 1.670 1.254 1.175
round(corSNPM,2)
## rs4951864 rs12132517 rs11240777
## rs4951864 1.00 -0.89 -0.49
## rs12132517 -0.89 1.00 0.44
## rs11240777 -0.49 0.44 1.00
round(corPheM,2)
## V4 V6 V8 V12 V16 V18
## V4 1.00 -0.02 0.12 0.27 0.56 0.07
## V6 -0.02 1.00 0.00 -0.01 0.08 0.00
## V8 0.12 0.00 1.00 0.27 0.16 -0.01
## V12 0.27 -0.01 0.27 1.00 0.25 0.34
## V16 0.56 0.08 0.16 0.25 1.00 0.03
## V18 0.07 0.00 -0.01 0.34 0.03 1.00
In corSNPM
, correlation between rs4951864
and others are negative. The second column of ZsM
looks a bit odd. It could be probable since the p-value is not very much significant and rs11240777
might be related a bit with Height
. However It might be related to the coding error as well. The original data set did not provide Z-scores. P-values for each SNP and beta estimate was provided in the download page of GIANT data. I transformed P-values to Z-scores by absZs = qnorm(1 - (Ps)/2)
and then multiplied the sign of beta estimates to recover the original Z-scores. Maybe I am wrong somewhere or provided beta estimates are not all correct. In any case, it would be safe to analyze GIANT data using P-values rather than using Z-scores.
So far, we demostrated how we can use the software. In this section we will visualize some identified genes.
Gene LCORL and RASA2 are two of identified genes by MGAS using KGG software.
In the Figure, we draw image of \(-\log_{10}\) transformed P-values for each SNP and trait. Gene LCORL was the most significant gene. As we can see from the Figure, there are many very significant SNPs for trait Height in Gene LCORL. No doubt MGAS and MTaSPUsSet identified this gene.
MGAS use extended Simes procedure so it considers top few p-values in their algorithm. Thus, MGAS works well with sparse signal such as RASA2. Since the signal is sparse, MTaSPUsSet detect this gene with larger \(\gamma_1\) and \(\gamma_2\). With (\(\gamma_1, \gamma_2\)) = (1,8), (2,8), (4,8) and (8,8), MTaSPUsSet P-values were less than 1e-7 with \(10^7\) permutations.
MTaSPUsSet test have advantage in detecting genes loke STK33 and RPGRIP1L. We can see that signals are widely spread. MTaSPUsSet test have higher power with smaller \(\gamma_1\) and \(\gamma_2\). As we can see from the figure, each P-values are not very big. These genes would be hard to detect using SNP based, or gene - single trait association testing. However, MTaSPUsSet test can detect these genes by aggregating signals for all traits and SNPs.