This vignette illustrates the use of MTaSPUsSet test, an adaptive gene-multitrait association testing with GWAS summary statistics.

Data preparation

We first downloaded GWAS summary statistics of Genetic Investigation of ANthropometric Traits (GIANT) consortium data. We get the genomic coordinates of SNPs using SnpTracker software (hg19 used), and then we mapped SNPs to Genes using MAGMA software (build 37 used). We will consider testing a gene named SAMD11 for example.

Let’s load SAMD11 data first.

data(SAMD11)
attach(SAMD11)

ZsM and PsM are Z-scores and P-values for GIANT Man data mapped on gene SAMD11. (We used MAGMA software to map rs ids to genes. )

round(ZsM,3)
##              BMIm HeightM   HIPm    WCm WeightM   WHRm
## rs4951864  -1.150   0.789 -0.539 -0.706  -0.628 -0.468
## rs12132517  1.372  -0.755  0.690  0.824   0.842  0.279
## rs11240777  1.036   0.878  0.755  1.670   1.254  1.175
PsM
##            BMIm HeightM HIPm   WCm WeightM WHRm
## rs4951864  0.25    0.43 0.59 0.480    0.53 0.64
## rs12132517 0.17    0.45 0.49 0.410    0.40 0.78
## rs11240777 0.30    0.38 0.45 0.095    0.21 0.24

Both ZsM and PsM are 3 by 6 matrix. Row represent SNPs and column represent phenotypes. 3 SNPs are mapped on gene SAMD11 and 6 number of phenotypes are considered in testing.

corSNPM and corPheM are correlation among SNPs and Phenotypes respectively for GIANT Man data.

round(corSNPM,2)
##            rs4951864 rs12132517 rs11240777
## rs4951864       1.00      -0.89      -0.49
## rs12132517     -0.89       1.00       0.44
## rs11240777     -0.49       0.44       1.00
round(corPheM,2)
##        V4    V6    V8   V12  V16   V18
## V4   1.00 -0.02  0.12  0.27 0.56  0.07
## V6  -0.02  1.00  0.00 -0.01 0.08  0.00
## V8   0.12  0.00  1.00  0.27 0.16 -0.01
## V12  0.27 -0.01  0.27  1.00 0.25  0.34
## V16  0.56  0.08  0.16  0.25 1.00  0.03
## V18  0.07  0.00 -0.01  0.34 0.03  1.00

ZsF and PsF are Z-scores and P-values for GIANT Woman data mapped on gene SAMD11.

round(ZsF,3)
##              BMIw HeightW   HIPw    WCw WeightW   WHRw
## rs4951864   0.628   0.345  0.674  1.405   0.643  1.150
## rs12132517 -0.789  -0.412 -0.935 -1.555  -0.860 -1.227
## rs11240777 -0.385   0.215 -0.482 -0.496  -0.025 -0.305
PsF
##            BMIw HeightW HIPw  WCw WeightW WHRw
## rs4951864  0.53    0.73 0.50 0.16    0.52 0.25
## rs12132517 0.43    0.68 0.35 0.12    0.39 0.22
## rs11240777 0.70    0.83 0.63 0.62    0.98 0.76

corSNPF and corPheF are correlation among SNPs and Phenotypes respectively for GIANT Woman data.

round(corSNPF,2)
##            rs4951864 rs12132517 rs11240777
## rs4951864       1.00      -0.89      -0.49
## rs12132517     -0.89       1.00       0.44
## rs11240777     -0.49       0.44       1.00
round(corPheF,2)
##        V5    V7    V9   V13  V17   V19
## V5   1.00 -0.01  0.19  0.30 0.66  0.03
## V7  -0.01  1.00 -0.01 -0.01 0.04 -0.01
## V9   0.19 -0.01  1.00  0.29 0.21 -0.02
## V13  0.30 -0.01  0.29  1.00 0.29  0.32
## V17  0.66  0.04  0.21  0.29 1.00  0.02
## V19  0.03 -0.01 -0.02  0.32 0.02  1.00

You can see that corSNPM and corSNPF are same. Yes, they are calculated from the reference population (here, we used 1000 genome CEU panel). They are same under H0 no matter what phenotype is.

Data analysis

Now, let’s perform MTaSPUsSet and MTaSPUsSetC functions. MTaSPUsSetC is C coded version it impemented using Rcpp and RcppArmadillo package. The C coded version is faster when n.perm < 10^4. However it used (n.perm \(\times\) npow1 \(\times\) npow2 space) in calculation while R version just used n.perm space in calculation (complexity of C version is higher). So R version is faster when n.perm is big like n.perm = \(10^6\). Thus, use C version for the 1st screening using n.perm = 1000 or \(10^5\) for all the genes, and increase the number of permutation for the genens with small p-value.

(outFZC <- MTaSPUsSetC(ZsF, corSNP=corSNPF, corPhe = corPheF,
           pow=c(1,2,4,8),  pow2 = c(1,2,4,8), n.perm=100, Ps=FALSE))
## MTSPUs1,1 MTSPUs2,1 MTSPUs4,1 MTSPUs8,1 MTSPUs1,2 MTSPUs2,2 MTSPUs4,2 
##      0.50      0.94      0.94      0.95      0.77      0.76      0.77 
## MTSPUs8,2 MTSPUs1,4 MTSPUs2,4 MTSPUs4,4 MTSPUs8,4 MTSPUs1,8 MTSPUs2,8 
##      0.76      0.79      0.78      0.76      0.76      0.79      0.76 
## MTSPUs4,8 MTSPUs8,8   MTaSPUs 
##      0.76      0.76      0.81
(outFZ <- MTaSPUsSet(ZsF, corSNP=corSNPF, corPhe = corPheF,
           pow=c(1,2,4,8),  pow2 = c(1,2,4,8), n.perm=100, Ps=FALSE))
## MTSPUsSet1,1 MTSPUsSet1,2 MTSPUsSet1,4 MTSPUsSet1,8 MTSPUsSet2,1 
##    0.5100000    0.9400000    0.9500000    0.9500000    0.7100000 
## MTSPUsSet2,2 MTSPUsSet2,4 MTSPUsSet2,8 MTSPUsSet4,1 MTSPUsSet4,2 
##    0.7600000    0.7800000    0.7600000    0.7900000    0.8000000 
## MTSPUsSet4,4 MTSPUsSet4,8 MTSPUsSet8,1 MTSPUsSet8,2 MTSPUsSet8,4 
##    0.7800000    0.7800000    0.7900000    0.7800000    0.7800000 
## MTSPUsSet8,8   MTaSPUsSet 
##    0.7800000    0.8415842

You can check that MTaSPUsSetC is much faster here. The speed was similar when n.perm=10^4 for this example and R coded version is faster when n.perm > 10^4. If the speed matters, you can try python version which have same complexity as R version. it is a bit faster then R version. (like 1.5 times)

To use Python version, first save files in .txt format.

#write.table(ZsF, quote=FALSE, row.names=FALSE, col.names=FALSE, file="ZsF.txt")
#write.table(corPheF, quote=FALSE, row.names=FALSE, col.names=FALSE, file="corPheF.txt")
#write.table(corSNPF, quote=FALSE, row.names=FALSE, col.names=FALSE, file="corSNPF.txt")

MTaSPUsSet.py function in py/aSPU_py do the same job with ./MTaSPUsSet.py ZsF.txt corPheF.txt corSNPF.txt 100 outF.txt.

Next, We can use the P-values for MTaSPUsSet test using MTaSPUsSetC and MTaSPUsSet functions. Put P-value matrix PsF and set Ps=TRUE in the option.

(outFPC <- MTaSPUsSetC(PsF, corSNP=corSNPF, corPhe = corPheF,
           pow=c(1,2,4,8),  pow2 = c(1,2,4,8), n.perm=100, Ps=TRUE))
## MTSPUs1,1 MTSPUs2,1 MTSPUs4,1 MTSPUs8,1 MTSPUs1,2 MTSPUs2,2 MTSPUs4,2 
##      0.72      0.74      0.77      0.78      0.79      0.79      0.78 
## MTSPUs8,2 MTSPUs1,4 MTSPUs2,4 MTSPUs4,4 MTSPUs8,4 MTSPUs1,8 MTSPUs2,8 
##      0.76      0.83      0.82      0.80      0.75      0.83      0.84 
## MTSPUs4,8 MTSPUs8,8   MTaSPUs 
##      0.83      0.82      0.86
(outFP <- MTaSPUsSet(PsF, corSNP=corSNPF, corPhe = corPheF,
           pow=c(1,2,4,8),  pow2 = c(1,2,4,8), n.perm=100, Ps=TRUE))
## MTSPUsSet1,1 MTSPUsSet1,2 MTSPUsSet1,4 MTSPUsSet1,8 MTSPUsSet2,1 
##    0.7100000    0.7600000    0.7900000    0.8000000    0.7800000 
## MTSPUsSet2,2 MTSPUsSet2,4 MTSPUsSet2,8 MTSPUsSet4,1 MTSPUsSet4,2 
##    0.8200000    0.8300000    0.8100000    0.8500000    0.8400000 
## MTSPUsSet4,4 MTSPUsSet4,8 MTSPUsSet8,1 MTSPUsSet8,2 MTSPUsSet8,4 
##    0.8200000    0.8200000    0.8500000    0.8500000    0.8400000 
## MTSPUsSet8,8   MTaSPUsSet 
##    0.8400000    0.8316832

Results are not much different.

Next, let’s try MTaSPUsSet test using GIANT Man data with input of P-values and Z-scores.

(outMPC <- MTaSPUsSetC(PsM, corSNP=corSNPM, corPhe = corPheM,
           pow=c(1,2,4,8),  pow2 = c(1,2,4,8), n.perm=100, Ps=TRUE))
## MTSPUs1,1 MTSPUs2,1 MTSPUs4,1 MTSPUs8,1 MTSPUs1,2 MTSPUs2,2 MTSPUs4,2 
##      0.38      0.47      0.59      0.67      0.55      0.72      0.75 
## MTSPUs8,2 MTSPUs1,4 MTSPUs2,4 MTSPUs4,4 MTSPUs8,4 MTSPUs1,8 MTSPUs2,8 
##      0.75      0.76      0.79      0.79      0.79      0.78      0.78 
## MTSPUs4,8 MTSPUs8,8   MTaSPUs 
##      0.78      0.78      0.55
(outMZC <- MTaSPUsSetC(ZsM, corSNP=corSNPM, corPhe = corPheM,
           pow=c(1,2,4,8),  pow2 = c(1,2,4,8), n.perm=100, Ps=FALSE))
## MTSPUs1,1 MTSPUs2,1 MTSPUs4,1 MTSPUs8,1 MTSPUs1,2 MTSPUs2,2 MTSPUs4,2 
##      0.01      0.17      0.29      0.38      0.50      0.60      0.69 
## MTSPUs8,2 MTSPUs1,4 MTSPUs2,4 MTSPUs4,4 MTSPUs8,4 MTSPUs1,8 MTSPUs2,8 
##      0.73      0.64      0.73      0.78      0.79      0.72      0.74 
## MTSPUs4,8 MTSPUs8,8   MTaSPUs 
##      0.74      0.74      0.04

This time we got smaller P-value using ZsM input than PsM input. Why is it so? This can be answered from corSNPM, corPheM and ZsM.

round(ZsM,3)
##              BMIm HeightM   HIPm    WCm WeightM   WHRm
## rs4951864  -1.150   0.789 -0.539 -0.706  -0.628 -0.468
## rs12132517  1.372  -0.755  0.690  0.824   0.842  0.279
## rs11240777  1.036   0.878  0.755  1.670   1.254  1.175
round(corSNPM,2)
##            rs4951864 rs12132517 rs11240777
## rs4951864       1.00      -0.89      -0.49
## rs12132517     -0.89       1.00       0.44
## rs11240777     -0.49       0.44       1.00
round(corPheM,2)
##        V4    V6    V8   V12  V16   V18
## V4   1.00 -0.02  0.12  0.27 0.56  0.07
## V6  -0.02  1.00  0.00 -0.01 0.08  0.00
## V8   0.12  0.00  1.00  0.27 0.16 -0.01
## V12  0.27 -0.01  0.27  1.00 0.25  0.34
## V16  0.56  0.08  0.16  0.25 1.00  0.03
## V18  0.07  0.00 -0.01  0.34 0.03  1.00

In corSNPM, correlation between rs4951864 and others are negative. The second column of ZsM looks a bit odd. It could be probable since the p-value is not very much significant and rs11240777 might be related a bit with Height. However It might be related to the coding error as well. The original data set did not provide Z-scores. P-values for each SNP and beta estimate was provided in the download page of GIANT data. I transformed P-values to Z-scores by absZs = qnorm(1 - (Ps)/2) and then multiplied the sign of beta estimates to recover the original Z-scores. Maybe I am wrong somewhere or provided beta estimates are not all correct. In any case, it would be safe to analyze GIANT data using P-values rather than using Z-scores.

Data Visualization of some detected genes

So far, we demostrated how we can use the software. In this section we will visualize some identified genes.

Gene LCORL and RASA2 are two of identified genes by MGAS using KGG software.

plot of chunk plot_MGAS

In the Figure, we draw image of \(-\log_{10}\) transformed P-values for each SNP and trait. Gene LCORL was the most significant gene. As we can see from the Figure, there are many very significant SNPs for trait Height in Gene LCORL. No doubt MGAS and MTaSPUsSet identified this gene.

MGAS use extended Simes procedure so it considers top few p-values in their algorithm. Thus, MGAS works well with sparse signal such as RASA2. Since the signal is sparse, MTaSPUsSet detect this gene with larger \(\gamma_1\) and \(\gamma_2\). With (\(\gamma_1, \gamma_2\)) = (1,8), (2,8), (4,8) and (8,8), MTaSPUsSet P-values were less than 1e-7 with \(10^7\) permutations.

plot of chunk plot_MT

MTaSPUsSet test have advantage in detecting genes loke STK33 and RPGRIP1L. We can see that signals are widely spread. MTaSPUsSet test have higher power with smaller \(\gamma_1\) and \(\gamma_2\). As we can see from the figure, each P-values are not very big. These genes would be hard to detect using SNP based, or gene - single trait association testing. However, MTaSPUsSet test can detect these genes by aggregating signals for all traits and SNPs.