For a simple example, I prepared 23 SNPs mapped on gene SSU72, SSU72.txt
bash-3.2$ head SSU72.txt
rs4970458
rs2031709
rs3128342
rs3766169
rs2296715
rs2296716
rs3118509
rs3766177
rs3766178
rs3766180
I also prepared small subset of 1000 human genome project data which contains 33 SNPs for 379 subjects. Assume this is our reference panel data. SubsetCh1.txt
We can take sub data for SSU72 using awk command below
bash-3.2$ awk 'NR==FNR {h[$1] = $0; next} h[$1] !="" {print h[$1]}' SubsetCh1.txt SSU72.txt > SubsetSSU72.txt
Among 23 SNPs in SSU72, 22 SNPs were in 1000 human genome panel and sub data saved in SubsetSSU72.txt.
The reason I do this in bash command is that it is simple and fast. reading 1000 human genome project data in R and mapping it in R takes a lot of time. On the other hand, it can be done in few seconds using bash.
Now, we can use R to estimate correlation among SNPs.
locdat <- read.table(file="SubsetSSU72.txt")
datn <- t(matrix(as.numeric(as.matrix(locdat[,3:381])), dim(locdat)[1],379))
colnames(datn) <- locdat[,1]
dim(datn)
## [1] 379 22
We also need to save minor alleles used in 0,1,2 coding because we have to alter 0 and 2 if minor alleles used are different.
usedallele <- as.character(locdat[,2])
usedallele
## [1] "G" "C" "C" "A" "C" "C" "T" "T" "T" "T" "C" "T" "C" "T" "C" "C" "G" "A" "G"
## [20] "G" "A" "A"
Check whether there are SNPs with all 0s.
max(apply( datn == 0, 2, sum )) == 379
## [1] TRUE
if( max(apply( datn == 0, 2, sum )) == 379 ) {
er_list = 0
er_list <- c(er_list, which( apply( datn == 0, 2, sum ) == 379 ) )
datn <- datn[, -er_list]
usedallele <- usedallele[-er_list]
}
dim(datn)
## [1] 379 21
One SNP errased.
Correlation among SNPs can be earned by
corSNP <- cor(as.matrix(datn), use="pairwise.complete.obs")
You maybe want to further delete highly correlated SNPs, greater than .95 and less than -.95
while( sum( corSNP > 0.95 ) > ncol(corSNP) ) {
toerase <- which( apply( corSNP > 0.95, 1, sum) == max( apply( corSNP > 0.95, 1, sum) ) )
corSNP <- corSNP[-toerase[1], -toerase[1]]
if( is.matrix(corSNP) == FALSE) break
}
while( sum( corSNP < -0.95 ) > 0 ) {
toerase <- which( apply( corSNP < -.95, 1, sum) == max( apply( corSNP < -.95, 1, sum) ) )
corSNP <- corSNP[-toerase[1], -toerase[1]]
if( is.matrix(corhap) == FALSE) break
}
Which gives you 12 by 12 correlation matrix among SNPs mapped on SSU72
> corSNP
## rs4970458 rs2031709 rs3128342 rs2296716 rs3118509
## rs4970458 1.00000000 -0.04084146 -0.11782642 -0.03750708 -0.05942565
## rs2031709 -0.04084146 1.00000000 0.14012201 -0.06018247 0.03567443
## rs3128342 -0.11782642 0.14012201 1.00000000 0.46098632 0.14086027
## rs2296716 -0.03750708 -0.06018247 0.46098632 1.00000000 0.08782511
## rs3118509 -0.05942565 0.03567443 0.14086027 0.08782511 1.00000000
## rs3766180 -0.08970164 0.23626920 0.74324439 0.55738899 0.08025057
## rs7519837 0.17811089 0.21915775 0.57968667 0.41502048 0.06321917
## rs7531530 0.21823251 0.02542996 0.52629416 0.60316415 0.09660614
## rs11260611 -0.02390300 0.58526295 0.04830153 -0.06766329 0.02087892
## rs1887284 -0.03825834 0.93675229 0.12558069 -0.04945293 0.03341810
## rs7540231 0.04403638 0.04046505 0.35820590 0.19971246 0.28462121
## rs9439468 0.18605204 0.20408257 0.63447443 0.52033462 0.07004870
## rs3766180 rs7519837 rs7531530 rs11260611 rs1887284 rs7540231
## rs4970458 -0.08970164 0.17811089 0.21823251 -0.02390300 -0.03825834 0.04403638
## rs2031709 0.23626920 0.21915775 0.02542996 0.58526295 0.93675229 0.04046505
## rs3128342 0.74324439 0.57968667 0.52629416 0.04830153 0.12558069 0.35820590
## rs2296716 0.55738899 0.41502048 0.60316415 -0.06766329 -0.04945293 0.19971246
## rs3118509 0.08025057 0.06321917 0.09660614 0.02087892 0.03341810 0.28462121
## rs3766180 1.00000000 0.83387857 0.74297399 0.07120456 0.19508262 0.25553017
## rs7519837 0.83387857 1.00000000 0.78678281 0.06272010 0.17965214 0.27158556
## rs7531530 0.74297399 0.78678281 1.00000000 0.08539244 0.03009128 0.28301160
## rs11260611 0.07120456 0.06272010 0.08539244 1.00000000 0.62477878 0.02626193
## rs1887284 0.19508262 0.17965214 0.03009128 0.62477878 1.00000000 0.02799795
## rs7540231 0.25553017 0.27158556 0.28301160 0.02626193 0.02799795 1.00000000
## rs9439468 0.90145417 0.93769555 0.86371585 0.05521614 0.16604648 0.27761963
## rs9439468
## rs4970458 0.18605204
## rs2031709 0.20408257
## rs3128342 0.63447443
## rs2296716 0.52033462
## rs3118509 0.07004870
## rs3766180 0.90145417
## rs7519837 0.93769555
## rs7531530 0.86371585
## rs11260611 0.05521614
## rs1887284 0.16604648
## rs7540231 0.27761963
## rs9439468 1.00000000