In some of tutorials(T1,T2), I used awk because it is faster. Here I do some experiments how much it is faster. I compared five methods : 1) Old R functions, 2) dplyr and readr 3) data.table 4) awk and 5) perl.
GinatdatAI.txt
file have multi trait GWAS summary statistics of 2719717 SNPs. The data have rs-id, chromosome, genomic coordinate, 18 of GWAS summary statistics and allele information. Part of data looks like this.
bash-3.2$ head -n 3 GiantdatAl.txt
rs4747841 10 9918166 0.94 0.31 0.68 0.63 0.31 0.50 0.16 0.76 0.47 0.80 0.26 0.38 0.96 0.27 0.49 0.65 0.55 0.75 a g
rs4749917 10 9918296 0.94 0.31 0.68 0.63 0.31 0.50 0.16 0.75 0.47 0.80 0.26 0.38 0.96 0.27 0.49 0.65 0.55 0.75 t c
rs737656 10 98252982 0.70 0.28 0.25 0.27 0.28 0.67 0.25 0.59 0.70 0.74 0.94 0.29 0.34 0.49 0.54 0.35 0.97 0.38 a g
In this example, AGRN.txt have 3 SNPs mapped.
bash-3.2$ cat AGRN.txt
rs3121561
rs2799064
rs3128126
We would like to take subset of AGRN from GiantdatAl.txt
. It can be done using built in R functions.
## in R (Old) ###
t1 <- proc.time()
dat1 <- read.table("GiantdatAl.txt", colClasses=c("character", "character",
"numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
"numeric", "character", "character") )
snps <- read.table("AGRN.txt", header=FALSE)
sdat <- dat1[ dat1[,1] %in% snps$V1, ]
t2 <- proc.time()
sdat
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
## 1165433 rs2799064 1 1022518 0.76 0.580 0.32 0.18 0.66 0.610 0.72 0.31 0.87
## 1166049 rs3128126 1 1026830 0.53 0.076 0.83 0.12 0.94 0.081 0.67 0.39 0.59
## 1168776 rs3121561 1 1055000 0.45 0.380 0.78 0.14 0.73 0.012 0.40 0.11 0.98
## V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23
## 1165433 0.79 0.96 0.93 0.40 0.83 0.97 0.73 0.82 0.56 t g
## 1166049 0.17 0.78 0.47 0.73 0.15 0.48 0.81 0.63 0.58 a g
## 1168776 0.07 0.12 0.33 0.54 0.64 0.98 0.81 0.48 0.59 t c
t2 - t1
## user system elapsed
## 26.844 1.484 28.340
Now, I tried the same job using dplyr and readr. These R packages are coded under C++ using Rcpp.
library(readr)
library(dplyr)
### in R , using readr and dplyr ###
t1 <- proc.time()
dat1 <- read_delim("GiantdatAl.txt", delim=" ", col_type=cols("c", "c", "i", "d",
"d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d",
"d", "d", "d", "d", "d", "c", "c"), col_names=FALSE )
snps <- read_delim("AGRN.txt", col_names=FALSE, delim= " ", col_type=cols("c") )
sdat <- filter(dat1, X1 %in% snps$X1)
t2 <- proc.time()
sdat
## Source: local data frame [3 x 23]
##
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
## (chr) (chr) (int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 rs2799064 1 1022518 0.76 0.580 0.32 0.18 0.66 0.610 0.72 0.31 0.87
## 2 rs3128126 1 1026830 0.53 0.076 0.83 0.12 0.94 0.081 0.67 0.39 0.59
## 3 rs3121561 1 1055000 0.45 0.380 0.78 0.14 0.73 0.012 0.40 0.11 0.98
## Variables not shown: X13 (dbl), X14 (dbl), X15 (dbl), X16 (dbl), X17 (dbl), X18
## (dbl), X19 (dbl), X20 (dbl), X21 (dbl), X22 (chr), X23 (chr)
t2 - t1
## user system elapsed
## 13.443 0.686 14.169
Using readr
and dplyr
this data manipulation can be done about 2 times faster.
Let’s try data.table next. It is reported that fread
function in data.table
package is a bit faster since the function is coded using pure C.
library(data.table)
### in R using data.table ###
t1 <- proc.time()
dat1 <- fread("GiantdatAl.txt", colClasses=c("character", "character", "integer",
"double", "double", "double", "double", "double", "double", "double",
"double", "double", "double", "double", "double", "double", "double",
"double", "double", "double", "double", "character", "character") )
snps <- fread("AGRN.txt", header=FALSE)
sdat <- dat1[ V1 %chin% snps$V1, ]
t2 <- proc.time()
t2 - t1
## user system elapsed
## 10.839 0.331 11.174
We can check that data.table
is a bit faster.
Next I used awk
command for the same data manipulation.
### using ask ####
t1 <- proc.time()
cmd2run <- "awk 'NR==FNR {h[$1] = $1; next} h[$1] !=\"\" {print $0}' AGRN.txt GiantdatAl.txt > exdat1.txt"
system(cmd2run)
sdat <- read.table("exdat1.txt", colClasses=c("character", "character", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "character", "character") )
t2 <- proc.time()
t2 - t1
## user system elapsed
## 2.771 0.361 3.142
Yes, it is super fast. It is more than 3 times faster then any other methods that I compared here.
Next one is using perl
. I was just curious and tried it. I write a perl code, TakeSubDat.pl for this.
t1 <- proc.time()
cmd2run <- "./TakeSubDat.pl AGRN.txt GiantdatAl.txt > exdat2.txt"
system(cmd2run)
sdat <- read.table("exdat2.txt", colClasses=c("character", "character", "numeric",
"numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
"character", "character") )
t2 <- proc.time()
t2 - t1
## user system elapsed
## 33.550 0.350 33.935
The speed using perl
was even slower then old R
Next, I used R microbenchmark package for more exact speed comparison. f1()
is using old R, f2()
is using dplyr
and readr
, f3()
is using data.table
, f4()
is using awk
and f5()
is using perl
. The same procedure evaluated 273 times and summary table is provided below. We can check that awk
is very fast compared to other methods.
> summary(Res)
expr min lq mean median uq max neval
1 f1() 20.309837 24.543637 28.156107 26.497877 29.282222 184.19319 273
2 f2() 10.917517 12.781069 15.563997 14.476658 16.171709 101.48219 273
3 f3() 8.577753 10.611985 12.609191 11.639268 13.519316 68.04706 273
4 f4() 2.774250 3.392059 3.926914 3.625753 4.052245 18.72048 273
5 f5() 29.320233 34.100792 36.069089 34.621827 36.131966 50.47498 273
In conclusion, it would be good to use dplyr
readr
and data.table
for plain data manipulation since they are easy to use (easier syntax). If speed matters, it would be good to write awk script.
I thank Irucka Embry for helpful suggestions to use %chin%
and microbenchmark
package. If you have better opinons and suggestions please e-mail me, ikwak@umn.edu. Thank you! :)