In some of tutorials(T1,T2), I used awk because it is faster. Here I do some experiments how much it is faster. I compared five methods : 1) Old R functions, 2) dplyr and readr 3) data.table 4) awk and 5) perl.

GinatdatAI.txt file have multi trait GWAS summary statistics of 2719717 SNPs. The data have rs-id, chromosome, genomic coordinate, 18 of GWAS summary statistics and allele information. Part of data looks like this.

bash-3.2$ head -n 3 GiantdatAl.txt
rs4747841 10 9918166 0.94 0.31 0.68 0.63 0.31 0.50 0.16 0.76 0.47 0.80 0.26 0.38 0.96 0.27 0.49 0.65 0.55 0.75 a g
rs4749917 10 9918296 0.94 0.31 0.68 0.63 0.31 0.50 0.16 0.75 0.47 0.80 0.26 0.38 0.96 0.27 0.49 0.65 0.55 0.75 t c
rs737656 10 98252982 0.70 0.28 0.25 0.27 0.28 0.67 0.25 0.59 0.70 0.74 0.94 0.29 0.34 0.49 0.54 0.35 0.97 0.38 a g

In this example, AGRN.txt have 3 SNPs mapped.

bash-3.2$ cat AGRN.txt
rs3121561
rs2799064
rs3128126

We would like to take subset of AGRN from GiantdatAl.txt. It can be done using built in R functions.

## in R (Old)  ###
t1 <- proc.time()
dat1 <- read.table("GiantdatAl.txt", colClasses=c("character", "character",
               "numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
               "numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
               "numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
               "numeric", "character", "character")  )
snps <- read.table("AGRN.txt", header=FALSE)
sdat <- dat1[ dat1[,1] %in% snps$V1, ]
t2 <- proc.time()
sdat
##                V1 V2      V3   V4    V5   V6   V7   V8    V9  V10  V11  V12
## 1165433 rs2799064  1 1022518 0.76 0.580 0.32 0.18 0.66 0.610 0.72 0.31 0.87
## 1166049 rs3128126  1 1026830 0.53 0.076 0.83 0.12 0.94 0.081 0.67 0.39 0.59
## 1168776 rs3121561  1 1055000 0.45 0.380 0.78 0.14 0.73 0.012 0.40 0.11 0.98
##          V13  V14  V15  V16  V17  V18  V19  V20  V21 V22 V23
## 1165433 0.79 0.96 0.93 0.40 0.83 0.97 0.73 0.82 0.56   t   g
## 1166049 0.17 0.78 0.47 0.73 0.15 0.48 0.81 0.63 0.58   a   g
## 1168776 0.07 0.12 0.33 0.54 0.64 0.98 0.81 0.48 0.59   t   c
t2 - t1
##   user  system elapsed
## 26.844   1.484  28.340

Now, I tried the same job using dplyr and readr. These R packages are coded under C++ using Rcpp.

library(readr)
library(dplyr)
### in R , using readr and dplyr ###

t1 <- proc.time()
dat1 <- read_delim("GiantdatAl.txt", delim=" ", col_type=cols("c", "c", "i", "d",
                   "d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d",
                   "d", "d", "d", "d", "d", "c", "c"), col_names=FALSE )
snps <- read_delim("AGRN.txt", col_names=FALSE, delim= " ", col_type=cols("c") )
sdat <- filter(dat1, X1 %in% snps$X1)
t2 <- proc.time()
sdat
## Source: local data frame [3 x 23]
##
##          X1    X2      X3    X4    X5    X6    X7    X8    X9   X10   X11   X12
##       (chr) (chr)   (int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 rs2799064     1 1022518  0.76 0.580  0.32  0.18  0.66 0.610  0.72  0.31  0.87
## 2 rs3128126     1 1026830  0.53 0.076  0.83  0.12  0.94 0.081  0.67  0.39  0.59
## 3 rs3121561     1 1055000  0.45 0.380  0.78  0.14  0.73 0.012  0.40  0.11  0.98
## Variables not shown: X13 (dbl), X14 (dbl), X15 (dbl), X16 (dbl), X17 (dbl), X18
##    (dbl), X19 (dbl), X20 (dbl), X21 (dbl), X22 (chr), X23 (chr)
t2 - t1
##   user  system elapsed
## 13.443   0.686  14.169

Using readr and dplyr this data manipulation can be done about 2 times faster.

Let’s try data.table next. It is reported that fread function in data.table package is a bit faster since the function is coded using pure C.

library(data.table)
### in R using data.table  ###

t1 <- proc.time()
dat1 <- fread("GiantdatAl.txt", colClasses=c("character", "character", "integer",
              "double", "double", "double", "double", "double", "double", "double",
              "double", "double", "double", "double", "double", "double", "double",
              "double", "double", "double", "double", "character", "character")  )
snps <- fread("AGRN.txt", header=FALSE)
sdat <- dat1[ V1 %chin% snps$V1, ]
t2 <- proc.time()
t2 - t1
##   user  system elapsed
## 10.839   0.331  11.174

We can check that data.table is a bit faster.

Next I used awk command for the same data manipulation.

### using ask ####

t1 <- proc.time()
cmd2run <- "awk 'NR==FNR {h[$1] = $1; next} h[$1] !=\"\" {print $0}' AGRN.txt GiantdatAl.txt > exdat1.txt"
system(cmd2run)
sdat <- read.table("exdat1.txt", colClasses=c("character", "character", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "character", "character")  )
t2 <- proc.time()
t2 - t1
##  user  system elapsed
##  2.771   0.361   3.142

Yes, it is super fast. It is more than 3 times faster then any other methods that I compared here.

Next one is using perl. I was just curious and tried it. I write a perl code, TakeSubDat.pl for this.

t1 <- proc.time()
cmd2run <- "./TakeSubDat.pl AGRN.txt GiantdatAl.txt > exdat2.txt"
system(cmd2run)
sdat <- read.table("exdat2.txt", colClasses=c("character", "character", "numeric",
              "numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
              "numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
              "numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
              "character", "character")  )
t2 <- proc.time()
t2 - t1
##   user  system elapsed
## 33.550   0.350  33.935

The speed using perl was even slower then old R

Next, I used R microbenchmark package for more exact speed comparison. f1() is using old R, f2() is using dplyr and readr, f3() is using data.table, f4() is using awk and f5() is using perl. The same procedure evaluated 273 times and summary table is provided below. We can check that awk is very fast compared to other methods.

> summary(Res)
  expr       min        lq      mean    median        uq       max neval
  1 f1() 20.309837 24.543637 28.156107 26.497877 29.282222 184.19319   273
  2 f2() 10.917517 12.781069 15.563997 14.476658 16.171709 101.48219   273
  3 f3()  8.577753 10.611985 12.609191 11.639268 13.519316  68.04706   273
  4 f4()  2.774250  3.392059  3.926914  3.625753  4.052245  18.72048   273
  5 f5() 29.320233 34.100792 36.069089 34.621827 36.131966  50.47498   273

In conclusion, it would be good to use dplyr readr and data.table for plain data manipulation since they are easy to use (easier syntax). If speed matters, it would be good to write awk script.

I thank Irucka Embry for helpful suggestions to use %chin% and microbenchmark package. If you have better opinons and suggestions please e-mail me, ikwak@umn.edu. Thank you! :)