First step is getting pathway data or gene set data. I downloaded these data from GSEA website. I downloaded KEGG gene sets, gene symbols, c2.cp.kegg.v5.1.symbols.gmt.
Total 186 Kegg pathways are in the file.
bash-3.2$ awk ' END {print NR} ' c2.cp.kegg.v5.1.symbols.gmt
186
Here are first three pathways.
bash-3.2$ head -n 3 data/c2.cp.kegg.v5.1.symbols.gmt
KEGG_GLYCOLYSIS_GLUCONEOGENESIS http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_GLYCOLYSIS_GLUCONEOGENESIS ACSS2 GCK PGK2 PGK1 PDHB PDHA1 PDHA2 PGM2 TPI1 ACSS1 FBP1 ADH1B HK2 ADH1C HK1 HK3 ADH4 PGAM2 ADH5 PGAM1 ADH1A ALDOC ALDH7A1 LDHAL6B PKLR LDHAL6A ENO1 PKM2 PFKP BPGM PCK2 PCK1 ALDH1B1 ALDH2 ALDH3A1 AKR1A1 FBP2 PFKM PFKL LDHC GAPDH ENO3 ENO2 PGAM4 ADH7 ADH6 LDHB ALDH1A3 ALDH3B1 ALDH3B2 ALDH9A1 ALDH3A2 GALM ALDOA DLD DLAT ALDOB G6PC2 LDHA G6PC PGM1 GPI
KEGG_CITRATE_CYCLE_TCA_CYCLE http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_CITRATE_CYCLE_TCA_CYCLE IDH3B DLST PCK2 CS PDHB PCK1 PDHA1 LOC642502 PDHA2 LOC283398 FH SDHD OGDH SDHB IDH3A SDHC IDH2 IDH1 ACO1 ACLY MDH2 DLD MDH1 DLAT OGDHL PC SDHA SUCLG1 SUCLA2 SUCLG2 IDH3G ACO2
KEGG_PENTOSE_PHOSPHATE_PATHWAY http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_PENTOSE_PHOSPHATE_PATHWAY RPE RPIA PGM2 PGLS PRPS2 FBP2 PFKM PFKL TALDO1 TKT FBP1 TKTL2 PGD RBKS ALDOA ALDOC ALDOB H6PD LOC729020 PRPS1L1 PRPS1 DERA G6PD PGM1 TKTL1 PFKP GPI
I write a perl script, processkegg.pl, to process this raw data. The program create a directory and save all 186 kegg pathways in separate files, kegg1.txt to kegg186.txt.
bash-3.2$ mkdir kegg
bash-3.2$ ./processkegg.pl data/c2.cp.kegg.v5.1.symbols.gmt kegg
bash-3.2$ cd kegg
bash-3.2$ head -n 5 kegg1.txt
KEGG_GLYCOLYSIS_GLUCONEOGENESIS
ACSS2
GCK
PGK2
PGK1
The first row of kegg1.txt
is name of the pathway and gene symbols inside of the pathway are follows.
To perform a pathway based tests using these 186 pathways. We need to get corresponding gene.info
and snp.info
for each pathway. gene.info
is a GENE information matrix, The 1st column is GENE id, 2nd column is chromosome number, 3rd and 4th column indicate start and end positions of the gene. for the corresponding pathway. You can take a subset of gene database downloadable from MAGMA website. snp.info
is a SNP information matrix for corresponding pathway. The 1st column is SNP id, 2nd column is chromosome #, 3rd column indicates SNP location.
Once you get snp.info
and gene.info
for each pathway, read Vignette for aSPUs and aSPUsPath. You can find how to perform aSPUsPath
test. aSPUpath
and MTaSPUsPath
are also similar to use.