jvarkit

VcfRegulomeDB

Last commit

Annotate a VCF with the Regulome data (http://regulome.stanford.edu/

Usage

Usage: vcfregulomedb [options] Files
  Options:
    -h, --help
      print help and exit
    --helpFormat
      What kind of help. One of [usage,markdown,xml].
    -o, --output
      Output file. Optional . Default: stdout
    --version
      print version and exit
    -T
      tag in vcf INFO.
      Default: REGULOMEDB
  * -b
       bed indexed with tabix. Format: chrom(tab)start(tab)end(tab)rank
    -r
      if defined, only accept the rank matching the regular expression
    -x
      (int) base pairs. look.for data around the variation +/- 'x'
      Default: 5

Compilation

Requirements / Dependencies

Download and Compile

$ git clone "https://github.com/lindenb/jvarkit.git"
$ cd jvarkit
$ ./gradlew vcfregulomedb

The java jar file will be installed in the dist directory.

Source code

https://github.com/lindenb/jvarkit/tree/master/src/main/java/com/github/lindenb/jvarkit/tools/misc/VcfRegulomeDB.java

Contribute

License

The project is licensed under the MIT license.

Citing

Should you cite vcfregulomedb ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md

The current reference is:

http://dx.doi.org/10.6084/m9.figshare.1425030

Lindenbaum, Pierre (2015): JVarkit: java-based utilities for Bioinformatics. figshare. http://dx.doi.org/10.6084/m9.figshare.1425030

Building the Tabix File for regulomeDB:

(for S in 1 2 3 4 5 6 ; do curl -s "http://regulome.stanford.edu/downloads/RegulomeDB.dbSNP132.Category${S}.txt.gz"  | gunzip -c |  cut -f 1,2,5 | sed -e 's/^chrX/23/'  -e 's/^chr//'  | awk -F ' ' '{printf("%s\t%d\t%s\t%s\n",$1,int($2)-1,$2,$3);}' | uniq ; done)| LC_ALL=C sort -t ' ' -k1,1n -k2,2n -k3,3n |  sed 's/^23/X/' | bgzip -c > regulomeDB.bed.gz && tabix  -p bed -f regulomeDB.bed.gz 

Example

$   curl -kLs "https://raw.githubusercontent.com/arq5x/gemini/master/test/ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.snippet.snpEff.vcf" |\
   java -jar dist/vcfstripannot.jar -k '*' |\
   java -jar dist/vcfregulomedb.jar -b regulomeDB.bed.gz -x 10  |\
   grep -i REGU | head

##INFO=<ID=REGULOMEDB,Number=.,Type=String,Description="Format: Position|Distance|Rank">
##VcfRegulomeDBCmdLine=-b regulomeDB.bed.gz -x 10
##VcfRegulomeDBVersion=235a18b083ea15c4ad94060de512f1edc74cec42
1	10583	rs58108140	G	A	100	PASS	REGULOMEDB=10582|0|5
1	13327	rs144762171	G	C	100	PASS	REGULOMEDB=13326|0|6
1	13980	rs151276478	T	C	100	PASS	REGULOMEDB=13971|8|6,13979|0|6,13980|1|6
1	46402	.	C	CTGT	31	PASS	REGULOMEDB=46402|1|6
1	55164	rs3091274	C	A	100	PASS	REGULOMEDB=55163|0|6
1	55299	rs10399749	C	T	100	PASS	REGULOMEDB=55298|0|6
1	55313	rs182462964	A	T	100	PASS	REGULOMEDB=55321|9|6
1	55326	rs3107975	T	C	100	PASS	REGULOMEDB=55321|4|6,55325|0|6
1	55330	rs185215913	G	A	100	PASS	REGULOMEDB=55321|8|6,55325|4|6

Those SNPs can be seen at: