Transfer information from a BED to a VCF
Usage: java -jar dist/vcfbed.jar [options] Files
Usage: vcfbed [options] Files
Options:
--bcf-output
If this program writes a VCF to a file, The format is first guessed from
the file suffix. Otherwise, force BCF output. The current supported BCF
version is : 2.1 which is not compatible with bcftools/htslib (last
checked 2019-11-15)
Default: false
* -B, --bed, -m, --map
Tribble or Tabix bed file. Files must be indexed unless option --fast is
selected.
--bnd
for SVTYPE=BND do not try to extend the interval to the variant and the
mate breakend.
Default: false
-e, --expr, --jexl, --format
[20180124]A JEXL Expression returning a string JEXL stands for Java
EXpression Language. See
https://commons.apache.org/proper/commons-jexl/reference/syntax.html .
The variable 'bed' is the current observed BedLine (see https://github.com/lindenb/jvarkit/blob/7bddffca3899196e568fb5e1a479300c0038f74f/src/main/java/com/github/lindenb/jvarkit/util/bio/bed/BedLine.java
). The variable 'ctx' or 'variant' is the current observed variant. The
variable 'line' is the original bed line
Default: bed.get(0)+":"+bed.get(1)+"-"+bed.get(2)
--generate-vcf-md5
Generate MD5 checksum for VCF output.
Default: false
-h, --help
print help and exit
--helpFormat
What kind of help. One of [usage,markdown,xml].
-ignoreFiltered, --ignoreFiltered
[20171031] Ignore FILTERed Variants (should be faster)
Default: false
--fast, --memory
Load files in memory (faster than tribble/tabix but memory consumming)
Default: false
-mofb, --min-overlap-bed-fraction
Minimum overlap required as a fraction of BED record. A decimal number
between 0.0 and 1.0. If the value ends with '%' it is interpretted as a
percentage eg. '1%' => '0.01'. A slash '/' is interpretted as a ratio.
e.g: '1/100' => '0.01'.
-mofr, --min-overlap-fraction
Require that the minimum fraction be satisfied for VCF OR BED.A decimal
number between 0.0 and 1.0. If the value ends with '%' it is
interpretted as a percentage eg. '1%' => '0.01'. A slash '/' is
interpretted as a ratio. e.g: '1/100' => '0.01'.
-mofv, --min-overlap-vcf-fraction
Minimum overlap required as a fraction of VCF record. A decimal number
between 0.0 and 1.0. If the value ends with '%' it is interpretted as a
percentage eg. '1%' => '0.01'. A slash '/' is interpretted as a ratio.
e.g: '1/100' => '0.01'.
-o, --out
Output file. Optional . Default: stdout
-T, --tag
Name for the INFO tag name
Default: VCFBED
--version
print version and exit
-extend, --extend, --within-distance
Variant and BED must be within 'x' bp.A distance specified as a positive
integer.Commas are removed. The following suffixes are interpreted :
b,bp,k,kb,m,mb,g,gb
Default: 0
${PATH}
. Setting JAVA_HOME is not enough : (e.g: https://github.com/lindenb/jvarkit/issues/23 )$ git clone "https://github.com/lindenb/jvarkit.git"
$ cd jvarkit
$ ./gradlew vcfbed
The java jar file will be installed in the dist
directory.
20180406
The project is licensed under the MIT license.
Should you cite vcfbed ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md
The current reference is:
http://dx.doi.org/10.6084/m9.figshare.1425030
Lindenbaum, Pierre (2015): JVarkit: java-based utilities for Bioinformatics. figshare. http://dx.doi.org/10.6084/m9.figshare.1425030
Map the NCBI biosystems to a BED file using the following script: https://gist.github.com/6024788
$ gunzip -c ~/ncbibiosystem.bed.gz | head
1 69091 70008 79501 106356 30 Signaling_by_GPCR
1 69091 70008 79501 106383 50 Olfactory_Signaling_Pathway
1 69091 70008 79501 119548 40 GPCR_downstream_signaling
1 69091 70008 79501 477114 30 Signal_Transduction
1 69091 70008 79501 498 40 Olfactory_transduction
1 69091 70008 79501 83087 60 Olfactory_transduction
1 367640 368634 26683 106356 30 Signaling_by_GPCR
1 367640 368634 26683 106383 50 Olfactory_Signaling_Pathway
1 367640 368634 26683 119548 40 GPCR_downstream_signaling
1 367640 368634 26683 477114 30 Signal_Transduction
Now, annotate a remote VCF with the data of NCBI biosystems.
curl "https://raw.github.com/arq5x/gemini/master/test/test1.snpeff.vcf" |\
sed 's/^chr//' |\
java -jar dist/vcfbed.jar -B ~/ncbibiosystem.bed.gz -T NCBIBIOSYS -f '($4|$5|$6|$7)' |\
grep -E '(^#CHR|NCBI)'
##INFO=<ID=NCBIBIOSYS,Number=.,Type=String,Description="metadata added from /home/lindenb/ncbibiosystem.bed.gz . Format was ($4|$5|$6|$7)">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1094PC0005 1094PC0009 1094PC0012 1094PC0013
1 69270 . A G 2694.18 . AC=40;AF=1.000;AN=40;DP=83;Dels=0.00;EFF=SYNONYMOUS_CODING(LOW|SILENT|tcA/tcG|S60|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=0.000;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.0598;MQ=31.06;MQ0=0;NCBIBIOSYS=(79501|119548|40|GPCR_downstream_signaling),(79501|106356|30|Signaling_by_GPCR),(79501|498|40|Olfactory_transduction),(79501|83087|60|Olfactory_transduction),(79501|477114|30|Signal_Transduction),(79501|106383|50|Olfactory_Signaling_Pathway);QD=32.86 GT:AD:DP:GQ:PL ./. ./. 1/1:0,3:3:9.03:106,9,0 1/1:0,6:6:18.05:203,18,0
1 69511 . A G 77777.27 . AC=49;AF=0.875;AN=56;BaseQRankSum=0.150;DP=2816;DS;Dels=0.00;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Gca|T141A|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=21.286;HRun=0;HaplotypeScore=3.8956;InbreedingCoeff=0.0604;MQ=32.32;MQ0=0;MQRankSum=1.653;NCBIBIOSYS=(79501|119548|40|GPCR_downstream_signaling),(79501|106356|30|Signaling_by_GPCR),(79501|498|40|Olfactory_transduction),(79501|83087|60|Olfactory_transduction),(79501|477114|30|Signal_Transduction),(79501|106383|50|Olfactory_Signaling_Pathway);QD=27.68;ReadPosRankSum=2.261 GT:AD:DP:GQ:PL ./. ./. 0/1:2,4:6:15.70:16,0,40 0/1:2,2:4:21.59:22,0,40</h:pre>
Another example:
$ tabix -h dbsnp138_00-All.vcf.gz "19:58864565-58865165" | sed '/^[^#]/s/^/chr/' |\
java -jar dist/vcfbed.jar -m your.bed -e 'bed.get(0)+"|"+bed.get(1)+"|"+bed.get(2)+"|"+bed.get(3)+"&"+bed.get(4)'
##INFO=<ID=VCFBED,Number=.,Type=String,Description="metadata added from your.bed . Format was ${1}|${2}|${3}|${4}&${5}">
(...)
chr19 58864911 rs113760967 T C . . GNO;OTHERKG;R5;RS=113760967;RSPOS=58864911;SAO=0;SLO;SSR=0;VC=SNV;VCFBED=chr19|58864565|58865165|A1BG&58864865;VP=0x050100020001000102000100;WGT=1;dbSNPBuildID=132
chr19 58865054 rs893183 T C . . CAF=[0.1299,0.8701];COMMON=1;G5;GNO;HD;KGPROD;KGPhase1;KGPilot123;OTHERKG;PH3;R5;RS=893183;RSPOS=58865054;RV;SAO=0;SLO;SSR=0;VC=SNV;VCFBED=chr19|58864565|58865165|A1BG&58864865;VLD;VP=0x05010002000115051f000100;WGT=1;dbSNPBuildID=86
chr19 58865068 rs893182 T C . . CAF=[0.1299,0.8701];COMMON=1;G5;GNO;HD;KGPROD;KGPhase1;KGPilot123;OTHERKG;PH3;R5;RS=893182;RSPOS=58865068;RV;SAO=0;SLO;SSR=0;VC=SNV;VCFBED=chr19|58864565|58865165|A1BG&58864865;VLD;VP=0x05010002000115051f000100;WGT=1;dbSNPBuildID=86
chr19 58865082 rs893181 A T . . CAF=[0.1295,0.8705];COMMON=1;G5;GNO;HD;KGPROD;KGPhase1;KGPilot123;OTHERKG;PH3;R5;RS=893181;RSPOS=58865082;RV;SAO=0;SLO;SSR=0;VC=SNV;VCFBED=chr19|58864565|58865165|A1BG&58864865;VLD;VP=0x05010002000115051f000100;WGT=1;dbSNPBuildID=86
chr19 58865091 rs893180 A G . . CAF=[0.1299,0.8701];COMMON=1;G5;GNO;HD;KGPROD;KGPhase1;KGPilot123;OTHERKG;R5;RS=893180;RSPOS=58865091;RV;SAO=0;SLO;SSR=0;VC=SNV;VCFBED=chr19|58864565|58865165|A1BG&58864865;VLD;VP=0x05010002000115051e000100;WGT=1;dbSNPBuildID=86
chr19 58865112 rs188818621 C T . . CAF=[0.9954,0.004591];COMMON=1;KGPROD;KGPhase1;R5;RS=188818621;RSPOS=58865112;SAO=0;SSR=0;VC=SNV;VCFBED=chr19|58864565|58865165|A1BG&58864865;VP=0x050000020001000014000100;WGT=1;dbSNPBuildID=135
chr19 58865164 rs80109863 C T . . CAF=[0.9949,0.005051];COMMON=1;GNO;KGPROD;KGPhase1;OTHERKG;R5;RS=80109863;RSPOS=58865164;SAO=0;SSR=0;VC=SNV;VCFBED=chr19|58864565|58865165|A1BG&58864865;VP=0x050000020001000116000100;WGT=1;dbSNPBuildID=132