
Transfer information from a BED to a VCF
This program is now part of the main jvarkit tool. See jvarkit for compiling.
Usage: java -jar dist/jvarkit.jar vcfbed [options] Files
Usage: vcfbed [options] Files
Options:
--bcf-output
If this program writes a VCF to a file, The format is first guessed from
the file suffix. Otherwise, force BCF output. The current supported BCF
version is : 2.1 which is not compatible with bcftools/htslib (last
checked 2019-11-15)
Default: false
* -B, --bed, -m, --map
Tribble or Tabix bed file. Files must be indexed unless option --fast is
selected.
--bnd
for SVTYPE=BND do not try to extend the interval to the variant and the
mate breakend.
Default: false
-e, --expr, --jexl, --format
[20180124]A JEXL Expression returning a string JEXL stands for Java
EXpression Language. See
https://commons.apache.org/proper/commons-jexl/reference/syntax.html .
The variable 'bed' is the current observed BedLine (see https://github.com/lindenb/jvarkit/blob/7bddffca3899196e568fb5e1a479300c0038f74f/src/main/java/com/github/lindenb/jvarkit/util/bio/bed/BedLine.java
). The variable 'ctx' or 'variant' is the current observed variant. The
variable 'line' is the original bed line
Default: bed.get(0)+":"+bed.get(1)+"-"+bed.get(2)
--generate-vcf-md5
Generate MD5 checksum for VCF output.
Default: false
-h, --help
print help and exit
--helpFormat
What kind of help. One of [usage,markdown,xml].
-ignoreFiltered, --ignoreFiltered
Ignore FILTERed Variants (should be faster)
Default: false
--limit
limit number of annotations to that number. -1: unlimited
Default: -1
--fast, --memory
Load files in memory (faster than tribble/tabix but memory consumming)
Default: false
-mofb, --min-overlap-bed-fraction
Minimum overlap required as a fraction of BED record. A decimal number
between 0.0 and 1.0. If the value ends with '%' it is interpretted as a
percentage eg. '1%' => '0.01'. A slash '/' is interpretted as a ratio.
e.g: '1/100' => '0.01'.
-mofr, --min-overlap-fraction
Require that the minimum fraction be satisfied for VCF OR BED.A decimal
number between 0.0 and 1.0. If the value ends with '%' it is
interpretted as a percentage eg. '1%' => '0.01'. A slash '/' is
interpretted as a ratio. e.g: '1/100' => '0.01'.
-mofv, --min-overlap-vcf-fraction
Minimum overlap required as a fraction of VCF record. A decimal number
between 0.0 and 1.0. If the value ends with '%' it is interpretted as a
percentage eg. '1%' => '0.01'. A slash '/' is interpretted as a ratio.
e.g: '1/100' => '0.01'.
-M, --mutual-fraction
Mutual comparator Two SV have are the same if they share a fraction 'x'
of their bases. For very small SV the fraction can be quite small while
for large SV the fraction should be close to 1. The Syntax is the
following : (<MAX_SIZE_INCLUSIVE>:<FRACTION as double or percent>;)+ .
For example if the SV as a size of 99bp, the fraction used with be 0.6
for '10:0.1;100:0.6;1000:0.9'. For the smallest size, a simple overlap
is a positive match.. Example:"10:0.5;100:75%;1000:80%;10000:90%".
-o, --out
Output file. Optional . Default: stdout
-T, --tag
Name for the INFO tag name
Default: VCFBED
--version
print version and exit
-extend, --extend, --within-distance
Variant and BED must be within 'x' bp.A distance specified as a positive
integer.Commas are removed. The following suffixes are interpreted :
b,bp,k,kb,m,mb,g,gb
Default: 0
20180406
The project is licensed under the MIT license.
Should you cite vcfbed ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md
The current reference is:
http://dx.doi.org/10.6084/m9.figshare.1425030
Lindenbaum, Pierre (2015): JVarkit: java-based utilities for Bioinformatics. figshare. http://dx.doi.org/10.6084/m9.figshare.1425030
Map the NCBI biosystems to a BED file using the following script: https://gist.github.com/6024788
$ gunzip -c ~/ncbibiosystem.bed.gz | head
1 69091 70008 79501 106356 30 Signaling_by_GPCR
1 69091 70008 79501 106383 50 Olfactory_Signaling_Pathway
1 69091 70008 79501 119548 40 GPCR_downstream_signaling
1 69091 70008 79501 477114 30 Signal_Transduction
1 69091 70008 79501 498 40 Olfactory_transduction
1 69091 70008 79501 83087 60 Olfactory_transduction
1 367640 368634 26683 106356 30 Signaling_by_GPCR
1 367640 368634 26683 106383 50 Olfactory_Signaling_Pathway
1 367640 368634 26683 119548 40 GPCR_downstream_signaling
1 367640 368634 26683 477114 30 Signal_Transduction
Now, annotate a remote VCF with the data of NCBI biosystems.
curl "https://raw.github.com/arq5x/gemini/master/test/test1.snpeff.vcf" |\
sed 's/^chr//' |\
java -jar jvarkit.jar vcfbed -B ~/ncbibiosystem.bed.gz -T NCBIBIOSYS -f '($4|$5|$6|$7)' |\
grep -E '(^#CHR|NCBI)'
##INFO=<ID=NCBIBIOSYS,Number=.,Type=String,Description="metadata added from /home/lindenb/ncbibiosystem.bed.gz . Format was ($4|$5|$6|$7)">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1094PC0005 1094PC0009 1094PC0012 1094PC0013
1 69270 . A G 2694.18 . AC=40;AF=1.000;AN=40;DP=83;Dels=0.00;EFF=SYNONYMOUS_CODING(LOW|SILENT|tcA/tcG|S60|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=0.000;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.0598;MQ=31.06;MQ0=0;NCBIBIOSYS=(79501|119548|40|GPCR_downstream_signaling),(79501|106356|30|Signaling_by_GPCR),(79501|498|40|Olfactory_transduction),(79501|83087|60|Olfactory_transduction),(79501|477114|30|Signal_Transduction),(79501|106383|50|Olfactory_Signaling_Pathway);QD=32.86 GT:AD:DP:GQ:PL ./. ./. 1/1:0,3:3:9.03:106,9,0 1/1:0,6:6:18.05:203,18,0
1 69511 . A G 77777.27 . AC=49;AF=0.875;AN=56;BaseQRankSum=0.150;DP=2816;DS;Dels=0.00;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Gca|T141A|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=21.286;HRun=0;HaplotypeScore=3.8956;InbreedingCoeff=0.0604;MQ=32.32;MQ0=0;MQRankSum=1.653;NCBIBIOSYS=(79501|119548|40|GPCR_downstream_signaling),(79501|106356|30|Signaling_by_GPCR),(79501|498|40|Olfactory_transduction),(79501|83087|60|Olfactory_transduction),(79501|477114|30|Signal_Transduction),(79501|106383|50|Olfactory_Signaling_Pathway);QD=27.68;ReadPosRankSum=2.261 GT:AD:DP:GQ:PL ./. ./. 0/1:2,4:6:15.70:16,0,40 0/1:2,2:4:21.59:22,0,40</h:pre>
Another example:
$ tabix -h dbsnp138_00-All.vcf.gz "19:58864565-58865165" | sed '/^[^#]/s/^/chr/' |\
java -jar dist/vcfbed.jar -m your.bed -e 'bed.get(0)+"|"+bed.get(1)+"|"+bed.get(2)+"|"+bed.get(3)+"&"+bed.get(4)'
##INFO=<ID=VCFBED,Number=.,Type=String,Description="metadata added from your.bed . Format was ${1}|${2}|${3}|${4}&${5}">
(...)
chr19 58864911 rs113760967 T C . . GNO;OTHERKG;R5;RS=113760967;RSPOS=58864911;SAO=0;SLO;SSR=0;VC=SNV;VCFBED=chr19|58864565|58865165|A1BG&58864865;VP=0x050100020001000102000100;WGT=1;dbSNPBuildID=132
chr19 58865054 rs893183 T C . . CAF=[0.1299,0.8701];COMMON=1;G5;GNO;HD;KGPROD;KGPhase1;KGPilot123;OTHERKG;PH3;R5;RS=893183;RSPOS=58865054;RV;SAO=0;SLO;SSR=0;VC=SNV;VCFBED=chr19|58864565|58865165|A1BG&58864865;VLD;VP=0x05010002000115051f000100;WGT=1;dbSNPBuildID=86
chr19 58865068 rs893182 T C . . CAF=[0.1299,0.8701];COMMON=1;G5;GNO;HD;KGPROD;KGPhase1;KGPilot123;OTHERKG;PH3;R5;RS=893182;RSPOS=58865068;RV;SAO=0;SLO;SSR=0;VC=SNV;VCFBED=chr19|58864565|58865165|A1BG&58864865;VLD;VP=0x05010002000115051f000100;WGT=1;dbSNPBuildID=86
chr19 58865082 rs893181 A T . . CAF=[0.1295,0.8705];COMMON=1;G5;GNO;HD;KGPROD;KGPhase1;KGPilot123;OTHERKG;PH3;R5;RS=893181;RSPOS=58865082;RV;SAO=0;SLO;SSR=0;VC=SNV;VCFBED=chr19|58864565|58865165|A1BG&58864865;VLD;VP=0x05010002000115051f000100;WGT=1;dbSNPBuildID=86
chr19 58865091 rs893180 A G . . CAF=[0.1299,0.8701];COMMON=1;G5;GNO;HD;KGPROD;KGPhase1;KGPilot123;OTHERKG;R5;RS=893180;RSPOS=58865091;RV;SAO=0;SLO;SSR=0;VC=SNV;VCFBED=chr19|58864565|58865165|A1BG&58864865;VLD;VP=0x05010002000115051e000100;WGT=1;dbSNPBuildID=86
chr19 58865112 rs188818621 C T . . CAF=[0.9954,0.004591];COMMON=1;KGPROD;KGPhase1;R5;RS=188818621;RSPOS=58865112;SAO=0;SSR=0;VC=SNV;VCFBED=chr19|58864565|58865165|A1BG&58864865;VP=0x050000020001000014000100;WGT=1;dbSNPBuildID=135
chr19 58865164 rs80109863 C T . . CAF=[0.9949,0.005051];COMMON=1;GNO;KGPROD;KGPhase1;OTHERKG;R5;RS=80109863;RSPOS=58865164;SAO=0;SSR=0;VC=SNV;VCFBED=chr19|58864565|58865165|A1BG&58864865;VP=0x050000020001000116000100;WGT=1;dbSNPBuildID=132