Split VCF+VEP by gene/transcript.
This program is now part of the main jvarkit
tool. See jvarkit for compiling.
Usage: java -jar dist/jvarkit.jar vcfgenesplitter [options] Files
Usage: vcfgenesplitter [options] Files
Options:
-e, -E, --extractors
Gene Extractors Name. Space/semicolon/Comma separated. custom:tag is a
custom extractor extracting all the values for INFO/tag as one or more
gene name
Default: ANN/GeneId VEP/GeneId
-h, --help
print help and exit
--helpFormat
What kind of help. One of [usage,markdown,xml].
--ignore-filtered
Ignore FILTERED variant
Default: false
-l, --list
list all available extractors
-m, --manifest
Manifest Bed file output containing chrom/start/end of each gene
-M, --max-variant
Maximum number of variants required to write a vcf. don't write if
num(variant) > 'x' . '<=0' is ignore
Default: -1
-n, --min-variant
Minimum number of variants required to write a vcf. don't write if
num(variant) < 'x'
Default: 1
--open-max
Maximum number of opened VCF writers at the same time.
Default: 100
* -o, --output
An existing directory or a filename ending with the '.zip' or '.tar' or
'.tar.gz' suffix.
--version
print version and exit
20160310
The project is licensed under the MIT license.
Should you cite vcfgenesplitter ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md
The current reference is:
http://dx.doi.org/10.6084/m9.figshare.1425030
Lindenbaum, Pierre (2015): JVarkit: java-based utilities for Bioinformatics. figshare. http://dx.doi.org/10.6084/m9.figshare.1425030
java -jar dist/vcfgenesplitter.jar -o jeter.zip src/test/resources/rotavirus_rf.ann.vcf.gz -m jeter.mf
$ unzip -l jeter.zip
Archive: jeter.zip
Length Date Time Name
--------- ---------- ----- ----
1565 2019-05-27 11:26 2c/8fb9d2539e3f30d1d9b06f9ec54c4c/Gene_18_3284.vcf.gz
2278 2019-05-27 11:26 4e/4897c51fe2dd067a8b75c19f111477/Gene_1621_1636.vcf.gz
2278 2019-05-27 11:26 74/ca4273c3d5803c5865891c808234da/UniProtKB_Swiss-Prot:P12472.vcf.gz
2264 2019-05-27 11:26 23/6b59cfe4fdd33a5f4feeb55521dd34/Gene_50_2557.vcf.gz
2169 2019-05-27 11:26 b3/4bda8d8502e64e442fce077e45ded6/Gene_9_2339.vcf.gz
2106 2019-05-27 11:26 b7/83f96c410c7cd75bc732d44a1522a7/Gene_32_1507.vcf.gz
2023 2019-05-27 11:26 6f/8472e9f192c92bf46e4893b2367b7e/Gene_23_1216.vcf.gz
1862 2019-05-27 11:26 3c/513d82eaea18447dd5f621f92b40e6/Gene_0_1073.vcf.gz
1655 2019-05-27 11:26 84/977eac8cdef861cbd3109209675d21/Gene_0_1058.vcf.gz
1754 2019-05-27 11:26 db/aee9cc8f5c9c3d39c7af4cec63b7a5/Gene_0_1061.vcf.gz
1746 2019-05-27 11:26 b0/133c483f0ea676f8d29ab1f2daee5d/Gene_41_568.vcf.gz
1664 2019-05-27 11:26 59/0fd5c1e8d6d60a986a0021fe357514/Gene_20_616.vcf.gz
1663 2019-05-27 11:26 83/bc905cf311428ab80ce59aaf503838/Gene_78_374.vcf.gz
--------- -------
25027 13 files
$ cat jeter.mf
#chrom start end key path Count_Variants
RF01 969 970 ANN/GeneId Gene_18_3284 2c/8fb9d2539e3f30d1d9b06f9ec54c4c/Gene_18_3284.vcf.gz 1
RF02 250 1965 ANN/GeneId Gene_1621_1636 4e/4897c51fe2dd067a8b75c19f111477/Gene_1621_1636.vcf.gz 5
RF02 250 1965 ANN/GeneId UniProtKB/Swiss-Prot:P12472 74/ca4273c3d5803c5865891c808234da/UniProtKB_Swiss-Prot:P12472.vcf.gz 5
RF03 1220 2573 ANN/GeneId Gene_50_2557 23/6b59cfe4fdd33a5f4feeb55521dd34/Gene_50_2557.vcf.gz 8
RF04 886 1920 ANN/GeneId Gene_9_2339 b3/4bda8d8502e64e442fce077e45ded6/Gene_9_2339.vcf.gz 7
RF05 40 1339 ANN/GeneId Gene_32_1507 b7/83f96c410c7cd75bc732d44a1522a7/Gene_32_1507.vcf.gz 6
RF06 516 1132 ANN/GeneId Gene_23_1216 6f/8472e9f192c92bf46e4893b2367b7e/Gene_23_1216.vcf.gz 5
RF07 97 952 ANN/GeneId Gene_0_1073 3c/513d82eaea18447dd5f621f92b40e6/Gene_0_1073.vcf.gz 4
RF08 925 992 ANN/GeneId Gene_0_1058 84/977eac8cdef861cbd3109209675d21/Gene_0_1058.vcf.gz 2
RF09 293 414 ANN/GeneId Gene_0_1061 db/aee9cc8f5c9c3d39c7af4cec63b7a5/Gene_0_1061.vcf.gz 3
RF10 45 175 ANN/GeneId Gene_41_568 b0/133c483f0ea676f8d29ab1f2daee5d/Gene_41_568.vcf.gz 3
RF11 73 79 ANN/GeneId Gene_20_616 59/0fd5c1e8d6d60a986a0021fe357514/Gene_20_616.vcf.gz 1
RF11 73 79 ANN/GeneId Gene_78_374 83/bc905cf311428ab80ce59aaf503838/Gene_78_374.vcf.gz 1