
Split VCF+VEP by gene/transcript.
This program is now part of the main jvarkit tool. See jvarkit for compiling.
Usage: java -jar dist/jvarkit.jar vcfgenesplitter [options] Files
Usage: vcfgenesplitter [options] Files
Options:
--disable-hash-directory, --dhd
disable default which is to save each file in a checksum-based
directory-a-la-nextflow to avoid a large number of files in the same
directory.
Default: false
-e, -E, --extractors
Gene Extractors Name. Space/semicolon/Comma separated. custom:tag is a
custom extractor extracting all the values for INFO/tag as one or more
gene name. +x is a custom extractor using sliding windows of integer
size=x (e.g: '+10000' or '+1Mb' )
Default: ANN/GeneId VEP/GeneId
-h, --help
print help and exit
--helpFormat
What kind of help. One of [usage,markdown,xml].
--ignore-filtered
Ignore FILTERED variant
Default: false
-l, --list
list all available extractors
-m, --manifest
Manifest BED file output containing chrom/POS of each gene
--maxRecordsInRam
When writing files that need to be sorted, this will specify the number
of records stored in RAM before spilling to disk. Increasing this number
reduces the number of file handles needed to sort a file, and increases
the amount of RAM needed
Default: 50000
* -o, --output
An existing directory or a filename ending with the '.zip' or '.tar' or
'.tar.gz' suffix.
--prefix
prefix each output VCF file with this string
Default: <empty string>
--tmpDir
tmp working directory. Default: java.io.tmpDir
Default: []
--version
print version and exit
20160310
The project is licensed under the MIT license.
Should you cite vcfgenesplitter ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md
The current reference is:
http://dx.doi.org/10.6084/m9.figshare.1425030
Lindenbaum, Pierre (2015): JVarkit: java-based utilities for Bioinformatics. figshare. http://dx.doi.org/10.6084/m9.figshare.1425030
java -jar dist/vcfgenesplitter.jar -o jeter.zip src/test/resources/rotavirus_rf.ann.vcf.gz -m jeter.mf
$ unzip -l jeter.zip
Archive: jeter.zip
Length Date Time Name
--------- ---------- ----- ----
1565 2019-05-27 11:26 2c/8fb9d2539e3f30d1d9b06f9ec54c4c/Gene_18_3284.vcf.gz
2278 2019-05-27 11:26 4e/4897c51fe2dd067a8b75c19f111477/Gene_1621_1636.vcf.gz
2278 2019-05-27 11:26 74/ca4273c3d5803c5865891c808234da/UniProtKB_Swiss-Prot:P12472.vcf.gz
2264 2019-05-27 11:26 23/6b59cfe4fdd33a5f4feeb55521dd34/Gene_50_2557.vcf.gz
2169 2019-05-27 11:26 b3/4bda8d8502e64e442fce077e45ded6/Gene_9_2339.vcf.gz
2106 2019-05-27 11:26 b7/83f96c410c7cd75bc732d44a1522a7/Gene_32_1507.vcf.gz
2023 2019-05-27 11:26 6f/8472e9f192c92bf46e4893b2367b7e/Gene_23_1216.vcf.gz
1862 2019-05-27 11:26 3c/513d82eaea18447dd5f621f92b40e6/Gene_0_1073.vcf.gz
1655 2019-05-27 11:26 84/977eac8cdef861cbd3109209675d21/Gene_0_1058.vcf.gz
1754 2019-05-27 11:26 db/aee9cc8f5c9c3d39c7af4cec63b7a5/Gene_0_1061.vcf.gz
1746 2019-05-27 11:26 b0/133c483f0ea676f8d29ab1f2daee5d/Gene_41_568.vcf.gz
1664 2019-05-27 11:26 59/0fd5c1e8d6d60a986a0021fe357514/Gene_20_616.vcf.gz
1663 2019-05-27 11:26 83/bc905cf311428ab80ce59aaf503838/Gene_78_374.vcf.gz
--------- -------
25027 13 files
$ cat jeter.mf
#chrom POS key path Count_Variants
RF01 969 ANN/GeneId Gene_18_3284 2c/8fb9d2539e3f30d1d9b06f9ec54c4c/Gene_18_3284.vcf.gz 1
RF02 250 ANN/GeneId Gene_1621_1636 4e/4897c51fe2dd067a8b75c19f111477/Gene_1621_1636.vcf.gz 5
RF02 250 ANN/GeneId UniProtKB/Swiss-Prot:P12472 74/ca4273c3d5803c5865891c808234da/UniProtKB_Swiss-Prot:P12472.vcf.gz 5
RF03 1220 ANN/GeneId Gene_50_2557 23/6b59cfe4fdd33a5f4feeb55521dd34/Gene_50_2557.vcf.gz 8
RF04 886 ANN/GeneId Gene_9_2339 b3/4bda8d8502e64e442fce077e45ded6/Gene_9_2339.vcf.gz 7
RF05 40 ANN/GeneId Gene_32_1507 b7/83f96c410c7cd75bc732d44a1522a7/Gene_32_1507.vcf.gz 6
RF06 516 ANN/GeneId Gene_23_1216 6f/8472e9f192c92bf46e4893b2367b7e/Gene_23_1216.vcf.gz 5
RF07 97 ANN/GeneId Gene_0_1073 3c/513d82eaea18447dd5f621f92b40e6/Gene_0_1073.vcf.gz 4
RF08 925 ANN/GeneId Gene_0_1058 84/977eac8cdef861cbd3109209675d21/Gene_0_1058.vcf.gz 2
RF09 293 ANN/GeneId Gene_0_1061 db/aee9cc8f5c9c3d39c7af4cec63b7a5/Gene_0_1061.vcf.gz 3
RF10 45 ANN/GeneId Gene_41_568 b0/133c483f0ea676f8d29ab1f2daee5d/Gene_41_568.vcf.gz 3
RF11 73 ANN/GeneId Gene_20_616 59/0fd5c1e8d6d60a986a0021fe357514/Gene_20_616.vcf.gz 1
RF11 73 ANN/GeneId Gene_78_374 83/bc905cf311428ab80ce59aaf503838/Gene_78_374.vcf.gz 1