Split VCF+VEP by gene/transcript using a GTF file.
Usage: vcfgtfsplitter [options] Files
Options:
--bcf
Use bcf format
Default: false
-C, --contig, --chromosome
Limit to those contigs.
Default: []
--coding
Only use gene_biotype="protein_coding".
Default: false
--upstream, --downstream
length for upstream and downstream features. A distance specified as a
positive integer.Commas are removed. The following suffixes are
interpreted : b,bp,k,kb,m,mb
Default: 1000
--features
Features to keep. Comma separated values. A set of 'cds,exon,intron,transcript,utr,utr5,utr3,stop,start,upstream,downstream,splice'
Default: cds,exon,intron,transcript,cds_utr,cds_utr5,cds_utr3,utr5,utr3,stop,start
--force
Force writing a gene/transcript even if there is no variant.
Default: false
* -g, -G, --gtf
A GTF (General Transfer Format) file. See
https://www.ensembl.org/info/website/upload/gff.html . Please note that
CDS are only detected if a start and stop codons are defined.
-h, --help
print help and exit
--helpFormat
What kind of help. One of [usage,markdown,xml].
--ignore-filtered
Ignore FILTERED variant
Default: false
--index
index files
Default: false
-m, --manifest
Manifest Bed file output containing chrom/start/end of each gene
* -o, --output
An existing directory or a filename ending with the '.zip' suffix.
--splice
distance to splice site for 'splice' feature. A distance specified as a
positive integer.Commas are removed. The following suffixes are
interpreted : b,bp,k,kb,m,mb
Default: 5
-T, --transcript
split by transcript. (default is to split per gene)
Default: false
--version
print version and exit
--xannotate
Remove annotations. Variant Attribute cleaner. The syntax is the same as
'bcftools annotate'. e.g: 'INFO/AC,INFO/ANN' Empty string does nothing.
${PATH}
. Setting JAVA_HOME is not enough : (e.g: https://github.com/lindenb/jvarkit/issues/23 )$ git clone "https://github.com/lindenb/jvarkit.git"
$ cd jvarkit
$ ./gradlew vcfgtfsplitter
The java jar file will be installed in the dist
directory.
20191118
The project is licensed under the MIT license.
Should you cite vcfgtfsplitter ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md
The current reference is:
http://dx.doi.org/10.6084/m9.figshare.1425030
Lindenbaum, Pierre (2015): JVarkit: java-based utilities for Bioinformatics. figshare. http://dx.doi.org/10.6084/m9.figshare.1425030
$ java -jar dist/vcfgtfsplitter.jar -m jeter.manifest --gtf input.gtf.gz -o jeter.zip src/test/resources/test_vcf01.vcf
$ unzip -l jeter.zip
Archive: jeter.zip
Length Date Time Name
--------- ---------- ----- ----
4741 2019-11-18 10:52 d2/a0884d9bf86378a2cd0bfacef19723/ENSG00000188157.vcf.gz
4332 2019-11-18 10:52 4d/11ff4f2413d4a369ee9a51192acad3/ENSG00000131591.vcf.gz
--------- -------
9073 2 files
$ column -t jeter.manifest
#chrom start end Gene-Id Gene-Name Gene-Biotype path Count_Variants
1 955502 991496 ENSG00000188157 AGRN protein_coding d2/a0884d9bf86378a2cd0bfacef19723/ENSG00000188157.vcf.gz 20
1 1017197 1051741 ENSG00000131591 C1orf159 protein_coding 4d/11ff4f2413d4a369ee9a51192acad3/ENSG00000131591.vcf.gz 6
$ java -jar dist/vcfgtfsplitter.jar -T --index --gtf jeter.gtf -m jeter.manifest -o jeter.zip src/test/resources/test_vcf01.vcf
$ unzip -l jeter.zip
Archive: jeter.zip
Length Date Time Name
--------- ---------- ----- ----
4749 2019-11-18 10:56 93/00d3aa560bdbe3b3ff964e24303107/ENST00000379370.vcf.gz
4749 2019-11-18 10:56 93/00d3aa560bdbe3b3ff964e24303107/ENST00000379370.vcf.gz.tbi
3108 2019-11-18 10:56 6d/001c37d11b192bcd77ac089ec258f2/ENST00000477585.vcf.gz
(...)
2969 2019-11-18 10:56 3a/8e697b04e03716942b362afbe7ee51/ENST00000472741.vcf.gz
2969 2019-11-18 10:56 3a/8e697b04e03716942b362afbe7ee51/ENST00000472741.vcf.gz.tbi
2969 2019-11-18 10:56 2c/4ced556d7722b978453c38d5525ee0/ENST00000480643.vcf.gz
2969 2019-11-18 10:56 2c/4ced556d7722b978453c38d5525ee0/ENST00000480643.vcf.gz.tbi
--------- -------
175030 46 files
$ column -t jeter.manifest
#chrom start end Gene-Id Gene-Name Gene-Biotype Transcript-Id path Count_Variants
1 955502 991496 ENSG00000188157 AGRN protein_coding ENST00000379370 93/00d3aa560bdbe3b3ff964e24303107/ENST00000379370.vcf.gz 20
1 969485 976105 ENSG00000188157 AGRN protein_coding ENST00000477585 6d/001c37d11b192bcd77ac089ec258f2/ENST00000477585.vcf.gz 3
1 970246 976777 ENSG00000188157 AGRN protein_coding ENST00000469403 a9/fe6f84e1a425797d5b58ceae3a8313/ENST00000469403.vcf.gz 2
1 983908 984774 ENSG00000188157 AGRN protein_coding ENST00000492947 55/b5a514c94417efe2632aba379d8247/ENST00000492947.vcf.gz 1
1 1017197 1051461 ENSG00000131591 C1orf159 protein_coding ENST00000379339 7b/26f9faff411f6e056fefc89cd15a24/ENST00000379339.vcf.gz 6
1 1017197 1051736 ENSG00000131591 C1orf159 protein_coding ENST00000448924 56/fad8194b90771b0c3e931bc4772be1/ENST00000448924.vcf.gz 6
1 1017197 1051736 ENSG00000131591 C1orf159 protein_coding ENST00000294576 df/3aa6bebac650a6c9f59409521ae17d/ENST00000294576.vcf.gz 6
(...)