Filtering VCF with dynamically-compiled java expressions
This program is now part of the main jvarkit
tool. See jvarkit for compiling.
Usage: java -jar dist/jvarkit.jar vcffilterjdk [options] Files
Usage: vcffilterjdk [options] Files
Options:
--bcf-output
If this program writes a VCF to a file, The format is first guessed from
the file suffix. Otherwise, force BCF output. The current supported BCF
version is : 2.1 which is not compatible with bcftools/htslib (last
checked 2019-11-15)
Default: false
--body
user's code is the whole body of the filter class, not just the 'apply'
method.
Default: false
-e, --expression
The java code expression.
-xf, --extra-filters
[20180716] extra FILTERs names that will be added in the VCF header and
that you can add in the variant using https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/variantcontext/VariantContextBuilder.html#filter-java.lang.String-
. Multiple separated by space/comma
Default: <empty string>
-F, --filter
If not empty, variants won't be discarded and this name will be used in
the FILTER column
Default: <empty string>
--generate-vcf-md5
Generate MD5 checksum for VCF output.
Default: false
-h, --help
print help and exit
--helpFormat
What kind of help. One of [usage,markdown,xml].
--nocode
Don't show the generated code
Default: false
-o, --output
Output file. Optional . Default: stdout
-p, --pedigree
Optional pedigree file. A pedigree file. tab delimited. Columns:
family,id,father,mother,
sex:(0|.|undefined|unknown:unknown;1|male|M:male;2|female|F:female),
phenotype
(-9|?|.:unknown;1|affected|case:affected;0|unaffected|control:unaffected)
-rc, --recalc
[20180716] recalc attributes like INFO/AF, INFO/AC, INFO/AN... if the
number of genotypes has been altered. Recal is not applied if there is
no genotype.
Default: false
--saveCodeInDir
Save the generated java code in the following directory
-f, --script
The java source code file.
-vn, --variable
[20180716] how to name the VariantContext in the code. htsjdk/gatk often
use 'vc'.
Default: variant
--version
print version and exit
20170705
The project is licensed under the MIT license.
Should you cite vcffilterjdk ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md
The current reference is:
The user code is a piece of java code that will be inserted as the method apply, or as the body of a com.github.lindenb.jvarkit.tools.vcffilterjs.VcfFilterJdk.AbstractFilter class with implements java.util.function.Function<VariantContext,Object>
:
At the time of writing the documentation, the parent class AbstractFilter is defined as:
public static class AbstractFilter
extends com.github.lindenb.jvarkit.util.vcf.VcfTools
implements Function<VariantContext,Object>
{
protected final Map<String,Object> userData = new HashMap<>();
protected final VCFHeader header;
protected AbstractFilter(final VCFHeader header) {
super(header);
this.header = header;
}
@Override
public Object apply(final VariantContext variant) {
throw new IllegalStateException("apply(variant) for AbstractFilter is not implemented");
}
// if option --pedigree is defined. Returns an instance of com.github.lindenb.jvarkit.pedigree.Pedigree
public Pedigree getPedigree();
public boolean hasPedigree();
}
where
‘userData’ will be filled with the following properties:
<"first.variant",Boolean>
the current variant is the first variant in the VCF<"last.variant",Boolean>
the current variant is the last variant in the VCFIf the user puts <"STOP",Boolean.TRUE>
in userData
the scanning of the VCF will be aborted without error.
The user code will be inserted in the following java code:
1 import java.util.*;
2 import java.util.stream.*;
3 import java.util.function.*;
4 import htsjdk.samtools.util.*;
5 import htsjdk.variant.variantcontext.*;
6 import htsjdk.variant.vcf.*;
9 public class VcfFilterJdkCustom123 extends com.github.lindenb.jvarkit.tools.vcffilterjs.VcfFilterJdk.AbstractFilter {
10 public VcfFilterJdkCustom123(final VCFHeader header) {
11 super(header);
12 }
13 @Override
14 public Object apply(final VariantContext variant) {
15 // user code starts here
16 user's code is inserted here <===================
17 // user code ends here
18 }
19 }
When the option --body
is set : the user’s code is the whole body (but the constructor) of the class
The program is then compiled in memory.
The method apply
returns an object that can either:
At first, this tool is not safe for a public Galaxy server, because the javascript code can access the filesystem. But you can use the JVM parameter
-J-Djava.security.manager
to prevent it to access the filesystem. See http://stackoverflow.com/questions/40177810
I’m interested in finding all sites (regardless of genotype call heterozygous or homozygous) where at least one of the alternative alleles have an AD value (Allelic Depth) greater than 10,
see https://bioinformatics.stackexchange.com/questions/974/
java -jar dist/vcffilterjdk.jar -e 'return variant.getGenotypes().stream().filter(G->G.hasAD() && java.util.Arrays.stream(G.getAD()).skip(1).filter(AD->AD>10).findAny().isPresent()).findAny().isPresent();'
filter homozygotes for sample NA12878
$ curl -sL "https://raw.github.com/jamescasbon/PyVCF/master/vcf/test/gatk.vcf" |\
java -jar dist/vcffilterjdk.jar -e 'return variant.getGenotype("NA12878").isHom();' | grep CHROM -A2
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT BLANK NA12878 NA12891 NA12892 NA19238 NA19239 NA19240
chr22 42522755 . C G 36.98 . AC=1;AF=0.071;AN=14;BaseQRankSum=-14.866;DP=1527;DS;Dels=0.01;FS=0.000;HRun=0;HaplotypeScore=253.4254;MQ=197.36;MQ0=2;MQRankSum=-10.810;QD=0.15;ReadPosRankSum=-17.244 GT:AD:DP:GQ:PL 0/0:26,1:27:51:0,51,570 0/0:208,40:248:99:0,236,4169 0/0:192,56:249:99:0,114,4292 0/1:179,66:245:75:75,0,3683 0/0:214,32:246:99:0,172,4235 0/0:200,49:249:61:0,61,4049 0/0:195,50:246:32:0,32,3757
chr22 42523003 rs116917064 A G 7113.55 . AC=8;AF=0.571;AN=14;BaseQRankSum=6.026;DB;DP=1433;DS;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=101.7894;MQ=182.04;MQ0=0;MQRankSum=-2.501;QD=4.96;ReadPosRankSum=8.294 GT:AD:DP:GQ:PL 0/1:10,2:12:1:1,0,257 1/1:9,173:183:99:2385,273,0 0/1:153,95:249:99:355,0,2355 0/1:140,110:250:99:1334,0,2242 0/1:164,85:249:99:1070,0,2279 0/1:160,90:250:99:1245,0,2300 0/1:156,81:238:99:724,0,2764
first and second genotype are not the same:
java -jar dist/vcffilterjdk.jar -e 'return !variant.getGenotype(0).sameGenotype(variant.getGenotype(1));'
at least 3 samples have a DP greater than 30
java -jar dist/vcffilterjdk.jar -e 'return variant.getGenotypes().stream().filter(G->G.hasDP() && G.getDP()>30).limit(3).count()> 2;'
Variant is annotated with SO:0001818 or its children ( protein_altering_variant )
$ java -jar dist/vcffilterjdk.jar -e 'return this.hasSequenceOntologyLabel(variant,"protein_altering_variant");'
or
java -jar dist/vcffilterjdk.jar -e 'return this.hasSequenceOntologyAccession(variant,"SO:0001818");'
Unphase a VCF file
java -jar dist/vcffilterjdk.jar -e 'return new VariantContextBuilder(variant).genotypes(variant.getGenotypes().stream().map(G->new GenotypeBuilder(G).phased(false).make()).collect(Collectors.toList())).make();' input.vcf
Change haploid to diploid
note: things like ‘AF’ are not fixed.
$ wget -q -O - "https://raw.githubusercontent.com/CostaLab/practical_SS2015/598ea0dddf2ef073a55ae21bc6d39ac2172eb617/data_analysis/organisms/escherichia_coli/O157H7_Sakai/IonTorrentPGM_mem/SRX185723/SRR566635/SRR566635-snps.vcf" | java -jar dist/vcffilterjdk.jar -e 'return new VariantContextBuilder(variant).genotypes(variant.getGenotypes().stream().map(G->!G.isCalled()?GenotypeBuilder.createMissing(G.getSampleName(),2):G).map(G->G.isCalled() && G.getPloidy()==1?new GenotypeBuilder(G).alleles(Arrays.asList(G.getAllele(0),G.getAllele(0))).make():G).collect(Collectors.toList())).attribute("AC",variant.getGenotypes().stream().mapToInt(G->G.isCalled() && G.getPloidy()==1 && !G.getAllele(0).isReference()?2:G.getAlleles().size()).sum()).attribute("AN",variant.getGenotypes().stream().mapToInt(G->G.isCalled() && G.getPloidy()==1 ?2:G.getAlleles().size()).sum()).make();'
set DP missing in genotypes when only AD is present:
curl -s "http://www.icbi.at/svnsimplex/CommonBioCommands/tags/simplex-1.0/CommonBioCommands/testdata/vcf/AMS1_converted_filtered_short_chr1.vcf" |\
awk '/^#CHROM/ {printf("##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"\">\n");} {print;}' | \
java -jar dist/vcffilterjdk.jar -e 'List<Genotype> gl = variant.getGenotypes().stream().map(G->{int ad[]=G.getAD();if(ad==null || ad.length==0) return G; int c= Arrays.stream(ad).sum(); return new GenotypeBuilder(G).DP(c).make();}).collect(Collectors.toList());return new VariantContextBuilder(variant).genotypes(gl).make();'
How can I access the n-th item in the m-th sample”
$ java -jar dist/vcffilterjdk.jar -e 'Genotype G=variant.getGenotype(0); return G.hasAD() && G.getAD().length>1 && G.getAD()[1]>3;' input.vcf
Identifying variants differing between control/treatment
$ java -jar dist/vcffilterjdk.jar --body -e 'List<String> cases = null,controls=null; public Object apply(final VariantContext variant) { if(cases==null) try {cases=IOUtil.slurpLines(new java.io.File("treat.txt")) ; controls= IOUtil.slurpLines(new java.io.File("control.txt")); } catch(Exception e) {throw new RuntimeIOException(e);}; for(final String S1:cases) {final Genotype G1=variant.getGenotype(S1); if(G1==null ||!G1.isCalled()) continue;for(final String S2:controls) {final Genotype G2=variant.getGenotype(S2); if(G2==null ||!G2.isCalled()) continue; if(G1.sameGenotype(G2)) return false;}} return true;}' input.vcf```
Retain sites where atleast 80% of the individuals had at least depth DP >= 10 and GQ>=20 irrespective of the reference or non-reference allele
java -jar dist/vcffilterjdk.jar -e 'return variant.getGenotypes().stream().filter(G->G.getDP()>=10 && G.getGQ()>=20).count()/(double)variant.getNSamples() > 0.8;' input.vcf
Retain sites where at least one sample has the non-reference allele with DP>= 10 and GQ >= 20.
java -jar dist/vcffilterjdk.jar -e 'return variant.getGenotypes().stream().anyMatch(G->G.getDP()>=10 && G.getGQ()>=20 && G.getAlleles().stream().anyMatch(A->A.isCalled() && !A.isReference())) ;'
creating a simple GUI for vcffilterjdk using zenity
see https://gist.github.com/lindenb/4465c0e822b175f3428029526beef80c , https://www.biostars.org/p/296145/
updating AF and MAF fields:
$ gunzip -c input.vcf.gz |\
awk '/^#CHROM/ {printf("##INFO=<ID=MAF,Number=1,Type=Float,Description=\"Min Allele Frequency\">\n##INFO=<ID=AF,Number=A,Type=Float,Description=\"Allele Frequency\">\n");} {print}' |\
java -jar dist/vcffilterjdk.jar -e 'VariantContextBuilder vcb = new VariantContextBuilder(variant); float ac = variant.getAttributeAsInt("AN",0); if(ac>0) { List<Float> af = variant.getAttributeAsIntList("AC",0).stream().map(N->N/ac).collect(Collectors.toList());vcb.attribute("AF",af);vcb.attribute("MAF",af.stream().mapToDouble(X->X.floatValue()).min().orElse(-1.0) );} return vcb.make();'
Set attribute “AA=at | …” to upper case in vcf file |
java -jar dist/vcffilterjdk.jar -e 'if(!variant.hasAttribute("AA")) return variant; String AA= variant.getAttributeAsString("AA",""); int pipe=AA.indexOf("|"); AA= AA.substring(0,pipe).toUpperCase()+AA.substring(pipe); return new VariantContextBuilder(variant).attribute("AA",AA).make();'
select LUMPY-SV familial structural variations
run with the option ‘–body’
private final List<Set<String>> sampleSetList = Arrays.asList(
new HashSet<>(Arrays.asList("S1","S2")),
new HashSet<>(Arrays.asList("S3","S4","S5")),
new HashSet<>(Arrays.asList("S6","S7","S8","S9"))
);
private boolean isSV(final Genotype g) {
if(g.getAttributeAsInt("SU",0)>0) return true;
if(g.getAttributeAsInt("SR",0)>0) return true;
return false;
}
private boolean validateSet(final VariantContext V,final Set<String> sampleSet) {
if(sampleSet.stream().
map(N->V.getGenotype(N)).
anyMatch(G->!isSV(G)) )
{
return false;
}
if(V.getGenotypes().stream().
filter(G->!sampleSet.contains(G.getSampleName())).
anyMatch(G->isSV(G)))
{
return false;
}
return true;
}
@Override
public Object apply(final VariantContext V) {
for(final Set<String> sampleSet : this.sampleSetList)
{
if(validateSet(V,sampleSet)) return true;
}
return false;
}
What I want is to assign ./. missing genotypes for sample-level genotypes in a VCF, if they fail to pass defined AD ratio (so, keeping the variant itself still if it is present in at least one sample with desired AD ratio thresholds).
return new VariantContextBuilder(variant).
genotypes(
variant.
getGenotypes().
stream().
map( G->{
if(G.hasAD()) {
final int A[]= G.getAD();
if(A.length>1 && A[0]>0 && A[1]/(double)A[0]> 0.33) return G;
}
return GenotypeBuilder.createMissing(G.getSampleName(),2);
}).
collect(Collectors.toList())
).
make();
I’ve filtered variants based on quality and depth using bcftools, but I also want to filter them that were called within 5 base pairs of each other. Is there any tools that can do this?
$ java -jar dist/vcffilterjdk.jar --body -e 'String prev_contig=null;int prev_end=-1; public Object apply(final VariantContext vc) {boolean ret=!(vc.getContig().equals(prev_contig) && vc.getStart() - prev_end <=5); prev_contig=vc.getContig();prev_end=vc.getEnd();return ret; }' in.vcf
how to filter a multi-VCF for sex-specific genotypes?
see https://bioinformatics.stackexchange.com/questions/5518
java -jar dist/vcffilterjdk.jar -e 'return variant.getGenotypes().stream().allMatch(G->(G.getSampleName().endsWith("M") && G.isHet()) || (G.getSampleName().endsWith("F") && G.isHomRef())); ' input.vcf