jvarkit

VcfFilterJdk

Last commit

Filtering VCF with dynamically-compiled java expressions

Usage

This program is now part of the main jvarkit tool. See jvarkit for compiling.

Usage: java -jar dist/jvarkit.jar vcffilterjdk  [options] Files

Usage: vcffilterjdk [options] Files
  Options:
    --bcf-output
      If this program writes a VCF to a file, The format is first guessed from 
      the file suffix. Otherwise, force BCF output. The current supported BCF 
      version is : 2.1 which is not compatible with bcftools/htslib (last 
      checked 2019-11-15)
      Default: false
    --body
      user's code is the whole body of the filter class, not just the 'apply' 
      method. 
      Default: false
    -e, --expression
      The java code expression.
    -xf, --extra-filters
      [20180716] extra FILTERs names that will be added in the VCF header and 
      that you can add in the variant using https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/variantcontext/VariantContextBuilder.html#filter-java.lang.String- 
      . Multiple separated by space/comma
      Default: <empty string>
    -F, --filter
      If not empty, variants won't be discarded and this name will be used in 
      the FILTER column
      Default: <empty string>
    --generate-vcf-md5
      Generate MD5 checksum for VCF output.
      Default: false
    -h, --help
      print help and exit
    --helpFormat
      What kind of help. One of [usage,markdown,xml].
    --nocode
       Don't show the generated code
      Default: false
    -o, --output
      Output file. Optional . Default: stdout
    -p, --pedigree
      Optional pedigree file. A pedigree file. tab delimited. Columns: 
      family,id,father,mother, 
      sex:(0|.|undefined|unknown:unknown;1|male|M:male;2|female|F:female), 
      phenotype 
      (-9|?|.:unknown;1|affected|case:affected;0|unaffected|control:unaffected) 
    -rc, --recalc
      [20180716] recalc attributes like INFO/AF, INFO/AC, INFO/AN... if the 
      number of genotypes has been altered. Recal is not applied if there is 
      no genotype.
      Default: false
    --saveCodeInDir
      Save the generated java code in the following directory
    -f, --script
      The java source code file.
    -vn, --variable
      [20180716] how to name the VariantContext in the code. htsjdk/gatk often 
      use 'vc'.
      Default: variant
    --version
      print version and exit

Keywords

See also in Biostars

Creation Date

20170705

Source code

https://github.com/lindenb/jvarkit/tree/master/src/main/java/com/github/lindenb/jvarkit/tools/vcffilterjs/VcfFilterJdk.java

Unit Tests

https://github.com/lindenb/jvarkit/tree/master/src/test/java/com/github/lindenb/jvarkit/tools/vcffilterjs/VcfFilterJdkTest.java

Contribute

License

The project is licensed under the MIT license.

Citing

Should you cite vcffilterjdk ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md

The current reference is:

About the script

The user code is a piece of java code that will be inserted as the method apply, or as the body of a com.github.lindenb.jvarkit.tools.vcffilterjs.VcfFilterJdk.AbstractFilter class with implements java.util.function.Function<VariantContext,Object>:

At the time of writing the documentation, the parent class AbstractFilter is defined as:

public static class AbstractFilter
	extends com.github.lindenb.jvarkit.util.vcf.VcfTools
	implements Function<VariantContext,Object>
	{
	protected final Map<String,Object> userData = new HashMap<>();
	protected final VCFHeader header;
	protected AbstractFilter(final VCFHeader header) {
		super(header);
		this.header = header;
		}
	@Override
	public Object apply(final VariantContext variant) {
		throw new IllegalStateException("apply(variant) for AbstractFilter is not implemented");
		}
	// if option --pedigree is defined. Returns an instance of com.github.lindenb.jvarkit.pedigree.Pedigree
	public Pedigree getPedigree();
    public boolean hasPedigree();
	}

where

‘userData’ will be filled with the following properties:

If the user puts <"STOP",Boolean.TRUE> in userData the scanning of the VCF will be aborted without error.

The user code will be inserted in the following java code:

 1  import java.util.*;
 2  import java.util.stream.*;
 3  import java.util.function.*;
 4  import htsjdk.samtools.util.*;
 5  import htsjdk.variant.variantcontext.*;
 6  import htsjdk.variant.vcf.*;
 9  public class VcfFilterJdkCustom123 extends com.github.lindenb.jvarkit.tools.vcffilterjs.VcfFilterJdk.AbstractFilter {
10    public VcfFilterJdkCustom123(final VCFHeader header) {
11    super(header);
12    }
13    @Override
14    public Object apply(final VariantContext variant) {
15     // user code starts here 
16     user's code is inserted here <===================
17     // user code ends here 
18     }
19  }

When the option --body is set : the user’s code is the whole body (but the constructor) of the class

The program is then compiled in memory.

The method apply returns an object that can either:

See also

About Galaxy

At first, this tool is not safe for a public Galaxy server, because the javascript code can access the filesystem. But you can use the JVM parameter

-J-Djava.security.manager

to prevent it to access the filesystem. See http://stackoverflow.com/questions/40177810

Examples

Example

I’m interested in finding all sites (regardless of genotype call heterozygous or homozygous) where at least one of the alternative alleles have an AD value (Allelic Depth) greater than 10,

see https://bioinformatics.stackexchange.com/questions/974/

java -jar dist/vcffilterjdk.jar -e 'return variant.getGenotypes().stream().filter(G->G.hasAD() && java.util.Arrays.stream(G.getAD()).skip(1).filter(AD->AD>10).findAny().isPresent()).findAny().isPresent();' 

Example

filter homozygotes for sample NA12878

$ curl -sL "https://raw.github.com/jamescasbon/PyVCF/master/vcf/test/gatk.vcf" |\
	java -jar dist/vcffilterjdk.jar  -e 'return variant.getGenotype("NA12878").isHom();' | grep CHROM -A2

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	BLANK	NA12878	NA12891	NA12892	NA19238	NA19239	NA19240
chr22	42522755	.	C	G	36.98	.	AC=1;AF=0.071;AN=14;BaseQRankSum=-14.866;DP=1527;DS;Dels=0.01;FS=0.000;HRun=0;HaplotypeScore=253.4254;MQ=197.36;MQ0=2;MQRankSum=-10.810;QD=0.15;ReadPosRankSum=-17.244	GT:AD:DP:GQ:PL	0/0:26,1:27:51:0,51,570	0/0:208,40:248:99:0,236,4169	0/0:192,56:249:99:0,114,4292	0/1:179,66:245:75:75,0,3683	0/0:214,32:246:99:0,172,4235	0/0:200,49:249:61:0,61,4049	0/0:195,50:246:32:0,32,3757
chr22	42523003	rs116917064	A	G	7113.55	.	AC=8;AF=0.571;AN=14;BaseQRankSum=6.026;DB;DP=1433;DS;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=101.7894;MQ=182.04;MQ0=0;MQRankSum=-2.501;QD=4.96;ReadPosRankSum=8.294	GT:AD:DP:GQ:PL	0/1:10,2:12:1:1,0,257	1/1:9,173:183:99:2385,273,0	0/1:153,95:249:99:355,0,2355	0/1:140,110:250:99:1334,0,2242	0/1:164,85:249:99:1070,0,2279	0/1:160,90:250:99:1245,0,2300	0/1:156,81:238:99:724,0,2764

Example

first and second genotype are not the same:

java -jar dist/vcffilterjdk.jar -e 'return !variant.getGenotype(0).sameGenotype(variant.getGenotype(1));' 

Example

at least 3 samples have a DP greater than 30

java -jar dist/vcffilterjdk.jar -e 'return variant.getGenotypes().stream().filter(G->G.hasDP() && G.getDP()>30).limit(3).count()> 2;' 

Example

Variant is annotated with SO:0001818 or its children ( protein_altering_variant )

$ java -jar dist/vcffilterjdk.jar -e 'return this.hasSequenceOntologyLabel(variant,"protein_altering_variant");' 

or

java -jar dist/vcffilterjdk.jar -e 'return this.hasSequenceOntologyAccession(variant,"SO:0001818");' 

Example

Unphase a VCF file

java -jar dist/vcffilterjdk.jar -e 'return new VariantContextBuilder(variant).genotypes(variant.getGenotypes().stream().map(G->new GenotypeBuilder(G).phased(false).make()).collect(Collectors.toList())).make();' input.vcf

Example

Change haploid to diploid

note: things like ‘AF’ are not fixed.

$ wget -q -O - "https://raw.githubusercontent.com/CostaLab/practical_SS2015/598ea0dddf2ef073a55ae21bc6d39ac2172eb617/data_analysis/organisms/escherichia_coli/O157H7_Sakai/IonTorrentPGM_mem/SRX185723/SRR566635/SRR566635-snps.vcf" | java -jar dist/vcffilterjdk.jar -e 'return new VariantContextBuilder(variant).genotypes(variant.getGenotypes().stream().map(G->!G.isCalled()?GenotypeBuilder.createMissing(G.getSampleName(),2):G).map(G->G.isCalled() && G.getPloidy()==1?new GenotypeBuilder(G).alleles(Arrays.asList(G.getAllele(0),G.getAllele(0))).make():G).collect(Collectors.toList())).attribute("AC",variant.getGenotypes().stream().mapToInt(G->G.isCalled() && G.getPloidy()==1 && !G.getAllele(0).isReference()?2:G.getAlleles().size()).sum()).attribute("AN",variant.getGenotypes().stream().mapToInt(G->G.isCalled() && G.getPloidy()==1 ?2:G.getAlleles().size()).sum()).make();'

Example

set DP missing in genotypes when only AD is present:

curl -s "http://www.icbi.at/svnsimplex/CommonBioCommands/tags/simplex-1.0/CommonBioCommands/testdata/vcf/AMS1_converted_filtered_short_chr1.vcf" |\
awk '/^#CHROM/ {printf("##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"\">\n");} {print;}' | \
java -jar dist/vcffilterjdk.jar  -e 'List<Genotype> gl = variant.getGenotypes().stream().map(G->{int ad[]=G.getAD();if(ad==null || ad.length==0) return G; int c= Arrays.stream(ad).sum(); return new GenotypeBuilder(G).DP(c).make();}).collect(Collectors.toList());return new VariantContextBuilder(variant).genotypes(gl).make();'

Example

How can I access the n-th item in the m-th sample”

$ java -jar dist/vcffilterjdk.jar -e 'Genotype G=variant.getGenotype(0); return G.hasAD() && G.getAD().length>1 &&  G.getAD()[1]>3;' input.vcf

Example

Identifying variants differing between control/treatment

$ java -jar dist/vcffilterjdk.jar --body -e 'List<String> cases = null,controls=null;  public Object apply(final VariantContext variant) { if(cases==null) try {cases=IOUtil.slurpLines(new java.io.File("treat.txt")) ; controls= IOUtil.slurpLines(new java.io.File("control.txt")); } catch(Exception e) {throw new RuntimeIOException(e);}; for(final String S1:cases) {final Genotype G1=variant.getGenotype(S1); if(G1==null ||!G1.isCalled()) continue;for(final String S2:controls) {final Genotype G2=variant.getGenotype(S2); if(G2==null ||!G2.isCalled()) continue;   if(G1.sameGenotype(G2)) return false;}} return true;}' input.vcf```

Example

Retain sites where atleast 80% of the individuals had at least depth DP >= 10 and GQ>=20 irrespective of the reference or non-reference allele

java -jar dist/vcffilterjdk.jar -e 'return variant.getGenotypes().stream().filter(G->G.getDP()>=10 && G.getGQ()>=20).count()/(double)variant.getNSamples() > 0.8;' input.vcf

Retain sites where at least one sample has the non-reference allele with DP>= 10 and GQ >= 20.

java -jar dist/vcffilterjdk.jar -e 'return variant.getGenotypes().stream().anyMatch(G->G.getDP()>=10 && G.getGQ()>=20 && G.getAlleles().stream().anyMatch(A->A.isCalled() && !A.isReference())) ;'

Example

creating a simple GUI for vcffilterjdk using zenity

see https://gist.github.com/lindenb/4465c0e822b175f3428029526beef80c , https://www.biostars.org/p/296145/

capture

Example

updating AF and MAF fields:

$ gunzip -c  input.vcf.gz |\
 awk '/^#CHROM/ {printf("##INFO=<ID=MAF,Number=1,Type=Float,Description=\"Min Allele Frequency\">\n##INFO=<ID=AF,Number=A,Type=Float,Description=\"Allele Frequency\">\n");} {print}' |\
 java -jar dist/vcffilterjdk.jar -e 'VariantContextBuilder vcb = new VariantContextBuilder(variant); float ac = variant.getAttributeAsInt("AN",0); if(ac>0) { List<Float> af = variant.getAttributeAsIntList("AC",0).stream().map(N->N/ac).collect(Collectors.toList());vcb.attribute("AF",af);vcb.attribute("MAF",af.stream().mapToDouble(X->X.floatValue()).min().orElse(-1.0) );} return vcb.make();'

Example

Set attribute “AA=at …” to upper case in vcf file
java -jar dist/vcffilterjdk.jar -e 'if(!variant.hasAttribute("AA")) return variant; String AA= variant.getAttributeAsString("AA",""); int pipe=AA.indexOf("|"); AA= AA.substring(0,pipe).toUpperCase()+AA.substring(pipe); return new VariantContextBuilder(variant).attribute("AA",AA).make();'

Example

select LUMPY-SV familial structural variations

run with the option ‘–body’

private final List<Set<String>> sampleSetList = Arrays.asList(
		new HashSet<>(Arrays.asList("S1","S2")),
		new HashSet<>(Arrays.asList("S3","S4","S5")),
		new HashSet<>(Arrays.asList("S6","S7","S8","S9"))
		);

private boolean isSV(final Genotype g) {
	if(g.getAttributeAsInt("SU",0)>0) return true;
	if(g.getAttributeAsInt("SR",0)>0) return true;
	return false;	
	}


private boolean validateSet(final VariantContext V,final Set<String> sampleSet) {
	if(sampleSet.stream().
		map(N->V.getGenotype(N)).
		anyMatch(G->!isSV(G)) )
		{
		return false;
		}

	if(V.getGenotypes().stream().
		filter(G->!sampleSet.contains(G.getSampleName())).
		anyMatch(G->isSV(G)))
		{
		return false;
		}
	return true;
	}


@Override
public Object apply(final VariantContext V) {
	for(final Set<String> sampleSet : this.sampleSetList)
		{
		if(validateSet(V,sampleSet)) return true;
		}
	return false;
	}

Example

What I want is to assign ./. missing genotypes for sample-level genotypes in a VCF, if they fail to pass defined AD ratio (so, keeping the variant itself still if it is present in at least one sample with desired AD ratio thresholds).

return new VariantContextBuilder(variant).
    genotypes(
        variant.
        getGenotypes().
        stream().
        map( G->{
            if(G.hasAD()) {
                final int A[]= G.getAD();
                if(A.length>1 &&  A[0]>0 && A[1]/(double)A[0]> 0.33) return G;
                }
            return  GenotypeBuilder.createMissing(G.getSampleName(),2);
            }).
        collect(Collectors.toList())
        ).
    make();

Example

I’ve filtered variants based on quality and depth using bcftools, but I also want to filter them that were called within 5 base pairs of each other. Is there any tools that can do this?

$ java -jar dist/vcffilterjdk.jar --body -e 'String prev_contig=null;int prev_end=-1; public Object apply(final VariantContext vc) {boolean ret=!(vc.getContig().equals(prev_contig)  && vc.getStart() - prev_end <=5); prev_contig=vc.getContig();prev_end=vc.getEnd();return ret; }'  in.vcf

Example

how to filter a multi-VCF for sex-specific genotypes?

see https://bioinformatics.stackexchange.com/questions/5518

java -jar dist/vcffilterjdk.jar -e 'return variant.getGenotypes().stream().allMatch(G->(G.getSampleName().endsWith("M") && G.isHet()) || (G.getSampleName().endsWith("F") && G.isHomRef())); ' input.vcf

Cited in: