Reconstruct SNP haplotypes from reads


Usage: java -jar dist/bam2haplotypes.jar  [options] Files
Usage: bam2haplotypes [options] Files
      How shall we handle ALT allele that are not in the VCF. skip, warn (skip 
      and warning), error (raise an error), N (replace with 'N')), all: use 
      all alleles.
      Default: all
      Possible Values: [skip, warn, error, N, all]
      When we're looking for variants in a lare VCF file, load the variants in 
      an interval of 'N' bases instead of doing a random access for each 
      Default: 1000
    -h, --help
      print help and exit
      What kind of help. One of [usage,markdown,xml].
      In paired mode, ignore discordant read-groups RG-ID.
      Default: false
      When writing  files that need to be sorted, this will specify the number 
      of records stored in RAM before spilling to disk. Increasing this number 
      reduces the number of file  handles needed to sort a file, and increases 
      the amount of RAM needed
      Default: 50000
    -o, --out
      Output file. Optional . Default: stdout
      Activate Paired-end mode. Variant can be supported by the read or/and is 
      mate. Input must be sorted on query name using for example 'samtools 
      Default: false
    -R, --reference
      Indexed fasta Reference file. This file must be indexed with samtools 
      faidx and with picard/gatk CreateSequenceDictionary or samtools dict
      Limit analysis to this interval. A source of intervals. The following 
      suffixes are recognized: vcf, vcf.gz bed, bed.gz, gtf, gff, gff.gz, 
      gtf.gz.Otherwise it could be an empty string (no interval) or a list of 
      plain interval separated by '[ \t\n;,]'
      tmp working directory. Default: java.io.tmpDir
      Default: []
      SAM Reader Validation Stringency
      Default: LENIENT
      Possible Values: [STRICT, LENIENT, SILENT]
  * -V, --vcf
      Indexed VCf file. Only diallelic SNP will be considered.
      print version and exit


Download and Compile

$ git clone "https://github.com/lindenb/jvarkit.git"
$ cd jvarkit
$ ./gradlew bam2haplotypes

The java jar file will be installed in the dist directory.

The project is licensed under the MIT license.


Should you cite bam2haplotypes ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md

The current reference is:


Lindenbaum, Pierre (2015): JVarkit: java-based utilities for Bioinformatics. figshare. http://dx.doi.org/10.6084/m9.figshare.1425030


$ java -jar dist/bam2haplotypes.jar -V src/test/resources/rotavirus_rf.vcf.gz src/test/resources/S5.bam

RF03	1221	1242	11	2	1221	C	1242	C
RF03	1688	1708	5	2	1688	G	1708	T
RF04	1900	1920	4	2	1900	C	1920	A
RF06	517	543	9	2	517	C	543	G
RF06	668	695	4	2	668	G	695	T
RF08	926	992	2	2	926	C	992	G
RF09	294	317	6	2	294	T	317	A
RF10	139	175	1	2	139	T	175	G
RF10	139	175	3	2	139	T	175	C

in paired mode

samtools collate -O -u src/test/resources/S5.bam TMP | java -jar dist/bam2haplotypes.jar --paired -V src/test/resources/rotavirus_rf.vcf.gz

RF02	251	578	1	2	251	A	578	G
RF03	1221	1688	1	2	1221	C	1688	G
RF03	1221	1242	7	2	1221	C	1242	C
RF03	1221	1688	1	2	1221	C	1688	G
RF03	1688	1708	1	2	1688	G	1708	T
RF03	1708	2150	1	2	1708	T	2150	T
RF03	1221	1708	1	3	1221	C	1688	G	1708	T
RF03	1221	1688	2	3	1221	C	1242	C	1688	G
RF03	1688	2150	1	3	1688	G	1708	T	2150	T
RF03	1221	1708	2	4	1221	C	1242	C	1688	G	1708	T
RF04	887	1241	1	2	887	A	1241	T
RF04	1900	1920	4	2	1900	C	1920	A
RF05	41	499	2	2	41	T	499	A
RF05	499	879	1	2	499	A	879	C
RF05	795	1297	2	2	795	A	1297	T
RF05	879	1297	2	2	879	C	1297	T
RF06	517	543	9	2	517	C	543	G
RF06	668	695	4	2	668	G	695	T
RF07	225	684	1	2	225	C	684	G
RF07	225	684	1	2	225	C	684	T
RF08	926	992	2	2	926	C	992	G
RF09	294	317	6	2	294	T	317	A
RF10	139	175	1	2	139	T	175	G
RF10	139	175	3	2	139	T	175	C