Getting a VCF file from a CLUSTAW or a FASTA alignment.
use https://github.com/sanger-pathogens/snp_sites
This program is now part of the main jvarkit
tool. See jvarkit for compiling.
Usage: java -jar dist/jvarkit.jar msa2vcf [options] Files
Usage: msa2vcf [options] Files
Options:
-a, --allsites
print all sites
Default: false
--bcf-output
If this program writes a VCF to a file, The format is first guessed from
the file suffix. Otherwise, force BCF output. The current supported BCF
version is : 2.1 which is not compatible with bcftools/htslib (last
checked 2019-11-15)
Default: false
-c, --consensus
use this sequence as CONSENSUS
-f, --fasta
save computed fasta sequence in this file.
--generate-vcf-md5
Generate MD5 checksum for VCF output.
Default: false
-m, --haploid
haploid output
Default: false
-h, --help
print help and exit
--helpFormat
What kind of help. One of [usage,markdown,xml].
-N, --ignore-n-bases
ignore, to the extent possible N-bases in the reads.
Default: false
-o, --output
Output file. Optional . Default: stdout
-R, --reference_contig_name
reference name used for the CHROM column. Optional
Default: chrUn
--version
print version and exit
20151226
The project is licensed under the MIT license.
Should you cite msa2vcf ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md
The current reference is:
http://dx.doi.org/10.6084/m9.figshare.1425030
Lindenbaum, Pierre (2015): JVarkit: java-based utilities for Bioinformatics. figshare. http://dx.doi.org/10.6084/m9.figshare.1425030
Deprecated: use https://github.com/sanger-pathogens/snp_sites , though some people told me they still use it for misc reasons.
Getting a VCF file from a CLUSTAW alignment. See also http://www.biostars.org/p/94573/
input is a clustalw file like: https://github.com/biopython/biopython/blob/master/Tests/Clustalw/opuntia.aln
$ curl https://raw.github.com/biopython/biopython/master/Tests/Clustalw/opuntia.aln
CLUSTAL W (1.81) multiple sequence alignment
gi|6273285|gb|AF191659.1|AF191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273284|gb|AF191658.1|AF191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273287|gb|AF191661.1|AF191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273286|gb|AF191660.1|AF191 TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273290|gb|AF191664.1|AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273289|gb|AF191663.1|AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273291|gb|AF191665.1|AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
******* **** *************************************
gi|6273285|gb|AF191659.1|AF191 TATATA----------ATATATTTCAAATTTCCTTATATACCCAAATATA
gi|6273284|gb|AF191658.1|AF191 TATATATA--------ATATATTTCAAATTTCCTTATATACCCAAATATA
gi|6273287|gb|AF191661.1|AF191 TATATA----------ATATATTTCAAATTTCCTTATATATCCAAATATA
gi|6273286|gb|AF191660.1|AF191 TATATA----------ATATATTTATAATTTCCTTATATATCCAAATATA
gi|6273290|gb|AF191664.1|AF191 TATATATATA------ATATATTTCAAATTCCCTTATATATCCAAATATA
gi|6273289|gb|AF191663.1|AF191 TATATATATA------ATATATTTCAAATTCCCTTATATATCCAAATATA
gi|6273291|gb|AF191665.1|AF191 TATATATATATATATAATATATTTCAAATTCCCTTATATATCCAAATATA
****** ******** **** ********* *********
gi|6273285|gb|AF191659.1|AF191 AAAATATCTAATAAATTAGATGAATATCAAAGAATCCATTGATTTAGTGT
gi|6273284|gb|AF191658.1|AF191 AAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGT
gi|6273287|gb|AF191661.1|AF191 AAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGT
gi|6273286|gb|AF191660.1|AF191 AAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGT
gi|6273290|gb|AF191664.1|AF191 AAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGT
gi|6273289|gb|AF191663.1|AF191 AAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTAT
gi|6273291|gb|AF191665.1|AF191 AAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGT
************************************ *********** *
gi|6273285|gb|AF191659.1|AF191 ACCAGA
gi|6273284|gb|AF191658.1|AF191 ACCAGA
gi|6273287|gb|AF191661.1|AF191 ACCAGA
gi|6273286|gb|AF191660.1|AF191 ACCAGA
gi|6273290|gb|AF191664.1|AF191 ACCAGA
gi|6273289|gb|AF191663.1|AF191 ACCAGA
gi|6273291|gb|AF191665.1|AF191 ACCAGA
******
generate the VCF
$ curl https://raw.github.com/biopython/biopython/master/Tests/Clustalw/opuntia.aln" |\
java -jar dist/jvarkit.jar msa2vcf
##fileformat=VCFv4.1
##Biostar94573CmdLine=
##Biostar94573Version=ca765415946f3ed0827af0773128178bc6aa2f62
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth.">
##contig=<ID=chrUn,length=156>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT gi|6273284|gb|AF191658.1|AF191 gi|6273285|gb|AF191659.1|AF191 gi|6273286|gb|AF191660.1|AF191 gi|6273287|gb|AF191661.1|AF191 gi|6273289|gb|AF191663.1|AF191 gi|6273290|gb|AF191664.1|AF191 gi|6273291|gb|AF191665.1|AF191
chrUn 8 . T A . . DP=7 GT:DP 0:1 0:1 1:1 0:1 0:1 0:1 0:1
chrUn 13 . A G . . DP=7 GT:DP 0:1 0:1 0:1 0:1 1:1 1:1 1:1
chrUn 56 . ATATATATATA ATA,A,ATATA . . DP=7 GT:DP 1:1 2:1 2:1 2:1 3:1 3:1 0:1
chrUn 74 . TCA TAT . . DP=7 GT:DP 0:1 0:1 1:1 0:1 0:1 0:1 0:1
chrUn 81 . T C . . DP=7 GT:DP 0:1 0:1 0:1 0:1 1:1 1:1 1:1
chrUn 91 . T C . . DP=7 GT:DP 1:1 1:1 0:1 0:1 0:1 0:1 0:1
chrUn 137 . T C . . DP=7 GT:DP 0:1 1:1 0:1 0:1 0:1 0:1 0:1
chrUn 149 . G A . . DP=7 GT:DP 0:1 0:1 0:1 0:1 1:1 0:1 0:1