javascript version of awk for bioinformatics
This program is now part of the main jvarkit
tool. See jvarkit for compiling.
Usage: java -jar dist/jvarkit.jar bioalcidae [options] Files
Usage: bioalcidae [options] Files
Options:
-e, --expression
Javascript expression
-F, --format
force format: one of VCF BAM SAM FASTQ FASTA BLAST . BLAST is BLAST XML
version 1. INSDSEQ is XML output of NCBI EFetch rettype=gbc.
-h, --help
print help and exit
--helpFormat
What kind of help. One of [usage,markdown,xml].
-J, --json
Optional. Reads a JSON File using google gson
(https://google-gson.googlecode.com/svn/trunk/gson/docs/javadocs/index.html
) and injects it as 'userData' in the javascript context.
-o, --output
Output file. Optional . Default: stdout
-f, --scriptfile
Javascript file
--version
print version and exit
The project is licensed under the MIT license.
Should you cite bioalcidae ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md
The current reference is:
Bioinformatics file javascript-based reformatter ( java engine http://openjdk.java.net/projects/nashorn/ ). Something like awk for VCF, BAM, SAM, FASTQ, FASTA etc…
As ‘bioalcidae’ looks like an ‘awk’ for bioinformatics, we used ‘Alcidae’, the taxonomic Family of the ‘auk’ species.
At first, this tool is not safe for a public Galaxy server, because the javascript code can access the filesystem. But you can use the JVM parameter
-J-Djava.security.manager
to prevent it to access the filesystem. See http://stackoverflow.com/questions/40177810
The program injects the following variables:
For VCF , the program injects the following variables:
public class Fasta
{
public String getSequence();
public String getName();
public void print();
public int getSize();
public char charAt(int i);
}
interface BlastIteration {
public int getNum();
public String getQueryId();
public String getQueryDef();
public int getQueryLen();
}
}
getting an histogram of the length of the reads
L={};
while(iter.hasNext()) {
var rec=iter.next();
if( rec.getReadUnmappedFlag() || rec.isSecondaryOrSupplementary()) continue;
var n= rec.getReadLength();
if(n in L) {L[n]++;} else {L[n]=1;}
}
for(var i in L) {
out.println(""+i+"\t"+L[i]);
}
“Creating a consensus based on ‘x’ number of fasta files” ( https://www.biostars.org/p/185162/#185168)
$ echo -e ">A_2795\nTCAGAAAGAACCTC\n>B_10\nTCAGAAAGCACCTC\n>C_3\nTCTGAAAGCACTTC" |\
java -jar ~/src/jvarkit-git/dist/jvarkit.jar bioalcidae -F fasta -e 'var consensus=[];while(iter.hasNext()) { var seq=iter.next();out.printlnseq.name+"\t"+seq.sequence);for(var i=0;i< seq.length();++i) {while(consensus.length <= i) consensus.push({}); var b = seq.charAt(i);if(b in consensus[i]) {consensus[i][b]++;} else {consensus[i][b]=1;} } } out.print("Cons.\t"); for(var i=0;i< consensus.length;i++) {var best=0,base="N"; for(var b in consensus[i]) {if(consensus[i][b]>best) { best= consensus[i][b];base=b;}} out.print(base);} out.println();'
A_2795 TCAGAAAGAACCTC
B_10 TCAGAAAGCACCTC
C_3 TCTGAAAGCACTTC
Cons. TCAGAAAGCACCTC
Reformating a VCF we want to reformat a VCF with header
CHROM POS REF ALT GENOTYPE_SAMPLE1 GENOTYPE_SAMPLE2 ... GENOTYPE_SAMPLEN
we use the following javascript file:
var samples = header.sampleNamesInOrder;
out.print("CHROM\tPOS\tREF\tALT");
for(var i=0;i< samples.size();++i)
{
out.print("\t"+samples.get(i));
}
out.println();
while(iter.hasNext())
{
var ctx = iter.next();
if(ctx.alternateAlleles.size()!=1) continue;
out.print(ctx.getContig() +"\t"+ctx.start+"\t"+ctx.reference.displayString+"\t"+ctx.alternateAlleles.get(0).displayString);
for(var i=0;i< samples.size();++i)
{
var g = ctx.getGenotype(samples.get(i));
out.print("\t");
if(g.isHomRef())
{
out.print("0");
}
else if(g.isHomVar())
{
out.print("2");
}
else if(g.isHet())
{
out.print("1");
}
else
{
out.print("-9");
}
}
out.println();
}
$ curl -s "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz" | \
gunzip -c | java -jar ./dist/jvarkit.jar bioalcidae -f jeter.js -F vcf | head -n 5 | cut -f 1-10
CHROM POS REF ALT HG00096 HG00097 HG00099 HG00100 HG00101 HG00102
22 16050075 A G 0 0 0 0 0 0
22 16050115 G A 0 0 0 0 0 0
22 16050213 C T 0 0 0 0 0 0
22 16050319 C T 0 0 0 0 0 0
for 1000 genome data, print CHROM/POS/REF/ALT/AF(europe):
$ curl "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5a.20130502.sites.vcf.gz" | gunzip -c |\
java -jar dist/jvarkit.jar bioalcidae -F VCF -e 'while(iter.hasNext()) {var ctx=iter.next(); if(!ctx.hasAttribute("EUR_AF") || ctx.alternateAlleles.size()!=1) continue; out.println(ctx.getContig()+"\t"+ctx.start+"\t"+ctx.reference.displayString+"\t"+ctx.alternateAlleles.get(0).displayString+"\t"+ctx.getAttribute("EUR_AF"));}'
1 10177 A AC 0.4056
1 10235 T TA 0
1 10352 T TA 0.4264
1 10505 A T 0
1 10506 C G 0
1 10511 G A 0
1 10539 C A 0.001
1 10542 C T 0
1 10579 C A 0
1 10616 CCGCCGTTGCAAAGGCGCGCCG C 0.994
(...)
$ cat ~/input.blastn.xml | java -jar dist/jvarkit.jar bioalcidae -F blast -e 'while(iter.hasNext())
{
var query = iter.getIteration();
var hit = iter.next();
out.println(query.getQueryDef()+" Hit: "+hit.getHitDef()+" num-hsp = "+hit.getHitHsps().getHsp().size());
}'
output:
$
Enterobacteria phage phiX174 sensu lato, complete genome Hit: Escherichia coli genome assembly FHI90 ,scaffold scaffold-6_contig-25.0_1_5253_[organism:Escherichia num-hsp = 2
Enterobacteria phage phiX174 sensu lato, complete genome Hit: Acinetobacter baumannii AC12, complete genome num-hsp = 2
Enterobacteria phage phiX174 sensu lato, complete genome Hit: Escherichia coli genome assembly FHI7 ,scaffold scaffold-5_contig-23.0_1_5172_[organism:Escherichia num-hsp = 2
Enterobacteria phage phiX174 sensu lato, complete genome Hit: Escherichia coli genome assembly FHI92 ,scaffold scaffold-6_contig-18.0_1_5295_[organism:Escherichia num-hsp = 2
Enterobacteria phage phiX174 sensu lato, complete genome Hit: Amycolatopsis lurida NRRL 2430, complete genome num-hsp = 2
Enterobacteria phage phiX174 sensu lato, complete genome Hit: Escherichia coli genome assembly FHI87 ,scaffold scaffold-4_contig-19.0_1_5337_[organism:Escherichia num-hsp = 2
Enterobacteria phage phiX174 sensu lato, complete genome Hit: Desulfitobacterium hafniense genome assembly assembly_v1 ,scaffold scaffold9 num-hsp = 2
Enterobacteria phage phiX174 sensu lato, complete genome Hit: Escherichia coli genome assembly FHI79 ,scaffold scaffold-4_contig-23.0_1_3071_[organism:Escherichia num-hsp = 1
Enterobacteria phage phiX174 sensu lato, complete genome Hit: Escherichia coli genome assembly FHI24 ,scaffold scaffold-8_contig-33.0_1_3324_[organism:Escherichia num-hsp = 2
Enterobacteria phage phiX174 sensu lato, complete genome Hit: Escherichia coli genome assembly FHI89 ,scaffold scaffold-8_contig-14.0_1_3588_[organism:Escherichia num-hsp = 2
Enterobacteria phage phiX174 sensu lato, complete genome Hit: Sphingobacterium sp. PM2-P1-29 genome assembly Sequencing method ,scaffold BN1088_Contig_19 num-hsp = 2
Enterobacteria phage phiX174 sensu lato, complete genome Hit: Escherichia coli genome assembly FHI43 ,scaffold scaffold-3_contig-14.0_1_2537_[organism:Escherichia num-hsp = 1
(...)
$ curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=25,26,27&rettype=gbc" |\
java -jar dist/jvarkit.jar bioalcidae -F INSDSEQ -e 'while(iter.hasNext()) {var seq= iter.next(); out.println(seq.getINSDSeqDefinition()+" LENGTH="+seq.getINSDSeqLength());}'
output:
Blue Whale heavy satellite DNA LENGTH=422
Blue Whale heavy satellite DNA LENGTH=416
B.physalus gene for large subunit rRNA LENGTH=518