jvarkit

BioAlcidaeJdk

Last commit

java-based version of awk for bioinformatics

Usage

This program is now part of the main jvarkit tool. See jvarkit for compiling.

Usage: java -jar dist/jvarkit.jar bioalcidaejdk  [options] Files

Usage: bioalcidaejdk [options] Files
  Options:
    --body
      user's code is the whole body of the filter class, not just the 'apply' 
      method. 
      Default: false
    -e, --expression
      inline java expression
    -F, --format
      define format if input is stdin or cannot be inferred from filename. 
      Must be one of: VCF,SAM,BAM,CRAM,FASTA,FASTQ,TEXT,GTF
    -h, --help
      print help and exit
    --helpFormat
      What kind of help. One of [usage,markdown,xml].
    --import
      [20180312] add/import those java packages/classes in the code header. 
      Multiple separated by space/colon/comma. .eg: 'java.util.StringBuilder 
      java.awt.*' . 	Useful if those packages are not already defined in the 
      default code.
      Default: <empty string>
    --nocode
       Don't show the generated code
      Default: false
    -o, --output
      Output file. Optional . Default: stdout
    -p, --pedigree
      Optional pedigree file. A pedigree file. tab delimited. Columns: 
      family,id,father,mother, 
      sex:(0|.|undefined|unknown:unknown;1|male|M:male;2|female|F:female), 
      phenotype 
      (-9|?|.:unknown;1|affected|case:affected;0|unaffected|control:unaffected) 
    -R, --reference
      [20190808]Indexed fasta Reference file. This file must be indexed with 
      samtools faidx and with picard/gatk CreateSequenceDictionary or samtools 
      dict For reading BAM files or to try to convert the chromosomes in a GTF 
      file in order to match the dictionary ('1' -> 'chr1').
    -f, --scriptfile
      java body file
    --version
      print version and exit

Keywords

sam
bam
vcf
java
jdk
gtf

Creation Date

20170712

Source code

https://github.com/lindenb/jvarkit/tree/master/src/main/java/com/github/lindenb/jvarkit/tools/bioalcidae/BioAlcidaeJdk.java

Unit Tests

https://github.com/lindenb/jvarkit/tree/master/src/test/java/com/github/lindenb/jvarkit/tools/bioalcidae/BioAlcidaeJdkTest.java

Contribute

Issue Tracker: http://github.com/lindenb/jvarkit/issues
Source Code: http://github.com/lindenb/jvarkit

License

The project is licensed under the MIT license.

Citing

Should you cite bioalcidaejdk ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md

The current reference is:

“bioalcidae, samjs and vcffilterjs: object-oriented formatters and filters for bioinformatics files” . Bioinformatics, 2017. Pierre Lindenbaum & Richard Redon https://doi.org/10.1093/bioinformatics/btx734.

Bioinformatics file java-based reformatter. Something like awk for VCF, BAM, SAM…

This program takes as input a VCF or a BAM on stdin or as a file. The user provides a piece of java code that will be compiled at runtime an executed.

Why this name, ‘BioAlcidae’ ?

As ‘bioalcidae’ looks like an ‘awk’ for bioinformatics, we used ‘Alcidae’, the taxonomic Family of the ‘auk’ species.

History

2019-01 migrating to openjdk11: switched to in-memory compiling to external compiling.

Base classes

At the time of writing, we have:

    public static abstract class AbstractHandler
    	{
    	// output file
    	protected PrintStream out = System.out;
    	// input file name
    	protected String inputFile = null;
    	// called at begin
    	public void initialize() {}
    	// called at end
    	public void dispose() {}
    	// users MUST implement the body of that function
    	public abstract void execute() throws Exception;
    	
    	public void print(final Object o) { this.out.print(o);}
    	public void println() { this.out.println();}
    	public void println(final Object o) { this.print(o);this.println();}
		
		// if option pedigree was specified
		public Pedigree getPedigree();
		publc boolean hasPedigree();
    	}
    
    public static abstract class VcfHandler extends AbstractHandler
		{
    	protected VcfTools tools = null;
    	protected VCFHeader header = null;
    	protected VCFIterator iter = null;
		public Stream<VariantContext> stream()
			{
			return StreamSupport.stream(
					new IterableAdapter<VariantContext>(this.iter).spliterator(),
					false);
			}		}

    public static abstract class SAMHandler extends AbstractHandler
		{
		protected SamReader in=null;
		protected SAMFileHeader header=null;
		protected SAMRecordIterator iter=null;
		public Stream<SAMRecord> stream()
			{
			return StreamSupport.stream(
					new IterableAdapter<SAMRecord>(this.iter).spliterator(),
					false);
			}
		}
    public static abstract class FastqHandler extends AbstractHandler
		{
		protected FastqReader iter=null;
		public Stream<FastqRecord> stream()
			{
			return StreamSupport.stream(
					new IterableAdapter<FastqRecord>(this.iter).spliterator(),
					false);
			}
		}
		
    public static abstract class FastaHandler extends AbstractHandler
		{
		protected CloseableIterator<FastaSequence> iter=null;
		public Stream<FastaSequence> stream()
			{
			return StreamSupport.stream(
					new IterableAdapter<FastaSequence>(this.iter).spliterator(),
					false);
			}
		}

    public static abstract class SimpleLineHandler extends AbstractHandler
		{
		protected CloseableIterator<String> iter=null;
		public Stream<String> stream()
			{
			return StreamSupport.stream(
					new IterableAdapter<String>(this.iter).spliterator(),
					false);
			}
		}

VCF

when reading a VCF, a new class extending VcfHandler will be compiled. The user’s code will be inserted as:

import java.util.*;
import java.util.stream.*;
import java.util.function.*;
import htsjdk.samtools.*;
import htsjdk.variant.variantcontext.*;
import htsjdk.variant.vcf.*;
public class Custom1694491176 extends com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk.VcfHandler {
  public Custom1694491176() {
  }
  @Override
  public void execute() throws Exception {
  
   // user's code is inserted here <=======================
   
   }
}

SAM

when reading a SAM/BAM/CRAM, a new class extending SAMHandler will be compiled. The user’s code will be inserted as:

import java.util.*;
import java.util.stream.*;
import java.util.function.*;
import htsjdk.samtools.*;
import htsjdk.variant.variantcontext.*;
import htsjdk.variant.vcf.*;
public class Custom1694491176 extends com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk.SAMHandler {
  public Custom1694491176() {
  }
  @Override
  public void execute() throws Exception {
  
   // user's code is inserted here <=======================
   
   }
}

FASTQ

when reading a Fastq, a new class extending FastqHandler will be compiled. The user’s code will be inserted as:

import java.util.*;
import java.util.stream.*;
import java.util.function.*;
import htsjdk.samtools.*;
import htsjdk.samtools.util.*;
import htsjdk.variant.variantcontext.*;
import htsjdk.variant.vcf.*;
public class BioAlcidaeJdkCustom220769712 extends com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk.FastqHandler {
  public BioAlcidaeJdkCustom220769712() {
  }
  @Override
  public void execute() throws Exception {
  
   // user's code is inserted here <=======================
   
   }
}

Fasta

when reading a Fasta, a new class extending FastaHandler will be compiled. The user’s code will be inserted as:

import java.util.*;
import java.util.stream.*;
import java.util.function.*;
import htsjdk.samtools.*;
import htsjdk.samtools.util.*;
import htsjdk.variant.variantcontext.*;
import htsjdk.variant.vcf.*;
import com.github.lindenb.jvarkit.util.bio.fasta.FastaSequence;
public class BioAlcidaeJdkCustom298960668 extends com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk.FastaHandler {
  public BioAlcidaeJdkCustom298960668() {
  }
  @Override
  public void execute() throws Exception {
   // user's code starts here 
   
    //user's code ends here 
   }
}

GTF

when reading a Gtf, a new class extending GtfHandler will be compiled. The handler contains a stream of Gene ( https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/util/bio/structure/Gene.java ): The user’s code will be inserted as:

import java.util.*;
import java.util.stream.*;
import java.util.regex.*;
import java.util.function.*;
import htsjdk.samtools.*;
import htsjdk.samtools.util.*;
import htsjdk.variant.variantcontext.*;
import htsjdk.variant.vcf.*;
import com.github.lindenb.jvarkit.util.bio.fasta.FastaSequence;
import com.github.lindenb.jvarkit.math.RangeOfIntegers;
import com.github.lindenb.jvarkit.math.RangeOfDoubles;
import com.github.lindenb.jvarkit.util.Counter;
public class BioAlcidaeJdkCustom412261069 extends com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk.GtfHandler {
  public BioAlcidaeJdkCustom412261069() {
  }
  @Override
  public void execute() throws Exception {
   // user's code starts here 
    
    //user's code ends here 
   }
}

GFF

experimental. script will contains some instances of Gff3Feature : https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/tribble/gff/Gff3Feature.java

Examples

Example:

count the SNP having a ID in a VCF file:

$ java -jar dist/jvarkit.jar bioalcidaejdk -e 'println(stream().filter(V->V.isSNP()).filter(V->V.hasID()).count());' input.vcf.gz

953

Example:

count the mapped SAM Reads in a BAM file:

$ java -jar dist/jvarkit.jar bioalcidaejdk -e 'println(stream().filter(R->!R.getReadUnmappedFlag()).count());' input.bam
5518 

Example:

count the Reads starting with “A’ in a FASTQ file:

$ java -jar dist/jvarkit.jar bioalcidaejdk -e 'println(stream().filter(R->R.getReadString().startsWith("A")).count());' in.fastq.gz
218 

Example

generate a matrix: first row is sample names, one row per variant, genotypes (0=HOM_REF,1=HET,2=HOM_VAR,other=-9)

java -jar dist/jvarkit.jar bioalcidaejdk -e 'println(String.join(",",header.getSampleNamesInOrder())); stream().forEach(V->println(V.getGenotypes().stream().map(G->{if(G.isHomRef()) return "0";if(G.isHet()) return "1"; if(G.isHomVar()) return "2"; return "-9";}).collect(Collectors.joining(","))));' input.vcf

Example

longest fasta length

/java -jar dist/jvarkit.jar bioalcidaejdk -e 'println(stream().mapToInt(S->S.length()).max().getAsInt());' input.fasta

Example

Which tool to calculate per site stats on vcf file?

java -jar dist/jvarkit.jar bioalcidaejdk -e 'print("POS\t");for(GenotypeType GT : GenotypeType.values()) print("\t"+GT); println(); stream().forEach(V->{print(V.getContig()+":"+V.getStart()+":"+V.getReference().getDisplayString());for(GenotypeType GT : GenotypeType.values()) print("\t"+V.getGenotypes().stream().filter(G->G.getType()==GT).count()); println();});' in.vcf | column -t

POS               NO_CALL  HOM_REF  HET  HOM_VAR  UNAVAILABLE  MIXED
rotavirus:51:A    0        3        0    1        0            0
rotavirus:91:A    0        3        1    0        0            0
rotavirus:130:T   0        3        1    0        0            0
(...)

Example

BED from cigar string with deletion >= 1kb

java -jar dist/bioalcidaejdk -e  'stream().filter(R->!R.getReadUnmappedFlag()).forEach(R->{ int refpos = R.getStart(); for(CigarElement ce:R.getCigar()) { CigarOperator op = ce.getOperator(); int len = ce.getLength(); if(len>=1000 && (op.equals(CigarOperator.N) || op.equals(CigarOperator.D))) { println(R.getContig()+"\t"+(refpos-1)+"\t"+(refpos+len)); } if(op.consumesReferenceBases())  refpos+=len; } }); ' in.bam

Example

Printing the aligned reads in a given region of the reference (indels ignored)

final String chrom = "rotavirus";
final int start=150;
final int end=200;

stream().filter(R->!R.getReadUnmappedFlag()).
    filter(R->R.getContig().equals(chrom) && !(R.getEnd()<start || R.getStart()>end)).
    forEach(R->{

    for(int pos=start; pos< end ;++pos)
        {
        char base='.';
        int ref=R.getStart();
        int read=0;
        for(CigarElement ce:R.getCigar())
            {
            CigarOperator op=ce.getOperator();
            switch(op)
                {
                case P: break;
                case H: break;
                case S : read+=ce.getLength(); break;
                case D: case N: ref+=ce.getLength();break;
                case I: read+=ce.getLength();break;
                case EQ:case X: case M:
                    {
                    for(int i=0;i< ce.getLength() ;i++)
                        {
                        if(ref==pos)
                            {
                            base = R.getReadString().charAt(read);
                            break;
                            }
                        if(ref>pos) break;
                        read++;
                        ref++;
                        }
                    break;
                    }
                default:break;
                }
            if(base!='.')break;
            }
        print(base);
        }
    println();
});

Example

print information about reads overlaping the base ‘rotavirus:200’ and the tag ‘XS’

$ samtools view -h ~/src/gatk-ui/testdata/S1.bam "rotavirus:200-200" | java -jar dist/jvarkit.jar bioalcidaejdk -F SAM -e 'stream().forEach(record->{final String contig= "rotavirus"; final int mutpos = 200; if(record.getReadUnmappedFlag()) return ;if(!record.getContig().equals(contig)) return ;if(record.getEnd() < mutpos) return ;if(record.getStart() > mutpos) return ; int readpos = record.getReadPositionAtReferencePosition(mutpos); if(readpos<1) return ; readpos--; final byte[] bases= record.getReadBases();println(record.getReadName()+" "+record.getContig()+" "+record.getStart()+" "+record.getEnd()+" base["+mutpos+"]="+(char)bases[readpos]+" XS="+record.getAttribute("XS"));});'

Example

$ java -jar dist/jvarkit.jar bioalcidaejdk -e 'stream().forEach(V->{println(V.getContig()+"\t"+V.getStart()+"\t"+V.getReference().getDisplayString()+"\t"+V.getAlternateAlleles().stream().map(A->A.getDisplayString()).collect(Collectors.joining(","))+"\t"+V.getGenotypes().stream().map(G->G.getSampleName()+"="+G.getAlleles().stream().filter(A->A.isCalled()).map(A->A.getDisplayString()).collect(Collectors.joining("/"))).collect(Collectors.joining("\t")));});' src/test/resources/test_vcf01.vcf

 956852  C   T   S1= S2=T/T  S3=C/T  S4=C/T  S5=T/T  S6=C/T
 959155  G   A   S1= S2=G/A  S3=G/A  S4=G/A  S5=A/A  S6=G/A
 959169  G   C   S1= S2=G/C  S3=G/C  S4=G/C  S5=C/C  S6=G/C
 959231  G   A   S1= S2=G/A  S3=G/A  S4=G/A  S5=A/A  S6=G/A
 960409  G   C   S1= S2=G/C  S3=G/C  S4=G/C  S5=C/C  S6=G/C
 962210  A   G   S1=A/G  S2=A/G  S3=A/G  S4=A/G  S5= S6=A/G

Example

‘Merging of sequence chunks’ https://www.biostars.org/p/288324/#288466

final Map<String,List<FastaSequence>> id2seqs= new HashMap<>();
stream().forEach(S->{
    String n = S.getName();
    if(!id2seqs.containsKey(n)) id2seqs.put(n,new ArrayList<>());
    id2seqs.get(n).add(S);
    });

for(final List<FastaSequence> seqs: id2seqs.values())
{
println(">"+seqs.get(0).getName());
int len = seqs.stream().mapToInt(S->S.length()).max().getAsInt();
for(int i=0;i< len;i++)
    {
    char c='\0';
    for(FastaSequence s:seqs)
        {
        if(i>=s.length() || s.charAt(i)=='-'  ||  c==s.charAt(i)) continue;
        if(c=='\0')
            {
            c=s.charAt(i);
            }
        else
            {   
            c='X';
            }
        }
    print(c=='\0'?'-':c);
    }
println();
}

Example

SNP,INDEL counting per chromosomes in vcf

stream().
  map(V->V.getType().name()+" "+V.getContig()).
  collect(Collectors.groupingBy(Function.identity(), Collectors.counting())).forEach((K,V)->{println(K+" : "+V);});

Example

Print Minimum Depth from a VCF

$ java -jar dist/jvarkit.jar bioalcidaejdk -e 'stream().forEach(V->{println(V.getContig()+":"+V.getStart()+":"+V.getReference().getDisplayString()+"\t"+V.getGenotypes().stream().filter(G->G.hasDP()).mapToInt(G->G.getDP()).min().orElse(-1));});' input.vcf

Example

remove sequences with non-canonical nucleotides from fasta file

$ java -jar dist/jvarkit.jar bioalcidaejdk -e 'stream().filter(F->java.util.regex.Pattern.matches("^[ATGCatgc]+$",F)).forEach(S->println(">"+S.getName()+"\n"+S));' input.fa

Example

Parse BAM by insertion size and get genomic coordinates

https://bioinformatics.stackexchange.com/questions/3492/parse-bam-by-insertion-size-and-get-genomic-coordinates

$ java -jar dist/jvarkit.jar bioalcidaejdk -e 'stream().filter(R->!R.getReadUnmappedFlag() && R.getCigar().getCigarElements().stream().anyMatch(C->C.getLength()>100 && (C.getOperator().equals(CigarOperator.N) || C.getOperator().equals(CigarOperator.D)))).forEach(R->println(R.getContig()+"\t"+R.getStart()+"\t"+R.getEnd()));' input.bam

Example

Print BED file with NM attributes

java -jar dist/jvarkit.jar bioalcidaejdk -e 'stream().filter(R->!R.getReadUnmappedFlag() && R.hasAttribute("NM")).forEach(R->println(R.getContig()+"\t"+(R.getStart()-1)+"\t"+R.getEnd()+"\t"+R.getIntegerAttribute("NM")));'

Example

Extract all genotype counts from phased data in vcf files

$ wget -q -O - "http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz" |\
gunzip -c |\
java -jar dist/jvarkit.jar bioalcidaejdk -F VCF -e 'stream().forEach(V->println(V.getContig()+" "+V.getStart()+" " +V.getReference().getDisplayString()+" "+V.getGenotypes().stream().map(G->G.getAlleles().stream().map(A->String.valueOf(V.getAlleleIndex(A))).collect(Collectors.joining(G.isPhased()?"|":"/"))).collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))));'

16050075 A {0|0=2503, 0|1=1}
16050115 G {0|0=2472, 1|0=16, 0|1=16}
16050213 C {0|0=2467, 1|0=18, 0|1=18, 1|1=1}
16050319 C {0|0=2503, 1|0=1}
16050527 C {0|0=2503, 0|1=1}
16050568 C {0|0=2502, 0|1=2}
16050607 G {0|0=2499, 1|0=3, 0|1=2}
16050627 G {0|0=2502, 1|0=2}
16050646 G {0|0=2503, 1|0=1}
16050654 A {0|0=1834, 1|0=3, 0|1=6, 0|2=48, 2|0=37, 0|3=255, 3|0=258, 4|0=6, 0|4=12, 2|3=1, 3|2=1, 3|3=41, 4|3=2}
16050655 G {0|0=2503, 1|0=1}
16050678 C {0|0=2502, 0|1=2}
16050679 G {0|0=2503, 1|0=1}
16050688 C {0|0=2503, 1|0=1}
16050732 C {0|0=2503, 0|1=1}
16050739 TA {0|0=2467, 1|0=18, 0|1=18, 1|1=1}
16050758 T {0|0=2503, 0|1=1}
16050783 A {0|0=2466, 0|1=13, 1|0=24, 1|1=1}
16050840 C {0|0=2478, 1|0=11, 0|1=15}
16050847 T {0|0=2498, 1|0=2, 0|1=4}

Example

How to get percent identity from bam file ?

using the edit distance ‘NM’

$ java -jar dist/jvarkit.jar bioalcidaejdk -e 'stream().map(R->R.getReadUnmappedFlag()?0:(int)(100.0*(R.getReadLength()-R.getIntegerAttribute("NM"))/(double)R.getReadLength())).collect(Collectors.groupingBy(Function.identity(), Collectors.counting())).forEach((K,V)->{println(K+"\t"+V);});' input.bam

Example

sliding windows of 1000 variant, shift by 500 variants

java -jar jvarkit.jar bioalcidaejdk -e 'String prevContig=null; Consumer<List<VariantContext>> dump = (L)->{if(L.isEmpty()) return;System.out.println(L.get(0).getContig()+"\t"+(L.get(0).getStart()-1)+"\t"+L.get(L.size()-1).getEnd());}; final List<VariantContext> buffer=new ArrayList<>();while(iter.hasNext()) {VariantContext vc=iter.next();if(!vc.getContig().equals(prevContig)) {dump.accept(buffer);buffer.clear();prevContig=vc.getContig();} buffer.add(vc);if(buffer.size()>=1000) {dump.accept(buffer);buffer.subList(0,500).clear();}} dump.accept(buffer);' in.vcf

number of homozygous and heterozygous non-reference from a multi sample VCF file?

java -jar dist/jvarkit.jar bioalcidaejdk -e 'stream().flatMap(V->V.getGenotypes().stream()).filter(G->!G.isHomRef()).map(G->G.getSampleName()+"\t"+G.getType().name()).collect(Collectors.groupingBy(Function.identity(), Collectors.counting())).forEach((K,V)->println(K+"\t"+V));' src/test/resources/rotavirus_rf.vcf.gz

Counting soft clipped bases and reads

final long counts[]=new long[4];
stream().forEach(R->{
	counts[0]++;
	if(R.getReadUnmappedFlag()) return;
	Cigar cigar = R.getCigar();
	if(cigar==null || cigar.isEmpty()) return;
	counts[1]++;
	counts[2] += cigar.getReadLength();
	counts[3] += cigar.
		getCigarElements().
		stream().
		filter(C->C.getOperator().isClipping()).
		mapToInt(C->C.getLength()).
		sum();
	});

println("NUM-READS:"+counts[0]);
println("NUM-MAPPED-READS:"+counts[1]);
println("SUM-MAPPED-READS-LENGTH:"+counts[2]);
println("SUM-CLIPPING:"+counts[3]);

print the first exon as BED of each transcript in a gtf file

$ java -jar dist/jvarkit.jar bioalcidaejdk -e 'stream().flatMap(G->G.getTranscripts().stream().filter(T->T.hasStrand() && T.hasExon())).map(T->T.isPositiveStrand()?T.getExon(0):T.getExon(T.getExonCount()-1)).forEach(E->println(E.getContig()+"\t"+(E.getStart()-1)+"\t"+E.getEnd()+"\t"+E.getName()));' src/test/resources/Homo_sapiens.GRCh37.87.gtf.gz


120611947	120612240	ENST00000256646.Exon1
120466259	120466528	ENST00000493703.Exon1
120479904	120480086	ENST00000478864.Exon1
120611947	120612240	ENST00000479412.Exon1
120596733	120596839	ENST00000602566.Exon1
120572528	120572586	ENST00000489731.Exon1
41697525	41697776	ENST00000352645.Exon1
41697718	41697776	ENST00000486331.Exon1
41716664	41716717	ENST00000351589.Exon1
38674525	38674840	ENST00000414099.Exon1
38691021	38691163	ENST00000425664.Exon1
38691021	38691163	ENST00000443581.Exon1
38691021	38691164	ENST00000451551.Exon1
38691021	38691163	ENST00000413689.Exon1
38674525	38674853	ENST00000423572.Exon1
38691021	38691119	ENST00000333535.Exon1
38674525	38674823	ENST00000455624.Exon1
38674525	38674840	ENST00000450102.Exon1
38674525	38674807	ENST00000449557.Exon1
38595769	38596040	ENST00000464652.Exon1
38691021	38691164	ENST00000491944.Exon1
38674525	38674711	ENST00000476683.Exon1
38687181	38687267	ENST00000327956.Exon1

convert GTF to BED of introns

$ wget -q  -O - "ftp://ftp.ensemblgenomes.org/pub/release-44/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.44.gtf.gz" |\
gunzip -c |\
java -jar dist/jvarkit.jar bioalcidaejdk -F GTF -e 'stream().flatMap(G->G.getTranscripts().stream()).flatMap(T->T.getIntrons().stream()).forEach(I->println(I.getContig()+"\t"+(I.getStart()-1)+"\t"+I.getEnd()+"\t"+I.getTranscript().getGene().getGeneName()));' |\
sort -t $'\t' -k1,1 -k2,2n | uniq > introns.bed

$ head introns.bed
1   3913    3995    NAC001
1   4276    4485    NAC001
1   4605    4705    NAC001
1   5095    5173    NAC001
1   5326    5438    NAC001
1   7069    7156    ARV1
1   7232    7383    ARV1
1   7450    7563    ARV1
1   7649    7761    ARV1
1   7649    8235    ARV1

mean intron length

https://www.biostars.org/p/397168/

$ wget -O - -q "http://ftp.ensembl.org/pub/release-97/gtf/mus_musculus/Mus_musculus.GRCm38.97.gtf.gz" | gunzip -c | java -jar dist/jvarkit.jar bioalcidaejdk -F GTF -e 'stream().forEach(G->println(G.getId()+"\t"+G.getGeneName()+"\t"+G.getTranscripts().stream().flatMap(T->T.getIntrons().stream()).mapToInt(I->I.getLengthOnReference()).average().orElse(-99) ));'

introns, exons and cds in a gtf

stream().
    flatMap(GENE->GENE.getTranscripts().stream()).
    flatMap(TRANSCRIPT->{
        final List<Interval> L = new ArrayList<>();
        TRANSCRIPT.getExons().stream().forEach(E->L.add(E.toInterval()));
        TRANSCRIPT.getIntrons().stream().forEach(I->L.add(I.toInterval()));
        TRANSCRIPT.getUTRs().stream().forEach(U->L.add(U.toInterval()));
        return L.stream();
        }).forEach(R->println(R.getContig()+"\t"+(R.getStart()-1)+"\t"+R.getEnd()+"\t"+R.getStrand()+"\t"+R.getName()));

Cited in

Hickman, A.R., Selee, B., Pauly, R. et al. Discovery of eQTL Alleles Associated with Autism Spectrum Disorder: A Case-Control Study. J Autism Dev Disord (2022). https://doi.org/10.1007/s10803-022-05631-x

This site is open source. Improve this page.

jvarkit

BioAlcidaeJdk

Usage

Keywords

See also in Biostars

Creation Date

Source code

Unit Tests

Contribute

License

Citing

Why this name, ‘BioAlcidae’ ?

History

Base classes

VCF

SAM

FASTQ

Fasta

GTF

GFF

Examples

Example:

Example:

Example:

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

number of homozygous and heterozygous non-reference from a multi sample VCF file?

Counting soft clipped bases and reads

print the first exon as BED of each transcript in a gtf file

convert GTF to BED of introns

mean intron length

introns, exons and cds in a gtf

Cited in