Takes a ucsc genpred file, scan the 5’ UTRs and generate a GFF3 containing upstream-ORF. Inspired from https://github.com/ImperialCardioGenetics/uORFs
This program is now part of the main jvarkit
tool. See jvarkit for compiling.
Usage: java -jar dist/jvarkit.jar gff3upstreamorf [options] Files
Usage: gff3upstreamorf [options] Files
-o, --output
Output file. Optional . Default: stdout
* -r, -R, --reference
Indexed fasta Reference file. This file must be indexed with samtools
faidx and with picard/gatk CreateSequenceDictionary or samtools dict
only accept events that are greater or equal to this Kozak strength.
Default: nil
Possible Values: [Strong, Moderate, Weak, nil]
The project is licensed under the MIT license.
The current reference is:
Lindenbaum, Pierre (2015): JVarkit: java-based utilities for Bioinformatics. figshare. http://dx.doi.org/10.6084/m9.figshare.1425030
Part of this code was inspired from: https://github.com/ImperialCardioGenetics/uORFs/blob/master/5primeUTRannotator/five_prime_UTR_annotator.pm
An Upstream Open Reading Frame (uORF) is an open reading frame (ORF) within the 5’ untranslated region (5’UTR) of an mRNA. uORFs can regulate eukaryotic gene expression. Translation of the uORF typically inhibits downstream expression of the primary ORF. In bacteria, uORFs are called leader peptides, and were originally discovered on the basis of their impact on the regulation of genes involved in the synthesis or transport of amino acids.
Input is a UCSC “genpred/knowngene” file, but if you only have a gff/gff3 file, you can use gff2kg
to create one.
wget -O - "https://hgdownload.soe.ucsc.edu/goldenPath/hg38//database/wgEncodeGencodeBasicV47.txt.gz" |\
gunzip -c |\
java -jar dist/jvarkit.jar gff3upstreamorf -R GRCh38.fa > uorf.gff3
bcftools csq --ncsq 1000 -l -f GRCh38.fa -g uorf.gff3 in.bcf |\
bcftools annotate --rename-annots <(echo -e 'INFO/BCSQ\tUTR_BCSQ')
note to self: test ENSG00000141736 https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1003529