Convert GFF3 format to UCSC knownGene format.
This program is now part of the main jvarkit
tool. See jvarkit for compiling.
Usage: java -jar dist/jvarkit.jar gff2kg [options] Files
Usage: gff2kg [options] Files
Options:
-bed12, --bed12
Ouput bed.
Default: false
-bin, --bin
Insert UCSC 'bin' column as the first column.
Default: false
-h, --help
print help and exit
--helpFormat
What kind of help. One of [usage,markdown,xml].
-o, --output
Output file. Optional . Default: stdout
--version
print version and exit
20160404
The project is licensed under the MIT license.
Should you cite gff2kg ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md
The current reference is:
http://dx.doi.org/10.6084/m9.figshare.1425030
Lindenbaum, Pierre (2015): JVarkit: java-based utilities for Bioinformatics. figshare. http://dx.doi.org/10.6084/m9.figshare.1425030
$ curl -s "ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz" |\
gunzip -c |\
java -jar dist/jvarkit.jar gff2kg
(...)
1826 ENST00000367917.3 chr1 + 162760522 162782607 162760590 162782210 8 162760522,162762448,162766374,162767591,162769532,162774056,162775183,162782087 162760625,162762652,162766467,162767706,162769727,162774113,162775282,162782607 gene_id=ENSG00000132196.9;transcript_id=ENST00000367917.3;gene_type=protein_coding;gene_status=KNOWN;gene_name=HSD17B7;transcript_type=protein_coding;transcript_name=HSD17B7-201;protein_id=ENSP00000356894.3;havana_gene=OTTHUMG00000034420.6; ENST00000367917.3
(...)
In the UCSC (not the structure of konwGene, but we can validate intervals):
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -D hg19 -e 'select * from wgEncodeGencodeBasicV19 where name="ENST00000367917.3"' | cat
bin name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds score name2 cdsStartStat cdsEndStat exonFrames
1826 ENST00000367917.3 chr1 + 162760522 162782607 162760590 162782210 8 162760522,162762448,162766374,162767591,162769532,162774056,162775183,162782087, 162760625,162762652,162766467,162767706,162769727,162774113,162775282,162782607, 0 HSD17B7 cmpl cmpl 0,2,2,2,0,0,0,0,
$ wget -O - "ftp://ftp.ensembl.org/pub/grch37/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh37.82.gtf.gz" |\
gunzip -c |\
java -jar dist/jvarkit.jar gff2kg