Last commit

Extracting reads from a regular expression in a bam file


This program is now part of the main jvarkit tool. See jvarkit for compiling.

Usage: java -jar dist/jvarkit.jar biostar9462889  [options] Files

Usage: biostar9462889 [options] Files
      Compression Level. 0: no compression. 9: max compression;
      Default: 5
      overwrite existing files
      Default: false
    -h, --help
      print help and exit
      What kind of help. One of [usage,markdown,xml].
    -M, --manifest
      Manifest file describing the generated files. Optional
  * -o, --output
      (prefix) output directory
      Output file prefix
      Default: split
    -R, --reference
      Indexed fasta Reference file. This file must be indexed with samtools 
      faidx and with picard/gatk CreateSequenceDictionary or samtools dict
  * --regex, -regex
      Regular expression that can be used to parse read names in the incoming 
      SAM file. Regex groups are used to classify the reads.
      Default: <empty string>
      Limit analysis to this interval. A source of intervals. The following 
      suffixes are recognized: vcf, vcf.gz bed, bed.gz, gtf, gff, gff.gz, 
      gtf.gz.Otherwise it could be an empty string (no interval) or a list of 
      plain interval separated by '[ \t\n;,]'
      Sam output format.
      Default: BAM
      Possible Values: [BAM, SAM, CRAM]
      SAM Reader Validation Stringency
      Default: LENIENT
      Possible Values: [STRICT, LENIENT, SILENT]
      print version and exit


See also in Biostars

Creation Date


Source code




The project is licensed under the MIT license.


Should you cite biostar9462889 ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md

The current reference is:


Lindenbaum, Pierre (2015): JVarkit: java-based utilities for Bioinformatics. figshare. http://dx.doi.org/10.6084/m9.figshare.1425030


$ java -jar dist/biostar9462889.jar --manifest jeter.mf -o TMP --regex '^(RF[0-9]+)_' src/test/resources/S1.bam
WARNING: BAM index file /home/lindenb/src/jvarkit/src/test/resources/S1.bam.bai is older than BAM /home/lindenb/src/jvarkit/src/test/resources/S1.bam
[INFO][Biostar9462889]Creating output for "RF01" N=1
[INFO][Biostar9462889]Creating output for "RF02" N=2
[INFO][Biostar9462889]Creating output for "RF03" N=3
[INFO][Biostar9462889]Creating output for "RF04" N=4
[INFO][Biostar9462889]Creating output for "RF05" N=5
[INFO][Biostar9462889]Creating output for "RF06" N=6
[INFO][Biostar9462889]Creating output for "RF07" N=7
[INFO][Biostar9462889]Creating output for "RF08" N=8
[INFO][Biostar9462889]Creating output for "RF09" N=9
[INFO][Biostar9462889]Creating output for "RF10" N=10
[INFO][Biostar9462889]Creating output for "RF11" N=11
[WARN][Biostar9462889]0 read(s) where lost because the regex '^(RF[0-9]+)_' failed.

$ ls TMP/*.bam
TMP/split.000001.bam  TMP/split.000004.bam  TMP/split.000007.bam  TMP/split.000010.bam
TMP/split.000002.bam  TMP/split.000005.bam  TMP/split.000008.bam  TMP/split.000011.bam
TMP/split.000003.bam  TMP/split.000006.bam  TMP/split.000009.bam

$ cat jeter.mf
RF01	TMP/split.000001.bam
RF02	TMP/split.000002.bam
RF03	TMP/split.000003.bam
RF04	TMP/split.000004.bam
RF05	TMP/split.000005.bam
RF06	TMP/split.000006.bam
RF07	TMP/split.000007.bam
RF08	TMP/split.000008.bam
RF09	TMP/split.000009.bam
RF10	TMP/split.000010.bam
RF11	TMP/split.000011.bam