GMAP for long read RNA

GMAP is an old (circa 2006) software for long read alignment. Its use case is for mapping RNA reads back to a genome. It has found new life in the world of long read RNAseq such as from Pacbio reads.

Perhaps because of its age and architecture, it has some quirks and dependencies that seem odd to the modern bioinformatician.

Download & untar

Get the code from http://research-pub.gene.com/gmap/. It is versioned by date.

Extract the files using

tar zxvf gmap-2018-05-30.tar.gz

change into the source directory and edit the config.site file. config.site
sets some system wide parameters the most important will be the path to your
GMAP database, which is where the index, one you create it, will be placed.

Then compile from the command line


./configure
make
make install

It will help to read the notes README and INSTALL for particular issues with different architectures, operating systems, etc. Because of the many optimizations, some deep diving may be required.

Building an index

There might be some prebuilt indexes, but I couldn’t find them. To build your own you need to specify fasta file(s). One per chromosome or one for the whole genome or contigs or whatever.


# in this case, i called it hg19
gmap_build -d nameOfGenome /path/to/genome_fasta_files

The index files will be placed in whatever you set in config.site or on the gmap_build command line options.

Align a cdna

To align, there are a few options.

  1. map
  2. map and align (show alignment)
  3. align only
  4. batch mode

Since I had several fastqs in one file, I needed to convert FASTQ to FASTA.

Using seqtk
seqtk seq -a input.fastq > output.fasta

And align

# this aligns
gmap -d hg19 myfile.fasta
# this aligns and maps (shows alignment)
gmap -d hg19 -A myfile.fasta

Examining output

The output has genome sequence on the top row and the cDNA on the bottom with the cDNA in its forward direction.
Numbers below introns indicate its length in nucleotides.
GMAP uses these symbols:


| match
> canonical intron
- gap
= non-canonical intron

Figure 4 in the Wu et al paper is helpful in understanding the output. [Wu & Watanabe. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics May 2005]