GMAP is an old (circa 2006) software for long read alignment. Its use case is for mapping RNA reads back to a genome. It has found new life in the world of long read RNAseq such as from Pacbio reads.
Perhaps because of its age and architecture, it has some quirks and dependencies that seem odd to the modern bioinformatician.
Download & untar
Get the code from http://research-pub.gene.com/gmap/. It is versioned by date.
Extract the files using
tar zxvf gmap-2018-05-30.tar.gz
change into the source directory and edit the config.site file. config.site
sets some system wide parameters the most important will be the path to your
GMAP database, which is where the index, one you create it, will be placed.
Then compile from the command line
It will help to read the notes README and INSTALL for particular issues with different architectures, operating systems, etc. Because of the many optimizations, some deep diving may be required.
Building an index
There might be some prebuilt indexes, but I couldn’t find them. To build your own you need to specify fasta file(s). One per chromosome or one for the whole genome or contigs or whatever.
# in this case, i called it hg19
gmap_build -d nameOfGenome /path/to/genome_fasta_files
The index files will be placed in whatever you set in config.site or on the gmap_build command line options.
Align a cdna
To align, there are a few options.
- map and align (show alignment)
- align only
- batch mode
Since I had several fastqs in one file, I needed to convert FASTQ to FASTA.
seqtk seq -a input.fastq > output.fasta
# this aligns
gmap -d hg19 myfile.fasta
# this aligns and maps (shows alignment)
gmap -d hg19 -A myfile.fasta
The output has genome sequence on the top row and the cDNA on the bottom with the cDNA in its forward direction.
Numbers below introns indicate its length in nucleotides.
GMAP uses these symbols:
> canonical intron
= non-canonical intron
Figure 4 in the Wu et al paper is helpful in understanding the output. [Wu & Watanabe. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics May 2005]