Assembling reads with velvet

Velvet is a hash-based deBrujin assembler. I occasionally use it to build contigs from short read RNAseq data, although it can be used for long reads, metagenomics, etc.

There are two parts to the program:
1. velveth – prepare the sequences
2. velvetg – build the graph and output the contig.fa and some summary statistics


# prepare sequences
velveth contig/ 21 -raw dmd_kmer_matches.txt
# build the graph
velvetg contig/ -min_contig_lgth 100

There are several parameters you can tune.
– kmer length
– coverage cutoff (cov_cutoff)
– minumum contig length (min_contig_lgth)

It can take as input paired end or single end, FASTQ, FASTA and other formats. It is not restricted to illumina output.

velvetg will produce a fasta file of contigs and summary stats.

GMAP for long read RNA

GMAP is an old (circa 2006) software for long read alignment. Its use case is for mapping RNA reads back to a genome. It has found new life in the world of long read RNAseq such as from Pacbio reads.

Perhaps because of its age and architecture, it has some quirks and dependencies that seem odd to the modern bioinformatician.

Download & untar

Get the code from http://research-pub.gene.com/gmap/. It is versioned by date.

Extract the files using

tar zxvf gmap-2018-05-30.tar.gz

change into the source directory and edit the config.site file. config.site
sets some system wide parameters the most important will be the path to your
GMAP database, which is where the index, one you create it, will be placed.

Then compile from the command line


./configure
make
make install

It will help to read the notes README and INSTALL for particular issues with different architectures, operating systems, etc. Because of the many optimizations, some deep diving may be required.

Building an index

There might be some prebuilt indexes, but I couldn’t find them. To build your own you need to specify fasta file(s). One per chromosome or one for the whole genome or contigs or whatever.


# in this case, i called it hg19
gmap_build -d nameOfGenome /path/to/genome_fasta_files

The index files will be placed in whatever you set in config.site or on the gmap_build command line options.

Align a cdna

To align, there are a few options.

  1. map
  2. map and align (show alignment)
  3. align only
  4. batch mode

Since I had several fastqs in one file, I needed to convert FASTQ to FASTA.

Using seqtk
seqtk seq -a input.fastq > output.fasta

And align

# this aligns
gmap -d hg19 myfile.fasta
# this aligns and maps (shows alignment)
gmap -d hg19 -A myfile.fasta

Examining output

The output has genome sequence on the top row and the cDNA on the bottom with the cDNA in its forward direction.
Numbers below introns indicate its length in nucleotides.
GMAP uses these symbols:


| match
> canonical intron
- gap
= non-canonical intron

Figure 4 in the Wu et al paper is helpful in understanding the output. [Wu & Watanabe. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics May 2005]
 

Using liftover

The LiftOver tool is essentially a way for you to take a location or set of locations in one build of a genome and map it to a different build or genome. For instance, say you have a locus chr17:1000000-1000100 in human genome build hg17 and you want to know where it resides in the hg18 build. You can use LiftOver for this.

If you have a relatively small number of locations you can use the web version of LiftOver (http://genome.ucsc.edu/cgi-bin/hgLiftOver LiftOver).

Here we’ll use the command line tool to lift over coordinates.

Grab the executable you need from http://hgdownload.cse.ucsc.edu/admin/exe/, browser over to your operating system (for instance linux.x86_64) and download the file liftOver to your local machine. This file is an executable, so you don’t need to compile and it should just work. You’ll probably need to make it executable by changing the permissions

chmod 744 liftOver

The basic synatax is this:

liftOver fileOfOldPositions map.chain fileOfNewPositions fileOfUnmappedPositions

The map.chain files are the ones that allow you to get from one build (or species) to another. These are found on the UCSC sequence and annotation downloads page (http://hgdownload.cse.ucsc.edu/downloads.html#terms) for the organism and genome build of interest. For each species and build, there’s a “LiftOver files” link. Click that to get to the chain files.

The chain files are named as explained below:

The file names reflect the assembly conversion data contained within
in the format To.over.chain.gz. For example, a file named
hg15Tohg16.zip file contains the liftOver data needed to convert hg15
(Human Build 33) coordinates to hg16 (Human Build 34). If no file is
available for the assembly in which you’re interested, please send a
request to the genome mailing list (genome@soe.ucsc.edu) and we will
attempt to provide you with one.

So if you want to liftOver hg17 positions to hg18 positions in human, you’ll want a file named hg17ToHg18.over.chain.gz. Chain files are basically the result of whole genome-to-genome alignments.

Unzip the chain file
gunzip hg17ToHg18.over.chain.gz

You’ll also need your list of positions in BED format, which is basically a tab separated file of positions like so:

chrN startpos endpos

Then all you have to do is run liftOver. Not every position will be found in the newer build because merges, changes or revisions of the build so unmapped locations will be written out to the unmapped file. You can also use gff/gtf files as input. The only option I’ve really messed with is the minMatch parameter. For human to human, the recommended setting is 0.95 (95% of bases in your given interval match). For cross-species, say human to mouse, I commonly used 0.10 (10%) which is the recommended setting.